I Gave 6 AI Coding Tools the Same Codebase. The Most Interesting Finding Had Nothing to Do With the Scores.
It’s no secret that I’ve had the itch to go back to where it all began. I love building with AI-assisted engineering. I’ve shipped seven small production systems in the past year. But I’ve had this pull to go deeper. To truly understand the technology I’ve fallen in love with. I want to know what machines think and how people interact with them.
After my adventures building MicroGPT, I went back to studying. Andrew Ng’s Machine Learning Specialization for the foundations, and Chip Huyen’s AI Engineering for the practical engineering layer. I even sent Andrew a thank-you through the course feedback form — he gave a history graduate a new passion, and I wanted him to know.
But I wouldn’t be me if I didn’t start experimenting while reading. Chapters 3 and 4 are about evaluations — how do you measure whether a model is actually good? So I decided to find out for myself.
The Experiment: One Prompt, Six Tools, One Codebase
The idea was simple. I took AIropa — my own European AI news aggregator, built with Python, FastAPI, SQLAlchemy, and a multi-agent LLM pipeline — and gave six different AI coding tools the same prompt:
Please analyse the codebase and explain the architecture, the logic of the project to me. Provide positive arguments about this approach, provide negatives, and provide suggestions on how to make the system better.
I ran it on three tools natively, then three more times through Kilo Code’s CLI to see if the tool wrapper mattered. Six runs total, timed and recorded.
Runs tested:
- Native CLI: Claude Code, Codex (GPT-5.3), Devstral 2
- Kilo Code: Claude Opus 4.6, Codex (GPT-5.2), Devstral 2
Note: Kilo Code didn’t offer Codex 5.3, so the Codex runs aren’t identical models — native = GPT-5.3, Kilo = GPT-5.2.
Claude came back with a whole methodology — rubrics, weights, a neat little evaluation machine. I recognized the approach from Chip’s book. It was sound. I let it run. I trusted the shape of rigor more than the substance.
That was my first mistake.
What surprised me wasn’t who won — it was that the same model could “see” different problems depending on what the tool decided to read. But I’m getting ahead of myself.
Enter Bitty: “You’re Measuring Impressiveness, Not Correctness”
I took Claude’s results to BittyGPT — my “anti-offloading” sparring partner, designed to challenge me instead of finishing my thoughts. Bitty’s response was blunt: this experiment measured “who writes the nicest-sounding review,” not “who is correct.”
Three gaps:
- No groundedness check — models could hallucinate with confidence and I’d score them well for it.
- No truth set — I had no way to measure accuracy, only impressiveness.
- Two different tasks conflated — architecture understanding and bug detection are different skills.
This is the moment the experiment became real. Not when I ran the models, but when someone pushed back on my method and I recognized the critique was valid.
But here’s the uncomfortable part: instead of me coming up with these critiques, it was yet another AI model. For someone who is obsessed with training critical thinking whenever possible, the appeal to just accept suggestions is… uncomfortably easy. I’ll come back to this.
Verification: Where Devstral Fell Apart
Bitty suggested I verify claims against the actual codebase. So I picked 3 claims per model — 18 total — and checked them with Claude Code against the real source files.
This is where Devstral fell apart. It had praised AIropa’s “comprehensive testing” — but CI actually runs only 2 of 6 test files. It praised “good documentation” — but the README was wildly outdated. A model that praises broken things is worse than silence — it manufactures confidence.
Meanwhile, Codex (GPT-5.2 via Kilo Code) — which I’d initially scored lower — turned out to be 100% accurate on every claim it made. It found fewer things, but everything it found was real. That mattered. Accuracy-per-claim is a different metric than total-claims-found, and for anyone who cares about hallucinations, it’s the one that counts.
Bitty suggested adding “Groundedness” as a seventh criterion, weighted at 2x. This was the most valuable suggestion of the entire experiment. Devstral dropped from merely weak to actively misleading. Codex earned more respect. This is what Chip means when she talks about evaluation criteria evolving through iteration — you don’t start with perfect criteria. You discover what matters by running the eval.
Round 2: Can a Better Prompt Fix a Weaker Model?
I saw flaws in my first method. It was fun, so let’s run it again.
With verification done and a gold standard bug list established (10 confirmed issues with file paths and line numbers), I designed a second round. The hypothesis: “A structured anti-hallucination prompt will improve weaker models more than stronger ones.”
The new prompt forced models to include evidence for every claim, label uncertainty as INFERRED or UNKNOWN, and separate what the code does from why it does it.
Why does forcing INFERRED/UNKNOWN labels actually work?
This is something I didn’t fully understand at first. But it clicked when I thought about it through my historian brain.
When I wrote history papers, I couldn’t just say “Napoleon lost at Waterloo because he was overconfident.” I had to source the claim. Primary source? Secondary interpretation? My own inference from the evidence? If I couldn’t source it, I either dropped the claim or flagged it as interpretation.
The INFERRED/UNKNOWN labels do the same thing to a language model.
Without labels, the model has one mode: assert confidently. That’s what RLHF training rewards — fluent, authoritative-sounding text. The model doesn’t naturally distinguish between “I read this in the code” and “this seems like it would be true based on patterns I’ve seen in other codebases.” It all comes out sounding the same.
When you force the model to categorize each claim, you insert a reasoning step between “generate claim” and “output claim.” The model has to evaluate: do I actually have evidence for this, or am I pattern-matching?
For models with strong reasoning (Claude, Codex), that checkpoint works. They can genuinely evaluate their own certainty. That’s why Codex went from messy to disciplined — the capability was already there, the prompt gave it structure to express it.
For Devstral, the checkpoint fails because the model can’t reliably evaluate its own certainty. It labeled an SQL injection risk as S2 severity — it tried to use the format but couldn’t tell the difference between “SQLAlchemy uses parameterized queries so this isn’t a risk” and “I’ve seen SQL injection in other projects so it’s probably here too.”
The mechanism is forced self-evaluation before output. Strong models benefit because they can do accurate self-evaluation. Weak models attempt it but still confuse inference with evidence. That’s why the prompt is a floor-raiser, not a ceiling-raiser.
Round 2 Results
Codex (GPT-5.3 native) improved the most. It went from unstructured-but-accurate to disciplined-and-accurate — 39/45 in 37 seconds. That’s a genuinely useful tool for quick architectural scans.
Devstral stopped praising broken things. The prompt prevented its worst failure mode. But it couldn’t make Devstral find bugs it doesn’t have the capability to find. It just dressed up mediocre analysis in structured output.
Claude barely changed. Already at the ceiling.
The Title Payoff: Model Brain vs. Tool Eyes
Claude Opus via Kilo Code found a Groq API key in the .env file — sensitive, and potentially catastrophic if that file was ever committed or shared. No other model flagged it. This happened because Kilo Code’s file selection strategy included .env, while Claude Code’s native CLI didn’t look there.
Same LLM vendor. Same prompt. Different findings… because the tool decided to look at different files.
Bitty named this perfectly: “model brain vs. tool eyes.” You’re not evaluating the model in isolation. You’re evaluating model + tool context strategy. This turned out to be one of the most interesting findings of the whole experiment, and it had nothing to do with the scores.
If you’re using AI tools to audit a codebase, you’re not just picking a model — you’re picking a file selection policy.
The Realization: What I Actually Didn’t Do
After both rounds were done, I brought Bitty’s analysis and Claude’s analysis together. They agreed on everything. All three of us — me, Bitty, Claude — were aligned on findings, methodology, and recommendations.
It felt comfortable. And that’s when it hit me.
I had run the entire experiment to see what would happen. But the most fun part — the sparring, the critical thinking — I didn’t do. I let AI handle it.
I’ve spent months building BittyGPT specifically to prevent cognitive offloading: the pattern where you let AI do your thinking and just consume the output. I wrote a whole blog post about it. I designed a “struggle budget” and “memory echo” system. This is my thing.
And here I was, reading two AI analyses of an experiment where an AI did the scoring, an AI did the verification, and an AI wrote the report — and my response was “good to see everyone’s aligned.”
I flagged it. Not because the experiment was invalid — the design decisions, the methodology iterations, the decision to add a second round, the choice to bring peer review, those were mine. But the moment of comfortable consensus was exactly the kind of thing BittyGPT exists to disrupt. Three sources agreeing feels like validation. It can also be an echo chamber with extra steps.
Claude then overcorrected — challenged the scoring methodology I’d asked it to build — and I had to push back on that too. The experiment was a learning exercise, not a peer-reviewed paper. Different standards apply.
But the exchange was useful. It demonstrated that the cognitive offloading risk isn’t binary. It’s not “you thought” or “AI thought.” It’s a spectrum, and the line moves depending on what you’re trying to learn. I had designed the experiment, iterated the method, and made the key decisions. But I’d handed off the analytical work that I enjoy most.
I don’t learn by reading about things. I learn by building things and then reading about them to understand what I built. I read Chapter 3, then immediately turned it into an experiment using my own codebase. The theory made sense because I had hands-on experience to anchor it to. I run first, understand second. That’s not a bug — that’s how my brain works. And this time, I need to make sure the “understand” part is actually mine.
Note to self: build a Claude anti-cognitive-offloading prompt too. Bitty can’t be the only one holding me accountable.
What I’d Do Differently
- Run each model twice. Single runs are nondeterministic. One bad inference changes the score. Even two runs would show variance.
- Get a second human scorer. My rubric, my scores, my verification — all validated by AIs working from my framework. A second human would introduce actual disagreement.
- Separate tasks more cleanly. “Analyze the architecture” and “find bugs” are different capabilities. Purpose-built prompts for each would be cleaner.
- Track token usage and cost. I measured time but not cost. For practical recommendations, cost-per-insight matters.
What I’ll Take Forward
- Codex at 37 seconds is a real tool. For quick scans before merging or onboarding to a new codebase, it’s the right choice.
- Claude for deep reviews. When you need to understand why something was built, not just what exists.
- Run complementary prompts, not just one. Different prompts find different bugs. Coverage beats optimization.
- Devstral is unreliable for trust-dependent work. It improved with structure but still invents plausible-sounding issues.
- The tool layer matters as much as the model. Which files the CLI reads determines what the model can find.
- Evaluation criteria should evolve. Start loose, discover what matters, tighten. Don’t design the perfect rubric upfront.
The Meta-Lesson
I designed a two-round evaluation experiment with ground truth verification, tested a hypothesis about prompt design, iterated methodology based on peer review, and produced actionable findings for my own codebase — on a Monday afternoon, for fun, while also posting memes about CAPTCHAs on LinkedIn.
This isn’t what I expected when I opened Chip’s book that morning. But it’s exactly the kind of learning I came to AI engineering to do. Not consuming knowledge. Producing it. Messy, iterative, curiosity-driven, and mine.
If you want to see the full experiment data (scoring rubrics, model outputs, gold standard bug list, Round 1 vs Round 2 comparison), I’ll link the complete write-up in the comments.
And if you’re curious about BittyGPT and anti-cognitive-offloading design — that’s a whole other post. But the short version is: if three AI systems agree with you and it feels comfortable, that’s exactly when you should be suspicious.