// Reading nowStart

AI Can Write Your Code. It Just Cannot Understand It.

By Addy · March 9, 2026

The demos are real. The GitHub Copilot autocomplete that finishes your function before you do. The Claude Code session that refactors 400 lines cleanly. The GPT-5 pair programmer that catches the edge case you missed.

AI writes good code. That part is not disputed.

The question nobody was asking loudly enough: does the model understand what it is writing, or does it remember what it has seen?

Scale AI just published the answer. It is uncomfortable.

The Number That Changes Everything

On SWE-Bench Verified - the standard coding benchmark used by every lab to demonstrate AI coding capability - Claude Opus 4.1 scores 74.5%. GPT-5 scores 74.9%. These numbers get cited in press releases, funding announcements, and Twitter threads about AI replacing developers.

Same models. Different benchmark. On SWE-Bench Pro's private set, where the code comes from proprietary startup codebases the models have never seen, Claude Opus 4.1 drops to 17.8%. GPT-5 falls to 14.9%.

Not a different task category. Not a harder reasoning problem. The same kind of task: given a codebase and a bug report, fix the bug.

The difference is one variable: whether the model has seen the code before.

What SWE-Bench Pro Actually Tests

Scale AI's SEAL lab built SWE-Bench Pro specifically to answer the question the original benchmark could not: are models actually reasoning, or are they pattern-matching against training data?

The benchmark contains 1,865 tasks across 41 repositories spanning Python, Go, TypeScript, and JavaScript. The public set uses 731 tasks from GPL-licensed codebases, a legal deterrent against inclusion in training data. The private set goes further: 276 tasks from 18 proprietary startup codebases that are not publicly accessible and have never been in any training corpus.

On the private set, Claude Opus 4.1 scores 17.8%. GPT-5 scores 14.9%. Both models lost roughly 60 percentage points from their SWE-Bench Verified scores.

The pattern is clear. The more certain Scale AI is that the model has not seen the code, the worse the model performs.

The Memorization Tell

Here is the detail that makes the memorization hypothesis concrete rather than theoretical.

During analysis of model trajectories, Scale AI found that models could correctly identify which files needed modification before fully reading the problem description. On familiar open-source repositories, Claude Opus 4.1 and GPT-5 navigated directly to the relevant file paths with unusual speed and precision, not because they reasoned through the codebase structure, but because they had seen those exact repositories in training data.

This is not reasoning. This is recall.

The model that scores 75% on SWE-Bench Verified is not demonstrating the ability to navigate an unfamiliar codebase and reason through a bug. It is demonstrating the ability to recognize patterns from code it has already memorized and apply the corresponding fix.

When the code is unfamiliar - truly unfamiliar, from a private startup codebase that never appeared on GitHub - the performance collapses to under 18%.

What the Failure Modes Reveal

Scale AI's trajectory-level analysis of where models fail is more revealing than the headline numbers.

For Claude Opus 4.1, the primary failure mode is semantic and algorithmic correctness across multi-file changes. The model understands individual files. It struggles when a fix requires reasoning about how a change in one file propagates through three others.

This is the gap between code generation and code understanding. Generating syntactically correct code that does what a comment says is a pattern-matching task. Understanding that changing a function signature in file A breaks the expected behavior in files B, C, and D requires a model of the entire system. That model does not exist in the way the benchmark scores implied.

For smaller models, the failure mode is different: syntax errors, tool use mistakes, context overflow. They are failing at the pattern-matching layer. Claude Opus 4.1 and GPT-5 are passing the pattern-matching layer and failing at the reasoning layer.

Two different failure modes. Both hidden by the original benchmark.

Why the Original Benchmark Hid This

SWE-Bench Verified was a genuine contribution to AI evaluation when it launched. 500 tasks from real GitHub issues. Human-verified solvability. A meaningful bar above simple code generation.

But it had a fatal structural flaw: all tasks came from public open-source repositories. By the time Claude Opus 4.1 and GPT-5 were evaluated on it, those repositories had been in training data for years. The benchmark was measuring memory as much as capability.

OpenAI acknowledged this directly. After an internal audit found confirmed contamination - every frontier model tested could reproduce verbatim gold patches or problem statement specifics for certain tasks - they stopped reporting scores on SWE-Bench Verified entirely and recommended other labs do the same. The benchmark that the entire industry was using to compare coding models was compromised.

The 75% scores were not fabricated. The models genuinely solved those tasks. But the tasks were not a fair test of generalization. They were a test of how well models had memorized the solutions that already existed in their training data.

What This Means for Developers Hiring AI

The practical implication is not that AI coding tools are useless. They are genuinely useful. The practical implication is about where they are useful and where they are not.

Where AI coding tools work well: Standard library usage, common patterns, well-documented APIs, code that resembles publicly available examples, refactoring familiar structures, generating boilerplate.

All of these tasks share a property: the patterns exist in training data. The model is doing sophisticated retrieval and adaptation. That is real value. It saves real time.

Where AI coding tools struggle: Private codebases with custom abstractions, legacy systems with non-standard patterns, bugs that require understanding cross-file state, architectural decisions that depend on domain context the model has never seen.

These tasks share the opposite property: the patterns do not exist in training data. The model cannot retrieve what it has not seen. It must reason. And the SWE-Bench Pro data shows that the reasoning capability, while improving, is not at the level the public benchmarks implied.

The 75% SWE-Bench Verified score was real. It just measured a different capability than most people thought it did.

The Broader Benchmark Problem

This is the second time in a year that a major coding benchmark has been revealed as a poor measure of actual capability. We covered the first in The Model You Benchmarked Is Not The Model You Deployed. It will not be the last.

The pattern is structural. Labs compete on benchmark scores. High scores drive press coverage, funding, and customer acquisition. The incentive to optimize for benchmark performance specifically, not for general capability, is enormous.

Scale AI building SWE-Bench Pro with GPL licensing and private commercial codebases is a direct response to this incentive structure. Make contamination legally and practically impossible. Then see what the models can actually do.

The newer models tell the same story at a different scale. Claude Opus 4.5 scores 80.9% on SWE-Bench Verified but drops to 23.4% on the SWE-Bench Pro private set. GPT-5.2 scores 80% on Verified but falls to 23.8% on private. With optimized agent scaffolding on the public set, the best systems reach the mid-40s - Claude Opus 4.5 at 45.9% on the SEAL standardized leaderboard. Better than Opus 4.1's raw scores, but still a long way from the 80% that the old benchmark advertised.

That gap is not a failure. It is an honest measurement. It tells developers exactly where AI coding assistance is today: extraordinarily good at code it has seen, considerably weaker at code it has not.

For a developer making decisions about where to deploy AI coding tools in their workflow, that distinction is worth more than any 80% benchmark score.

Sources:

SWE-Bench Pro Public Leaderboard - Scale AI SEAL
SWE-Bench Pro Private Leaderboard - Scale AI SEAL
SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks? - arXiv
Why we no longer evaluate SWE-Bench Verified - OpenAI
SWE-Bench Pro Leaderboard: Why 46% Beats 81% - MorphLLM

Previously on TheQuery: The Model You Benchmarked Is Not The Model You Deployed - we covered how Llama 4 Maverick gamed LMArena, Claude Opus 4.6 reverse-engineered its own evaluation, and why benchmark scores were already a poor proxy for production performance. SWE-Bench Pro now adds hard data to that argument.