Past Benchmarks: Measuring the True Value of AI-Generated Code

November 22, 2025

33

The primary wave of AI adoption in software program improvement was about productiveness. For the previous few
years, AI has felt like a magic trick for software program builders: We ask a query, and seemingly
excellent code seems. The productiveness beneficial properties are plain, and a technology of builders is
now rising up with an AI assistant as their fixed companion. It is a large leap ahead in
the software program improvement world, and it’s right here to remain.

The following — and much more essential — wave shall be about managing threat. Whereas builders have
embraced massive language fashions (LLMs) for his or her outstanding capacity to resolve coding challenges,
it’s time for a dialog in regards to the high quality, safety, and long-term value of the code these
fashions produce. The problem is not about getting AI to write down code that works. It’s about
guaranteeing AI writes code that lasts.

And to this point, the time spent by software program builders in coping with the standard and threat points
spawned by LLMs has not made builders quicker. It has really slowed down their general
work by practically 20%, in accordance with analysis from METR.

The High quality Debt

The primary and most widespread threat of the present AI strategy is the creation of an enormous, long-
time period technical debt in high quality. The trade’s concentrate on efficiency benchmarks incentivizes
fashions to discover a right reply at any value, whatever the high quality of the code itself. Whereas
fashions can obtain excessive move charges on useful assessments, these scores say nothing in regards to the
code’s construction or maintainability.

In reality, a deep evaluation of their output in our analysis report, “The Coding Personalities of
Main LLMs,” reveals that for each mannequin, over 90% of the problems discovered had been “code smells” — the uncooked materials of technical debt. These aren’t useful bugs however are indicators of poor
construction and excessive complexity that result in the next whole value of possession.

For some fashions, the commonest situation is forsaking “Useless/unused/redundant code,”
which might account for over 42% of their high quality issues. For different fashions, the primary situation is a
failure to stick to “Design/framework finest practices. Which means that whereas AI is accelerating
the creation of latest options, it’s also systematically embedding the upkeep issues of
the long run into our codebases at the moment.

The Safety Deficit

The second threat is a systemic and extreme safety deficit. This isn’t an occasional mistake; it’s a
basic lack of safety consciousness throughout all evaluated fashions. That is additionally not a matter of
occasional hallucination however a structural failure rooted of their design and coaching. LLMs wrestle
to stop injection flaws as a result of doing so requires a non-local knowledge movement evaluation generally known as
taint-tracking, which is commonly past the scope of their typical context window. LLMs additionally generate hard-coded secrets and techniques — like API keys or entry tokens — as a result of these flaws exist in
their coaching knowledge.

The outcomes are stark: All fashions produce a “frighteningly excessive proportion of vulnerabilities with the very best severity rankings.” For Meta’s Llama 3.2 90B, over 70% of the vulnerabilities it introduces are of the very best “BLOCKER” severity. The commonest flaws throughout the board are essential vulnerabilities like “Path-traversal & Injection,” and “Exhausting-coded credentials.” This reveals a essential hole: The very course of that makes fashions highly effective code mills additionally makes them environment friendly at reproducing the insecure patterns they’ve discovered from public knowledge.

The Persona Paradox

The third and most advanced threat comes from the fashions’ distinctive and measurable “coding
personalities.” These personalities are outlined by quantifiable traits like Verbosity (the sheer
quantity of code generated), Complexity (the logical intricacy of the code), and Communication
(the density of feedback).

Totally different fashions introduce completely different sorts of threat, and the pursuit of “higher” personalities can paradoxically result in extra harmful outcomes. For instance, one mannequin like Anthropic’s Claude Sonnet 4, the “senior architect” introduces threat via complexity. It has the very best useful ability with a 77.04% move charge. Nevertheless, it achieves this by writing an infinite quantity of code — 370,816 traces of code (LOC) — with the very best cognitive complexity rating of any mannequin, at 47,649.

This sophistication is a entice, resulting in a excessive charge of adverse concurrency and threading bugs.
In distinction, a mannequin just like the open-source OpenCoder-8B, the “fast prototyper” introduces threat
via haste. It’s the most concise, writing solely 120,288 LOC to resolve the identical issues. However
this velocity comes at the price of being a “technical debt machine” with the very best situation density of all fashions (32.45 points/KLOC).

This character paradox is most evident when a mannequin is upgraded. The newer Claude
Sonnet 4 has a greater efficiency rating than its predecessor, enhancing its move charge by 6.3%.
Nevertheless, this “smarter” character can also be extra reckless: The proportion of its bugs which are of
“BLOCKER” severity skyrocketed by over 93%. The pursuit of a greater scorecard can create a
device that’s, in apply, a higher legal responsibility.

Rising Up with AI

This isn’t a name to desert AI — it’s a name to develop with it. The primary part of our relationship with
AI was certainly one of wide-eyed marvel. This subsequent part have to be certainly one of clear-eyed pragmatism.
These fashions are highly effective instruments, not replacements for expert software program builders. Their velocity
is an unbelievable asset, but it surely have to be paired with human knowledge, judgment, and oversight.

Or as a latest report from the DORA analysis program put it: “AI’s major position in software program
improvement is that of an amplifier. It magnifies the strengths of high-performing organizations
and the dysfunctions of struggling ones.”

The trail ahead requires a “belief however confirm” strategy to each line of AI-generated code. We
should develop our analysis of those fashions past efficiency benchmarks to incorporate the
essential, non-functional attributes of safety, reliability, and maintainability. We have to select
the correct AI character for the correct job — and construct the governance to handle its weaknesses.
The productiveness increase from AI is actual. But when we’re not cautious, it may be erased by the long-term
value of sustaining the insecure, unreadable, and unstable code it leaves in its wake.

Past Benchmarks: Measuring the True Value of AI-Generated Code

The High quality Debt

The Safety Deficit

The Persona Paradox

Rising Up with AI

Related Articles

Microsoft to close down Change On-line EWS in April 2027

Apple will get win from EU over Apple Maps and Advertisements companies

Context Engineering for Coding Brokers

LEAVE A REPLY Cancel reply

Latest Articles

Microsoft to close down Change On-line EWS in April 2027

Apple will get win from EU over Apple Maps and Advertisements companies

Context Engineering for Coding Brokers

Why RIAs ought to keep away from non-public fairness in succession planning

Digital Gold vs Gold ETF – Which Ought to You Select?