Technology · June 19, 2026

AI Coding Agents Are Not Making Your Codebase Healthier — The CMU Evidence on Speed vs. Quality

By Arthur Jenkins · Technical Writer

The Disconnect Between Perception and Measured Productivity

Every developer productivity survey in 2025 and 2026 reports the same headline: engineers using AI coding tools feel dramatically faster. Anthropic’s internal survey measured a self-reported 50% productivity boost. Stack Overflow’s survey reported 52%. LinkedIn commentary calls this the biggest productivity shift since the compiler.

Then the Machine Learning Engineering for Research (METR) team ran a randomized controlled trial with 16 experienced open-source maintainers on 246 real tasks across mature codebases averaging over a million lines of code. The measured result: developers using Cursor Pro with Claude 3.5 and 3.7 Sonnet took 19% longer to complete tasks than those working unaided. The developers predicted a 24% speedup, believed after the trial that they were 20% faster, but the clock told a different story — a 39-point perception gap that invalidates every survey-based productivity claim published this year (METR, arXiv:2507.09089, July 2025).

The METR finding does not mean AI tools are useless. It means the question “are developers faster?” is the wrong question. The real question — the one the peer-reviewed evidence from 2026 answers — is: what kind of code are AI agents producing, and what does that code cost your codebase over time?

What the CMU Studies Actually Found

Two papers from Carnegie Mellon University’s STRUDEL group, both accepted at the 23rd International Conference on Mining Software Repositories (MSR ‘26), provide the clearest causal evidence yet. Both use staggered difference-in-differences with matched controls — the gold standard for causal inference in observational data — and both converge on the same uncomfortable conclusion.

Paper 1: “AI IDEs or Autonomous Agents?” (Agarwal et al.)

This study separates repositories into two groups: those that adopted AI agents as their first AI coding tool (“agent-first”), and those that already used AI IDEs like Copilot before adopting agents (“IDE-first”). The distinction matters because it isolates the effect of agents from the effect of AI tools generally.

OutcomeAgent-First (% Change)IDE-First (% Change)
Monthly Commits+36.25% (p<0.001)+3.06% (not significant)
Lines Added+76.59% (p<0.001)–6.34% (not significant)
Static Analysis Warnings+17.73% (p<0.001)+19.00% (not significant)
Code Complexity+34.85% (p<0.001)+42.87% (p<0.01)

Source: Agarwal et al., MSR ‘26

Agent-first repos saw a massive velocity spike: lines added jumped 77% and commits rose 36%. But complexity and static-analysis warnings rose right alongside them — and the IDE-first group, which showed no significant velocity gain, still saw complexity rise 43%. The headline finding: agent-induced complexity is not a side effect of shipping more code faster. It is an independent effect of the way agents write code.

The dynamic effects chart in the paper is even more revealing. Agent-first repos saw a +216% spike in lines added in the month of adoption, tapering to around +50% by month six. Complexity, by contrast, rose steadily: +21% at adoption to +49% by month five. Velocity gains faded; complexity debt did not.

Paper 2: “Speed at the Cost of Quality” (He et al.)

This study focuses on one tool — Cursor — across 806 treated repositories matched to 1,380 controls. The design is the same staggered DiD with the Borusyak imputation estimator, and the results mirror the first paper with finer granularity:

OutcomeCursor ATT (% Change)Temporal Pattern
Lines Added+28.58%Transient: +281% in month 1, +48% in month 2, then non-significant
Static Analysis Warnings+30.26%Persistent across the entire observation window
Code Complexity+41.64%Persistent across the entire observation window

Source: He et al., MSR ‘26

The +281% spike in lines added during month one is the kind of number that gets shared in executive dashboards. What does not make it into the dashboard is what happens next. By month three, the velocity advantage is gone. The complexity and warnings are not. They persist for the entire observation window, and there is no evidence they ever decay.

The second paper goes further: it uses dynamic panel GMM (Arellano-Bond estimation) to test the causal relationship between complexity and future velocity. The result is stark:

A 100% increase in code complexity causes a ~65% decrease in future lines added. A 100% increase in static-analysis warnings causes a ~50% decrease.

This is the self-reinforcing cycle that surveys never capture: agents ship complex code → that complexity slows future work → teams compensate by using more agents → more complex code. The initial velocity is borrowed from future maintainability.

Why Agent-Generated Code Is Inherently More Complex

The He et al. paper controls for the possibility that complexity rises simply because more code is being written. After controlling for velocity, Cursor adoption still causes a 9% baseline increase in code complexity. AI-generated code is not just more code — it is structurally more complex code per line.

Multiple mechanisms explain this. Agents optimize for completing the immediate task, not for the long-term structure of the codebase they are modifying. An agent ingests a user prompt, generates a solution, and moves on. It has no concept of the architecture’s evolutionary trajectory, no understanding of which abstractions the team plans to consolidate, and no memory of the tacit design decisions that live in the team’s shared mental model. Without a persistent understanding of the codebase’s history and direction, every agent contribution is a local optimum that degrades the global structure.

Addy Osmani captured the corollary problem: agent reasoning is discarded after code generation. A human reviewer must reconstruct intent that was never written down — and agents tend to “ghost” the moment they receive subjective feedback. A companion 2026 paper found that reviewer abandonment accounted for 38% of rejected agent PRs (Addy Osmani, Elevate, June 2026). The agent writes code it cannot defend.

The Review Bottleneck Becomes the New Ceiling

The Faros AI study of 22,000 developers across 4,000 teams (March 2026) documents what happens when agent-scaled production meets human-scaled review:

MetricChange After AI Adoption
Code Churn+861%
Median Review Duration+441.5%
Incidents per PR+242.7%
PRs Merged with Zero Review+31.3%

Source: Faros AI, 22K developers, 4K teams (March 2026)

The last row is the most alarming. A 31% increase in PRs merged without review means teams, overwhelmed by the review backlog, are bypassing quality gates entirely. This is not a process failure — it is a capacity failure. Human reading speed is a physical constant (roughly 200–400 lines of code per hour for a thorough review). Agent output speed compounds exponentially. The gap between the two curves is where quality falls through.

CodeRabbit’s analysis of 470 open-source PRs confirms the stakes: AI-coauthored changes contain 1.7x more issues overall, with logic and correctness problems up 75% and security issues 1.5–2x more common compared to human-authored PRs (CodeRabbit, Dec 2025, via Addy Osmani analysis).

When It Works: The Conditions for Sustainable Agent Adoption

None of this means coding agents are a bad investment. The Anthropic 2026 Agentic Coding Trends Report documents genuine wins at scale: Rakuten engineers used Claude Code to implement a feature on the 12.5-million-line vLLM codebase in 7 hours of autonomous work with 99.9% numerical accuracy. CRED, a fintech with 15 million users, doubled execution speed by shifting developers to higher-value work rather than eliminating headcount. TELUS created 13,000 custom AI solutions and shipped code 30% faster, saving over 500,000 engineering hours (Anthropic 2026 Agentic Coding Trends Report).

Three patterns distinguish these success stories from the CMU study populations:

  1. Strong engineering fundamentals. DORA’s 2025 survey of ~5,000 professionals found that AI is an amplifier, not a fix: teams with CI/CD, small batches, real testing, and clear platform abstractions saw AI amplify throughput while holding stability. Teams with weak practices saw incidents per PR triple and review time double. As DORA lead Nathen Harvey put it: “You might be listening to a high school band full of amateurs, and you just made it a lot louder.”

  2. Quality infrastructure scaled alongside agent adoption. The successful organizations did not add agents and hope quality somehow held. They added automated review gates, circuit-breaker triage for agent PRs, tiered review policies by risk level, and monitoring for complexity metrics — the same infrastructure the CMU papers recommend as “first-class citizens” in agent tooling design.

  3. Selective deployment, not wholesale replacement. Anthropic’s own data shows that developers use AI in roughly 60% of their work but can fully delegate only 0–20% of tasks. The teams that succeed treat agents as high-throughput contributors that require active supervision, not as replacements for engineering judgment.

A Practical Playbook for Engineering Leaders

The evidence points to a specific set of actions, not a general “use agents” or “don’t use agents” conclusion:

1. Baseline your metrics before provisioning any licenses. Measure current cycle time, review duration, change failure rate, defect density, and code complexity (SonarQube cognitive complexity is a reasonable default). Run a 30–60 day pilot on one team. Compare what developers say to what the data shows — the METR perception gap will eat your post-rollout debate.

2. Tier your review policy by blast radius. Addy Osmani’s framework is the most practical available: config changes and boilerplate get lint + glance; medium-risk changes get tests + one AI reviewer; high-risk core logic gets types + tests + two AI reviewers + human owner + security pass. The tiered approach acknowledges that you cannot review everything at the same depth.

3. Do not trust agents to review agents. The CodeRabbit analysis found that 93.4% of flagged issues across four different AI review tools were caught by exactly one tool — almost none overlapped. Run two different AI reviewers (covering different model architectures) rather than multiple passes of the same model.

4. Surface complexity metrics in agent feedback loops. The CMU papers explicitly recommend that maintainability metrics be surfaced directly in agent planning and prompting. If your agent tooling does not expose complexity deltas per PR, it is flying blind. Push your vendor to provide this or build it yourself.

The Trade-Off Is Real — and Manageable

The 2026 evidence establishes an empirical fact that the 2024 and 2025 hype cycle papered over: AI coding agents produce a persistent increase in code complexity that compounds over time, and the velocity gains that make the headlines are front-loaded and transient. The CMU findings, the METR RCT, the Faros AI data, and Addy Osmani’s synthesis all point in the same direction.

But the same evidence also shows that the trade-off is manageable. The organizations that succeed are not those that adopt the most advanced agents or the most expensive enterprise licenses. They are the ones that invest in the boring stuff: CI/CD pipelines, automated review tooling, complexity monitoring, tiered review policies, and the human judgment to know when to trust generated code and when to dig in.

The companies that treat AI tools as a shortcut around engineering discipline will learn the lesson the CMU data already teaches: you cannot outrun technical debt by generating it faster.

Want results like these for your store?