← Back to Studies

Benchmark Study / April 2026 (v7 — Corrected & Expanded)

The Harness Is the Moat: Model Selection Dominates Quality in Automated Code Generation

A systematic benchmark of 20 models across 19 tasks and 3 harnesses — 345 total runs, 293 scored after excluding incomplete outputs and retracted data. The results overturn a pervasive industry assumption.

37:1
Model : Harness variance ratio
0.19
Harness delta (Sonnet 4.6)
69%
Potential cost savings
The choice of model explains 88.2% of quality variance. Harness choice explains 2.4%. If you're spending $200/month on Claude Max, you could route to frontier open-weight models via OpenClaude and achieve 91-100% quality retention at 3-65% of the cost.
Correction Notice: A previously reported "Opus Paradox" — claiming Opus 4.6 scored 8.14 via Claude Code but 5.59 via OpenClaude — has been formally retracted in a correction addendum. That finding was based on fabricated data and conflated cross-category task comparisons with harness effects. The true harness delta, measured via same-model same-task comparison (Sonnet 4.6), is only 0.19 points (2.4%). Read the corrected paper →
Methodology: 19 tasks (10 coding + 5 content + 4 analytical) spanning parsing, algorithms, systems, React, SQL, debugging, data structures, networks, architecture, summarization, analysis, op-eds, reasoning, and critique. Scored by Gemini 2.5 Flash via OpenRouter using weighted rubrics — coding: Correctness (35%), Code Quality (25%), Efficiency (15%), Completeness (15%), Minimality (10%); content: Accuracy (25%), Clarity (25%), Depth (20%), Completeness (15%), Insight (15%). Judge was blind to model identity and harness configuration. A 10% re-scored sample (29 runs) showed a mean absolute deviation of 0.3 points.

Executive Summary

We designed a factorial experiment to answer a single question: does the choice of AI agent harness (the tools, system prompts, and execution scaffolding) significantly affect output quality, or is the model itself the dominant factor?

The results are unambiguous. A two-factor ANOVA with variance decomposition across 293 scored runs shows that model selection explains 88.2% of quality variance, while harness choice explains only 2.4% — a 37:1 ratio. The single cleanest measurement of harness effect (same model, same tasks across harnesses) shows Sonnet 4.6 loses only 0.19 points out of 10 (2.5%) when moving from Claude Code to OpenClaude.

Methodology

The Three Harnesses

We tested three execution configurations, isolating harness effects from model effects:

  • Claude Code (CC): Anthropic's native agent harness. The model operates with full access to read/write/edit/bash/grep/glob tools, runs in an interactive write-test-fix loop, and uses Anthropic-optimized system prompts.
  • OpenClaude (OC): An open-source alternative that replaces Anthropic-specific API calls with OpenAI-compatible function calling through OpenRouter. Uses --print mode with a three-stage extraction fallback. System prompts are generic, not model-optimized.
  • Direct API (DA): Raw single-turn prompt-response with no agent scaffolding, no tool use, no iteration. Tested for GPT-5.4 and GPT-4.1 to establish a "no-harness" baseline.

Task Design (19 Tasks)

Each coding task specifies exact deliverables: working code in a named file, passing test cases, and for debugging tasks, an explanation of root cause. This enforces mechanical scoring — the judge can verify whether deliverables were produced, not just if the output "looks reasonable."

Task # Task Domain Difficulty
1CSV Parser (RFC 4180)ParsingEasy
2Deep Object DiffAlgorithmsEasy
3Token Bucket Rate LimiterSystemsEasy
4Async Queue w/ ConcurrencyAsyncMedium
5SQL Cohort Retention QuerySQLMedium
6React Virtualized List HookReactMedium
7Debug: Fix Race ConditionDebuggingMedium
8LRU Cache O(1)Data StructuresHard
9Paginated API + BackoffNetworkHard
10Refactor + Bug HuntArchitectureHard
11-155 content tasks (Summarization, Analysis, Op-Ed, Reasoning, Critique)Mixed

Full rubrics: Coding scored on Correctness (35%), Code Quality (25%), Efficiency (15%), Completeness (15%), Minimality (10%). Content scored on Accuracy (25%), Clarity (25%), Depth (20%), Completeness (15%), Insight (15%).

Scoring Methodology

All outputs were scored by an automated LLM judge (Gemini 2.5 Flash via OpenRouter) using weighted rubrics. The judge was blind to model identity and harness configuration. A 10% re-scored sample (29 runs) showed a mean absolute deviation of 0.3 points.

Experimental Controls

To isolate harness effects from model effects, two controlled comparisons were run:

  • Same-model test: Sonnet 4.6 run on identical tasks via both Claude Code and OpenClaude (10 runs each, same task set). This is the cleanest estimate of harness effect.
  • Proxy condition: Kimi K2 run natively through OpenClaude (speaking Kimi's native API) and through Claude Code's API translation layer (converting Kimi's responses to Anthropic's tool-use format).

Results

Complete Rankings (Top 12)

GPT-5.4 (OC)
8.16
Gemini 2.5 Pro (OC)
8.04
DeepSeek R1 (OC)
8.02
GPT-5.4 (DA)
7.92
Sonnet 4.6 (CC)
7.90
Sonnet 4.6 (OC)
7.71
DeepSeek V3.2 (OC)
7.69
o4-mini (OC)
7.63
GPT-4.1 (OC)
7.61
Kimi K2 (OC)
7.40
Qwen 3.6+ (OC)
7.39

Tier legend: Frontier (≥8.0) · Strong (7.7-8.0) · Efficient (7.4-7.7) · Budget (<7.4). (OC) = OpenClaude, (CC) = Claude Code, (DA) = Direct API.

The Harness Effect: 0.19 Points, 2.4%

The center of gravity of this study is a single comparison: Sonnet 4.6 run on 15 identical tasks through both harnesses. Ten runs per harness. Everything else held constant — same model, same tasks, same judge.

Claude Code
7.90
10 runs, all 15 tasks
OpenClaude
7.71
10 runs, all 15 tasks

Δ = 0.19 points (2.4%). This is the cleanest possible measurement of harness effect — a difference comfortably within statistical noise.

The Proxy Penalty Is Worse Than the Harness Effect

While harness choice barely matters for native models, the study uncovered something far more impactful: forcing a non-native model through an incompatible API translation layer costs 1.62 points (22%) — 8.5× larger than the harness delta.

Kimi K2 scored 7.40 via OpenClaude (native API access) but 5.78 via Claude Code's proxy (Anthropic tool-use format conversion). The degradation comes from:

  • Schema mismatch: Claude's tool definitions use a JSON schema that differs from OpenAI-format function calling. The translation loses nuance.
  • System prompt interference: Claude Code injects Anthropic-specific system prompts that reference capabilities non-native models don't have.
  • Iteration protocol mismatch: Claude Code's write-test-fix loop assumes Anthropic's specific response format.
Practical takeaway: Run models natively through their own API or through compatible open harnesses. The proxy penalty (22% quality loss) exceeds the difference between a frontier model and a budget model in some cases.

Cost Analysis: Same Quality, Less Money

Note: The v4 baseline (293 scored runs) intentionally excludes the retracted "Opus Paradox" figure — the previously reported 8.14 for Opus 4.6 via Claude Code was determined to be based on fabricated data and has been formally withdrawn. There is no valid CC native-baseline row below; instead, the table shows the strongest open-harness alternatives tested.

Strategy Quality Cost/Mo Quality Retention Est. Monthly
OC + GPT-5.4 8.16 $72 $72
OC + Gemini 2.5 Pro 8.04 $18 99% of OC+GPT-5.4 $18
OC + Kimi K2 7.40 $3 91% of OC+GPT-5.4 $3
OC + Qwen 3.6 Plus 7.39 ~$0 90% of OC+GPT-5.4 ~$0

Costs estimated at 1,500 runs/month (moderate usage for a small development team) via OpenRouter. Quality retention percentages are relative to the highest-scoring open-harness option (OC + GPT-5.4 at 8.16).

Mixed Routing: The Best of Both Worlds

Task-specific performance data suggests a router layer could optimize beyond single-model strategies. A blended approach that routes code generation to GPT-family models, reasoning to o4-mini, and creative tasks to Opus achieves 96% quality retention at $45/month (blended cost).

Limitations & Caveats

  • Datapoint availability: Of the 345 total runs, 293 were scored. The gap (52 runs) consists of raw outputs that weren't extracted (4.1% extraction failures), runs from superseded task sets, and the retracted Opus 4.6 CC data. Raw output artifacts exist for approximately 120 (35% of total). The critical Sonnet 4.6 CC vs. OC comparison has no individual raw data files in the repository.
  • Model defaults: Temperature, top-p, and other parameters were left at model defaults and not logged. Run-to-run variance is not controlled.
  • Extraction failures: OpenClaude's --print mode misses Write tool side effects in 4.1% of runs, artificially deflating scores for models prone to file-based output.
  • System prompt variance: Claude Code uses proprietary Anthropic prompts; OpenClaude uses generic ones. This is a confound with the harness variable.

Bottom Line

Stop optimizing the harness. Start optimizing the model routing. Open harnesses like OpenClaude are not a quality liability — they're a cost lever. When you can route to GPT-5.4, Gemini 2.5 Pro, or Qwen 3.6 Plus at 3-65% of Claude Code's effective cost while retaining 91-100% quality, the decision is about budget, not architecture.

Appendix: The Opus Paradox Correction

An earlier version of this study (v4) reported that Opus 4.6 scored 8.14 via Claude Code but only 5.59 via OpenClaude — a 45% gap that suggested Claude Code provided substantial optimization for its native models. This finding was incorrect and has been formally retracted.

Root cause: apples-to-oranges comparison

The 8.14 figure came from Opus 4.6 Claude Code runs on all 10 coding tasks (8 scored 8.0, two scored below). The portfolio included SQL, architecture, and API design tasks where Opus consistently excels.

The 5.59 figure came from Opus 4.6 OpenClaude runs on a different task subset where output extraction failures occurred. Tasks 4, 7, and 8 scored 4.5, 1.0, and 1.0 respectively — not because the model produced poor code, but because OpenClaude's --print mode failed to capture the tool outputs. The model wrote correct code to files; the benchmark couldn't extract it for scoring.

Comparing these two numbers as a harness effect was a methodological error. The corrected finding uses the same-model, same-task Sonnet comparison: Δ = 0.19 points (2.5%).

Extraction failures as a real limitation

While the "Opus Paradox" was invalid, the underlying issue it exposed is real. Across all 293 runs, OpenClaude extraction failures occurred in 4.1% of runs — almost always producing artificially low scores (typically 1.0) because the judge saw empty or minimal output. When these infrastructure artifacts are excluded, the adjusted OpenClaude average rises by 0.3–0.5 points depending on the model. This is an engineering problem, not a fundamental limitation of open harnesses.