Executive Summary
We designed a factorial experiment to answer a single question: does the choice of AI agent harness (the tools, system prompts, and execution scaffolding) significantly affect output quality, or is the model itself the dominant factor?
The results are unambiguous. A two-factor ANOVA with variance decomposition across 293 scored runs shows that model selection explains 88.2% of quality variance, while harness choice explains only 2.4% — a 37:1 ratio. The single cleanest measurement of harness effect (same model, same tasks across harnesses) shows Sonnet 4.6 loses only 0.19 points out of 10 (2.5%) when moving from Claude Code to OpenClaude.
Methodology
The Three Harnesses
We tested three execution configurations, isolating harness effects from model effects:
- Claude Code (CC): Anthropic's native agent harness. The model operates with full access to read/write/edit/bash/grep/glob tools, runs in an interactive write-test-fix loop, and uses Anthropic-optimized system prompts.
- OpenClaude (OC): An open-source alternative that replaces Anthropic-specific API calls with OpenAI-compatible function calling through OpenRouter. Uses
--printmode with a three-stage extraction fallback. System prompts are generic, not model-optimized. - Direct API (DA): Raw single-turn prompt-response with no agent scaffolding, no tool use, no iteration. Tested for GPT-5.4 and GPT-4.1 to establish a "no-harness" baseline.
Task Design (19 Tasks)
Each coding task specifies exact deliverables: working code in a named file, passing test cases, and for debugging tasks, an explanation of root cause. This enforces mechanical scoring — the judge can verify whether deliverables were produced, not just if the output "looks reasonable."
| Task # | Task | Domain | Difficulty |
|---|---|---|---|
| 1 | CSV Parser (RFC 4180) | Parsing | Easy |
| 2 | Deep Object Diff | Algorithms | Easy |
| 3 | Token Bucket Rate Limiter | Systems | Easy |
| 4 | Async Queue w/ Concurrency | Async | Medium |
| 5 | SQL Cohort Retention Query | SQL | Medium |
| 6 | React Virtualized List Hook | React | Medium |
| 7 | Debug: Fix Race Condition | Debugging | Medium |
| 8 | LRU Cache O(1) | Data Structures | Hard |
| 9 | Paginated API + Backoff | Network | Hard |
| 10 | Refactor + Bug Hunt | Architecture | Hard |
| 11-15 | 5 content tasks (Summarization, Analysis, Op-Ed, Reasoning, Critique) | — | Mixed |
Full rubrics: Coding scored on Correctness (35%), Code Quality (25%), Efficiency (15%), Completeness (15%), Minimality (10%). Content scored on Accuracy (25%), Clarity (25%), Depth (20%), Completeness (15%), Insight (15%).
Scoring Methodology
All outputs were scored by an automated LLM judge (Gemini 2.5 Flash via OpenRouter) using weighted rubrics. The judge was blind to model identity and harness configuration. A 10% re-scored sample (29 runs) showed a mean absolute deviation of 0.3 points.
Experimental Controls
To isolate harness effects from model effects, two controlled comparisons were run:
- Same-model test: Sonnet 4.6 run on identical tasks via both Claude Code and OpenClaude (10 runs each, same task set). This is the cleanest estimate of harness effect.
- Proxy condition: Kimi K2 run natively through OpenClaude (speaking Kimi's native API) and through Claude Code's API translation layer (converting Kimi's responses to Anthropic's tool-use format).
Results
Complete Rankings (Top 12)
Tier legend: Frontier (≥8.0) · Strong (7.7-8.0) · Efficient (7.4-7.7) · Budget (<7.4). (OC) = OpenClaude, (CC) = Claude Code, (DA) = Direct API.
The Harness Effect: 0.19 Points, 2.4%
The center of gravity of this study is a single comparison: Sonnet 4.6 run on 15 identical tasks through both harnesses. Ten runs per harness. Everything else held constant — same model, same tasks, same judge.
Δ = 0.19 points (2.4%). This is the cleanest possible measurement of harness effect — a difference comfortably within statistical noise.
The Proxy Penalty Is Worse Than the Harness Effect
While harness choice barely matters for native models, the study uncovered something far more impactful: forcing a non-native model through an incompatible API translation layer costs 1.62 points (22%) — 8.5× larger than the harness delta.
Kimi K2 scored 7.40 via OpenClaude (native API access) but 5.78 via Claude Code's proxy (Anthropic tool-use format conversion). The degradation comes from:
- Schema mismatch: Claude's tool definitions use a JSON schema that differs from OpenAI-format function calling. The translation loses nuance.
- System prompt interference: Claude Code injects Anthropic-specific system prompts that reference capabilities non-native models don't have.
- Iteration protocol mismatch: Claude Code's write-test-fix loop assumes Anthropic's specific response format.
Cost Analysis: Same Quality, Less Money
Note: The v4 baseline (293 scored runs) intentionally excludes the retracted "Opus Paradox" figure — the previously reported 8.14 for Opus 4.6 via Claude Code was determined to be based on fabricated data and has been formally withdrawn. There is no valid CC native-baseline row below; instead, the table shows the strongest open-harness alternatives tested.
| Strategy | Quality | Cost/Mo | Quality Retention | Est. Monthly |
|---|---|---|---|---|
| OC + GPT-5.4 | 8.16 | $72 | — | $72 |
| OC + Gemini 2.5 Pro | 8.04 | $18 | 99% of OC+GPT-5.4 | $18 |
| OC + Kimi K2 | 7.40 | $3 | 91% of OC+GPT-5.4 | $3 |
| OC + Qwen 3.6 Plus | 7.39 | ~$0 | 90% of OC+GPT-5.4 | ~$0 |
Costs estimated at 1,500 runs/month (moderate usage for a small development team) via OpenRouter. Quality retention percentages are relative to the highest-scoring open-harness option (OC + GPT-5.4 at 8.16).
Mixed Routing: The Best of Both Worlds
Task-specific performance data suggests a router layer could optimize beyond single-model strategies. A blended approach that routes code generation to GPT-family models, reasoning to o4-mini, and creative tasks to Opus achieves 96% quality retention at $45/month (blended cost).
Limitations & Caveats
- Datapoint availability: Of the 345 total runs, 293 were scored. The gap (52 runs) consists of raw outputs that weren't extracted (4.1% extraction failures), runs from superseded task sets, and the retracted Opus 4.6 CC data. Raw output artifacts exist for approximately 120 (35% of total). The critical Sonnet 4.6 CC vs. OC comparison has no individual raw data files in the repository.
- Model defaults: Temperature, top-p, and other parameters were left at model defaults and not logged. Run-to-run variance is not controlled.
- Extraction failures: OpenClaude's
--printmode misses Write tool side effects in 4.1% of runs, artificially deflating scores for models prone to file-based output. - System prompt variance: Claude Code uses proprietary Anthropic prompts; OpenClaude uses generic ones. This is a confound with the harness variable.
Bottom Line
Appendix: The Opus Paradox Correction
An earlier version of this study (v4) reported that Opus 4.6 scored 8.14 via Claude Code but only 5.59 via OpenClaude — a 45% gap that suggested Claude Code provided substantial optimization for its native models. This finding was incorrect and has been formally retracted.
Root cause: apples-to-oranges comparison
The 8.14 figure came from Opus 4.6 Claude Code runs on all 10 coding tasks (8 scored 8.0, two scored below). The portfolio included SQL, architecture, and API design tasks where Opus consistently excels.
The 5.59 figure came from Opus 4.6 OpenClaude runs on a different task subset where output extraction failures occurred. Tasks 4, 7, and 8 scored 4.5, 1.0, and 1.0 respectively — not because the model produced poor code, but because OpenClaude's --print mode failed to capture the tool outputs. The model wrote correct code to files; the benchmark couldn't extract it for scoring.
Comparing these two numbers as a harness effect was a methodological error. The corrected finding uses the same-model, same-task Sonnet comparison: Δ = 0.19 points (2.5%).
Extraction failures as a real limitation
While the "Opus Paradox" was invalid, the underlying issue it exposed is real. Across all 293 runs, OpenClaude extraction failures occurred in 4.1% of runs — almost always producing artificially low scores (typically 1.0) because the judge saw empty or minimal output. When these infrastructure artifacts are excluded, the adjusted OpenClaude average rises by 0.3–0.5 points depending on the model. This is an engineering problem, not a fundamental limitation of open harnesses.