Benchmark Study Archive

April 7, 2026 — v5 Release

Regenerated study as IEEE two-column PDF with corrected data throughout. Added Section IV "Correcting the Opus Paradox" explaining the methodology error in detail. Added formal erratum. All site references updated from v4 to v5. The v4 PDF URL now serves v5 content for backward compatibility.

April 6, 2026 — v4 Correction

An earlier analysis compared Opus 4.6 across harnesses and found a large apparent gap. This comparison was invalid: the Claude Code Opus runs were primarily coding tasks, while the OpenClaude Opus runs disproportionately included harder production tasks (IEEE paper generation, study pre-registration). Different task types, different difficulty levels.

The valid harness comparison uses Sonnet 4.6 with identical task assignments: 7.90 via Claude Code vs 7.71 via OpenClaude (Δ = 0.19 points, 2.4%) — within statistical noise.

Corrected conclusion: OpenClaude's architecture is not the source of quality differences. Model selection (37× variance factor) matters far more than harness choice.

Benchmark Study
Versions

Study Versions

The Harness Is the Moat (v5 — Corrected)

The Harness Is the Moat (v4)

Overview Brief (v3)

Early Results Brief (v2)

Initial Release (v1)

Corrections Log

Benchmark Study Versions