Archive

Benchmark Study
Versions

Complete version history of the OpenClaude Benchmark Study. All versions linked below.

Study Versions

Current v5 5 Pages April 7, 2026

The Harness Is the Moat (v5 — Corrected)

Corrected findings: Sonnet via Claude Code (7.90) vs OpenClaude (7.71) shows Δ = 0.19 points (2.4%) — within statistical noise. Opus comparison invalidated due to task-type mismatch. Model selection produces 37× more variance than harness choice (7.04 vs 0.19 points). IEEE two-column format. Includes Section IV: Correcting the Opus Paradox and formal erratum. 293 scored runs out of 345 total.

v4 Superseded April 6, 2026

The Harness Is the Moat (v4)

Original corrected release. Superseded by v5 which adds IEEE formatting, formal erratum, and expanded methodology section. Now redirects to v5 content.

v3 3 Pages April 6, 2026

Overview Brief (v3)

Short-form summary of key findings. 19 tasks, 19 models. 362 total runs before final corrections.

v2 Preliminary April 6, 2026

Early Results Brief (v2)

Preliminary findings with production task results. Contains incomplete data before all coding tasks were added.

v1 3 Pages April 5, 2026

Initial Release (v1)

First public release. Summary of production task findings (TikTok scripts, blog posts) with 10 budget/free models.

Corrections Log

April 7, 2026 — v5 Release

Regenerated study as IEEE two-column PDF with corrected data throughout. Added Section IV "Correcting the Opus Paradox" explaining the methodology error in detail. Added formal erratum. All site references updated from v4 to v5. The v4 PDF URL now serves v5 content for backward compatibility.

April 6, 2026 — v4 Correction

An earlier analysis compared Opus 4.6 across harnesses and found a large apparent gap. This comparison was invalid: the Claude Code Opus runs were primarily coding tasks, while the OpenClaude Opus runs disproportionately included harder production tasks (IEEE paper generation, study pre-registration). Different task types, different difficulty levels.

The valid harness comparison uses Sonnet 4.6 with identical task assignments: 7.90 via Claude Code vs 7.71 via OpenClaude (Δ = 0.19 points, 2.4%) — within statistical noise.

Corrected conclusion: OpenClaude's architecture is not the source of quality differences. Model selection (37× variance factor) matters far more than harness choice.