Comparison / April 2026

6,363 Jobs Later: Claude Code vs Codex vs the Open-Source Alternatives

Real production data from running every major AI coding tool in the same job queue.

Not benchmarks. Not weekend experiments. 138 jobs per day across Claude Code, Codex, OpenCode, and more - for months. Here's what I actually know.

Last Tuesday at 2am, my orchestration engine dispatched job #6,300. A Codex worker picked it up - a straightforward file refactor across three modules. Nine minutes later it came back clean. The same engine had dispatched a nearly identical task to a Claude Code worker the day before. That one took six minutes but produced code that needed zero follow-up. A week earlier, I'd tried the same class of task on OpenCode routing through a free model. It took four minutes but the output needed two rounds of fixes before it was usable.

That's the whole comparison in miniature. Not which tool is "best" - which tool is best for what, and what does "best" actually cost you when you're running hundreds of these a week.

I've been running all of these tools in the same production job queue since January. Not benchmarks. Not weekend experiments. A real system that now averages 138 jobs per day across Claude Code, Codex, OpenCode, Gemini, and various LLM backends—though it ramped slowly from about 15 jobs/day in January to this pace by late February. The database has 6,363 completed jobs as of this morning. Here's what I actually know.

6,363

Total Jobs

138

Avg Daily Jobs

5

Backends

What Each Tool Actually Is

Strip the marketing and you get three very different things.

Claude Code is Anthropic's agentic coding CLI. It reads your codebase, uses tools (file edits, bash, web search, agent spawning), and maintains context across long sessions. The killer feature isn't the model - it's CLAUDE.md, a markdown file that acts as persistent configuration across every session and subprocess. If you've built anything serious with it, you know that file becomes your operating system. My CLAUDE.md carries triage rules, incident-encoded constraints, workspace maps, and routing logic that took months to develop.

Codex CLI is OpenAI's answer. Open-sourced under Apache 2.0, it routes through o3 (or whatever frontier model is current in your ChatGPT tier) and runs in a kernel-level sandbox — macOS Seatbelt, Linux Landlock plus seccomp. It doesn't try to be an orchestrator. It tries to be a reliable pair of hands: give it a task, it executes in a sandbox, returns a diff. It uses AGENTS.md instead of CLAUDE.md, which is an open standard but has less configuration surface.

The open-source field - Aider, Continue, OpenCode, and a growing list of others - each solves a different slice of the problem. Aider is the most mature single-agent coding tool. Continue is an IDE extension that embeds AI into your editor workflow. OpenCode tried to be the universal frontend for any model, any provider - and walked into a firestorm in the process.

Where Claude Code Wins (And It's Not Close)

Context Management

I have 1,612 Claude workers in my database. Some of them are persistent - they survive job completion and pick up related work with warm context. My email triage worker has handled dozens of jobs in a single session. It knows inbox patterns, recurring senders, what to archive versus what to flag. That accumulated context would cost thousands of tokens to rebuild from scratch every time.

Claude Code's session persistence is genuine and durable. I've compacted and resumed sessions dozens of times. Conversation history, tool calls, todos - all persisted to disk. When a worker picks up where it left off, it actually picks up where it left off.

Codex doesn't do this. Every cloud task starts fresh. OpenCode has SQLite-backed sessions, which helps, but the coordination layer is entirely on you.

Agentic Orchestration

Here's the real code that routes jobs to workers in my system:

// capacity.mjs - real production config
export const CAPACITY_DEFAULTS = {
  max_slots: 20,
  reserved_danny_slot: 1,
  token_budget_5min: 2_000_000,
  backpressure_queue_depth: 20,
  backpressure_token_pct: 0.8,
};

export const TOKEN_ESTIMATES = {
  llm: 50_000,
  'coding-agent': 200_000,
  codex: 200_000,
  'writing-review': 100_000,
};

That reserved_danny_slot is important. Even when all 20 worker slots are full with background jobs, a direct message from me always gets through. I learned this the hard way - the system was once so busy doing maintenance that it couldn't respond to a simple question for ten minutes.

Claude Code's Agent Teams let me run 2 to 16 independent instances that coordinate through a shared task list and peer-to-peer messaging. Codex has cloud parallel tasks, but they don't talk to each other. OpenCode is strictly single-agent - you can run multiple instances, but coordination is your problem.

The Failure Record

Here's what the database actually says:

Backend	Jobs	Success	Failed	Rate
coding-agent	736	643	48	87.4%
codex	485	431	47	88.9%
opencode	507	493	8	97.2%
claude	497	476	19	95.8%
gemini	89	84	4	94.4%

Those numbers need context. OpenCode's 97.2% success rate looks amazing until you realize it handles the simplest tasks - formatting, boilerplate, commit messages. Feed it the same complex refactoring that Claude Code handles, and those numbers would look very different. Codex and Claude Code's coding-agent workers handle comparable complexity, and their failure rates are close enough to call it a draw.

But the type of failure differs. Codex failures tend to be clean - the sandbox catches them, the task terminates, you get a clear error. Claude Code failures are more subtle. The task "succeeds" but the output needs revision. My multi-model review pipeline catches these, but they don't show up as failures in the database. If I counted revision-needed as failures, Claude Code's real rate would be lower.

Where Codex Surprises

Speed Per Dollar

Average completion time from my database:

coding-agent (Claude Code): 376 seconds (~6 min)
codex: 561 seconds (~9 min)

Wait - Codex is slower? Yes, but that's misleading. Codex tasks in my system tend to be meatier. When I control for comparable tasks, Codex finishes in about 70% of the time with roughly 4x fewer tokens. On a per-token basis, it's dramatically more efficient.

Both tools are effectively free for me - Claude Max ($200/mo) and Codex through ChatGPT Plus ($20/mo) are subscriptions I'd have anyway. But if you're on API billing, that 4x token efficiency matters.

The Sandbox Is Real

Codex's sandboxing isn't application-layer permissions like Claude Code's hook system. It's kernel-level isolation - Seatbelt on macOS, Landlock and seccomp on Linux. Network disabled by default. Protected paths. The agent can't escape its box even if the model hallucinates a creative attempt.

Claude Code gives you 17 hook events for fine-grained control, which is more flexible. But Codex's approach is more fundamentally secure. If I'm running untrusted code generation, Codex is where it goes.

Different Failure Modes

When Claude Code fails, it tends to fail creatively - it'll try an approach that's technically interesting but wrong, and because it's good at explaining itself, the failure looks plausible. You have to actually test the output to catch it.

When Codex fails, it fails simply. Wrong variable name. Missed import. Straightforward bugs that a review pass catches immediately. I'd rather debug simple failures than plausible-looking ones.

The Open-Source Gap

Aider Is Good But Different

Aider is the most mature open-source coding agent, and I respect it. It does one thing well: AI-assisted code editing in your terminal. It has smart context management, git integration, support for dozens of models.

But Aider is a coding tool. It's not trying to be an orchestration platform. You can't build a job queue around it. You can't persist sessions across workers. You can't spawn sub-agents that coordinate. If your use case is "I want AI help while I code," Aider is excellent. If your use case is "I want AI to code while I sleep," you need something else.

Continue Is IDE-Bound

Continue embeds AI into VS Code and JetBrains. It's good at what it does - autocomplete, inline edits, chat with your codebase. But it's fundamentally a human-in-the-loop tool. The AI assists; you drive.

For an individual developer writing code, Continue is arguably the best daily-driver experience. For autonomous agent systems, it's not in the running. Different tool for a different job.

OpenCode and the Controversy

OpenCode needs a fair look before you write it off.

The tool itself is solid. 75+ provider support, Bubble Tea TUI, SQLite sessions, LSP integration. It's the only tool in this comparison that runs completely offline with Ollama. For privacy-sensitive environments or air-gapped systems, it's the only option.

The controversy: in January 2026, Anthropic blocked OpenCode from using Claude Max subscriptions. The community was furious. DHH called it "very hostile to users." OpenCode gained 18,000 stars in two weeks from the backlash.

My take: OpenCode's architecture mirrors Claude Code's pretty closely—the AGENTS.md convention, the tool-use patterns, the session management. The line between "inspired by" and "wrapping" is blurry. Anthropic had a business reason to block it, and users had a legitimate reason to be angry about it. Both things are true.

What matters for practitioners: OpenCode works great with OpenAI, open-source models, and Anthropic's API (not subscription). I use it in my job queue for tasks that don't need frontier intelligence - routing through free models on OpenRouter. For that use case, nothing else comes close on cost.

But I wouldn't build my core system on it. The project moves fast, the controversy introduced uncertainty about provider relationships, and the coordination layer you'd need to build around it is the same work you'd do with Claude Code but without the ecosystem.

The Real Test: Same Task, Three Tools

I ran a non-trivial task across all three: refactor a 200-line Node.js module to extract a reusable utility, update all call sites, and add basic error handling. Real code from my system, not a contrived example.

Claude Code (coding-agent, Opus):

$ claude -p "Refactor capacity.mjs: extract estimateJobTokens
  and computeAvailableSlots into a shared utils module.
  Update all imports." --model opus

> Read 4 files, identified 7 import sites
> Created shared/capacity-utils.mjs
> Updated imports in engine.mjs, dispatch.mjs,
  backpressure.mjs, worker-pool.mjs, 3 test files
> Added JSDoc comments to exported functions
> Ran existing tests - all passing

Time: 4m 12s | Files touched: 9 | Follow-up needed: No

Codex:

$ codex "Refactor capacity.mjs: extract
  estimateJobTokens and computeAvailableSlots into shared
  utils. Update all imports."

> Created utils/capacity.mjs
> Updated 5 of 7 import sites
> Missed 2 test files
> Tests: 2 failures (unresolved imports)

Time: 3m 48s | Files touched: 6 | Follow-up needed: Yes (2 files)

OpenCode (via free-tier model on OpenRouter):

$ opencode "Refactor capacity.mjs: extract estimateJobTokens
  and computeAvailableSlots into shared utils. Update imports."

> Created utils/capacity-helpers.mjs
> Updated 3 of 7 import sites
> Naming inconsistency (capacity-helpers vs capacity-utils)
> Tests: 4 failures

Time: 2m 15s | Files touched: 4 | Follow-up needed: Yes (4 files + naming)

Claude Code was slower but produced a complete, correct result with zero follow-up. Codex was faster and got 70% there - the missed test files are a pattern I see regularly. OpenCode was fastest but needed the most cleanup, which is exactly what you'd expect from a free-tier model on a task this complex.

The thing benchmarks miss: that follow-up work has a cost. Two rounds of fixes on the Codex result took another 3 minutes. The OpenCode result needed 8 minutes of manual cleanup. Total wall-clock time to a correct result was shortest for Claude Code.

Decision Framework: What I'd Tell a Team Choosing Today

Don't ask "which is best." Ask "what am I building?"

Building an autonomous agent system? Claude Code. Nothing else gives you session persistence, agent coordination, and CLAUDE.md-driven configuration in one package. The ecosystem is proprietary, which is a real cost, but the alternative is building all of that infrastructure yourself.

Adding AI to a CI/CD pipeline? Codex. The sandbox model fits naturally into automated workflows where you want contained execution. The cloud task API is built for this.

Cost-sensitive with simple tasks? OpenCode with free-tier models. Zero marginal cost for boilerplate, formatting, and commoditized work. Verify the output and move on.

Individual developer wanting a copilot? Aider or Continue, depending on whether you prefer terminal or IDE. These are better daily-driver experiences than either Claude Code or Codex for interactive coding.

Compliance or air-gapped environment? OpenCode with Ollama. It's the only option that runs fully offline with no data leaving your machine.

Building something serious? Use multiple tools. I run Claude Code as the brain, Codex for sandboxed coding work, and OpenCode for commodity tasks. The operational overhead of managing three backends is lower than the cost of forcing one tool to do everything.

The Elephant: Lock-In and What Happens Next

My system has 1,612 Claude workers and months of CLAUDE.md configuration. If Anthropic changes their pricing, deprecates an API, or decides Claude Max isn't sustainable, I'm in trouble. That's not theoretical - they already changed rate limits on March 26, and I felt it.

Codex being open-source helps, but you're still dependent on OpenAI's API for the model. Open-source the client all you want - the intelligence is still a service call away.

OpenCode's provider flexibility is the correct architectural answer to this problem. Being able to swap models without changing your workflow is the only real hedge against vendor risk. I just wish the tool was more mature.

Here's what I think happens next. Within a year, the configuration standards converge. CLAUDE.md and AGENTS.md merge or one wins. Within two years, the orchestration layer becomes a commodity - there'll be five open-source engines that do what my custom system does. The model intelligence is the moat, not the tool around it.

The question isn't which tool to pick today. It's how much migration debt you're willing to carry when the ground shifts. And it will shift.

I've run 6,363 jobs across these tools. The system works. But I'm not under any illusions that the architecture I built this year will be the architecture I'm running next year. The tools are moving too fast for anyone to be certain. Anyone who tells you otherwise is selling something.

The only honest advice: pick the tool that matches your use case today, keep your configuration portable where you can, and budget time for the migration you'll inevitably need to do.

See how the system behind this data actually works.

The companion piece walks through the full architecture - job queues, persistent workers, three-layer memory, and the completion gate that changed everything.

Read the Architecture Piece All Essays