Tooling Review · 2026-03-29

Claude Code in production: what breaks after 5,000 jobs

Five thousand Claude Code jobs on a home server. The model was rarely the thing that failed. Rate limits, context bloat, and session state were, and every fix lived outside the model.

A hand-drawn editorial sketch of a single small rust-colored engine held at the center of an elaborate charcoal scaffold of struts, pipes, gauges and a support cradle, a metaphor that the model runs reliably in production only because of the control plane built around it.

At 3:17 AM last Thursday, my home server hit the Anthropic rate limit. A batch of jobs writing documentation for new functions stalled all at once. After roughly 5,000 Claude Code jobs over three months, I have stopped treating that as an incident. It is weather. You build for it.

The system runs Claude Code, Opus over the CLI Max plan, as its main coding agent. It refactors code, drafts project plans, edits files, runs bash. Over 90 days it executed about 5,112 jobs that touched the filesystem and shell, at a 92.8% completion rate. It hides where the work actually went.

5,112
Claude Code Jobs
92.8%
Completion Rate

The model was almost never the failure. Anthropic ships a genuinely strong coding agent. What breaks is everything around it: rate limits, context that grows until it gets slow and dull, and session state the model does not carry between calls. The three sections below are the three things that cost me real engineering time, in the order they bit me.

Rate limits are operational, not exceptional

The Max plan has high limits. They are not infinite. You get bursts, then you get throttled. During the 90-day run, in windows where I ran more than 50 Claude Code jobs, the system averaged 1.4 rate-limit events per 24 hours. Each one stalled the affected jobs for 10 to 15 minutes.

My first instinct was the wrong one. I built a fallback chain that silently rerouted a throttled job to a second model, then tore it out. Silent degradation is worse than a clean stop, because you stop trusting any output you cannot tell came from the backend you asked for. The model that finished the job is part of the result.

So the rule now is the opposite. One runner per job, chosen explicitly. If it gets rate-limited, the job does not quietly migrate to something cheaper. It backs off, retries against the same backend, and if that fails too, it surfaces the failure to me with the runner named. Honest failure beats a quiet swap I did not ask for.

The retry loop is mundane, which is the point. The interesting decision is what it refuses to do:

async function runJob(job, runner) {
  const maxAttempts = 4;
  let wait = 15_000;

  for (let attempt = 1; attempt <= maxAttempts; attempt++) {
    try {
      return await dispatch(runner, job);
    } catch (err) {
      if (!isRateLimit(err) || attempt === maxAttempts) {
        // no silent fallback: report which runner failed, and why
        throw new Error(`${runner} failed: ${err.message}`);
      }
      await sleep(wait);
      wait *= 2; // back off against the same backend
    }
  }
}

Anthropic's own guidance points the same way. Their Claude Code best-practices writeup spends more words on environment, tooling, and iteration loops than on prompts, because that is where production reliability actually comes from.

Context bloat is the slow tax

Claude Code handles large contexts well. But a big context window is a budget, not a free lunch. Past a point you pay for it in latency and in dilution, the model spending attention on things that no longer matter.

My persistent session builds a system prompt that averages 197KB before any user input or retrieved memory: skill definitions, config, dynamic context. The naive way to keep that bounded is truncation. Cut the oldest messages. For coding tasks that is a bad trade, because the oldest message is often the one that explains why a file looks the way it does.

What worked better was active compaction. Before a new request, a smaller, cheaper model (Haiku) rewrites the history instead of trimming it:

  • Summarize files that were read but not changed. Key points, not the full blob.
  • Condense settled turns. A back-and-forth that ended in a working patch collapses to "requested X, implemented Y."
  • Swap static context in and out. A skill definition rides along in full only while that skill is in use; otherwise it sits condensed.

That cut the average request size for multi-turn coding tasks by 38% without losing the thread.

Context strategy Avg. tokens Avg. cost / request (Opus) Avg. latency
Simple truncation (baseline) ~75,000 ~$0.63 ~45 s
Active compaction (Haiku pre-process) ~46,500 ~$0.49 + Haiku ~30 s

Costs are approximate and track current Anthropic pricing. The Haiku pre-process adds roughly $0.005 and about 3 seconds per request.

You cannot throw raw history at a long-running session and expect it to stay sharp. Compaction is not a tuning knob. It is the difference between a session that holds a thread and one that forgets why it started.

Session state lives outside the model

The hardest part is state. Claude Code is stateless between calls. Coding is not. You open files, edit them, run tests, watch them fail, and try again. That whole arc has to live somewhere the model does not.

So every coding task gets a session_id. File edits, shell commands, and model turns all log against it. When a job dies mid-flight, on a compile error, on a rate limit caught between generations, the system can rebuild the environment and retry from a known point. Three things make that possible:

  • Transactional file ops. Edits stage, and only commit when the model reports success and the tests pass.
  • Rollback. A job that introduces breaking changes reverts to the last clean state. Over 90 days, 38 jobs needed a partial or full rollback.
  • Heartbeats. Each active session pings every 30 seconds. Silent for 5 minutes, it gets flagged as stalled and killed, which frees the slot.

That durable layer is what the 92.8% completion rate actually measures. The model can write a flawless patch. Without infrastructure to hold the lifecycle of that patch, the patch is just a string in a log.

Treat coding-agent sessions as stateless transactions and you will pay for it in rework. The state has to live in a durable layer you own, not in the context window.

The costs that do not show up on the bill

Tokens are the obvious cost. They are not the only one. Spawning hundreds of Claude Code workers over time for filesystem and bash work has real overhead, even when they are not all running at once. Each one wants resources transiently. I had to dedicate at least 4 CPU cores and 8GB of RAM to the orchestration engine to keep peak coding activity from bottlenecking.

Then there is review. The goal is autonomy, but around 15% of coding tasks still want a human, either because the change is sensitive or because the model flagged its own uncertainty. That is not a defect. It is what safe deployment looks like, and it means building the UI and the workflow for a person to step in cleanly.

And generated code is only as good as the tests it is held to. Claude Code will happily pass a weak test suite. Of the 5,112 jobs, 420 existed only to write or harden tests, which tells you how continuous that work is. This is the part the field underrates: the payoff is in the evaluation, not the generation. Hamel Husain makes the case better than I can in his piece on why evals are the bottleneck, and running an agent at volume turned that argument into a daily fact for me.

Computational Overhead Human In The Loop Test Quality / Generation Bottlenecks Operational Blind Spots

What I would tell anyone doing this

None of the gains came from making the model smarter. They came from the scaffold. If you are putting Claude Code, or any coding agent, into production, the work is here:

  • Fail honestly. No silent model swaps. Retry the runner you chose, then report which one failed and why.
  • Compact, do not truncate. Treat the context window as a budget. Summarize and condense before you hit the wall.
  • Own the state. Durable sessions, transactional file ops, rollback. The model will not do this for you.
  • Plan for the compute. Tool-using agents cost CPU and RAM, not just tokens.
  • Build the review path. A clean way for a human to step in is a feature.
  • Invest in tests. Your agent is only as trustworthy as the suite it has to satisfy.

This matches the through-line in Anthropic's Building Effective Agents. The durable wins come from simple, composable infrastructure around the model, not from a cleverer model. After 5,000 jobs I would put it more bluntly. The model is the easy part. The control plane is the work.