Build Your Own AI Coding Harness

AI Engineering / April 6, 2026

6,442 Jobs Later: Model Selection Beats Harness Choice 37-to-1 - But You Still Need to Build Your Own

On April 4, 2026, Anthropic cut off third-party harness access to Claude subscriptions. 135,000 OpenClaw instances went dark overnight. My system didn't notice.

Model selection creates 37x more performance variance than harness choice. The harness doesn't win the performance race - it wins the optionality race. Without it, you can't access the models that do.

Download Full Study (PDF, 293 runs) →

On Friday afternoon, Anthropic sent an email to subscribers. Starting immediately, third-party harnesses - OpenClaw, OpenCode, anything not built by Anthropic - would no longer draw from subscription credits. Pay-per-token or get out.

On April 4, 2026, Anthropic took 135,000 OpenClaw instances offline. That was the headline. The deeper story was optionality: if your harness is a wrapper around one provider, you're not running infrastructure - you're renting it.

By Saturday morning, Hacker News had 400+ comments. Reddit was on fire. OpenClaw's creator called it anti-competitive. Community sentiment split roughly 63% angry, 24% sympathetic, 13% confused. The angry crowd had a point - some of them were paying $200/month for Max and running legitimate automation through OpenClaw. That automation now costs $1,000 to $5,000 per month at API rates.

I read the thread over coffee. Then I checked my dashboard. 147 jobs had run overnight. Email triage, research tasks, a TikTok video render, two article reviews, a trading analysis. All on native Claude Code. All on my $200 Max subscription. Nothing changed. Nothing broke.

94% of Opus quality. 66x cheaper. DeepSeek V3.2 in production, scored blind.

Want the full data? Download the complete benchmark study - 293 runs, 19 models, scoring methodology, and raw results. Download the Benchmark Study →

The harness doesn't win the performance race. It wins the optionality race. That distinction is the point of this article.

That wasn't luck. That was eight months of architectural decisions paying off.

The full architecture: six layers, five backends, one SQLite database holding it all together.

Why This Matters Beyond the Drama

The surface story is about pricing. Anthropic couldn't sustain 135,000 OpenClaw instances burning $1,000 to $5,000 worth of compute each on $200 flat subscriptions. Boris Cherny, Head of Claude Code, said it plainly: "Third party services are not optimized in this way, so it's really hard for us to do sustainably."

But the real story is architectural. And it's been obvious for months to anyone running AI at production scale.

Third-party harnesses sit outside the provider's optimization path. They can't benefit from prompt caching the way native tools can. Claude Code's context management keeps warm caches that reduce token burn by roughly 50%. OpenClaw, polling every 30 seconds with cold-start cron jobs, burns through that budget blind. Cherny himself submitted PRs to OpenClaw to improve cache hit rates. Think about that. The Head of Claude Code was trying to fix someone else's harness because the waste was that bad.

The cache efficiency gap isn't a bug. It's a structural property of the architecture. And it's why I never went down that road.

What I Built Instead

My system is an orchestrator - a hub-and-spoke architecture where a central coordinator dispatches work to specialized workers running across multiple model backends. Not a wrapper around one model. Not a harness that pipes prompts through someone else's API. A production operating system for AI work.

46 days of production data:

Metric	Number
Total jobs executed	6,442
Completion rate	93.9% (6,051 succeeded)
Failed jobs	258 (4%)
Active projects	47
Concurrent execution slots	20
Peak daily throughput	370 jobs (March 28)
Average daily throughput	209 jobs
Monthly cost	~$205
Cost per job	$0.032

That $205 breaks down simply. $200 for Claude Max (unlimited Opus 4.6, Sonnet 4.6, Haiku 4.5 via CLI). About $3 to $5 on OpenRouter for DeepSeek, Kimi, and free-tier models. A couple bucks on OpenAI for embeddings. Everything else is free tier.

The jobs aren't toy demos. Research papers, website deployments, email triage, algorithmic trading decisions, Reddit engagement, TikTok video production, article writing and review pipelines, and a hundred maintenance tasks that keep the system coherent without my involvement.

The Architecture Nobody Talks About

Every AI demo shows the happy path. User types prompt, model returns answer, audience claps. Nobody shows what happens when you need to run 370 jobs in a day across multiple model providers while keeping costs under $7.

The core is a job queue backed by SQLite. Every piece of work becomes a row in a jobs table with 60+ columns tracking everything from token consumption to failure class to routing history. The engine ticks every 10 seconds, picks up pending work, routes it to the right backend, and handles whatever goes wrong.

This is the loop. Model proposes, harness executes, result returns. 250 turns max.

const CAPACITY_DEFAULTS = {
  max_slots: 20,
  reserved_danny_slot: 1,
  token_budget_5min: 2_000_000,
  backpressure_queue_depth: 20,
  backpressure_token_pct: 0.8,
};

That reserved_danny_slot exists because I learned the hard way. March 14 - I tried to ask Penny a question and got queued behind 19 running jobs. Ten minutes before I could interact with my own system. Now one slot is always reserved for the operator. Production teaches you things that architecture diagrams never will.

Model Routing

Not every task needs Opus. The system routes by tier:

Tier 3 (Brain):   Opus 4.6    - judgment, planning, complex code
Tier 2 (Worker):  Sonnet 4.6  - routine work, analysis, drafting
Tier 1 (Grunt):   Haiku 4.5   - status checks, triage, health pings

In practice, the distribution across 6,442 jobs: Opus handles 51.5% (the hard stuff), Sonnet takes 21.9% (the bulk work), Haiku gets 10.8% (the cheap scans). The remaining 16% routes to external models - GPT-4.1, Kimi K2.5, Google Gemini, DeepSeek - through OpenRouter or direct API calls.

Route by complexity, not by habit. When rate limits hit, the fallback chain drops through five backends before giving up.

The routing isn't static. A coding job starts on Opus. If Opus is rate-limited, it drops to Codex. If Codex chokes, it queues for retry with a cached backoff window. If the Anthropic API itself goes down - and it did, at 2 AM one night in March - the system waterfalls through OpenRouter to whatever's available.

That night, 23 jobs rerouted automatically. Three fell back to DeepSeek R1. One stubborn coding task cascaded all the way down to Mixtral before completing. I was asleep for all of it.

async function routeJob(job) {
  for (const backend of FALLBACK_CHAIN) {
    try {
      const response = await dispatch(backend, job);
      if (response.status === 429) {
        rateLimitCache.set(backend.name, response.headers['retry-after']);
        continue;
      }
      job.last_routed_via = backend.name;
      return response;
    } catch (err) {
      log.warn(`${backend.name} failed: ${err.message}`);
    }
  }
  throw new Error('All backends exhausted');
}

Cache the rate limit reset times. Don't retry blindly after a 429. Parse the header, store it, skip that backend until the window resets. This one change cut my failed-job rate from 8% to under 4%.

Worker Lifecycle

Each job spawns a fresh claude -p session with context injected via --append-system-prompt. The worker gets a CLAUDE.md file that tells it who it is, what project it belongs to, where to write checkpoints, and how to report back. Workers don't talk to each other. They don't talk to me. They report to the orchestrator.

const claudeMdParts = [
  '# Job Context (compaction-proof)',
  `- **Job ID:** ${job.id}`,
  `- **Session:** ${sessionName} (${sessionId})`,
  `- **Workspace:** /home/node/.pennybot/workspace`,
  '',
  '## Checkpoint Directory',
  `All progress files go in: \`${jobDir}/\``,
  `- \`${jobDir}/progress.md\` - current state`,
  `- \`${jobDir}/step-N.md\` - per-step deliverables`,
];

This is the context engineering that makes native Claude Code work. Every worker session gets enough context to operate independently but not so much that it loses focus. The system prompt is the environment. The worker is the intelligence that emerges within it.

Five phases from message to delivery. Each phase has its own failure modes and recovery paths.

The Benchmark That Changed My Mind About Models

I ran a formal study across 197 coding and writing tasks to figure out which model I should route work to. What I found changed how I think about this entire system.

Sonnet matched Opus on coding. Across 10 coding tasks of varying difficulty, Sonnet 4.6 averaged 7.90/10 while Opus 4.6 averaged 8.14/10. Sonnet beat Opus on 4 tasks outright, tied on 2, lost on 4. On the LRU Cache implementation - a hard algorithms problem - Sonnet scored 9.15 to Opus's 9.0. The cheaper model won the harder task.

Free models failed completely. Qwen3 Coder, a free-tier model through OpenRouter, scored 1.0 across every task. Not low scores - literally zero implementations produced. The model accepted the prompt, thought about it for three minutes, and returned nothing usable. Free isn't cheap when it produces nothing.

The premium writing tier told a different story. I tested five premium models on research design, blog writing, academic papers, and critical analysis:

Model	Avg Score	Cost (5 tasks)	vs Claude Code
o4-mini	8.28/10	$0.10	102% of baseline
GPT-4.1	8.24/10	$0.10	101%
DeepSeek R1	8.02/10	$0.04	98%
Claude Sonnet 4.6 (OpenRouter)	7.72/10	$0.37	95%
Gemini 2.5 Pro	7.36/10	$0.29	90%

DeepSeek R1 scored 9.90 on a blog post and 9.25 on a research pre-registration design. Those were the two highest individual scores in the entire study. Total cost for five tasks: four cents.

But Gemini 2.5 Pro, Google's flagship at $1.25/$10 per million tokens, scored 2.2 out of 10 on an IEEE paper. It fabricated every citation. Premium pricing does not mean premium academic rigor. My routing table now sends academic work to o4-mini, the only model in the study that avoided citation fabrication entirely.

The most expensive model tested wasn't the best. Claude Sonnet via OpenRouter cost $0.37 for five tasks - 3.8 times more than o4-mini - and scored half a point lower.

66x cost gap, 6% quality gap. DeepSeek V3.2 is the sweet spot for production batch work.

The Real Finding: The Harness Is the Product

Download the full benchmark study: 293 runs, 19 models, scoring methodology, and raw results. Get the PDF →

None of this benchmark data matters as much as one observation that keeps showing up in every study I've seen.

LangChain's coding agent jumped from 52.8% to 66.5% on Terminal Bench 2.0 by changing nothing but the execution environment. Same model. Same tasks. Different harness. The environment was the entire difference.

I saw something similar in my own data. OpenClaw had 247,000 GitHub stars and 135,000 active instances. It solved a real problem. But the architecture was wrong for production in ways that compound over time.

No memory continuity. Every OpenClaw session starts cold. My system has three-tier memory: core memory (always present, 6KB identity file), semantic memory (LanceDB vectors scored across similarity, recency, importance, and frequency), and auto-captured episodic memory that persists across sessions.

// Memory scoring formula
score = 0.45 * similarity
      + 0.25 * recency
      + 0.20 * importance
      + 0.10 * frequency

When a worker picks up a task, it knows what happened yesterday because the environment tells it.

No routing intelligence. OpenClaw talks to one provider. If that provider is down or rate-limited, you're stuck. My system routes across 12 job types and 8+ model providers with automatic fallback. The 2 AM Anthropic outage would have killed OpenClaw dead. For me, it was a log entry I read at breakfast.

No quality gates. OpenClaw trusts the model's output. My system doesn't trust anything that matters. Every significant piece of work goes through a second-pass review by a different model family. The maker-checker pattern. The model that wrote it is never the model that scores it.

Security. CVE-2026-25253 - a critical remote code execution vulnerability, CVSS 8.8 - affected 135,000 OpenClaw instances on the public internet. About 12% of community-contributed skills on ClawHub were found to be malicious in a security audit. Native Claude Code runs in an isolated session with scoped permissions. No HTTP server exposed to the internet. No community plugins downloaded from strangers.

Three Things That Broke and What They Taught Me

Theory is cheap. Here's what the production system actually looked like when it failed.

The Quality Gate Disaster (March 11)

Workers were closing tasks without second-pass reviews. The system was optimized for throughput, and throughput was winning over verification. A children's educational app went through the pipeline and shipped with grammar errors - "Here is airplane" instead of "Here is the airplane" - and duplicate concept cards that nobody caught because nobody looked.

I caught it manually. Furious at my own system.

The fix was architectural, not prompting. I added a hard rule: no task gets marked done without a second-pass review by a different model family than the one that wrote the draft, and a test run with viewable output. Not a suggestion in the prompt. A gate in the engine that blocks the status transition.

That rule now catches roughly 15% of first-pass work that doesn't meet standard. The system is slower. The output is better. That trade-off is the whole job.

The OAuth Chain Reaction (March 11)

An OAuth token expired at 11:02 AM. Four consecutive Pulse jobs - the 30-minute maintenance cycle that handles email triage, status snapshots, and project advancement - failed with 401 auth errors. The system kept dispatching them because auth errors weren't in the failure taxonomy. It wasn't a rate limit. It wasn't a model failure. The existing categories simply didn't cover it.

By 12:35, the token auto-renewed and everything recovered. But for 93 minutes, maintenance was blind. No email checks, no health snapshots, no project advancement. Silent degradation.

The lesson: your failure taxonomy is never complete. Every new failure class that doesn't fit existing categories will cascade until you notice or it resolves itself. I added auth failures as a first-class type and built proactive token renewal checks.

The Silent System (March 15)

Me, to my own system: "I feel like you should be updating me without me having to ask. What kind of results were you seeing? Shouldn't you be telling me that kind of stuff?"

The system was doing research. Running jobs. Producing results. And sitting on them. Everything worked technically. Nothing was being communicated. The system had become competent and silent - the AI equivalent of an employee who does solid work but never sends a status update.

This is the failure mode nobody warns you about. Not crashes. Not hallucinations. Silence. The system gets good enough to work autonomously and bad enough at communication that the operator loses trust.

I added proactive update rules to core memory: push progress on long-running projects, surface uncertainty early. Trust isn't a feature you ship once. It's a property you maintain through communication discipline, every day.

The Economics Were Always Clear

An OpenClaw instance running autonomously for one day consumes $1,000 to $5,000 in equivalent API costs. Under a $200 Max subscription, that was a 5 to 25x cost overrun per user. Anthropic was subsidizing this because subscriptions were designed for human-speed interaction, not autonomous agent loops polling every 30 seconds.

My system runs 209 jobs per day on average. Peak was 370. Total monthly cost: $205.

Approach	Monthly Cost	Jobs/Month	Cost per Job
My system (native)	$205	~6,400	$0.032
OpenClaw on API (post-ban)	$1,000-5,000	~6,400	$0.16-$0.78
OpenClaw on Max (pre-ban)	$200	~6,400	$0.031

Pre-ban, the costs looked similar on paper. But pre-ban OpenClaw was running on borrowed time - a pricing arbitrage that Anthropic was always going to close. Building production infrastructure on pricing arbitrage is a bet that the house won't notice. The house always notices.

What Comes Next

The third-party harness era is over. Not just for Claude. OpenAI, Google, and every other provider will make similar moves as autonomous agent workflows consume orders of magnitude more compute than human-speed usage. The economics don't work any other way.

What replaces it isn't a better harness. It's a different architecture.

Build on the native platform for your primary provider. Use direct API integrations for fallback and specialized routing. No intermediary layer that can be killed by a policy change you learn about on a Friday afternoon.

Memory as infrastructure, not conversation history. Vectors, composite scoring, decay functions. The system that remembers what happened yesterday makes fundamentally different decisions than the one starting fresh every session.

Quality gates as load-bearing walls. Second-pass review by a different model family. Score thresholds with automatic revision queuing. When you're running hundreds of jobs autonomously, the maker-checker pattern is what keeps the output trustworthy.

The people who build these systems now - while the Hacker News thread is still arguing about whether the ban was fair - will have a structural advantage that compounds monthly. Not because they picked the right vendor. Because they built the right architecture.

I've been operating mine for 46 days, 6,442 jobs, and $205 a month. The harness ban didn't touch it because there was nothing to touch.

Want the full data? The complete benchmark study - 293 runs, 19 models, scoring methodology, and raw results - is available as a PDF. Download the Benchmark Study →

Danny Nakhla builds production AI systems and consults on multi-model orchestration, context engineering, and failure recovery. About Danny →

This is what 150 jobs per day actually looks like. The model becomes a parameter, not a dependency.

Build your own. The harness ban didn't touch my system because there was nothing to touch.

Danny Nakhla is an AI Solutions Manager with 20 years in tech. He builds production AI orchestration systems and writes about what actually works when the demos stop and the real work starts.

Building production AI systems? I consult on multi-model orchestration, context engineering, and the operational problems that tutorials skip. About Danny.

Reading lens

For developers and engineering leaders who want model flexibility, cost control, and independence from vendor lock-in. Production architecture, real benchmark data, zero theory.

Key numbers

Jobs run6,442

Monthly cost$205

Cost per job$0.032

Models benchmarked14

Benchmark runs197+

Concurrent workers20

Also by Danny

The Harness Is the Product → Claude Code in Production: 5,000 Jobs → Build Your Own AI Harness →

Work together

I consult on multi-model orchestration, context engineering, and production AI architecture.

About Danny →

More on building production AI systems.

Field notes from running 6,400+ autonomous jobs. What breaks, what scales, and what nobody tells you until it's too late.

All Writing 5,000 Jobs →