Architecture · 2026-03-28

Inside one production AI agent: routing and the failure log

What one production AI agent actually looks like after 5,252 jobs: multi-model routing, an explicit fallback chain, a durable SQLite job engine, and the five failure classes that account for the breakage.

A hand-drawn ink sketch of a single central dispatcher hub feeding stacked work cards onto six parallel tracks that fan out to different machine backends, with one blocked track rerouting along a rust-colored path to an alternate machine.

This is one agent. Not a fleet. One loop running 24/7 on a home server in Philadelphia that has dispatched 5,252 jobs, routing each one to whichever model backend the task actually needs. It handles email triage, research pipelines, content scheduling, project management, and home automation. It runs as a Docker container, takes input over Telegram, and executes work through a job engine while I sleep.

This essay is the routing and reliability half of the teardown: how work gets dispatched, how the system behaves when a model is rate limited, and what actually breaks across five thousand jobs. The memory system, composite scoring, and the identity layer are the other half, covered in the memory system and composite scoring. Every number here is from the production database. Every code sample is from the live system.

The framing matters because the consensus has settled. Anthropic's Building Effective Agents argues for augmented single-LLM loops over orchestration diagrams; Cognition's Walden Yan made the same case in Don't Build Multi-Agents. This system is one operating instance of that shape. The hard part was never the model. It was the machine around it.

The numbers first

Before architecture, context. Here is what the production database shows after operating at scale:

5,252

Total jobs dispatched

4,918

Jobs completed successfully

2,922

LLM inference jobs

1,556

Coding-agent + tool-use jobs

The job type breakdown tells a story about what AI systems actually spend their time doing:

Job Type	Count	What It Does
`llm`	2,922	Pure inference - classify, summarize, judge, extract
`coding-agent`	626	Claude Code with full tool access - file edit, bash, git
`codex`	471	OpenAI Codex agent - alternative coding backend
`opencode`	459	Open-source coding agent fallback
`shell`	449	Direct bash - scripts, CLI tools, data processing
`claude`	167	Claude CLI sessions with custom personas
`gemini`	79	Gemini for image gen, long-context tasks
`agentic`	47	Multi-step autonomous pipelines

The most important thing about that table: LLM jobs are 56% of total volume. The boring inference work - classifying an email, scoring a memory candidate, summarizing a research doc - dwarfs the flashy agentic work. Design your system around the boring 56% before you optimize the exciting 1%.

The multi-model architecture

The number one mistake I see in AI system design is treating model selection as a branding decision. "We use GPT-5." "We're a Claude shop." That thinking collapses the moment you hit a rate limit, a cost ceiling, or a task that genuinely needs a different capability profile.

The actual model routing in this system follows four tiers, each with a clear job description:

Tier	Model	Use Case	Access
Persistent Session	Claude Opus 4.6	Telegram conversations, planning, complex reasoning	Max plan OAuth (unlimited)
Mid	Claude Sonnet 4.6	Engine jobs, synthesis, review passes	Local proxy (Max plan)
Cheap	Claude Haiku 4.5	Classification, routing decisions, quick checks	Local proxy (Max plan)
Embed	text-embedding-3-small	Memory vectors, semantic search	OpenAI API

The unified LLM layer looks like this - every other module calls into here and stays ignorant of providers:

// llm.mjs - all model calls route through here
// No other module imports HTTP or creates API clients directly

export async function mid(messages, opts = {}) {
  return callProxy('sonnet', messages, {
    temperature: opts.temperature ?? 0.2,
    max_tokens: opts.max_tokens ?? 2000,
    timeout: opts.timeout || 30_000,
  });
}

export async function cheap(messages, opts = {}) {
  return callProxy('haiku', messages, {
    temperature: opts.temperature ?? 0.1,
    max_tokens: opts.max_tokens ?? 1000,
    timeout: opts.timeout || 30_000,
  });
}

// Council: fan out to N models, synthesize consensus
export async function bigDecision(messages, opts = {}) {
  const text = await mid(messages, { ...opts, timeout: 60_000 });
  return text ? [{ model: 'sonnet', text }] : [];
}

The proxy architecture matters here. All Anthropic calls route through a local proxy service. This means engine jobs consume the same Max plan quota as the persistent session - no separate API spend for background work. The proxy translates between Anthropic Messages API format and OpenAI Chat Completions, so any model or tool that speaks one format speaks both.

The fallback chain

When the Anthropic API is rate limited, the system does not just fail. It falls through to a secondary routing layer - a proxy that translates CLI calls to OpenRouter, which serves alternative models like MiniMax. From the perspective of the job running in the engine, the fallback is invisible. The tool calls still work. The structured outputs still parse. The job completes.

The fallback chain implementation handles cross-attempt rotation - if a backend failed on the last try, it moves to the end of the chain next time:

async function runWithFallback(job, backend, jobId) {
  let chain = buildFallbackChain(backend);

  // If last attempt failed on a specific backend, skip it first
  const lastFailed = job.last_routed_via;
  if (lastFailed && chain.length > 1) {
    const failedIdx = chain.findIndex(r =>
      r.name.toLowerCase().replace('run', '') === lastFailed
    );
    if (failedIdx >= 0) {
      const [failed] = chain.splice(failedIdx, 1);
      chain.push(failed); // move to end
    }
  }

  for (const runner of chain) {
    const result = await runner(job);
    if (result.status === 'done' || result.status === 'failed') {
      return result; // success or hard failure
    }
    // rate-limited → try next runner
  }
}

Rate limit resilience is not a nice-to-have. At 5,000+ jobs, you will hit limits. The question is whether your system degrades gracefully or crashes spectacularly.

The job engine

Every piece of work in the system is a row in a SQLite database. That sounds boring. It is, deliberately. Durable queues in a database give you exactly what you need: persistence across restarts, queryable state, transactional updates, and a single audit trail.

The engine's tick loop is simple: find pending jobs, claim them, dispatch to the right backend, mark complete. The complexity is in the edges.

// engine.mjs - simplified tick loop
async function tick() {
  // 1. Heal stuck jobs (>15 min since last heartbeat)
  await healStuckJobs();

  // 2. Check promise groups (parallel job coordination)
  const pending = queryAll("SELECT group_id FROM promise_groups WHERE status='pending'");
  for (const row of pending) await checkGroup(row.group_id);

  // 3. Advance active plans (auto-dispatch next steps)
  const activePlans = queryAll("SELECT id FROM plans WHERE status IN ('approved','active')");
  for (const p of activePlans) await dispatchPlanSteps(p.id);

  // 4. Claim and dispatch pending jobs
  const jobs = claimPendingJobs({ limit: maxConcurrent });
  await Promise.allSettled(jobs.map(j => dispatchJob(j)));
}

Job workers run as forked child processes - not threads, not async functions in the main loop. Each worker gets its own IPC channel, sends heartbeats every 30 seconds, and writes results directly to the database. If the parent process dies, in-flight jobs are recoverable from state.

The stuck job healer is worth naming explicitly: any job that stops sending heartbeats for 15 minutes gets auto-failed and is eligible for retry. This single mechanism has saved the system from silent hangs more times than any other reliability feature.

Promise groups: parallel coordination

When a research task fans out to five parallel browser sessions, you need a way to wait for all of them before synthesizing. Promise groups handle this cleanly - a set of jobs share a group_id, and the engine waits until all members reach a terminal state before triggering the downstream synthesis job.

-- Schema: promise groups link parallel jobs
CREATE TABLE promise_groups (
  group_id  TEXT PRIMARY KEY,
  job_ids   TEXT,          -- JSON array
  strategy  TEXT,          -- 'all' | 'race' | 'allSettled'
  status    TEXT DEFAULT 'pending',
  result    TEXT
);

The strategy field maps directly to JavaScript's Promise.all / Promise.race / Promise.allSettled semantics. A research swarm uses allSettled - collect whatever completes, synthesize from partial results if some fail. A multi-model validation pass uses all - every reviewer must pass before the artifact advances.

The memory system lives in part two

The other load-bearing subsystem is memory: a three-layer store with composite scoring that decides what the agent recalls before every job. It is the part most agentic systems underinvest in and then wonder why their assistant forgets everything. It gets its own teardown in the memory system and composite scoring, because it deserves more than a paragraph here. The rest of this essay stays on dispatch, routing, and what breaks.

Eval in practice

"We evaluated our system and it performs well" is not an eval framework. It is a sentence. Here is what evaluation actually looks like inside a production orchestration loop, at three levels.

Job-level evals

Every coding-agent job has an implicit eval: does the output compile, pass tests, and match the task spec? The system tracks success_criteria per task and runs a Haiku-tier judge after each completion. The judge checks three things: (1) was the success criterion met, (2) were any new problems introduced, (3) does the output require a follow-up job.

The judge model is cheap on purpose. A fast, cheap classifier running on every job is more valuable than an expensive reviewer running on 10% of jobs. You want full coverage at the cost of some false negatives, not perfect accuracy at the cost of blind spots. This is the discipline Hamel Husain argues for: evals are not a launch-day ritual, they are a continuous production loop.

Memory quality evals

The memory system runs periodic audits using the same LLM-judge pattern. Candidates are scored against four criteria: specificity (is it a real fact or vague noise?), accuracy (does it contradict known truths?), staleness (is it still true?), and utility (would it actually help in a future session?).

Memories scoring below threshold get flagged for deletion. This is not automatic - a human reviews the deletion candidates - but the flagging happens continuously. The result is a memory store that stays dense and accurate instead of bloating with false positives over time.

System-level evals: the pulse

Every 30 minutes, a pulse job runs. It checks: are the last N jobs completing at normal rates? Are error rates elevated? Are any job types stuck? Is memory recall taking longer than expected?

This is not ML model evaluation. It is operations monitoring applied to AI systems. The insight is that your AI system is also a software system, and the same SRE practices that keep servers healthy keep AI orchestrators healthy. Dashboards, alerts, anomaly detection, runbooks for common failure modes.

The best eval framework is one that runs continuously in production, not one you dust off before a launch. Real failure modes only appear at real scale.

What actually breaks

In 5,252 jobs, here are the actual failure categories, ranked by frequency:

1. Context loss between jobs (27% of failures). A job completes but the output lands in a temp file that the next job cannot find. The fix: explicit artifact naming conventions, not implicit paths.

2. Rate limit cascades (22%). One tier gets rate limited, fallback kicks in, fallback also gets rate limited, job fails. The fix: the cross-attempt backend rotation described earlier, plus spend caps per hour.

3. Optimistic completion (19%). A job marks itself done after one decent-looking pass instead of verifying against success criteria. The fix: mandatory success criteria fields in the task schema, judge evaluation before status update.

4. Duplicate work (16%). Two jobs independently perform the same research or write to the same file. The fix: plan deduplication checks at dispatch time, promise groups for parallel coordination.

5. Memory contamination (9%). A false or outdated memory surfaces during a session and corrupts downstream reasoning. The fix is the decay and audit mechanism covered in part two.

Failure Categories by Frequency

Notice that none of these are model failures. The model almost never produces wrong output that could not have been caught by better orchestration. The failures are infrastructure: routing, state, coordination, verification.

What to steal for your own system

You do not need 5,252 jobs to start building this way. Here are the decisions that matter most, in order of impact:

1. Unified LLM layer first. Before you write a single business-logic function, build a module that owns all model calls. Tier them by cost and capability. Never let other modules talk to APIs directly. This is the foundation everything else rests on.

2. Durable job queue from day one. SQLite is fine. The point is persistence. If your job state lives only in memory, you will lose work during restarts and spend weekends debugging ghost states. Write jobs to a database, poll for them, mark them done transactionally.

3. Explicit fallback chains, not silent degradation. Every model call should have a documented fallback path. Rate limits are guaranteed at scale. Your system's behavior when rate limited is as important as its behavior when everything works.

4. Composite memory scoring over raw similarity. If you are building any kind of persistent AI system, add recency and importance weights to your retrieval scoring before you ship. Raw cosine similarity retrieval produces a system that feels sharp for the first week and increasingly wrong after that. The full scoring formula and the anti-stale rotation are in part two.

5. Cheap judge on everything, expensive reviewer on nothing. A Haiku-tier classifier checking every job output at $0.001 per call beats an Opus-tier reviewer checking 10% of jobs at $0.08 per call. Cover everything cheaply, escalate selectively.

6. Operations first, features second. The pulse check, stuck job healer, and spend cap are not glamorous. They are also the reason this system has a 93.7% job completion rate across 5,252 runs. Build the control plane before you build the intelligence.

The progression that actually held:
Nah: A fleet of agents pretending to coordinate
Yeah: One agent in a loop, routing across many tools

The architecture in one diagram

The architectural insight is that these layers are orthogonal. The job engine does not care about memory. The memory system does not care about model routing. The identity layer does not care about job dispatch. You can build each independently, test each in isolation, and compose them without tight coupling.

That is not a philosophical principle. It is why a system running on a home server can dispatch 5,252 jobs without anyone managing it full-time.

The models were, as I have written in the control plane became the product, the easy part. Building the machine that keeps them coherent is the work that takes months and teaches you things no benchmark will. The memory system that holds that coherence across sessions, the composite scoring that decides what the agent recalls, and the 197KB identity layer are the subject of the memory system and composite scoring.