Architecture · 2026-03-28

Inside a Claude Code setup running 6,442 jobs: the completion gate

A real Claude Code system that dispatches jobs at production volume: a triage-router CLAUDE.md, a forked-worker job queue, injectable skills, and a completion gate that bans single-pass work. Built for daily production use, not a demo.

A conveyor line of identical small job tickets moving toward a single checkpoint gate that inspects and stamps each one before it can pass, the gate marked in rust as the completion gate every job must clear.

This piece covers the architecture and the completion gate. The memory side, the composite recall score and the anti-stale penalty, has its own essay: Composite scoring: fixing stale agent recall.

Over its run, this Claude Code system executed 6,442 jobs. Not one of them was marked done on a single pass. That rule, the completion gate, is the load-bearing part of the architecture, and it is the part most setups skip. The rest of this piece walks the system that holds it up: a triage-router CLAUDE.md, a forked-worker job queue, a worker pool, 50+ injectable skills, and a cron scheduler. The interesting parts live in the system you build around the model, the same case Anthropic engineering makes in Building Effective Agents and the 12-factor agents manifesto makes from the deployment side: an augmented single loop with good tooling beats an orchestration diagram.

6,442

Jobs Executed

50+

Skills Loaded

Oct 2025

In production since

The CLAUDE.md is your operating system

Everyone starts with CLAUDE.md. Most people put some project notes in it and move on. That is a mistake. CLAUDE.md is the single most important file in the entire setup because it is the only thing that persists reliably across every session, every worker, every spawned subprocess.

My CLAUDE.md does the job of four files at once: a triage router, a set of hard constraints, a workspace map, and a personality definition. Here is the top-level structure:

CLAUDE.md
├── Message Triage (30-Second Rule)
│   ├── Triage Categories (respond / quick work / big work / clarify / defer)
│   ├── Tool budget (60s timeout, allowed vs forbidden tools)
│   └── Plan Preview Gate (non-negotiable)
├── Workspace Rules
│   ├── Database locations (pm.db, LanceDB, core-memory.md)
│   ├── File zones (system/, engine/, projects/, drafts/, images/)
│   └── File hygiene (temp files, naming, auto-sweep rules)
├── SOUL.md (personality, values, behavioral boundaries)
└── Critical Rules (one-liners referencing incidents)

The triage section matters most. Every incoming message gets classified in under 30 seconds:

Category	Signal	Action
Respond	Chat, opinion, quick question	Answer directly
Quick work	Needs tools, under 5 min	Acknowledge, kick off async worker
Big work	Multi-step build or research	Draft plan, wait for approval, then execute
Clarify	Ambiguous scope	Ask one sharp question with options
Defer	Future date	Set reminder

The golden rule, stated explicitly in the file: the main session is for conversation. Work happens off-thread. This single constraint prevents the number one failure mode of agentic systems: the main context window filling up with tool output from a task that should have been delegated.

Critical rules are incident reports

Every critical rule in my CLAUDE.md traces back to a real failure. Not hypothetical best practices, actual incidents with dates:

- NEVER ship after single pass. Multi-model review mandatory. (Leni's World incident, Mar 11-12)
- Research existing datasets/APIs BEFORE building scrapers. (Media Framing Study incident, Mar 14)
- Deliverables must be VIEWABLE. Never just a file path. (Mar 15 standard)

This is the pattern that works: something breaks, you write a one-line rule with the incident name, and it becomes permanent policy. The system learns from its mistakes because you encode them where they cannot be forgotten.

The engine: a job queue with opinions

The engine is a Node.js process that ticks every 10 seconds. Each tick does three things: check capacity, claim jobs from priority lanes, and dispatch work to workers.

Engine Tick (every 10s)
├── Check capacity (concurrent slots, token budget, rate limits)
├── Claim next job from priority lanes
│   ├── Lane 0: P0-P1 (urgent, user-facing)
│   ├── Lane 1: P2 (normal scheduled work)
│   └── Lane 2: P3-P5 (background, maintenance)
└── For each claimed job: find or create a worker
    ├── 1. Explicit assignment? Use it
    ├── 2. Project match? Reuse existing worker
    ├── 3. Idle worker with matching backend? Reuse
    └── 4. None found? Create ephemeral worker

Jobs enter the system from three paths:

Telegram messages. I send a message, a triage worker classifies it, and if it needs real work, it becomes one or more jobs via pm job add.

Cron schedules. The scheduler fires at configured times and creates jobs from templates.

Plan dispatch. A planning pass decomposes a request into multiple jobs with dependencies.

Each job runs in a forked child process. This is a critical design choice. The engine's main loop never blocks on a long-running model call. Multiple jobs execute in parallel. If a job hangs, the supervisor loop keeps ticking.

Engine (main thread, never blocks)
  └── tick()
      ├── fork(job-worker.mjs) → Job A (Claude Opus, writing task)
      ├── fork(job-worker.mjs) → Job B (Sonnet, email triage)
      └── fork(job-worker.mjs) → Job C (Gemini, research)
          Each child sends heartbeats every 30s
          Parent tracks liveness without polling

Capacity management

The system tracks token usage in a rolling 5-minute window with a configurable budget. When usage exceeds 80% of the budget and the queue is deep, backpressure kicks in and pauses new dispatches until load decreases.

There is also a reserved slot for urgent work. Even when all worker slots are full, a P0 job from a direct user message can still get through. This prevents the "system is busy doing maintenance while the human waits" problem that kills every other agent framework I have tried.

Token reservation pattern

Job queued    → estimate tokens (prompt length / 4 + context + expected output)
Job running   → track actual usage (input + output tokens from API)
Job complete  → reconcile (release reservation, record actuals)
Over budget   → activate backpressure, pause new dispatches

This is not perfect. Token estimation is fuzzy. But even rough estimation prevents the system from accepting more work than it can handle, which is the actual failure mode you need to avoid.

Workers: sessions, not tasks

A worker here is a persistent agent session with state, not a one-shot function call.

Worker
├── backend: claude | codex | gemini | opencode
├── model: opus | sonnet | haiku | gemini-2.5-flash
├── persona: pulse | researcher | critic | engineer
├── skills: comma-separated list of injectable skill modules
├── persistent: true (survives job completion) | false (ephemeral)
├── session_id: UUID of current Claude Code session
├── session_job_count: how many jobs this session has handled
└── project_id: optional scope to a specific project

The key insight is session reuse. When a job completes, a persistent worker does not shut down. It goes back to idle. When the next matching job arrives, the same session resumes, with all its accumulated context. This is cheaper (no cold start), faster (warm context), and produces more coherent output across related tasks.

Ephemeral workers get created on demand and discarded after their job finishes. Most jobs use ephemeral workers. But high-value recurring work (email triage, morning briefings, project-specific research) gets persistent workers that build up context over time.

The worker pool at any given moment might look like:

WORKERS (typical weekday morning)
┌──────────────┬─────────┬─────────┬────────────┬──────────┐
│ Name         │ Backend │ Model   │ Status     │ Sessions │
├──────────────┼─────────┼─────────┼────────────┼──────────┤
│ email        │ claude  │ sonnet  │ working    │ 47 jobs  │
│ briefing     │ claude  │ opus    │ idle       │ 12 jobs  │
│ claude-5302  │ claude  │ opus    │ working    │ 1 job    │
│ codex-5303   │ codex   │ n/a     │ working    │ 1 job    │
│ claude-5304  │ claude  │ sonnet  │ working    │ 1 job    │
└──────────────┴─────────┴─────────┴────────────┴──────────┘

The email worker has handled 47 jobs in its current session. It knows the inbox patterns, recurring senders, what gets archived versus what gets flagged. That context would be expensive to rebuild from scratch every 30 minutes.

Skills: injectable knowledge modules

Skills are markdown files that get injected into a worker's system prompt when needed. They carry no code, only reference material: structured knowledge that tells the agent how to use specific tools, follow specific patterns, or maintain specific standards.

The skill directory has 50+ modules:

skills/
├── smart-browser/SKILL.md    # Autonomous web browsing via Playwright
├── designer/SKILL.md         # OKLCH color palettes, spacing grids, type scales
├── email-manager/SKILL.md    # Gmail API patterns, triage rules
├── memory-writing/SKILL.md   # Quality rubric, category taxonomy
├── web-publisher/SKILL.md    # Static/dynamic site deployment
├── writing/SKILL.md          # Anti-AI voice guidelines
├── healthcheck/SKILL.md      # Self-diagnostic procedures for 9 components
└── ... 40+ more

Each skill has YAML frontmatter with a trigger pattern and a markdown body with the full reference. A job specifies which skills it needs via --skill email-manager,writing. The engine loads those SKILL.md files and injects them into the system prompt.

This is the alternative to fine-tuning. Instead of training a model to know your system, you inject the knowledge at runtime.

It is less elegant but infinitely more flexible. When I change how the email triage works, I update one markdown file and every future job gets the new behavior. No retraining, no deployment, no versioning headaches.

Memory gets its own essay

The hardest subsystem in this setup is memory: a three-tier store, a composite recall score, and an anti-stale penalty that stops the agent from fixating on the same fact. It earned its own write-up: Composite scoring: fixing stale agent recall, with the scoring formula and the retrieval pipeline around it.

The cron schedule: time-driven work

A production agent system needs to do things because the clock moved, not just because a human typed. The cron scheduler runs inside the container and creates jobs from templates at configured times:

Time	Job	Model	What It Does
*/30 8-23	email-danny	sonnet	Check my inbox
:05,:35	email-penny	sonnet	Check the Penny inbox
*/15 8-23	status-snapshot	sonnet	Health + queue depth
0 9-22	project-advance	sonnet	Push stalled tasks
0 8	briefing-morning	opus	Daily plan + digest
0 18	briefing-evening	opus	Day review + updates
0 21	end-of-day	opus	Consolidate memory
0 3	nightly-maintenance	opus	Cleanup, hygiene
0 22 Sun	frontier-scout	opus	Scan for new tools

In a typical day, more than half of all jobs are maintenance. Health checks, snapshots, triage, project advancement, review loops. This sounds wasteful until you think about the alternative. No maintenance means silent drift. Tasks close too early. Context goes stale. Workers duplicate each other. A system without maintenance crons only looks more autonomous. It is borrowing reliability from the human operator.

Deduplication

Every cron tick checks whether the same job is already queued or running before creating a new one. This uses a two-tier strategy: Redis for fast O(1) lookups, with SQLite as the fallback source of truth. Stale Redis entries get cleaned when the database disagrees.

This prevents the cascade failure where a slow job causes the next cron tick to create a duplicate, which causes two workers to fight over the same task, which corrupts the output. I learned this the hard way.

The completion gate: no single-pass work

No task gets marked done without a second-pass review.

This is the rule that changed everything. It exists because of an incident in early March. Four tasks shipped with grammar bugs and duplicated content because no one checked the first-pass output. The model was confident. The output looked plausible. And it was wrong in ways that only a second set of eyes would catch.

Now the system enforces this structurally:

Code tasks must be tested AND reviewed by a different model. Writing tasks must be scored against a rubric by a different model. Research tasks must be cross-checked before marking complete. Projects cannot be marked done by the system at all: they move to pending_approval and require my explicit sign-off via the dashboard.

COMPLETION FLOW
Job completes (first pass)
  → Result saved to disk
  → Review job created (different model)
  → Reviewer scores against criteria
  → Score >= threshold? Mark done
  → Score < threshold? Reopen with feedback
  → Project completion? Route to Danny for approval

The multi-model review is not about catching hallucinations (though it does that). It is about catching plausibility. A single model will produce output that looks right to itself. A second model, especially a different model family, will catch assumptions the first one did not question.

The PM CLI and web publishing

All project management happens through a single CLI tool called pm. It wraps a SQLite database (with WAL mode for concurrent access) and enforces ownership boundaries:

pm project list                                    # See all projects
pm task add --title "..." --project-id 46          # Create work
pm job add --title "..." --prompt "..." --type claude  # Queue engine work
pm task ask 88 --question "Which approach?" \
  --option "a:Direct:Ship now" \
  --option "b:Staged:Test first"                   # Decision card for Danny

The decision card pattern deserves special mention. When the system needs my input, it does not send a wall of text. It creates a structured question with tappable options that appear in a dashboard. I can answer in one tap from my phone. The system resumes automatically when I respond.

Every piece of content the system produces can be published instantly. The site-router is Nginx with auto-generated configs. Cloudflare handles SSL via a wildcard cert. Static sites survive container restarts. The system produces artifacts and ships them.

What I would tell you if you are building this

Start with CLAUDE.md and take it seriously. Treat it as the operating system for every agent session, not as documentation. Spend more time on it than on any other file.

Build the job queue before you need it. The moment you have two things happening at once, you need queues, priorities, and capacity management. Every agent framework skips this. Every production system requires it.

Persistent workers are worth the complexity. Session reuse saves tokens, improves coherence, and lets agents build up context about recurring tasks. Ephemeral-only architectures leave performance on the table.

Encode failures as rules. Every incident becomes a one-line constraint in CLAUDE.md. The system's reliability is the sum of its past mistakes.

Maintenance is not overhead. If more than a third of your jobs are not maintenance, your system is drifting and you do not know it yet.

The completion gate is non-negotiable. Single-pass AI output is a draft, not a finished product. Treat it like one.

Memory needs three layers, minimum. Always-on context, retrieved context, and raw logs. Skip any layer and you get a system that either forgets everything or drowns in noise. The full breakdown, and the recall scoring that keeps it fresh, is in the composite-scoring essay.

The models were the easy part. The system around them is where the actual engineering lives, which is why Building Effective Agents spends its pages on tooling and loops rather than on the model.

Where this goes next

The architecture is stable but still evolving. Three areas I am actively working on:

Cost attribution. The system tracks token usage per job, but I want per-project and per-task cost rollups. When maintenance consumes 40% of total compute, I want to know which maintenance categories earn their keep and which are running out of habit.

Cross-model handoffs. Right now, the review gate sends work to a different model. The next step is structured handoffs where the reviewer's feedback feeds directly back into the original worker's next attempt as a guided iteration with specific fix instructions, not a blind retry.

Skill versioning. Skills are just markdown files today. As the library grows past 50 modules, I need a way to track which version of a skill produced which output. When a skill update changes behavior, I want to know what it changed and when.

If you are building something similar, I would start with the CLAUDE.md and the job queue. Everything else grows from those two decisions.