← Back to Writing

Technical Deep-Dive / March 2026

Anatomy of a Production Claude Code Setup

500+ workers, three-layer memory, cron schedules, and the engine that runs it all.

Most Claude Code walkthroughs show you a .claude/ folder and a CLAUDE.md. This is what happens when you run it as the backbone of a system that dispatches 150+ jobs per day.

Most Claude Code walkthroughs show you a .claude/ folder with a CLAUDE.md and maybe some custom commands. That is the tutorial version. This is what happens when you run Claude Code as the backbone of a system that dispatches 150+ jobs per day, maintains persistent memory across sessions, reviews its own work with multiple models, and wakes up every morning to check its email before you do.

I have been running a production orchestration system for the last few months. It started as a Claude Code wrapper. It is now an orchestration engine with a job queue, worker pool, three-layer memory system, 50+ injectable skills, a cron scheduler, and a web publishing pipeline. This article walks through the actual architecture — what the files look like, how the pieces connect, and where the hard problems live.

If you are evaluating whether Claude Code can do more than one-shot coding tasks, this is the answer. It can. But the interesting parts are not in the model. They are in the system you build around it.

5,000+
Jobs Executed
50+
Skills Loaded
150+
Daily Jobs

The CLAUDE.md Is Your Operating System

Everyone starts with CLAUDE.md. Most people put some project notes in it and move on. That is a mistake. CLAUDE.md is the single most important file in the entire setup because it is the only thing that persists reliably across every session, every worker, every spawned subprocess.

My CLAUDE.md is not a readme. It is a triage router, a set of hard constraints, a workspace map, and a personality definition — all in one file. Here is the top-level structure:

CLAUDE.md
├── Message Triage (30-Second Rule)
│   ├── Triage Categories (respond / quick work / big work / clarify / defer)
│   ├── Tool budget (60s timeout, allowed vs forbidden tools)
│   └── Plan Preview Gate (non-negotiable)
├── Workspace Rules
│   ├── Database locations (pm.db, LanceDB, core-memory.md)
│   ├── File zones (system/, engine/, projects/, drafts/, images/)
│   └── File hygiene (temp files, naming, auto-sweep rules)
├── SOUL.md (personality, values, behavioral boundaries)
└── Critical Rules (one-liners referencing incidents)

The triage section matters most. Every incoming message gets classified in under 30 seconds:

CategorySignalAction
RespondChat, opinion, quick questionAnswer directly
Quick workNeeds tools, under 5 minAcknowledge, kick off async worker
Big workMulti-step build or researchDraft plan, wait for approval, then execute
ClarifyAmbiguous scopeAsk one sharp question with options
DeferFuture dateSet reminder

The golden rule, stated explicitly in the file: the main session is for conversation. Work happens off-thread. This single constraint prevents the number one failure mode of agentic systems — the main context window filling up with tool output from a task that should have been delegated.

Critical Rules Are Incident Reports

Every critical rule in my CLAUDE.md traces back to a real failure. Not hypothetical best practices — actual incidents with dates:

- NEVER ship after single pass. Multi-model review mandatory. (Leni's World incident, Mar 11-12)
- Research existing datasets/APIs BEFORE building scrapers. (Media Framing Study incident, Mar 14)
- Deliverables must be VIEWABLE. Never just a file path. (Mar 15 standard)

This is the pattern that works: something breaks, you write a one-line rule with the incident name, and it becomes permanent policy. The system learns from its mistakes because you encode them where they cannot be forgotten.

The Engine: A Job Queue With Opinions

The engine is a Node.js process that ticks every 10 seconds. Each tick does three things: check capacity, claim jobs from priority lanes, and dispatch work to workers.

Engine Tick (every 10s)
├── Check capacity (concurrent slots, token budget, rate limits)
├── Claim next job from priority lanes
│   ├── Lane 0: P0-P1 (urgent, user-facing)
│   ├── Lane 1: P2 (normal scheduled work)
│   └── Lane 2: P3-P5 (background, maintenance)
└── For each claimed job: find or create a worker
    ├── 1. Explicit assignment? Use it
    ├── 2. Project match? Reuse existing worker
    ├── 3. Idle worker with matching backend? Reuse
    └── 4. None found? Create ephemeral worker

Jobs enter the system from three paths:

Telegram messages — I send a message, a triage worker classifies it, and if it needs real work, it becomes one or more jobs via pm job add.

Cron schedules — The scheduler fires at configured times and creates jobs from templates.

Plan dispatch — A planning pass decomposes a request into multiple jobs with dependencies.

Each job runs in a forked child process. This is a critical design choice. The engine's main loop never blocks on a long-running model call. Multiple jobs execute in parallel. If a job hangs, the supervisor loop keeps ticking.

Engine (main thread, never blocks)
  └── tick()
      ├── fork(job-worker.mjs) → Job A (Claude Opus, writing task)
      ├── fork(job-worker.mjs) → Job B (Sonnet, email triage)
      └── fork(job-worker.mjs) → Job C (Gemini, research)
          Each child sends heartbeats every 30s
          Parent tracks liveness without polling

Capacity Management

The system tracks token usage in a rolling 5-minute window with a configurable budget. When usage exceeds 80% of the budget and the queue is deep, backpressure kicks in and pauses new dispatches until load decreases.

There is also a reserved slot for urgent work. Even when all worker slots are full, a P0 job from a direct user message can still get through. This prevents the "system is busy doing maintenance while the human waits" problem that kills every other agent framework I have tried.

Token Reservation Pattern

Job queued    → estimate tokens (prompt length / 4 + context + expected output)
Job running   → track actual usage (input + output tokens from API)
Job complete  → reconcile (release reservation, record actuals)
Over budget   → activate backpressure, pause new dispatches

This is not perfect. Token estimation is inherently fuzzy. But even rough estimation prevents the system from accepting more work than it can handle — which is the actual failure mode you need to avoid.

Workers: Sessions, Not Tasks

A worker in this system is not a function call. It is an agent session with persistent state.

Worker
├── backend: claude | codex | gemini | opencode
├── model: opus | sonnet | haiku | gemini-2.5-flash
├── persona: pulse | researcher | critic | engineer
├── skills: comma-separated list of injectable skill modules
├── persistent: true (survives job completion) | false (ephemeral)
├── session_id: UUID of current Claude Code session
├── session_job_count: how many jobs this session has handled
└── project_id: optional scope to a specific project

The key insight is session reuse. When a job completes, a persistent worker does not shut down. It goes back to idle. When the next matching job arrives, the same session resumes — with all its accumulated context. This is cheaper (no cold start), faster (warm context), and produces more coherent output across related tasks.

Ephemeral workers get created on demand and discarded after their job finishes. Most jobs use ephemeral workers. But high-value recurring work — email triage, morning briefings, project-specific research — gets persistent workers that build up context over time.

The worker pool at any given moment might look like:

WORKERS (typical weekday morning)
┌──────────────┬─────────┬─────────┬────────────┬──────────┐
│ Name         │ Backend │ Model   │ Status     │ Sessions │
├──────────────┼─────────┼─────────┼────────────┼──────────┤
│ email        │ claude  │ sonnet  │ working    │ 47 jobs  │
│ briefing     │ claude  │ opus    │ idle       │ 12 jobs  │
│ claude-5302  │ claude  │ opus    │ working    │ 1 job    │
│ codex-5303   │ codex   │ —       │ working    │ 1 job    │
│ claude-5304  │ claude  │ sonnet  │ working    │ 1 job    │
└──────────────┴─────────┴─────────┴────────────┴──────────┘

The email worker has handled 47 jobs in its current session. It knows the inbox patterns, recurring senders, what gets archived versus what gets flagged. That context would be expensive to rebuild from scratch every 30 minutes.

Skills: Injectable Knowledge Modules

Skills are markdown files that get injected into a worker's system prompt when needed. They are not code. They are reference documents — structured knowledge that tells the agent how to use specific tools, follow specific patterns, or maintain specific standards.

The skill directory has 50+ modules:

skills/
├── smart-browser/SKILL.md    # Autonomous web browsing via Playwright
├── designer/SKILL.md         # OKLCH color palettes, spacing grids, type scales
├── email-manager/SKILL.md    # Gmail API patterns, triage rules
├── memory-writing/SKILL.md   # Quality rubric, category taxonomy
├── web-publisher/SKILL.md    # Static/dynamic site deployment
├── writing/SKILL.md          # Anti-AI voice guidelines
├── healthcheck/SKILL.md      # Self-diagnostic procedures for 9 components
└── ... 40+ more

Each skill has YAML frontmatter with a trigger pattern and a markdown body with the full reference. A job specifies which skills it needs via --skill email-manager,writing. The engine loads those SKILL.md files and injects them into the system prompt.

This is the alternative to fine-tuning. Instead of training a model to know your system, you inject the knowledge at runtime.

It is less elegant but infinitely more flexible. When I change how the email triage works, I update one markdown file and every future job gets the new behavior. No retraining, no deployment, no versioning headaches.

Memory: Three Layers

Memory is the hardest problem in agentic systems. Not because storing text is hard, but because retrieval is hard. You need the right memories at the right time, and "right" changes depending on what the agent is doing.

The system uses three layers:

MEMORY ARCHITECTURE
┌──────────────────────────────────────────────────────┐
│ Layer 3: CORE MEMORY (always loaded)                 │
│ - core-memory.md (~80 lines)                         │
│ - Identity, priorities, critical rules               │
│ - Edited manually, persists forever                  │
├──────────────────────────────────────────────────────┤
│ Layer 2: SEMANTIC MEMORY (retrieved on demand)       │
│ - LanceDB vector store                              │
│ - Categories: preference, decision, lesson, fact,    │
│   person, event, context                             │
│ - Composite scoring: 45% similarity + 25% recency   │
│   + 20% importance + 10% access frequency            │
│ - Anti-stale: 50% penalty if recalled within 1 hour  │
├──────────────────────────────────────────────────────┤
│ Layer 1: EPISODIC MEMORY (append-only daily logs)    │
│ - JSONL files per day                                │
│ - What happened, what was decided, what failed       │
│ - 30-day rolling window                              │
│ - Auto-compacted after 100 entries                   │
└──────────────────────────────────────────────────────┘

Core memory is the boot context. It loads into every session, every worker, every spawned subprocess. It contains who the system is, who I am, what the active projects are, and what the non-negotiable rules are. It has an 80-line budget. Every line has to earn its place.

Semantic memory is the long-term store. When a job starts, the engine runs a recall query against LanceDB using the job's prompt as the search vector. The top-K results get injected into the job's context. The composite scoring formula prevents two failure modes: recency bias (old important facts getting buried) and echo chambers (the same memory being recalled over and over).

The anti-stale mechanism is worth calling out. If a memory was recalled in the last hour, its score gets cut in half. This forces diversity in retrieved context and prevents the system from fixating on one fact at the expense of others.

Episodic memory is the raw log. It does not get retrieved automatically. It exists for two purposes: nightly consolidation (a cron job reviews the day's episodes and promotes important facts to semantic memory) and debugging (when something goes wrong, the episode log tells you exactly what happened and when).

Auto-Capture

After every job completes, an LLM judges whether the session produced memory-worthy facts. If so, it extracts them and stores them in the semantic layer with appropriate categories and importance scores. The system learns without explicit instruction — it watches its own work and remembers what matters.

The Cron Schedule: Time-Driven Work

A production agent system needs to do things because the clock moved, not just because a human typed. The cron scheduler runs inside the container and creates jobs from templates at configured times:

TimeJobModelWhat It Does
*/30 8-23email-dannysonnetCheck Danny's inbox
:05,:35email-pennysonnetCheck Penny's inbox
*/15 8-23status-snapshotsonnetHealth + queue depth
0 9-22project-advancesonnetPush stalled tasks
0 8briefing-morningopusDaily plan + digest
0 18briefing-eveningopusDay review + updates
0 21end-of-dayopusConsolidate memory
0 3nightly-maintenanceopusCleanup, hygiene
0 22 Sunfrontier-scoutopusScan for new tools

In a typical day, more than half of all jobs are maintenance. Health checks, snapshots, triage, project advancement, review loops. This sounds wasteful until you think about the alternative. No maintenance means silent drift. Tasks close too early. Context goes stale. Workers duplicate each other. A system without maintenance crons is not more autonomous — it is just borrowing reliability from the human operator.

Deduplication

Every cron tick checks whether the same job is already queued or running before creating a new one. This uses a two-tier strategy: Redis for fast O(1) lookups, with SQLite as the fallback source of truth. Stale Redis entries get cleaned when the database disagrees.

This prevents the cascade failure where a slow job causes the next cron tick to create a duplicate, which causes two workers to fight over the same task, which corrupts the output. I learned this the hard way.

The Completion Gate: No Single-Pass Work

No task gets marked done without a second-pass review.

This is the rule that changed everything. It exists because of an incident in early March. Four tasks shipped with grammar bugs and duplicated content because no one checked the first-pass output. The model was confident. The output looked plausible. And it was wrong in ways that only a second set of eyes would catch.

Now the system enforces this structurally:

Code tasks must be tested AND reviewed by a different model. Writing tasks must be scored against a rubric by a different model. Research tasks must be cross-checked before marking complete. Projects cannot be marked done by the system at all — they move to pending_approval and require my explicit sign-off via the dashboard.

COMPLETION FLOW
Job completes (first pass)
  → Result saved to disk
  → Review job created (different model)
  → Reviewer scores against criteria
  → Score >= threshold? Mark done
  → Score < threshold? Reopen with feedback
  → Project completion? Route to Danny for approval

The multi-model review is not about catching hallucinations (though it does that). It is about catching plausibility. A single model will produce output that looks right to itself. A second model, especially a different model family, will catch assumptions the first one did not question.

The PM CLI and Web Publishing

All project management happens through a single CLI tool called pm. It wraps a SQLite database (with WAL mode for concurrent access) and enforces ownership boundaries:

pm project list                                    # See all projects
pm task add --title "..." --project-id 46          # Create work
pm job add --title "..." --prompt "..." --type claude  # Queue engine work
pm task ask 88 --question "Which approach?" \
  --option "a:Direct:Ship now" \
  --option "b:Staged:Test first"                   # Decision card for Danny

The decision card pattern deserves special mention. When the system needs my input, it does not send a wall of text. It creates a structured question with tappable options that appear in a dashboard. I can answer in one tap from my phone. The system resumes automatically when I respond.

Every piece of content the system produces can be published instantly. The site-router is Nginx with auto-generated configs. Cloudflare handles SSL via a wildcard cert. Static sites survive container restarts. The system does not just produce artifacts — it ships them.

What I Would Tell You If You Are Building This

Start with CLAUDE.md and take it seriously. It is not documentation. It is the operating system for every agent session. Spend more time on it than on any other file.

Build the job queue before you need it. The moment you have two things happening at once, you need queues, priorities, and capacity management. Every agent framework skips this. Every production system requires it.

Persistent workers are worth the complexity. Session reuse saves tokens, improves coherence, and lets agents build up context about recurring tasks. Ephemeral-only architectures leave performance on the table.

Encode failures as rules. Every incident becomes a one-line constraint in CLAUDE.md. The system's reliability is literally the sum of its past mistakes.

Maintenance is not overhead. If more than a third of your jobs are not maintenance, your system is drifting and you do not know it yet.

The completion gate is non-negotiable. Single-pass AI output is not production-ready. It is a draft. Treat it like one.

Memory needs three layers, minimum. Always-on context (core memory), retrieved context (semantic search), and raw logs (episodic). Skip any layer and you get a system that either forgets everything or drowns in noise.

The models were the easy part. The system around them — that is where the actual engineering lives.

Where This Goes Next

The architecture is stable but still evolving. Three areas I am actively working on:

Cost attribution. The system tracks token usage per job, but I want per-project and per-task cost rollups. When maintenance consumes 40% of total compute, I want to know which maintenance categories earn their keep and which are just running out of habit.

Cross-model handoffs. Right now, the review gate sends work to a different model. The next step is structured handoffs where the reviewer's feedback feeds directly back into the original worker's next attempt — not as a retry, but as a guided iteration with specific fix instructions.

Skill versioning. Skills are just markdown files today. As the library grows past 50 modules, I need a way to track which version of a skill produced which output. When a skill update changes behavior, I want to know what it changed and when.

If you are building something similar, I would start with the CLAUDE.md and the job queue. Everything else grows from those two decisions.

Frequently Asked Questions

What should be in my CLAUDE.md?

At minimum: a triage system for incoming messages (what gets answered vs. what gets delegated), workspace rules (where files live, what is off-limits), and critical rules derived from actual incidents. Most CLAUDE.md files I see are too short and too generic. If you are running Claude Code for real work, your CLAUDE.md should be the most carefully maintained file in your project.

How do you configure Claude Code for production systems?

The key decisions are: persistent vs ephemeral workers (use persistent for recurring tasks like email triage), model routing (not everything needs Opus — Sonnet handles 70% of routine work), and capacity management (token budgets, backpressure, reserved slots for urgent work). The engine handles all of this, but the configuration lives in a combination of CLAUDE.md constraints and job-type templates.

How do Claude Code skills work?

Skills are markdown files injected into the system prompt at runtime. They contain tool-specific patterns, safety constraints, and reference data. The agent does not need training data to know how to use Gmail or publish a website — the skill file gives it the exact API patterns and error handling for that domain. Update the file, and every future job gets the new behavior. No retraining required.

How much does this cost to run?

The primary cost is API tokens. With Anthropic's Max plan and tiered model routing (Opus for complex work, Sonnet for routine, Haiku for classification), the token cost is manageable for a solo operator. The real cost is the maintenance overhead: roughly 40% of all compute goes to health checks, triage, and status snapshots. That is the autonomy tax. I am writing a separate piece on that.

Can I build this without the custom engine?

You can get surprisingly far with just Claude Code, a well-structured CLAUDE.md, and custom slash commands. The engine becomes necessary when you need concurrent workers, priority lanes, cron-driven automation, and capacity management. If you are running fewer than 20 jobs per day, start without the engine and add it when you hit the concurrency wall.

Continue Reading

The research program goes deeper than the essay version.

The writing page carries the field notes. The research page carries the formal studies, methods, and PDF artifacts behind them.