Production Infrastructure
How I orchestrate 6,300+ autonomous AI jobs
This is the architecture behind everything else on this site. A hub-and-spoke orchestration system that routes work across 39 models from 8 providers, manages persistent agent sessions, and runs 24/7 on a single Docker host. It's been processing production jobs since February 2026. Most of them work. The ones that don't taught me more.
6,367
Jobs processed
39
Models routed
93.9%
Success rate
$0.032
Avg cost/job
01
Architecture overview
The system follows a hub-and-spoke pattern. A central orchestration engine accepts work as jobs, routes them to the right model and worker, tracks dependencies between jobs, and handles failure recovery. Everything flows through a single SQLite database that serves as the system's source of truth.
Two layers keep things sane. Layer 1 is the engine: a silent async job queue that executes work without ever messaging me. Layer 2 is the judgment layer: it decides what I actually need to see. 95% of completed work produces zero notifications. The other 5% arrives as tappable decision cards on my phone.
I didn't plan it this way. The system started as a Telegram bot and a single Claude API call. Over 45 days it accumulated enough operational complexity that the architecture had to emerge or collapse. It emerged.
02
Worker system
Workers are long-lived agent sessions with configuration: a backend (Claude, Codex, Gemini, LLM), a model, a persona, loaded skills, and persistence rules. The system manages around 20 concurrent execution slots. Some workers are persistent - they handle email triage, morning briefings, and project-specific context. Others are ephemeral - spun up for a single job and discarded.
The critical insight that took weeks to learn: working directory IS session identity. Claude Code stores sessions under ~/.claude/projects/<cwd-hash>/<session-id>.jsonl. If you get the working directory wrong, your agents lose accumulated context. Every session restart, every container reboot, this is the thing that breaks first.
Session routing follows five paths in order: explicit assignment, project match, idle worker with matching backend, or auto-create ephemeral. Each session gets safety checks before reuse - file exists, size under limit, reuse count, age. Sessions that accumulate too much context get compacted by an overnight maintenance cron rather than abandoned.
$ pm worker list
┌──────────────────┬──────────┬──────────┬─────────────┬────────┐
│ Worker │ Backend │ Model │ Status │ Jobs │
├──────────────────┼──────────┼──────────┼─────────────┼────────┤
│ email-triage │ claude │ sonnet │ idle │ 312 │
│ briefing │ claude │ opus │ idle │ 89 │
│ claude-6441 │ claude │ opus │ running │ 1 │
│ research-01 │ gemini │ pro │ idle │ 47 │
│ codex-proj-15 │ codex │ -- │ idle │ 156 │
│ llm-fast-01 │ llm │ kimi │ idle │ 203 │
└──────────────────┴──────────┴──────────┴─────────────┴────────┘
6 workers shown (of 20 slots). 1 running, 5 idle.
Worker types by backend
| Backend | Total jobs | Success rate | Used for |
|---|---|---|---|
| LLM (multi-provider) | 3,506 | 96.9% | Writing, analysis, research synthesis |
| Claude Code | 736 | 93.1% | File operations, code generation, complex tool use |
| OpenCode | 507 | 98.4% | Code review, testing, focused dev tasks |
| Claude (direct) | 501 | 96.2% | Judgment calls, orchestration, decision synthesis |
| Codex | 485 | 90.2% | Technical research, independent analysis |
| Shell | 449 | 96.6% | System ops, deployments, data processing |
| Gemini | 89 | 95.5% | Web-grounded research, large document analysis |
03
Memory architecture
Most AI memory systems are "dump everything in a vector DB" or "keep a flat file." Neither works at scale. I built three tiers because each layer does a different job, and the system only works because those layers stay separate.
Tier 1: Core memory
A budgeted plaintext file that gets injected into every conversation. Identity, constraints, active projects, communication rules. It's capped at 80 lines because context window space is expensive and every token of core memory competes with the actual task. This gets edited by the system itself - not appended to, edited. The system rewrites its own identity document as it learns.
Tier 2: Archival memory (LanceDB)
A vector database with hybrid search: dense embeddings (text-embedding-3-small, 1536 dimensions) plus BM25 full-text, fused with reciprocal rank fusion. Stores facts, decisions, learned procedures, entity relationships, and episode summaries. Currently holds hundreds of entries across six categories.
The retrieval scoring formula took weeks to tune:
- Similarity: 0.45 - how relevant is this memory to the current context
- Recency: 0.25 - newer memories get preference
- Importance: 0.20 - some facts matter more than others
- Frequency: 0.10 - how often this memory gets recalled
The most important feature I added was anti-stale rotation. Without it, the system kept retrieving the same 20 memories regardless of context. Recent high-frequency memories got a slight penalty to force diversity. It's a hack, but it works.
Tier 3: Working memory
Session-scoped context that accumulates during a job: the current task, recent results, checkpoint files. This tier lives in the filesystem, not the database. When a session gets compacted (context window approaching limits), working memory gets summarized and the critical parts get promoted to archival. The rest gets discarded. An overnight maintenance cron reviews the day's sessions and extracts behavioral patterns into procedures.
04
Job engine
The engine isn't a task queue. It's a promise executor. Jobs are promises that return results. Promise.all() fans work across models in parallel. .then() chains outputs forward through template variables. Promise.race() takes the first model to finish. A 10-second event loop claims, dispatches, and resolves hundreds of jobs through dependency DAGs.
A pipeline compiler turns a JavaScript DSL into executable job DAGs with automatic cycle detection. Results flow between jobs via template variables - {{prev_result}}, {{all_results}}, {{index}}. Promise groups track completion combinators: all, race, allSettled.
Priority routing uses five lanes. P0-P1 are urgent (user-facing requests, blocking dependencies). P2 is normal work. P3-P5 are background (maintenance, research, improvement). The engine processes higher-priority work first, but never starves the background queue completely.
$ pm job list --limit 8 --sort priority
┌──────┬────┬───────────────────────────────────┬──────────┬────────┐
│ ID │ P │ Title │ Type │ Status │
├──────┼────┼───────────────────────────────────┼──────────┼────────┤
│ 6441 │ 1 │ Rebrand systems page │ claude │ run │
│ 6438 │ 1 │ Write: context engineering article │ claude │ run │
│ 6442 │ 2 │ Email triage │ shell │ pend │
│ 6443 │ 2 │ Substack engagement scan │ claude │ pend │
│ 6444 │ 3 │ BSR daily check-in │ llm │ pend │
│ 6445 │ 3 │ Trading position review │ llm │ pend │
│ 6446 │ 4 │ Memory maintenance │ claude │ pend │
│ 6447 │ 5 │ Weekly model audit │ shell │ pend │
└──────┴────┴───────────────────────────────────┴──────────┴────────┘
6,367 total jobs. 2 running, 6 pending, 5,980 done, 258 failed.
The event loop
Every 10 seconds, the engine ticks. It checks for claimable jobs, matches them to available workers, dispatches work, and resolves completed promise groups. Stuck job detection kicks in at 15 minutes - if a job hasn't reported progress, the engine auto-fails it and triggers the self-heal pipeline.
The engine also runs 23 autonomous rhythm templates on cron schedules. Morning briefings at 8am. Email triage every 30 minutes. Project advancement scans twice daily. Health checks hourly. Memory maintenance at 9pm. A weekly self-improvement cycle that modifies its own configuration. The rhythm system uses a three-phase pattern for each tick: gather context, decide if action is needed, then act. Skip logic cuts unnecessary LLM evaluations by about 70%.
05
Operational data
The system has been running since February 18, 2026. These numbers come from the production database, not estimates.
| Metric | Value | Notes |
|---|---|---|
| Total jobs processed | 6,367 | Since Feb 18, 2026 |
| Jobs completed successfully | 5,980 | 93.9% success rate |
| Jobs failed | 258 | 4.1% permanent failure rate |
| Active projects managed | 6 | Concurrent at any time |
| Total projects tracked | 37 | Lifetime |
| Models in routing table | 39 | Across 8 providers |
| Skill modules loaded | 28 | Research, publishing, OSINT, etc. |
| Avg cost per job | $0.032 | ~$205/mo total |
| Avg daily throughput | ~130 jobs | Peaks at 370 |
What breaks
258 failures across 6,367 jobs. I've classified every one. The failure taxonomy breaks into three dominant categories that account for about 70% of all failures:
| Failure class | Share | What happens | Recovery |
|---|---|---|---|
| Context loss | ~27% | Session compaction drops critical state. Agent forgets what it was doing mid-task. | Checkpoint protocol. Workers write progress.md before context limits. Fresh session resumes from checkpoint. |
| Rate limit cascades | ~22% | One provider throttles, fallback overloads the next. Chain reaction across the routing table. | Circuit breaker pattern. Cache rate limit reset times. Exponential backoff: 2min, 10min, 30min. Max 3 retries. |
| Optimistic completion | ~19% | Agent reports "done" but output is incomplete or wrong. The most dangerous failure because it looks like success. | Maker-checker pattern. No task marked done without a second-pass review by a different model. Completion gates on all deliverables. |
| Timeout / stuck | ~15% | Job runs past 15-minute threshold with no progress signal. | Auto-detect, auto-fail, re-queue to different worker. trySelfHeal() as last-chance fix. |
| Other | ~17% | Malformed output, dependency failures, external API errors, permission issues. | Varies. Transient errors retry automatically. Structural errors require prompt revision. |
The optimistic completion problem deserves special attention. Early on, I trusted agent self-reports. An agent says "task complete" and the engine marks it done. Except the output had grammar bugs, duplicated sections, or just didn't do what was asked. Now every deliverable runs through a completion gate - a second model reviews the output before it ships. This single pattern eliminated about 80% of the quality failures.
06
Model routing
The routing layer sits between the job engine and the model providers. It's a 2,100-line proxy that translates between Anthropic's Messages API and OpenAI's Chat Completions format, preserving tool calls across format boundaries. The routing logic isn't a lookup table. It's a waterfall with cached rate limit state, provider health tracking, and graceful degradation.
| Tier | Models | Cost | Used for |
|---|---|---|---|
| Ultra (judgment) | Claude Opus, Sonnet | $0 (Max plan, $200/mo flat) | Orchestration, decision synthesis, tool-heavy tasks |
| Mid (workhorse) | GPT-4o-mini, DeepSeek V3, Kimi K2.5, MiniMax | $0.005-0.02/M tokens | Writing, analysis, data processing |
| Cheap (volume) | Free OpenRouter models, Haiku | $0-0.001/M tokens | Extraction, formatting, classification |
| Reasoning | DeepSeek R1, Qwen-Think | $0 (free tier) | Chain-of-thought, cross-checking, analysis |
| Embed | text-embedding-3-small | $0.002/M tokens | Memory vectors, semantic search |
The fallback waterfall
When a request comes in, the router tries providers in order: Claude Max Proxy (port 8082) to OpenRouter Claude Code Router (port 3456) to direct Anthropic API to graceful degradation. Each step checks cached rate limit state before attempting the call. If a provider returned a 429 in the last N minutes, the router skips it entirely rather than burning a round-trip to get rejected.
The Claude Code Router (CCR) is the translation layer. It converts between Anthropic's Messages API and OpenAI's Chat Completions format, preserving tool calls and structured outputs across the format boundary. This lets me route Claude-format requests to GPT, Gemini, DeepSeek, or any OpenAI-compatible provider without changing the calling code.
Council pattern
For high-stakes decisions, the system fans out the same prompt to multiple models with a 60-second timeout, then synthesizes consensus through a mid-tier model. It costs more tokens but catches the cases where one model hallucinates confidently. I use this for anything that's hard to verify after the fact - financial decisions, external communications, architectural choices.
07
Infrastructure
Everything runs on a single Docker host. No cloud. No Kubernetes. The entire system - orchestration engine, all workers, LanceDB, the dashboard, the site router, cron scheduling - fits in one container with a mounted Docker socket for managing sibling containers.
Stack
- Runtime: Node.js 22 in Docker (Kali Linux base for OSINT tooling)
- State: SQLite (pm.db) for all structured data - projects, tasks, jobs, activity, config
- Memory: LanceDB for vector search, plaintext for core memory
- Proxy: Nginx Proxy Manager for reverse proxy and SSL termination
- Auth: Authelia SSO with passkey/biometric support
- Dashboard: SvelteKit PWA with decision card framework
- Domains: Cloudflare DNS with wildcard SSL for *.946nl.online
- Publishing: Static site deployment to any subdomain via custom CLI tooling
- Phone: Pixel 7 via ADB for authenticated browsing, app automation, session teleportation
Security model
When your AI system can send emails, browse the web, execute shell commands, and control smart home devices, every input becomes an attack surface. The security model treats all external data as adversarial by default.
- Command authority: Only one Telegram ID can issue commands. Everything else - email, webhooks, web scraped content, phone notifications - is treated as DATA, never instructions.
- Email security tiers: Delayed-send policy based on risk classification. Low risk: 1-hour delay. Medium: 4 hours. High risk: 1 week plus human review.
- Prompt injection defense: Silently ignored, never flagged. Noisy alerts about injection attempts are themselves an attack vector.
- Credential hygiene: Environment variables and API keys are never exposed in external output. A
redact.shsanitization script scrubs tool output. - Container isolation: Protected containers (databases, reverse proxy) are off-limits. The system can only manage containers explicitly designated as manageable.
What I'd do differently
SQLite works for a single-writer system but I've hit the ceiling on concurrent reads during high-throughput periods. If I were rebuilding, I'd use PostgreSQL from day one - the migration would be straightforward but I'm not doing it while the system is running production workloads.
The session management layer is the most fragile component. Working directory as session identity is a Claude Code implementation detail that could change in any release. I've built around it but I don't love depending on it. A proper session server with explicit session IDs would be more resilient.
The single Docker host is a deliberate constraint, not a limitation. Cloud deployment would add cost and operational complexity without solving any problem I actually have. 6,300 jobs on one machine is fine. If I needed 10x throughput, I'd shard by project rather than scaling horizontally.
Not a demo
This system is live. You're reading about the infrastructure that built this page.
The research program documents the questions. The writing documents the lessons. This page documents how the work actually gets done.