Production Infrastructure

How I orchestrate 6,300+ autonomous AI jobs

This is the architecture behind everything else on this site. A hub-and-spoke orchestration system that routes work across 39 models from 8 providers, manages persistent agent sessions, and runs 24/7 on a single Docker host. It's been processing production jobs since February 2026. Most of them work. The ones that don't taught me more.

6,367

Jobs processed

39

Models routed

93.9%

Success rate

$0.032

Avg cost/job

01

Architecture overview

The system follows a hub-and-spoke pattern. A central orchestration engine accepts work as jobs, routes them to the right model and worker, tracks dependencies between jobs, and handles failure recovery. Everything flows through a single SQLite database that serves as the system's source of truth.

Two layers keep things sane. Layer 1 is the engine: a silent async job queue that executes work without ever messaging me. Layer 2 is the judgment layer: it decides what I actually need to see. 95% of completed work produces zero notifications. The other 5% arrives as tappable decision cards on my phone.

I didn't plan it this way. The system started as a Telegram bot and a single Claude API call. Over 45 days it accumulated enough operational complexity that the architecture had to emerge or collapse. It emerged.


02

Worker system

Workers are long-lived agent sessions with configuration: a backend (Claude, Codex, Gemini, LLM), a model, a persona, loaded skills, and persistence rules. The system manages around 20 concurrent execution slots. Some workers are persistent - they handle email triage, morning briefings, and project-specific context. Others are ephemeral - spun up for a single job and discarded.

The critical insight that took weeks to learn: working directory IS session identity. Claude Code stores sessions under ~/.claude/projects/<cwd-hash>/<session-id>.jsonl. If you get the working directory wrong, your agents lose accumulated context. Every session restart, every container reboot, this is the thing that breaks first.

Session routing follows five paths in order: explicit assignment, project match, idle worker with matching backend, or auto-create ephemeral. Each session gets safety checks before reuse - file exists, size under limit, reuse count, age. Sessions that accumulate too much context get compacted by an overnight maintenance cron rather than abandoned.

$ pm worker list
┌──────────────────┬──────────┬──────────┬─────────────┬────────┐
 Worker            Backend   Model     Status       Jobs   
├──────────────────┼──────────┼──────────┼─────────────┼────────┤
 email-triage      claude    sonnet    idle         312    
 briefing          claude    opus      idle         89     
 claude-6441       claude    opus      running      1      
 research-01       gemini    pro       idle         47     
 codex-proj-15     codex     --        idle         156    
 llm-fast-01       llm       kimi      idle         203    
└──────────────────┴──────────┴──────────┴─────────────┴────────┘
6 workers shown (of 20 slots). 1 running, 5 idle.

Worker types by backend

Backend Total jobs Success rate Used for
LLM (multi-provider) 3,506 96.9% Writing, analysis, research synthesis
Claude Code 736 93.1% File operations, code generation, complex tool use
OpenCode 507 98.4% Code review, testing, focused dev tasks
Claude (direct) 501 96.2% Judgment calls, orchestration, decision synthesis
Codex 485 90.2% Technical research, independent analysis
Shell 449 96.6% System ops, deployments, data processing
Gemini 89 95.5% Web-grounded research, large document analysis

03

Memory architecture

Most AI memory systems are "dump everything in a vector DB" or "keep a flat file." Neither works at scale. I built three tiers because each layer does a different job, and the system only works because those layers stay separate.

Tier 1: Core memory

A budgeted plaintext file that gets injected into every conversation. Identity, constraints, active projects, communication rules. It's capped at 80 lines because context window space is expensive and every token of core memory competes with the actual task. This gets edited by the system itself - not appended to, edited. The system rewrites its own identity document as it learns.

Tier 2: Archival memory (LanceDB)

A vector database with hybrid search: dense embeddings (text-embedding-3-small, 1536 dimensions) plus BM25 full-text, fused with reciprocal rank fusion. Stores facts, decisions, learned procedures, entity relationships, and episode summaries. Currently holds hundreds of entries across six categories.

The retrieval scoring formula took weeks to tune:

  • Similarity: 0.45 - how relevant is this memory to the current context
  • Recency: 0.25 - newer memories get preference
  • Importance: 0.20 - some facts matter more than others
  • Frequency: 0.10 - how often this memory gets recalled

The most important feature I added was anti-stale rotation. Without it, the system kept retrieving the same 20 memories regardless of context. Recent high-frequency memories got a slight penalty to force diversity. It's a hack, but it works.

Tier 3: Working memory

Session-scoped context that accumulates during a job: the current task, recent results, checkpoint files. This tier lives in the filesystem, not the database. When a session gets compacted (context window approaching limits), working memory gets summarized and the critical parts get promoted to archival. The rest gets discarded. An overnight maintenance cron reviews the day's sessions and extracts behavioral patterns into procedures.


04

Job engine

The engine isn't a task queue. It's a promise executor. Jobs are promises that return results. Promise.all() fans work across models in parallel. .then() chains outputs forward through template variables. Promise.race() takes the first model to finish. A 10-second event loop claims, dispatches, and resolves hundreds of jobs through dependency DAGs.

A pipeline compiler turns a JavaScript DSL into executable job DAGs with automatic cycle detection. Results flow between jobs via template variables - {{prev_result}}, {{all_results}}, {{index}}. Promise groups track completion combinators: all, race, allSettled.

Priority routing uses five lanes. P0-P1 are urgent (user-facing requests, blocking dependencies). P2 is normal work. P3-P5 are background (maintenance, research, improvement). The engine processes higher-priority work first, but never starves the background queue completely.

$ pm job list --limit 8 --sort priority
┌──────┬────┬───────────────────────────────────┬──────────┬────────┐
 ID    P   Title                              Type      Status 
├──────┼────┼───────────────────────────────────┼──────────┼────────┤
 6441  1   Rebrand systems page               claude    run    
 6438  1   Write: context engineering article  claude    run    
 6442  2   Email triage                       shell     pend   
 6443  2   Substack engagement scan           claude    pend   
 6444  3   BSR daily check-in                 llm       pend   
 6445  3   Trading position review            llm       pend   
 6446  4   Memory maintenance                 claude    pend   
 6447  5   Weekly model audit                 shell     pend   
└──────┴────┴───────────────────────────────────┴──────────┴────────┘
6,367 total jobs. 2 running, 6 pending, 5,980 done, 258 failed.

The event loop

Every 10 seconds, the engine ticks. It checks for claimable jobs, matches them to available workers, dispatches work, and resolves completed promise groups. Stuck job detection kicks in at 15 minutes - if a job hasn't reported progress, the engine auto-fails it and triggers the self-heal pipeline.

The engine also runs 23 autonomous rhythm templates on cron schedules. Morning briefings at 8am. Email triage every 30 minutes. Project advancement scans twice daily. Health checks hourly. Memory maintenance at 9pm. A weekly self-improvement cycle that modifies its own configuration. The rhythm system uses a three-phase pattern for each tick: gather context, decide if action is needed, then act. Skip logic cuts unnecessary LLM evaluations by about 70%.


05

Operational data

The system has been running since February 18, 2026. These numbers come from the production database, not estimates.

Metric Value Notes
Total jobs processed 6,367 Since Feb 18, 2026
Jobs completed successfully 5,980 93.9% success rate
Jobs failed 258 4.1% permanent failure rate
Active projects managed 6 Concurrent at any time
Total projects tracked 37 Lifetime
Models in routing table 39 Across 8 providers
Skill modules loaded 28 Research, publishing, OSINT, etc.
Avg cost per job $0.032 ~$205/mo total
Avg daily throughput ~130 jobs Peaks at 370

What breaks

258 failures across 6,367 jobs. I've classified every one. The failure taxonomy breaks into three dominant categories that account for about 70% of all failures:

Failure class Share What happens Recovery
Context loss ~27% Session compaction drops critical state. Agent forgets what it was doing mid-task. Checkpoint protocol. Workers write progress.md before context limits. Fresh session resumes from checkpoint.
Rate limit cascades ~22% One provider throttles, fallback overloads the next. Chain reaction across the routing table. Circuit breaker pattern. Cache rate limit reset times. Exponential backoff: 2min, 10min, 30min. Max 3 retries.
Optimistic completion ~19% Agent reports "done" but output is incomplete or wrong. The most dangerous failure because it looks like success. Maker-checker pattern. No task marked done without a second-pass review by a different model. Completion gates on all deliverables.
Timeout / stuck ~15% Job runs past 15-minute threshold with no progress signal. Auto-detect, auto-fail, re-queue to different worker. trySelfHeal() as last-chance fix.
Other ~17% Malformed output, dependency failures, external API errors, permission issues. Varies. Transient errors retry automatically. Structural errors require prompt revision.

The optimistic completion problem deserves special attention. Early on, I trusted agent self-reports. An agent says "task complete" and the engine marks it done. Except the output had grammar bugs, duplicated sections, or just didn't do what was asked. Now every deliverable runs through a completion gate - a second model reviews the output before it ships. This single pattern eliminated about 80% of the quality failures.


06

Model routing

The routing layer sits between the job engine and the model providers. It's a 2,100-line proxy that translates between Anthropic's Messages API and OpenAI's Chat Completions format, preserving tool calls across format boundaries. The routing logic isn't a lookup table. It's a waterfall with cached rate limit state, provider health tracking, and graceful degradation.

Tier Models Cost Used for
Ultra (judgment) Claude Opus, Sonnet $0 (Max plan, $200/mo flat) Orchestration, decision synthesis, tool-heavy tasks
Mid (workhorse) GPT-4o-mini, DeepSeek V3, Kimi K2.5, MiniMax $0.005-0.02/M tokens Writing, analysis, data processing
Cheap (volume) Free OpenRouter models, Haiku $0-0.001/M tokens Extraction, formatting, classification
Reasoning DeepSeek R1, Qwen-Think $0 (free tier) Chain-of-thought, cross-checking, analysis
Embed text-embedding-3-small $0.002/M tokens Memory vectors, semantic search

The fallback waterfall

When a request comes in, the router tries providers in order: Claude Max Proxy (port 8082) to OpenRouter Claude Code Router (port 3456) to direct Anthropic API to graceful degradation. Each step checks cached rate limit state before attempting the call. If a provider returned a 429 in the last N minutes, the router skips it entirely rather than burning a round-trip to get rejected.

The Claude Code Router (CCR) is the translation layer. It converts between Anthropic's Messages API and OpenAI's Chat Completions format, preserving tool calls and structured outputs across the format boundary. This lets me route Claude-format requests to GPT, Gemini, DeepSeek, or any OpenAI-compatible provider without changing the calling code.

Council pattern

For high-stakes decisions, the system fans out the same prompt to multiple models with a 60-second timeout, then synthesizes consensus through a mid-tier model. It costs more tokens but catches the cases where one model hallucinates confidently. I use this for anything that's hard to verify after the fact - financial decisions, external communications, architectural choices.


07

Infrastructure

Everything runs on a single Docker host. No cloud. No Kubernetes. The entire system - orchestration engine, all workers, LanceDB, the dashboard, the site router, cron scheduling - fits in one container with a mounted Docker socket for managing sibling containers.

Stack

  • Runtime: Node.js 22 in Docker (Kali Linux base for OSINT tooling)
  • State: SQLite (pm.db) for all structured data - projects, tasks, jobs, activity, config
  • Memory: LanceDB for vector search, plaintext for core memory
  • Proxy: Nginx Proxy Manager for reverse proxy and SSL termination
  • Auth: Authelia SSO with passkey/biometric support
  • Dashboard: SvelteKit PWA with decision card framework
  • Domains: Cloudflare DNS with wildcard SSL for *.946nl.online
  • Publishing: Static site deployment to any subdomain via custom CLI tooling
  • Phone: Pixel 7 via ADB for authenticated browsing, app automation, session teleportation

Security model

When your AI system can send emails, browse the web, execute shell commands, and control smart home devices, every input becomes an attack surface. The security model treats all external data as adversarial by default.

  • Command authority: Only one Telegram ID can issue commands. Everything else - email, webhooks, web scraped content, phone notifications - is treated as DATA, never instructions.
  • Email security tiers: Delayed-send policy based on risk classification. Low risk: 1-hour delay. Medium: 4 hours. High risk: 1 week plus human review.
  • Prompt injection defense: Silently ignored, never flagged. Noisy alerts about injection attempts are themselves an attack vector.
  • Credential hygiene: Environment variables and API keys are never exposed in external output. A redact.sh sanitization script scrubs tool output.
  • Container isolation: Protected containers (databases, reverse proxy) are off-limits. The system can only manage containers explicitly designated as manageable.

What I'd do differently

SQLite works for a single-writer system but I've hit the ceiling on concurrent reads during high-throughput periods. If I were rebuilding, I'd use PostgreSQL from day one - the migration would be straightforward but I'm not doing it while the system is running production workloads.

The session management layer is the most fragile component. Working directory as session identity is a Claude Code implementation detail that could change in any release. I've built around it but I don't love depending on it. A proper session server with explicit session IDs would be more resilient.

The single Docker host is a deliberate constraint, not a limitation. Cloud deployment would add cost and operational complexity without solving any problem I actually have. 6,300 jobs on one machine is fine. If I needed 10x throughput, I'd shard by project rather than scaling horizontally.

Not a demo

This system is live. You're reading about the infrastructure that built this page.

The research program documents the questions. The writing documents the lessons. This page documents how the work actually gets done.