Production AI Infrastructure - How I Orchestrate 6,000+ Autonomous Jobs

01

Architecture overview

The system follows a hub-and-spoke pattern. A central orchestration engine accepts work as jobs, routes them to the right model and worker, tracks dependencies between jobs, and handles failure recovery. Everything flows through a single SQLite database that serves as the system's source of truth.

Two layers keep things sane. Layer 1 is the engine: a silent async job queue that executes work without ever messaging me. Layer 2 is the judgment layer: it decides what I actually need to see. 95% of completed work produces zero notifications. The other 5% arrives as tappable decision cards on my phone.

I didn't plan it this way. The system started as a Telegram bot and a single Claude API call. Over 45 days it accumulated enough operational complexity that the architecture had to emerge or collapse. It emerged.

┌─────────────────────┐ │ ORCHESTRATION HUB │ │ SQLite · Scheduler │ │ Router · Session Mgr │ └──────────┬──────────┘ ┌───────────────────┼───────────────────┐ │ │ │ ┌─────────┴────────┐ ┌───────┴────────┐ ┌──────┴──────────┐ │ Claude Workers │ │ Code Workers │ │ LLM Pool │ │ Opus · Sonnet │ │ Claude Code │ │ Gemini · GPT │ │ Haiku · Max │ │ OpenCode │ │ Kimi · DeepSeek │ └─────────┬────────┘ └───────┬────────┘ │ 30+ via OpenRouter│ │ │ └──────┬──────────┘ └───────────────────┼───────────────────┘ │ ┌──────────────────────────┼──────────────────────────┐ │ │ │ ┌─────────┴────────┐ ┌──────────────┴──────────┐ ┌──────────┴────────┐ │ Memory │ │ State │ │ Skills │ │ LanceDB vectors │ │ Projects · Tasks · Jobs │ │ 28 skill modules │ │ Core memory │ │ Activity · Daily notes │ │ Browsers · Email │ │ Archival facts │ │ Config · Refs │ │ Publishing · OSINT│ └──────────────────┘ └─────────────────────────┘ └───────────────────┘ │ │ │ └──────────────────────────┼──────────────────────────┘ │ ┌──────────────────────────┼──────────────────────────┐ │ │ │ ┌─────────┴────────┐ ┌──────────────┴──────────┐ ┌──────────┴────────┐ │ Telegram │ │ Email (Gmail x2) │ │ Phone (Pixel 7) │ │ Dashboard (PWA) │ │ Reddit · Substack │ │ Browser (CDP) │ │ Decision cards │ │ GitHub · HuggingFace │ │ Smart home (HA) │ └──────────────────┘ └─────────────────────────┘ └───────────────────┘

02

Worker system

Workers are long-lived agent sessions with configuration: a backend (Claude, Codex, Gemini, LLM), a model, a persona, loaded skills, and persistence rules. The system manages around 20 concurrent execution slots. Some workers are persistent - they handle email triage, morning briefings, and project-specific context. Others are ephemeral - spun up for a single job and discarded.

The critical insight that took weeks to learn: working directory IS session identity. Claude Code stores sessions under ~/.claude/projects/<cwd-hash>/<session-id>.jsonl. If you get the working directory wrong, your agents lose accumulated context. Every session restart, every container reboot, this is the thing that breaks first.

Session routing follows five paths in order: explicit assignment, project match, idle worker with matching backend, or auto-create ephemeral. Each session gets safety checks before reuse - file exists, size under limit, reuse count, age. Sessions that accumulate too much context get compacted by an overnight maintenance cron rather than abandoned.

$ pm worker list
┌──────────────────┬──────────┬──────────┬─────────────┬────────┐
│ Worker           │ Backend  │ Model    │ Status      │ Jobs   │
├──────────────────┼──────────┼──────────┼─────────────┼────────┤
│ email-triage     │ claude   │ sonnet   │ idle        │ 312    │
│ briefing         │ claude   │ opus     │ idle        │ 89     │
│ claude-6441      │ claude   │ opus     │ running     │ 1      │
│ research-01      │ gemini   │ pro      │ idle        │ 47     │
│ codex-proj-15    │ codex    │ --       │ idle        │ 156    │
│ llm-fast-01      │ llm      │ kimi     │ idle        │ 203    │
└──────────────────┴──────────┴──────────┴─────────────┴────────┘
6 workers shown (of 20 slots). 1 running, 5 idle.

Worker types by backend

Backend	Total jobs	Success rate	Used for
LLM (multi-provider)	3,506	96.9%	Writing, analysis, research synthesis
Claude Code	736	93.1%	File operations, code generation, complex tool use
OpenCode	507	98.4%	Code review, testing, focused dev tasks
Claude (direct)	501	96.2%	Judgment calls, orchestration, decision synthesis
Codex	485	90.2%	Technical research, independent analysis
Shell	449	96.6%	System ops, deployments, data processing
Gemini	89	95.5%	Web-grounded research, large document analysis

03

Memory architecture

Most AI memory systems are "dump everything in a vector DB" or "keep a flat file." Neither works at scale. I built three tiers because each layer does a different job, and the system only works because those layers stay separate.

Tier 1: Core memory

A budgeted plaintext file that gets injected into every conversation. Identity, constraints, active projects, communication rules. It's capped at 80 lines because context window space is expensive and every token of core memory competes with the actual task. This gets edited by the system itself - not appended to, edited. The system rewrites its own identity document as it learns.

Tier 2: Archival memory (LanceDB)

A vector database with hybrid search: dense embeddings (text-embedding-3-small, 1536 dimensions) plus BM25 full-text, fused with reciprocal rank fusion. Stores facts, decisions, learned procedures, entity relationships, and episode summaries. Currently holds hundreds of entries across six categories.

The retrieval scoring formula took weeks to tune:

Similarity: 0.45 - how relevant is this memory to the current context
Recency: 0.25 - newer memories get preference
Importance: 0.20 - some facts matter more than others
Frequency: 0.10 - how often this memory gets recalled

The most important feature I added was anti-stale rotation. Without it, the system kept retrieving the same 20 memories regardless of context. Recent high-frequency memories got a slight penalty to force diversity. It's a hack, but it works.

Tier 3: Working memory

Session-scoped context that accumulates during a job: the current task, recent results, checkpoint files. This tier lives in the filesystem, not the database. When a session gets compacted (context window approaching limits), working memory gets summarized and the critical parts get promoted to archival. The rest gets discarded. An overnight maintenance cron reviews the day's sessions and extracts behavioral patterns into procedures.

TASK ARRIVES │ ├──▶ Core memory injected (always present, 80-line budget) │ ├──▶ Archival recall (vector + BM25 hybrid search) │ ├── similarity 0.45 │ ├── recency 0.25 │ ├── importance 0.20 │ └── frequency 0.10 ← anti-stale penalty applied │ ├──▶ Procedures matched (skills, learned routines) │ └──▶ Working memory loaded (task state, checkpoints) │ ▼ JOB EXECUTES │ ▼ ┌────────────────────────────────────────┐ │ Auto-capture pipeline: │ │ Extract candidates from conversation │ │ LLM judges each: ADD/UPDATE/DELETE/NOOP│ │ Store winners with embeddings │ └────────────────────────────────────────┘ │ ▼ Overnight cron: distill sessions into procedures

04

Job engine

The engine isn't a task queue. It's a promise executor. Jobs are promises that return results. Promise.all() fans work across models in parallel. .then() chains outputs forward through template variables. Promise.race() takes the first model to finish. A 10-second event loop claims, dispatches, and resolves hundreds of jobs through dependency DAGs.

A pipeline compiler turns a JavaScript DSL into executable job DAGs with automatic cycle detection. Results flow between jobs via template variables - {{prev_result}}, {{all_results}}, {{index}}. Promise groups track completion combinators: all, race, allSettled.

Priority routing uses five lanes. P0-P1 are urgent (user-facing requests, blocking dependencies). P2 is normal work. P3-P5 are background (maintenance, research, improvement). The engine processes higher-priority work first, but never starves the background queue completely.

$ pm job list --limit 8 --sort priority
┌──────┬────┬───────────────────────────────────┬──────────┬────────┐
│ ID   │ P  │ Title                             │ Type     │ Status │
├──────┼────┼───────────────────────────────────┼──────────┼────────┤
│ 6441 │ 1  │ Rebrand systems page              │ claude   │ run    │
│ 6438 │ 1  │ Write: context engineering article │ claude   │ run    │
│ 6442 │ 2  │ Email triage                      │ shell    │ pend   │
│ 6443 │ 2  │ Substack engagement scan          │ claude   │ pend   │
│ 6444 │ 3  │ BSR daily check-in                │ llm      │ pend   │
│ 6445 │ 3  │ Trading position review           │ llm      │ pend   │
│ 6446 │ 4  │ Memory maintenance                │ claude   │ pend   │
│ 6447 │ 5  │ Weekly model audit                │ shell    │ pend   │
└──────┴────┴───────────────────────────────────┴──────────┴────────┘
6,367 total jobs. 2 running, 6 pending, 5,980 done, 258 failed.

The event loop

Every 10 seconds, the engine ticks. It checks for claimable jobs, matches them to available workers, dispatches work, and resolves completed promise groups. Stuck job detection kicks in at 15 minutes - if a job hasn't reported progress, the engine auto-fails it and triggers the self-heal pipeline.

The engine also runs 23 autonomous rhythm templates on cron schedules. Morning briefings at 8am. Email triage every 30 minutes. Project advancement scans twice daily. Health checks hourly. Memory maintenance at 9pm. A weekly self-improvement cycle that modifies its own configuration. The rhythm system uses a three-phase pattern for each tick: gather context, decide if action is needed, then act. Skip logic cuts unnecessary LLM evaluations by about 70%.

05

Operational data

The system has been running since February 18, 2026. These numbers come from the production database, not estimates.

Metric	Value	Notes
Total jobs processed	6,367	Since Feb 18, 2026
Jobs completed successfully	5,980	93.9% success rate
Jobs failed	258	4.1% permanent failure rate
Active projects managed	6	Concurrent at any time
Total projects tracked	37	Lifetime
Models in routing table	39	Across 8 providers
Skill modules loaded	28	Research, publishing, OSINT, etc.
Avg cost per job	$0.032	~$205/mo total
Avg daily throughput	~130 jobs	Peaks at 370

What breaks

258 failures across 6,367 jobs. I've classified every one. The failure taxonomy breaks into three dominant categories that account for about 70% of all failures:

Failure class	Share	What happens	Recovery
Context loss	~27%	Session compaction drops critical state. Agent forgets what it was doing mid-task.	Checkpoint protocol. Workers write progress.md before context limits. Fresh session resumes from checkpoint.
Rate limit cascades	~22%	One provider throttles, fallback overloads the next. Chain reaction across the routing table.	Circuit breaker pattern. Cache rate limit reset times. Exponential backoff: 2min, 10min, 30min. Max 3 retries.
Optimistic completion	~19%	Agent reports "done" but output is incomplete or wrong. The most dangerous failure because it looks like success.	Maker-checker pattern. No task marked done without a second-pass review by a different model. Completion gates on all deliverables.
Timeout / stuck	~15%	Job runs past 15-minute threshold with no progress signal.	Auto-detect, auto-fail, re-queue to different worker. `trySelfHeal()` as last-chance fix.
Other	~17%	Malformed output, dependency failures, external API errors, permission issues.	Varies. Transient errors retry automatically. Structural errors require prompt revision.

The optimistic completion problem deserves special attention. Early on, I trusted agent self-reports. An agent says "task complete" and the engine marks it done. Except the output had grammar bugs, duplicated sections, or just didn't do what was asked. Now every deliverable runs through a completion gate - a second model reviews the output before it ships. This single pattern eliminated about 80% of the quality failures.

06

Model routing

The routing layer sits between the job engine and the model providers. It's a 2,100-line proxy that translates between Anthropic's Messages API and OpenAI's Chat Completions format, preserving tool calls across format boundaries. The routing logic isn't a lookup table. It's a waterfall with cached rate limit state, provider health tracking, and graceful degradation.

Tier	Models	Cost	Used for
Ultra (judgment)	Claude Opus, Sonnet	$0 (Max plan, $200/mo flat)	Orchestration, decision synthesis, tool-heavy tasks
Mid (workhorse)	GPT-4o-mini, DeepSeek V3, Kimi K2.5, MiniMax	$0.005-0.02/M tokens	Writing, analysis, data processing
Cheap (volume)	Free OpenRouter models, Haiku	$0-0.001/M tokens	Extraction, formatting, classification
Reasoning	DeepSeek R1, Qwen-Think	$0 (free tier)	Chain-of-thought, cross-checking, analysis
Embed	text-embedding-3-small	$0.002/M tokens	Memory vectors, semantic search

The fallback waterfall

When a request comes in, the router tries providers in order: Claude Max Proxy (port 8082) to OpenRouter Claude Code Router (port 3456) to direct Anthropic API to graceful degradation. Each step checks cached rate limit state before attempting the call. If a provider returned a 429 in the last N minutes, the router skips it entirely rather than burning a round-trip to get rejected.

The Claude Code Router (CCR) is the translation layer. It converts between Anthropic's Messages API and OpenAI's Chat Completions format, preserving tool calls and structured outputs across the format boundary. This lets me route Claude-format requests to GPT, Gemini, DeepSeek, or any OpenAI-compatible provider without changing the calling code.

Council pattern

For high-stakes decisions, the system fans out the same prompt to multiple models with a 60-second timeout, then synthesizes consensus through a mid-tier model. It costs more tokens but catches the cases where one model hallucinates confidently. I use this for anything that's hard to verify after the fact - financial decisions, external communications, architectural choices.

07

Infrastructure

Everything runs on a single Docker host. No cloud. No Kubernetes. The entire system - orchestration engine, all workers, LanceDB, the dashboard, the site router, cron scheduling - fits in one container with a mounted Docker socket for managing sibling containers.

Stack

Runtime: Node.js 22 in Docker (Kali Linux base for OSINT tooling)
State: SQLite (pm.db) for all structured data - projects, tasks, jobs, activity, config
Memory: LanceDB for vector search, plaintext for core memory
Proxy: Nginx Proxy Manager for reverse proxy and SSL termination
Auth: Authelia SSO with passkey/biometric support
Dashboard: SvelteKit PWA with decision card framework
Domains: Cloudflare DNS with wildcard SSL for *.946nl.online
Publishing: Static site deployment to any subdomain via custom CLI tooling
Phone: Pixel 7 via ADB for authenticated browsing, app automation, session teleportation

Security model

When your AI system can send emails, browse the web, execute shell commands, and control smart home devices, every input becomes an attack surface. The security model treats all external data as adversarial by default.

Command authority: Only one Telegram ID can issue commands. Everything else - email, webhooks, web scraped content, phone notifications - is treated as DATA, never instructions.
Email security tiers: Delayed-send policy based on risk classification. Low risk: 1-hour delay. Medium: 4 hours. High risk: 1 week plus human review.
Prompt injection defense: Silently ignored, never flagged. Noisy alerts about injection attempts are themselves an attack vector.
Credential hygiene: Environment variables and API keys are never exposed in external output. A redact.sh sanitization script scrubs tool output.
Container isolation: Protected containers (databases, reverse proxy) are off-limits. The system can only manage containers explicitly designated as manageable.

What I'd do differently

SQLite works for a single-writer system but I've hit the ceiling on concurrent reads during high-throughput periods. If I were rebuilding, I'd use PostgreSQL from day one - the migration would be straightforward but I'm not doing it while the system is running production workloads.

The session management layer is the most fragile component. Working directory as session identity is a Claude Code implementation detail that could change in any release. I've built around it but I don't love depending on it. A proper session server with explicit session IDs would be more resilient.

The single Docker host is a deliberate constraint, not a limitation. Cloud deployment would add cost and operational complexity without solving any problem I actually have. 6,300 jobs on one machine is fine. If I needed 10x throughput, I'd shard by project rather than scaling horizontally.

How I orchestrate 6,300+ autonomous AI jobs