production ai systems

Building AI
that runs.
Not AI that demos.

8,100+ autonomous jobs. Six model families. 35 distinct models. One orchestration system I built and run daily. This site documents the architecture decisions, failure classes, and operational patterns that only surface after the demo phase ends.

Danny Nakhla - AI solutions manager and production AI systems architect. Available for consulting.

If you're building past the demo phase, this is for you.

Research Essays

Engineer D. NAKHLA

System Metrics

jobs.total 8,100+
models.active 35
uptime 24/7
model.families 6

The Architecture

Live System / 24-7

Fig. 1 — Production orchestration architecture. Hub-and-spoke topology with multi-model fallback routing.

Field Notes

2026

Harness Architecture APR 2026

The Harness Is the Moat

293 scored runs across 19 models. Model selection creates 37x more variance than harness choice. The moat isn't better quality - it's the freedom to route to the right model.

Read

Claude Code Deep Dives APR 2026

Claude Code in Production: What Nobody Tells You After 5,000 Jobs

Lessons from running Claude Code agents across 5,000+ real production jobs. The failures, fixes, and patterns nobody documents.

Read

AI Engineering MAR 2026

How I Run 500 AI Agents

Inside the orchestration system running hundreds of autonomous AI workers 24/7. Architecture, failures, and what actually scales.

Read

AI Code Tools MAR 2026

Claude Code vs Codex vs Open Source: A Practitioner's Honest Breakdown

A working practitioner's comparison of Claude Code, Codex CLI, and open-source alternatives. Benchmarks from real production use.

Read

All Essays

"8,100+ jobs: context loss beats model quality every time. Ground truth from production: 27% of failures are agents forgetting what they're doing. Not capability. Orchestration."

— Danny Nakhla

Failure Taxonomy

From 8,100+ Jobs

27%

Context Loss

The single largest failure class in this deployment. Agents lose track of what they're doing mid-task. No vendor talks about this.

22%

Rate Limit Cascades

One throttled request triggers a retry storm that burns through your budget before you notice.

40%

Overhead Tax

The compute your system spends managing itself. Memory, scheduling, error recovery. Every cost model misses this. 11% of failures fall outside these three classes.

The Infrastructure

Live System / 24-7

Core System

Autonomous AI Operations at Personal Scale

A 24/7 orchestration system coordinating research, trading, content production, and operational tasks across 20+ concurrent workers.

▦

Hub-and-Spoke

SQLite job queue with priority routing, model-specific fallback chains, and three-layer persistent memory.

☰

Production Reliability

Failure taxonomy from real operations: context loss (27%), rate limit cascades (22%), optimistic completion (19%).

Building AIthat runs. Not AI that demos.

The Architecture

Field Notes

The Harness Is the Moat

Claude Code in Production: What Nobody Tells You After 5,000 Jobs

How I Run 500 AI Agents

Claude Code vs Codex vs Open Source: A Practitioner's Honest Breakdown

Failure Taxonomy

Context Loss

Rate Limit Cascades

Overhead Tax

The Infrastructure

Autonomous AI Operations at Personal Scale

Hub-and-Spoke

Production Reliability

Building AI
that runs.
Not AI that demos.