production ai systems

Building AI
that runs.
Not AI that demos.

8,100+ autonomous jobs. Six model families. 35 distinct models. One orchestration system I built and run daily. This site documents the architecture decisions, failure classes, and operational patterns that only surface after the demo phase ends.

Danny Nakhla - AI solutions manager and production AI systems architect. Available for consulting.

If you're building past the demo phase, this is for you.

Daniel Nakhla
Engineer D. NAKHLA
  • jobs.total 8,100+
  • models.active 35
  • uptime 24/7
  • model.families 6
8,100+
Production Jobs
6
Model Families
35
Distinct Models
24/7
Autonomous Ops

The Architecture

INPUT LAYER ORCHESTRATION EXECUTION Telegram Cron Schedule Dashboard API / Webhook Email ORCHESTRATION ENGINE SQLite Queue / Priority Router / Session Mgr Memory System Core / Archival / Working Skill Library 28 Modules / Hot-Loadable MODEL ROUTER / FALLBACK CHAIN / COST OPTIMIZER Claude Opus Primary / Complex Claude Sonnet Standard / Fast GPT / Gemini Fallback / Research Kimi / Open-Source Budget / Batch Codex / Direct Coding Agent OUTPUT → TELEGRAM / DASHBOARD / FILES / EMAIL / WEB PUBLISH / TRADING / SMART HOME
Fig. 1 — Production orchestration architecture. Hub-and-spoke topology with multi-model fallback routing.

Field Notes

Harness Architecture

The Harness Is the Moat

293 scored runs across 19 models. Model selection creates 37x more variance than harness choice. The moat isn't better quality - it's the freedom to route to the right model.

Read
Claude Code Deep Dives

Claude Code in Production: What Nobody Tells You After 5,000 Jobs

Lessons from running Claude Code agents across 5,000+ real production jobs. The failures, fixes, and patterns nobody documents.

Read
AI Engineering

How I Run 500 AI Agents

Inside the orchestration system running hundreds of autonomous AI workers 24/7. Architecture, failures, and what actually scales.

Read
AI Code Tools

Claude Code vs Codex vs Open Source: A Practitioner's Honest Breakdown

A working practitioner's comparison of Claude Code, Codex CLI, and open-source alternatives. Benchmarks from real production use.

Read
All Essays
"8,100+ jobs: context loss beats model quality every time. Ground truth from production: 27% of failures are agents forgetting what they're doing. Not capability. Orchestration."
— Danny Nakhla

Failure Taxonomy

27%

Context Loss

The single largest failure class in this deployment. Agents lose track of what they're doing mid-task. No vendor talks about this.

22%

Rate Limit Cascades

One throttled request triggers a retry storm that burns through your budget before you notice.

40%

Overhead Tax

The compute your system spends managing itself. Memory, scheduling, error recovery. Every cost model misses this. 11% of failures fall outside these three classes.

The Infrastructure

Core System

Autonomous AI Operations at Personal Scale

A 24/7 orchestration system coordinating research, trading, content production, and operational tasks across 20+ concurrent workers.

Hub-and-Spoke

SQLite job queue with priority routing, model-specific fallback chains, and three-layer persistent memory.

Production Reliability

Failure taxonomy from real operations: context loss (27%), rate limit cascades (22%), optimistic completion (19%).