Milestone · 2026-05-22

I changed my mind.
From a fleet of agents to one agent.

A milestone note for anyone who has been reading this site since 2025. The architecture changed. The headlines on the older essays did not. This is the piece that explains the gap.

A tangle of many small identical agent nodes on the left, connected by a messy web of crossing lines, funneling rightward into one larger circular hub that radiates tidy spokes to a small ring of tool icons.

The short version

From October 2025 through April 2026, the production system this site documents ran on a fleet-of-agents architecture. Five hundred workers in a queue. A scheduler. An orchestrator. A coordination layer that looked, on paper, like the thing every conference talk was diagramming.

In late April 2026 I tore most of that down. What runs now is one agent in a loop. The agent is Claude. Its largest tool is n8n. Around n8n sit roughly thirty smaller tools the agent can call. The total line count of orchestration code dropped by an order of magnitude. The set of bugs that fire each week dropped with it.

The older essays on this site, "How I Run 500 AI Agents", "Six Agents, One Instruction", "Anatomy of a Production Claude Code Setup", describe the architecture I used to run. They are kept for the operational data. The headlines have not aged with the position. The newer essays reflect where the system landed.

What I used to think

The first system was built on an instinct that turned out to be wrong: that the way to scale an LLM-driven operation was to run more of them. More agents, more concurrent workers, more specialization. A research agent. A writing agent. A reviewer. A scheduler-of-schedulers. A queue with priorities. Memory shared across them via a vector store.

It worked, in the limited sense that it produced output. It also produced a steady, draining catalogue of failures that all had the same root cause once I looked closely. Agents lost track of what other agents were doing. The reviewer agreed with the writer too often. The orchestrator could not tell the difference between "the worker is thinking" and "the worker is stuck." Cost grew faster than throughput. Every additional agent multiplied the surface area where one of them could quietly produce confident garbage that the next one in the chain treated as ground truth.

The metric I was optimizing was the wrong one. "How many agents are alive right now" is easy to grow. "What can one agent that doesn't lose its mind actually do" is the harder, more useful question.

What changed

The change wasn't one moment. It was a year of evidence arriving from several directions at once and finally lining up:

The research foundations were already there. ReAct (Yao et al., 2022) formalized the reasoning-and-acting loop as the productive primitive. Toolformer (Schick et al., 2023) showed language models could teach themselves which tools to call. Reflexion (Shinn et al., 2023) demonstrated that self-critique inside the loop beats inter-agent committees. Voyager (Wang et al., 2023) argued for a single agent with a growing skill library, not many narrow agents in a hierarchy. The shape was sitting in the literature; the industry took two years to internalize it.
Anthropic published "Building Effective Agents" (Erik Schluntz and Barry Zhang, December 2024) and led with augmented single-LLM loops as the default primitive. Not coordination diagrams. Not handoff protocols. A loop with tools, plus a careful distinction between "workflows" (orchestrated) and "agents" (the LLM picks the next step).
OpenAI's function-calling docs and Anthropic's Model Context Protocol (MCP, late 2024) reframed the question. The infrastructure conversation became "how does one model call any tool" rather than "how do you get N models to talk to each other." MCP especially turned the entire ecosystem in a tool-not-agent direction.
Cognition wrote "Don't Build Multi-Agents" (Walden Yan, 2025) after a year of trying to ship them. Their post-mortem is specific and brutal about why the abstraction leaks: shared state is the bottleneck, not coordination protocol.
Google Research ran the empirical version ("Towards a Science of Scaling Agent Systems", 2025) and arrived at the same default: one agent, fan out only when the work itself is parallel.
The practitioner consensus arrived independently. Simon Willison has been writing the daily version on his blog all year. The 12-factor agents authors (Dexter Horthy et al., HumanLayer) crystallized it as a manifesto. Hamel Husain's writing on evals reframed reliability as a product question, not a model question. Eugene Yan and Chip Huyen were saying the same thing from the applied-ML side. The Latent Space podcast (Shawn Wang and Alessio Fanelli) and the Dwarkesh Podcast hosted the conversations as they happened.
The frameworks adjusted. DSPy (Omar Khattab et al., Stanford) treats the program structure as the load-bearing thing and the prompts as compiled artifacts. LangGraph moved from chains-of-many-things toward explicit single-graph state machines. Even AutoGen and CrewAI (the canonical multi-agent frameworks) added single-agent first-class patterns. The market voted.
My own production bug log. The classes of failure I kept paying for, an agent re-asking a question it had already answered, a worker reporting status='completed' on a job whose stdout was a paraphrase of its own prompt, a "manager" agent silently disagreeing with the worker it was reviewing, were all coordination bugs. None of them required multi-agent to manifest. All of them got worse with more agents in the chain.

When the same conclusion arrives from the 2022-23 research literature, the 2024 model-vendor engineering guides, a 2025 lab post-mortem, a Google Research empirical paper, the entire practitioner blog circuit, the frameworks themselves, and my own production logs, the right move is to update.

What the new shape looks like

One agent. Claude does the reasoning on the loop. Around it sit tools the agent can call when reasoning runs out of room:

The usual primitives. Read, write, bash, web. Memory, three tiers. Gmail, Telegram, home-assistant, paper-trading API.
n8n as the primary tool. Every workflow that has to outlive a single context window, scheduled jobs, branching content pipelines, durable retries, anything that takes longer than a few minutes, lives as an n8n workflow. The agent invokes the workflow when it decides that's the right move. It does not coordinate with the workflow as a peer. It calls it as a tool.

That is the entire architecture. There is no scheduler-of-schedulers. There is no agent fleet. There is no coordination layer. When you remove those, what's left is a loop that picks a tool, calls it, reads the result, and decides what to do next. Everything else is bookkeeping.

What survived the rewrite

Plenty did. Some of it is the most useful work on this site:

The memory system. Three tiers: core, archival, working. The same shape. Single agent or fleet, the agent still needs to know what happened yesterday, and that machinery is unchanged.
The model router. Routing the same agent across different models for different tasks turns out to matter more than routing different tasks to different agents. The harness study (linked below) is the canonical write-up.
The evaluation pipeline. Eval signals, taste signals, structural verifiers on what "completed" actually means, the work that catches an agent paraphrasing its prompt and calling it an answer.
n8n itself. The workflows I wrote for the fleet largely became the workflows the one agent now invokes. The architecture changed; the durable execution surface did not.
The bug discipline. Every "completed" claim with a structural witness on the way out, a stamped column, a SQL trigger, a tool-use floor. That ethos survived intact and is, honestly, the actual moat.

What this site publishes from here on

Field notes from one agent in production. Bug-log entries with structural fixes. Cost figures from the running ledger. Honest comparisons of the models the agent routes between. Notes when something I previously argued for turns out to have been wrong.

The instinct that "scale = more agents" is a dead end, and arguing against it is no longer the interesting work. The interesting work is the next layer of questions: how to make the one agent more reliable, what tool-use floors actually catch, what the agent's "completed" should mean, how memory degrades under hostile context, where the evaluation signal becomes the bottleneck, when a model swap beats a prompt rewrite.

If you came here for the agent-fleet posts, they are still in the archive. They got the data right and the framing wrong. The shift is real and I wanted it on the record.

I changed my mind. From a fleet of agents to one agent.