Essays

Operating notes from
my production agents.

Architecture decisions, failure taxonomies, and the bugs I have actually paid for, from running my production agents every day. The system has run daily since October 2025; the current shape, Claude in the loop with n8n as its primary tool, since the May 2026 rewrite. The numbers are pulled from the running system. Where I am extrapolating I say so.

On framing: earlier essays use a fleet-of-agents shape ("500 agents", "300+ workers"). Through 2025 the field's consensus moved toward one agent in a loop with many tools, articulated independently by Anthropic, Cognition, and Simon Willison. The older articles are kept for the operational data; the headlines have not aged with the position.

Latest

Newest first

If a model edits a system you run, make it show you the diff first

If a model edits a system you run, make it show you the diff first I let an AI rewrite the automations that run my house. The day it quietly deleted a step and I found out three days later, I stopped trusting it to just do the work. Now it has to show me a plain-English preview of every change, and nothing goes live until I say yes. Architecture
JUN 2026 · 1 min

My agent has a terminal. It doesn't need MCP.

My agent has a terminal. It doesn't need MCP. The standard way to give an AI agent its tools loads every tool's full description into the model's memory at the start of every session, used or not. Anthropic's own engineers watched that menu hit 134,000 tokens. My agent skips it: it gets a command line and reads the manual only when it needs it, about 200 tokens versus 14,000 for the same tool. It runs that way a few hundred times a day. The one place that genuinely can't open a terminal already has a door I built in thirty lines. Architecture
JUN 2026 · 4 min

Cookie transplant: how my agent posts as a real account

Cookie transplant: how my agent posts as a real account A website can tell a robot is driving the browser before you type a single word, and it quietly refuses. Instead of disguising the robot, I let it borrow a session a real person already logged into by hand. One trick gets through the door for three platforms. The catch: a post can report success and still not exist, so nothing counts until a second check walks back and finds it on the page. Failure Postmortem
JUN 2026 · 1 min

A five-dollar AI scored 76. The professionals scored 84.

A five-dollar AI scored 76. The professionals scored 84. There is a little robot in my group chat that runs on a five-dollar-a-month AI. I gave it one real task from each of forty-four jobs, graded against the professionals' own work, and it came within eight points of the pros. Then I ran the same model through my own homemade bot and it dropped six points, with the model never changing. This is what an AI evaluation is, and why the same model can be worth three different numbers. Field notes
JUN 2026 · 4 min

Cognitive surrender happens at the approval gate

Cognitive surrender happens at the approval gate When a Wharton study put a wrong AI answer in front of 1,372 people, they agreed 73 percent of the time, and felt more confident doing it. I run a system that asks me to approve real actions all day. This is what I learned about the moment that actually goes wrong: not the agent's decision, but mine. Position Response
JUN 2026 · 1 min

Your group chat is dying. I built a member that won't let it.

Your group chat is dying. I built a member that won't let it. A lot of friendships live in a group chat now, and that is also where a lot of them quietly end. No one is paid to keep yours loud, so when people get busy it goes silent and stays silent. I gave one group chat a member that fights that: it reads everything and says almost nothing, and when the room goes quiet for a week it reaches into the chat's own history, finds a day worth remembering, and hands the room back its own words. Live in one chat with 86,000 messages and almost four years of history. Field notes
JUN 2026 · 4 min

Start here

The whole argument in three, in order

From a fleet of agents to one agent: what changed and why

1. From a fleet of agents to one agent: what changed and why The position. Why I tore down a fleet of agents and rebuilt around one agent in a loop, and what survived the rewrite. Milestone
4 min

6,442 jobs later: model selection beats harness choice 37 to 1

2. 6,442 jobs later: model selection beats harness choice 37 to 1 The proof. 6,442 production jobs say model choice moved quality 37x more than the harness. The data behind the position. Study
4 min

Your agent needs routines, not just skills

3. Your agent needs routines, not just skills The operating model. Once you commit to one agent, routines (not more skills) are what make it reliable. Architecture
1 min

By topic

Every essay, once

The harness thesis all 7 →

A five-dollar AI scored 76. The professionals scored 84.

A five-dollar AI scored 76. The professionals scored 84. There is a little robot in my group chat that runs on a five-dollar-a-month AI. I gave it one real task from each of forty-four jobs, graded against the professionals' own work, and it came within eight points of the pros. Then I ran the same model through my own homemade bot and it dropped six points, with the model never changing. This is what an AI evaluation is, and why the same model can be worth three different numbers. Field notes
JUN 2026 · 4 min

How I keep one face consistent across AI-generated portraits

How I keep one face consistent across AI-generated portraits I wanted professional photos of myself in different outfits and settings without a photographer, and every AI tool gave me a face that was a few percent off and instantly read as a stranger. The fix was a rule: never let the model draw the face. Generate the outfit, the room, and the light around an empty space, then paste a real photo of my face into it and finish in code. Field notes on the method and why my eye was right every time. Method
JUN 2026 · 4 min

Work with the agent until it works. Then make it a workflow.

Work with the agent until it works. Then make it a workflow. I run one AI agent that drafts emails, publishes essays, and trades a small account every day. When it solves a task the same way twice, I freeze those steps into a fixed routine and stop letting it improvise that job. The agent is how you learn. The routine is what you keep. Method
JUN 2026 · 4 min

The model is the orchestrator, the workflow engine is the hands

The model is the orchestrator, the workflow engine is the hands One automated assistant has run in production since October, doing 6,442 jobs for about $205 a month. It works because exactly one part decides and everything else only acts. When I tested it, the brain mattered 37 times more than the tools around it. So I add hands, never a second brain. Position Response
JUN 2026 · 4 min

The control plane became the product: 5,079 jobs of operational truth

The control plane became the product: 5,079 jobs of operational truth I kept a 37-day log of one AI system: 5,079 jobs. The hard part was never the model. It was all the unglamorous work that kept the model honest. Architecture
MAR 2026 · 4 min

Tooling reviews all 2 →

Claude Code in production: what breaks after 5,000 jobs

Claude Code in production: what breaks after 5,000 jobs I ran an AI coding assistant 5,000 times on a home server. It almost never wrote bad code. What broke was everything around it, and the fixes had nothing to do with the model. Tooling Review
MAR 2026 · 1 min

What 12 months of daily AI coding in production actually cost

What 12 months of daily AI coding in production actually cost I coded through AI tools every working day for a year, on a live system past 5,000 jobs. The combination that does the most work costs me about $35 a month, and the lesson was that cost comes down to routing each kind of work to the tool whose price fits. Tooling Review
MAR 2026 · 4 min

Failure postmortems all 3 →

The API said success. The work never happened.

The API said success. The work never happened. For three weeks my automated publishing system reported a hundred percent success rate while posting nothing at all. The worst kind of bug is the one that confirms itself, and the only honest fix is to go check the real world you do not control. Failure Postmortem
MAY 2026 · 4 min

Why Chrome hidden tabs silently corrupt fetch results, and the one-line fix

Why Chrome hidden tabs silently corrupt fetch results, and the one-line fix My system reported a comment posted, and it never existed. The browser tab doing the work was hidden, and Chrome quietly returned empty results while every status check stayed green. The fix was one line. The real lesson was that a call coming back is not proof the work happened. Failure Postmortem
MAY 2026 · 4 min

Split reads from writes: how I cut my agent Chrome cost 90%

Split reads from writes: how I cut my agent Chrome cost 90% For a few weeks this spring, a program I built was paying full price for a browser session every fifteen minutes just to read comments. The expensive way to post was setting the rules for the cheap, constant work of reading. One commit later, reads went from ten seconds to under one. Failure Postmortem
MAY 2026 · 4 min

Architecture deep-dives all 8 →

My agent has a terminal. It doesn't need MCP.

My agent has a terminal. It doesn't need MCP. The standard way to give an AI agent its tools loads every tool's full description into the model's memory at the start of every session, used or not. Anthropic's own engineers watched that menu hit 134,000 tokens. My agent skips it: it gets a command line and reads the manual only when it needs it, about 200 tokens versus 14,000 for the same tool. It runs that way a few hundred times a day. The one place that genuinely can't open a terminal already has a door I built in thirty lines. Architecture
JUN 2026 · 4 min

Your group chat is dying. I built a member that won't let it.

Your group chat is dying. I built a member that won't let it. A lot of friendships live in a group chat now, and that is also where a lot of them quietly end. No one is paid to keep yours loud, so when people get busy it goes silent and stays silent. I gave one group chat a member that fights that: it reads everything and says almost nothing, and when the room goes quiet for a week it reaches into the chat's own history, finds a day worth remembering, and hands the room back its own words. Live in one chat with 86,000 messages and almost four years of history. Field notes
JUN 2026 · 4 min

How my agent earns the word done: evals and certification gates

How my agent earns the word done: evals and certification gates My own agent made my daughter's first-birthday book and put her name on the cover, in gold, spelled wrong. The kind of error a spellchecker passes, because it is a correctly spelled name, just not hers. So I stopped letting the agent decide when its own work was done, and made a second model sign off first. Here is how it works and what it costs. Architecture
JUN 2026 · 4 min

LetMeCheckThatBot: a group-chat agent that remembers

LetMeCheckThatBot: a group-chat agent that remembers I run a Telegram bot that lives inside a group chat and answers when you call it. The part that took the most work was not its personality. It was a memory that silently saves every photo, voice note, and link people drop, and hands the right one back weeks later when someone asks. The personality was a day of work. The memory was the whole project. Architecture
JUN 2026 · 4 min

One offhand message, a research paper by dinnertime

One offhand message, a research paper by dinnertime I sent one offhand message and got a finished research paper back in an afternoon, with no research assistants. Three choices made it work: the agent broke the job into pieces itself, ran each piece on the model that fit it, and let a second model reject the result. The gate is the part most people skip, and it is the part that keeps the work honest. Architecture
JUN 2026 · 4 min

Inside a Claude Code setup running 6,442 jobs: the completion gate

Inside a Claude Code setup running 6,442 jobs: the completion gate An AI assistant ran 6,442 jobs for me and never marked a single one done on its own. The rule that made it safe to trust: every piece of work has to pass a second model before it counts as finished. Architecture
MAR 2026 · 4 min

Inside one production AI agent: routing and the failure log

Inside one production AI agent: routing and the failure log One AI assistant ran 5,252 jobs from a home server in Philadelphia and finished 93.7% of them. The surprise is in the failures: not one was the AI saying something wrong. They were all plumbing. Part one of two. Architecture
MAR 2026 · 4 min

Context and memory engineering all 3 →

Composite scoring: fixing stale agent recall

Composite scoring: fixing stale agent recall At 2 AM my AI assistant wrote a confidently wrong brief because it trusted a four-month-old note about the wrong dataset. The fix was to stop ranking memories on resemblance alone and start weighing how fresh, how important, and how often-used each one is, across a store of 230,000 memories. Architecture
APR 2026 · 4 min

Context engineering in practice: a 3-tier memory system, 230K vectors

Context engineering in practice: a 3-tier memory system, 230K vectors An AI assistant I run got a job confidently wrong at 2 AM. The model was fine; it had simply remembered the wrong thing. The fix was a three-tier memory: a tiny always-on core file, a 230,000-fact searchable store, and a layer of rules it teaches itself from my corrections. 6,370 jobs in, it gets the job right 95.9% of the time and the memory costs three-tenths of a cent a month. Architecture
APR 2026 · 4 min

The memory system and composite scoring

The memory system and composite scoring An AI assistant that runs my home server kept surfacing a wrong old memory in the middle of a task. The fix was three kinds of memory, a gate that decides what gets saved, and letting unused memories fade. Part two of two. Architecture
MAR 2026 · 4 min

Field notes

The $50k that used to pay people now pays for tokens

The $50k that used to pay people now pays for tokens A team of engineers cost me about $50k a quarter. To test whether a swarm of AI agents could do their work, I spent more than the whole team would have cost, and what came back was buggy and expensive to clean up. The lesson that cost the most was that I stopped thinking. The honest part is who used to get the money the tokens now take. Field notes
MAY 2026 · 4 min