Essays
Operating notes from
my production agents.
Architecture decisions, failure taxonomies, and the bugs I have actually paid for, from running my production agents every day. The system has run daily since October 2025; the current shape, Claude in the loop with n8n as its primary tool, since the May 2026 rewrite. The numbers are pulled from the running system. Where I am extrapolating I say so.
On framing: earlier essays use a fleet-of-agents shape ("500 agents", "300+ workers"). Through 2025 the field's consensus moved toward one agent in a loop with many tools, articulated independently by Anthropic, Cognition, and Simon Willison. The older articles are kept for the operational data; the headlines have not aged with the position.
Latest
Newest first
If a model edits a system you run, make it show you the diff first
I let an AI rewrite the automations that run my house. The day it quietly deleted a step and I found out three days later, I stopped trusting it to just do the work. Now it has to show me a plain-English preview of every change, and nothing goes live until I say yes.
My agent has a terminal. It doesn't need MCP.
The standard way to give an AI agent its tools loads every tool's full description into the model's memory at the start of every session, used or not. Anthropic's own engineers watched that menu hit 134,000 tokens. My agent skips it: it gets a command line and reads the manual only when it needs it, about 200 tokens versus 14,000 for the same tool. It runs that way a few hundred times a day. The one place that genuinely can't open a terminal already has a door I built in thirty lines.
Cookie transplant: how my agent posts as a real account
A website can tell a robot is driving the browser before you type a single word, and it quietly refuses. Instead of disguising the robot, I let it borrow a session a real person already logged into by hand. One trick gets through the door for three platforms. The catch: a post can report success and still not exist, so nothing counts until a second check walks back and finds it on the page.
A five-dollar AI scored 76. The professionals scored 84.
There is a little robot in my group chat that runs on a five-dollar-a-month AI. I gave it one real task from each of forty-four jobs, graded against the professionals' own work, and it came within eight points of the pros. Then I ran the same model through my own homemade bot and it dropped six points, with the model never changing. This is what an AI evaluation is, and why the same model can be worth three different numbers.
Cognitive surrender happens at the approval gate
When a Wharton study put a wrong AI answer in front of 1,372 people, they agreed 73 percent of the time, and felt more confident doing it. I run a system that asks me to approve real actions all day. This is what I learned about the moment that actually goes wrong: not the agent's decision, but mine.
Your group chat is dying. I built a member that won't let it.
A lot of friendships live in a group chat now, and that is also where a lot of them quietly end. No one is paid to keep yours loud, so when people get busy it goes silent and stays silent. I gave one group chat a member that fights that: it reads everything and says almost nothing, and when the room goes quiet for a week it reaches into the chat's own history, finds a day worth remembering, and hands the room back its own words. Live in one chat with 86,000 messages and almost four years of history.
Start here
The whole argument in three, in order
1. From a fleet of agents to one agent: what changed and why
The position. Why I tore down a fleet of agents and rebuilt around one agent in a loop, and what survived the rewrite.
2. 6,442 jobs later: model selection beats harness choice 37 to 1
The proof. 6,442 production jobs say model choice moved quality 37x more than the harness. The data behind the position.
3. Your agent needs routines, not just skills
The operating model. Once you commit to one agent, routines (not more skills) are what make it reliable.
By topic
Every essay, once
The harness thesis
all 7 →
A five-dollar AI scored 76. The professionals scored 84.
There is a little robot in my group chat that runs on a five-dollar-a-month AI. I gave it one real task from each of forty-four jobs, graded against the professionals' own work, and it came within eight points of the pros. Then I ran the same model through my own homemade bot and it dropped six points, with the model never changing. This is what an AI evaluation is, and why the same model can be worth three different numbers.
How I keep one face consistent across AI-generated portraits
I wanted professional photos of myself in different outfits and settings without a photographer, and every AI tool gave me a face that was a few percent off and instantly read as a stranger. The fix was a rule: never let the model draw the face. Generate the outfit, the room, and the light around an empty space, then paste a real photo of my face into it and finish in code. Field notes on the method and why my eye was right every time.
Work with the agent until it works. Then make it a workflow.
I run one AI agent that drafts emails, publishes essays, and trades a small account every day. When it solves a task the same way twice, I freeze those steps into a fixed routine and stop letting it improvise that job. The agent is how you learn. The routine is what you keep.
The model is the orchestrator, the workflow engine is the hands
One automated assistant has run in production since October, doing 6,442 jobs for about $205 a month. It works because exactly one part decides and everything else only acts. When I tested it, the brain mattered 37 times more than the tools around it. So I add hands, never a second brain.
The control plane became the product: 5,079 jobs of operational truth
I kept a 37-day log of one AI system: 5,079 jobs. The hard part was never the model. It was all the unglamorous work that kept the model honest.
Tooling reviews
all 2 →
Claude Code in production: what breaks after 5,000 jobs
I ran an AI coding assistant 5,000 times on a home server. It almost never wrote bad code. What broke was everything around it, and the fixes had nothing to do with the model.
What 12 months of daily AI coding in production actually cost
I coded through AI tools every working day for a year, on a live system past 5,000 jobs. The combination that does the most work costs me about $35 a month, and the lesson was that cost comes down to routing each kind of work to the tool whose price fits.
Failure postmortems
all 3 →
The API said success. The work never happened.
For three weeks my automated publishing system reported a hundred percent success rate while posting nothing at all. The worst kind of bug is the one that confirms itself, and the only honest fix is to go check the real world you do not control.
Why Chrome hidden tabs silently corrupt fetch results, and the one-line fix
My system reported a comment posted, and it never existed. The browser tab doing the work was hidden, and Chrome quietly returned empty results while every status check stayed green. The fix was one line. The real lesson was that a call coming back is not proof the work happened.
Split reads from writes: how I cut my agent Chrome cost 90%
For a few weeks this spring, a program I built was paying full price for a browser session every fifteen minutes just to read comments. The expensive way to post was setting the rules for the cheap, constant work of reading. One commit later, reads went from ten seconds to under one.
Architecture deep-dives
all 8 →
My agent has a terminal. It doesn't need MCP.
The standard way to give an AI agent its tools loads every tool's full description into the model's memory at the start of every session, used or not. Anthropic's own engineers watched that menu hit 134,000 tokens. My agent skips it: it gets a command line and reads the manual only when it needs it, about 200 tokens versus 14,000 for the same tool. It runs that way a few hundred times a day. The one place that genuinely can't open a terminal already has a door I built in thirty lines.
Your group chat is dying. I built a member that won't let it.
A lot of friendships live in a group chat now, and that is also where a lot of them quietly end. No one is paid to keep yours loud, so when people get busy it goes silent and stays silent. I gave one group chat a member that fights that: it reads everything and says almost nothing, and when the room goes quiet for a week it reaches into the chat's own history, finds a day worth remembering, and hands the room back its own words. Live in one chat with 86,000 messages and almost four years of history.
How my agent earns the word done: evals and certification gates
My own agent made my daughter's first-birthday book and put her name on the cover, in gold, spelled wrong. The kind of error a spellchecker passes, because it is a correctly spelled name, just not hers. So I stopped letting the agent decide when its own work was done, and made a second model sign off first. Here is how it works and what it costs.
LetMeCheckThatBot: a group-chat agent that remembers
I run a Telegram bot that lives inside a group chat and answers when you call it. The part that took the most work was not its personality. It was a memory that silently saves every photo, voice note, and link people drop, and hands the right one back weeks later when someone asks. The personality was a day of work. The memory was the whole project.
One offhand message, a research paper by dinnertime
I sent one offhand message and got a finished research paper back in an afternoon, with no research assistants. Three choices made it work: the agent broke the job into pieces itself, ran each piece on the model that fit it, and let a second model reject the result. The gate is the part most people skip, and it is the part that keeps the work honest.
Inside a Claude Code setup running 6,442 jobs: the completion gate
An AI assistant ran 6,442 jobs for me and never marked a single one done on its own. The rule that made it safe to trust: every piece of work has to pass a second model before it counts as finished.
Inside one production AI agent: routing and the failure log
One AI assistant ran 5,252 jobs from a home server in Philadelphia and finished 93.7% of them. The surprise is in the failures: not one was the AI saying something wrong. They were all plumbing. Part one of two.
Context and memory engineering
all 3 →
Composite scoring: fixing stale agent recall
At 2 AM my AI assistant wrote a confidently wrong brief because it trusted a four-month-old note about the wrong dataset. The fix was to stop ranking memories on resemblance alone and start weighing how fresh, how important, and how often-used each one is, across a store of 230,000 memories.
Context engineering in practice: a 3-tier memory system, 230K vectors
An AI assistant I run got a job confidently wrong at 2 AM. The model was fine; it had simply remembered the wrong thing. The fix was a three-tier memory: a tiny always-on core file, a 230,000-fact searchable store, and a layer of rules it teaches itself from my corrections. 6,370 jobs in, it gets the job right 95.9% of the time and the memory costs three-tenths of a cent a month.
The memory system and composite scoring
An AI assistant that runs my home server kept surfacing a wrong old memory in the middle of a task. The fix was three kinds of memory, a gate that decides what gets saved, and letting unused memories fade. Part two of two.
