Method · 2026-06-02

Work with the agent until it works.
Then make it a workflow.

The most useful thing I have built is not an agent. It is the way an agent's work becomes a workflow. The agent is learning. The workflow is learned. These are field notes on the loop that turns one into the other.

A hand-drawn ink sketch of many tangled, searching exploratory lines on the left gathering and resolving into a single clean repeating groove on the right, one rust accent marking the settled path

I run one agent. It has done something every day since October 2025: drafted emails, published essays, placed phone calls, traded a paper account it is not allowed to lose too much of. For the first six months it was not one agent. It was a fleet of more than three hundred, coordinated by a scheduler I was proud of. I tore that down. What runs now is one model in a loop with many tools, and the largest tool by far is a library of workflows the agent helped write for itself.

This essay is about how those workflows get made, because the process is the most useful thing I have built, more useful than any single agent. The honest description is plain. I work with an agent until I figure out what reliably works, and then I turn that into a workflow. That is the whole method. The agent is learning. The workflow is learned.

It is also, almost exactly, the line Anthropic draws in Building Effective Agents (Erik Schluntz and Barry Zhang, December 2024) between agents and workflows. I did not get there from the paper. I got there from the scars, and then found the paper had mapped the territory first and mapped it well. So read this as one operator's field notes from inside that distinction, not a competing theory of it. If you want a short handle for the loop, call it routine distillation. The handle matters less than the practice.

The two abstractions I had to fail at first

The progression is easy to describe because I lived every wrong step of it.

The first abstraction was the agent. Give a capable model a loop and some tools and it can figure out almost anything once. That is intoxicating, and it is why everyone is building agents. The problem only shows up at volume. Past a thousand runs, a system made of nothing but agency starts to drift. Not crash: drift. The same task comes out slightly different today than yesterday, each version defensible, none identical, and the variance compounds run over run until you are debugging a system that has quietly rewritten its own behavior. Cognition reached the same place from the data side in Don't Build Multi-Agents: every action carries an implicit decision, and independent decisions conflict. Anthropic says it more gently, warning that autonomy brings "higher costs, and the potential for compounding errors." A chain of steps that each work nine times in ten does not work nine times in ten. Reliability is multiplicative, and unconstrained agency multiplies the wrong direction.

The second abstraction was the contract, and it is the one I see other serious people reaching for now, so it is worth being specific about why it failed me. The idea is borrowed from good engineering. Bertrand Meyer's Design by Contract says a component declares what it requires and what it guarantees, and you check those conditions at the boundary. Translated to agents: let the agent work freely, specify what a correct result looks like, and station a reviewer to enforce it. I built the reviewers. Agents writing, other agents grading, a whole bureaucracy of critics meeting a contract.

It does not hold, and the reason is structural, not a tuning problem. The reviewer is the same kind of thing as the worker: a model, sampling, fallible in the same direction on the same inputs. Asking it to catch the worker's mistakes is asking the unreliable thing to verify the unreliable thing. Google DeepMind's Large Language Models Cannot Self-Correct Reasoning Yet found that models asked to review and fix their own reasoning often make it worse once you remove the oracle that tells them when to stop. A contract enforced by a critic is correctness by inspection, and inspection inherits the unreliability you were trying to escape. The contract was not too weak. It was the wrong layer.

The contract was not too weak. It was the wrong layer. You cannot inspect your way to reliability when the inspector is made of the same stuff as the thing it inspects.

What the loop actually is

Here is the move that worked. You let the agent explore the task with full agency, once, or a few times until it has clearly found the path. Then you take that discovered path and distill it into a workflow: a concrete, named, deterministic routine that runs the same way every time, durably, with a log you can read. From then on the agent does not re-derive the task. It calls the routine. The agency was spent learning. The routine is what it learned.

This is the explore-then-exploit pattern from reinforcement learning, applied one level up. Exploration is expensive and you want it, once, to find the policy. After that you exploit the policy and stop paying the exploration cost on every repetition. An agent improvising a known task on every run is an agent that never stops exploring. That is not flexibility. That is a system that refuses to learn.

The agent explores once and the routine carries the result. Exceptions return to the agent, which re-distills. The loop is the method.

The criterion for what gets distilled is one sentence, the same test I use for skills versus routines: if I would be annoyed to discover the agent did this a slightly different way today, it belongs in a routine, not in the loop. Sending a payment is a routine. Posting to the right account with the right footer is a routine. Deciding whether today is a good day to post at all is judgment, and judgment stays with the agent. The agent keeps the decisions. The routine keeps the steps.

And a routine is not dead plumbing. A node inside one can stop and hand the model a single scoped question: classify this reply, rank these three drafts, decide if this is good enough to send. That is real reasoning inside a deterministic shell. The shell is fixed; the judgment in one node is not. This is bounded nondeterminism, and it is the honest version of the claim. The routine is deterministic in its shape, not in every token it ever emits. What stops varying is the structure of the work, even where a bounded decision inside it still flexes.

This is the direction the field is already moving

I did not invent the taxonomy and I want to be precise about what is mine. The workflow-versus-agent distinction is Anthropic's: a workflow orchestrates LLMs and tools through predefined code paths; an agent lets the model direct its own process. Crediting that is not modesty, it is the citation that makes the rest legible. What is worth naming is the loop that turns one into the other, and the research has been circling it from several directions at once:

Voyager (Wang et al., NeurIPS 2023) is the cleanest precedent. Its agent explores and crystallizes what works into an ever-growing skill library of executable code, retrieved and composed for later tasks. Discovered behavior becomes a stored, callable artifact. That is the distillation half, demonstrated years ago.
Agent Workflow Memory (Wang et al., ICML 2025) is almost the thing by name. It induces commonly reused routines, that is, workflows, from an agent's own past trajectories and feeds them back to guide future runs, offline and live. The academic vocabulary landed on the same word I did: routines.
AFlow (ICLR 2025) treats a workflow as code-represented nodes connected by edges and searches that space automatically. Its result belongs on a wall: a good workflow lets a smaller, cheaper model beat a frontier model running alone, at a few percent of the cost. The structure is worth more than the raw horsepower. That is the entire argument for n8n as the biggest tool.
DSPy (Khattab et al., 2023) makes the compiler explicit: write the program, and a compiler optimizes the pipeline rather than leaving you to hand-tune prompts by trial and error. Compile the program, do not improvise it.
The frame underneath all of it is Berkeley's compound AI systems position (Zaharia et al., 2024): state-of-the-art results increasingly come from systems with multiple components, not monolithic models, because improving control and trust is easier with systems. You cannot guarantee a model will avoid a behavior. You can guarantee a system does not give it the chance.

I am not claiming a discovery. Every one of these landed before me and most landed harder. What I can add is the operator's version: what this loop costs and returns when one person runs it every day, on a live system, for real stakes. And the part that is actually mine is not an idea at all. It is the workflows themselves. So I am going to start publishing them, the actual n8n graph and the scar that produced each one, here on this site. Not as a product. As field notes with the work attached.

The objections, and where the thesis actually bends

A method essay that does not argue against itself is marketing. Here are the three objections with teeth.

The Bitter Lesson. Rich Sutton's argument is that across seventy years of AI, hand-built structure loses to general methods that scale with compute. Am I not hand-building structure it says to delete? No, and the distinction is the whole defense. The Bitter Lesson is about capability: what the system can learn to do. This loop is about reliability: whether it does the same thing twice. Different axes. I am not hand-coding the intelligence. I am hand-fixing the parts that should never have varied, and even those are not hand-designed, they are distilled from what the agent itself discovered. The model stays at the frontier of every decision that needs judgment. The routine just stops it from reinventing the boring parts.

Anthropic says agents win at scale. The exact line is that agents are the better option when flexibility and model-driven decision-making are needed at scale, which looks like a head-on collision. It is not, because we mean two different things by scale. Anthropic means scale of decision complexity: open-ended problems where the next step genuinely cannot be drawn in advance. I mean scale of repetition: the thousandth time you do a task whose shape you already know. Both are true. Use an agent when the problem is open. Distill a routine when the problem is settled and only the volume is growing. The skill is telling which situation you are in.

Routines ossify. The world changes underneath a fixed procedure and a brittle routine breaks silently, a failure mode I have eaten more than once. This is why distillation is a loop and not a one-way trip. The agent does not get fired when the routine ships. It stays in the loop as the thing that handles the exceptions the routine was not built for, and when a routine starts failing, the agent is what notices, falls back to exploring, and re-distills the corrected version. The agent earns the routine and keeps the keys to revise it. A distilled system that cannot re-distill is just technical debt with good intentions.

Why the routines have to be yours

The vendors have noticed all of this. The same agent-to-routine pattern is showing up inside every major platform now, as managed skills, hosted routines, cloud-run procedures. The pattern is correct. The packaging is a fence. When your distilled routines live inside someone else's walled garden, the most valuable thing your system produced, the hardened record of everything your agent learned to do, is held on infrastructure you rent and cannot leave with.

The counter-model is old and well tested. HashiCorp built Terraform as an open tool practitioners adopted because it was good, and the company that grew on top of it grew because the value had already accrued to the people running it, not because anything was locked in. My routines run on n8n on a server in my apartment. They are files I can read, move, and version. That is not ideology, it is the same instinct as the rest of this essay: the durable asset is the distilled routine, so own the distilled routine. The agent is the loop. n8n is its primary tool. The library of routines is the part that compounds.

Owning them is also what lets me show them. A routine I rent, I cannot publish. A routine that is a file on my disk, I can hand to you. So the ones that are not tangled up in my own life and credentials, I am starting to put up here, the actual graph and all, with the scar that produced each one. If a thing I distilled saves you the week it cost me, that is the whole point, and it is also the only kind of credibility I have ever trusted in anyone else.

The test

If you are building one of these systems, the operating rule fits on a card. Give the agent agency to learn the task. Watch it do the task the same way twice. The moment it does, take the agency back and keep what it learned as a routine. Reserve the agent for the next thing it has not learned yet, and for the day a routine breaks.

The agent is learning. The workflow is learned. A system that cannot tell the difference will either drift, because it never stops exploring, or ossify, because it never explores again. The method is just refusing to do either.

Some operational details in these essays have been changed for narrative or privacy reasons. The arguments, the numbers, and the lessons are real.

Work with the agent until it works. Then make it a workflow.