Architecture · 2026-06-02

Auto-decomposition, multi-model review, and quality gates

Most of what people call "agentic" is a single model answering a single prompt. The interesting capability is not that. It is what happens when you hand one agent a large, vague task and it turns the task into a graph of smaller jobs, runs the pieces on the models best suited to each, and refuses to call the work done until a second model has tried to break it. I run exactly one agent. The three things below are how it takes on a job too big for one pass without me sketching the steps. None of them is a model trick. All three are architecture.

A hand-drawn diagram where a single ink line enters from the left, branches into parallel paths through different small shapes, then converges at a terracotta gate with a return arrow looping one rejected path back to the start.

Principle one: the agent decomposes, you do not

The first principle is that the agent owns the breakdown. You describe the outcome in plain language, and the agent decides what jobs exist, what depends on what, and what can run in parallel. You do not enumerate the steps. The moment you find yourself writing the dependency graph by hand, you have built a workflow with a chat box on the front, not an agent.

This is the line Anthropic draws in Building Effective Agents: an agent is a model using tools in a loop, and in their orchestrator-workers pattern the orchestrator is "a central LLM" that "dynamically breaks down tasks" and delegates the pieces. The deciding lives in the model, not in routing code. Decomposition is the clearest expression of that. The agent reads the goal, names the sub-tasks, infers their order from the data each one needs, and only then dispatches. Getting the dependency inference right is the difference between a chatbot and a system that can carry real work.

The decomposition has to be honest about two distinct things: ordering and parallelism. Ordering is "B needs A's output, so B waits." Parallelism is "C needs nothing A produces, so C runs now." Conflating them is the common failure. A system that serializes everything is slow and brittle; a system that fans everything out collides on shared state. The agent has to read the actual data dependencies and only fan out where the work is genuinely independent. Google Research makes the empirical version of this argument in Towards a Science of Scaling Agent Systems: default to one path, fan out only when the work is truly parallel.

Principle two: route each piece to the model that fits it

The second principle is that one agent does not mean one model. Decomposition produces a set of jobs with different shapes: collecting structured data from an API, embedding text into vectors, drafting long-form prose, judging whether that prose is any good. Those are not the same task, and the same model is rarely the best choice for all of them. Routing per job is how one agent stays one agent while still using the right tool for each piece.

This is model swapping, not agent swapping, and the distinction is the whole point. There is still one loop, one orchestrator, one set of memory. What changes between jobs is which model sits in the seat. A cheap, fast model collects and scans. A capable model drafts. A different model family reviews. The agent never hands its loop to another agent; it hands a bounded job to a model and waits for the result. The freedom to route models is the actual moat: in a study across 6,442 production jobs, same model with a different harness moved results by 0.19 points, while different models on the same harness moved them 37x. If the choice of model matters that much, the architecture has to make the choice cheap to vary, job by job.

The 12-factor agents authors make the deployment-side version of this point: small, stateless, composable units beat one monolith, because you can place each unit where it runs best. Per-job routing is that idea applied inside a single agent's task graph.

Principle three: a different model has to sign off

The third principle is the one most people skip, and it is the one that keeps the output honest. Single-pass model output is a draft. Treating it as final is how you ship embarrassing work. So the last node in any real pipeline is a review pass, and the review has to be run by a different model from the one that produced the work.

This matters because a model reviewing its own output is graded by the same blind spots that produced it. It reads its draft and it looks right, because "looks right to the model that wrote it" is exactly the property that got the draft written. A model from a different family, with different training and different failure modes, questions assumptions the author never thought to question. This is the discipline Hamel Husain keeps pointing at when he argues that evals, not vibes, are what make LLM systems reliable. The review pass is an eval with teeth: it can send the work back. A gate that cannot reject anything is decoration.

A model grades its own work against its own blind spots. The gate only works when a different model can say no.

The pattern, drawn once

Fig. 1. One instruction in, a dependency graph out. Each job routed to a fitting model, and a different model gating the result before anything ships.

The pattern under load: 2,102 obituaries

The clearest run of this pattern I have on record started from one casual message: take the New York Times obituaries dataset and do something with it. The agent decomposed that into six jobs in about four minutes. It pulled 2,102 obituaries from the NYT API and embedded them into a 2,102 by 1,536 float32 matrix; it scanned four subreddits in parallel for prior art and surfaced a 2025 PNAS paper (Markowitz et al.) that analyzed paid family notices, which sharpened the question toward editorial selection instead; it ran the statistical analysis and clustering; it drafted an IEEE-format paper on a capable model; and then a separate job on a different model family reviewed that draft. The review caught twelve issues, including a tautology the drafting model had written and read as a finding (that "widow of" appears only in female obituaries) and fifteen em dashes that needed replacing. Ordering held (collection before analysis before drafting), parallelism held (the Reddit scan ran alongside collection, the video branch forked off the analysis without waiting for the paper), and the gate had teeth (the shipped paper was meaningfully better than the first draft, and it surfaced its real limits, including a gender classifier that failed on 43.8% of records). End to end: two hours and twenty minutes, no research assistants, one instruction.

Where it goes wrong

The honest failure modes are all in the seams between the three principles. Decomposition can fan out work that shares state and produce a race; the fix is to make data dependencies explicit and serialize anything that touches the same store, which is the same discipline Cognition lands on in Don't Build Multi-Agents, where parallel work makes conflicting decisions that nobody reconciles. Routing can send a job to a model that is cheap but wrong for the shape, and you only notice three jobs downstream. And the gate is only as good as the model behind it: a reviewer from the same family as the author shares the blind spots and waves the work through, so the gate has to cross a real capability or training boundary to mean anything.

If you are building this, start with the gate, not the decomposition. Pick one large task you already run as a single prompt, and add exactly one step: a second model that reads the output and is allowed to reject it. Watch what it catches. That one change will teach you more about where your pipeline is lying to itself than any amount of upstream planning, and it is the cheapest of the three principles to add. Decomposition and routing are how the system gets fast. The gate is how it stays honest.

Some operational details in these essays have been changed for narrative or privacy reasons. The arguments, the numbers, and the lessons are real.