The agent I run, one model in a loop with many tools, has executed over 6,442 jobs in production. At that volume the failure mode is not the dramatic hallucination. It is the quiet pass: a job that completes, saves a result that reads correctly, and marks itself done while being wrong in a way only a second set of eyes catches. An eval layer is the second set of eyes, made structural. A certification gate is the rule that no job earns the status done until that second set of eyes has signed off.
What this subsystem does
It decides what "done" means, per job, and refuses to let the agent self-certify. Every unit of work carries an explicit success_criteria field, and no job transitions to done until a separate evaluation step has checked the output against those criteria and returned a score above threshold. The agent that did the work never grades its own paper.
What it looked like before, and the failure that ended it
Before this, the system was tuned for throughput, and throughput was winning over verification. A job finished its first pass, the result looked plausible, the status flipped to done. The book was the bill for that, and the name was not the only thing the pipeline waved through in that stretch, just the one I could not stop seeing. A children's learning task shipped a page reading "Here is airplane" instead of "Here is the airplane", and another duplicated half its cards. Same root cause every time: a task closed after one decent-looking pass, with nobody, human or model, assigned to check the output against what it was actually supposed to be. The model was sure on all of them. Sure is not the same as correct.
This is the exact gap Hamel Husain writes about when he argues that evals are not a launch-day ritual you perform once and retire. They are a continuous production loop, the discipline that separates a system you can trust from a demo that happened to work the day you showed it. Anthropic makes the same case from the vendor side in Demystifying Evals: the work is not a one-time benchmark, it is the ongoing measurement that tells you whether a change helped or hurt. The book was what that looks like when the loop is missing. The output was confident, plausible, and unmeasured, and unmeasured is where the bugs live.
The architecture: a completion flow with a certification step
The fix was to make completion a flow rather than a flag. A job no longer flips its own status. It hands its result to a separate review job, on a different model, that scores it against the job's criteria before anything is allowed to call itself done.
The load-bearing decisions
1. Certification is per type, not one rubric for everything. A code task and a writing task fail in different shapes, so they are certified differently. Code tasks must be tested and reviewed by a different model. Writing tasks must be scored against a rubric by a different model. Research tasks must be cross-checked before they can be marked complete. Project-level work cannot be marked done by the system at all: it moves to pending_approval and waits for my explicit sign-off through the dashboard. The gate matches the failure mode. A passing test suite certifies code. It says nothing useful about whether a sentence is true.
2. The certifier is a different model from the one that did the work. This is the part that earns its cost. Multi-model review is not mainly about catching hallucinations, though it does. It is about catching plausibility. A single model produces output that looks right to itself, because the same weights that generated it are the ones judging it. A second model, especially a different model family, questions assumptions the first one never surfaced. This is the operational cousin of the argument Anthropic makes in Building Effective Agents for separating the generator from the evaluator inside an augmented loop: the check has to come from outside the thing being checked.
3. The job-level judge is cheap on purpose, and runs on everything. Every coding-agent job carries an implicit eval underneath the certification gate: did the output compile, pass tests, and match the spec. A Haiku-tier judge runs after each completion and checks three things. Was the success criterion met. Were any new problems introduced. Does the output need a follow-up job. I chose a cheap, fast classifier deliberately. Full coverage at the cost of some false negatives beats perfect accuracy on a 10 percent sample with blind spots everywhere else. An eval you can only afford to run sometimes is an eval that is off most of the time. This is the same trade Eugene Yan describes from the applied-ML side: cheap automated judges, run on everything, beat expensive review run on a slice.
4. Memory is evaluated, not just stored. The same LLM-judge pattern runs as a periodic audit over the agent's memory. Every candidate fact is scored on four criteria: specificity, is it a real fact or vague noise; accuracy, does it contradict something already known to be true; staleness, is it still true; and utility, would it actually help a future session. Memory that fails the audit gets dropped. A persistent system without a memory eval feels sharp for a week and grows quietly wrong after that, because nothing is checking whether what it remembers is still real.
5. The success criteria are part of the task schema, not an afterthought. A certification gate is only as good as the criteria it certifies against. So the criteria are mandatory fields, populated when the job is created, not improvised at review time. The reviewer is not asked "is this good." It is asked "does this meet these specific, written-down criteria," which is a question a different model can answer consistently.
The numbers
The gate exists because I have the failure distribution to justify it. When I categorized why jobs go wrong across the running system, optimistic completion accounted for 19 percent of failures: a job marking itself done after one decent-looking pass instead of verifying against its success criteria. That is nearly one failure in five attributable to a single cause, and that single cause is exactly what the certification step removes. The structural fix was the mandatory success-criteria field plus judge evaluation before any status update. The before-and-after is not a vibe. It is a fifth of the failure surface, named and closed.
The cost side is real too. A certification gate roughly doubles the model calls on every job that passes through it, and adds latency, because the reviewer is a second job in the queue. I pay it because the alternative cost, my daughter's name spelled wrong in gold on the one book she gets for her first birthday, is the kind of cost that never shows up on the API bill. Cheap, full-coverage judges keep the recurring cost low enough to run on everything. The expensive certification is reserved for the types that warrant it.
What this is, and what it is not
This is not ML model evaluation in the benchmark sense. I am not computing accuracy on a held-out set. It is closer to operations applied to a non-deterministic system: continuous, in-production measurement of whether each unit of work met a written standard, with transcript-level review on the cases that fail. The insight is that an AI system is also a software system, and the discipline that keeps software healthy, measure continuously, fail loudly, certify before you ship, keeps an agent healthy too. The certification gate is just that discipline made unskippable.
Where it could still go wrong
The honest failure class I have not fully solved is the correlated reviewer. A different model from the same lab, trained on overlapping data, can share a blind spot with the model it is grading, and then the gate certifies a wrong answer with full confidence on both sides. Different model families reduce this, they do not eliminate it. The structural answer, the one Anthropic gestures at in Demystifying Evals, is a small set of human-labeled gold examples the automated judges are themselves measured against, so the grader's drift becomes visible. Building that gold set is the next piece of work. Until it exists, I treat a unanimous pass from two models as strong evidence, not proof, and the highest-stakes work still routes to pending_approval and waits for a human.
If you are building a single agent that ships work without you in the loop, the move is simple and unglamorous: write down what done means for each kind of task before the agent runs, and have a different model certify against that standard before any job is allowed to call itself finished. The cheap judge runs on everything. The expensive certification runs where the failure would hurt. The agent never grades its own paper. Start with the type that has already burned you once.
Some operational details in these essays have been changed for narrative or privacy reasons. The arguments, the numbers, and the lessons are real.