Failure Postmortem · 2026-05-22

The API said success.
The work never happened.

For three weeks, our automated publishing system reported a hundred percent success rate. Every dispatch returned a green checkmark. Every metric was healthy. Nobody had commented in three weeks.

A large green checkmark glowing in front of an empty room where someone should be working.

The technical mechanic is not the interesting part. One library, one function call, one state the library did not check. A one-line fix. The interesting part is the three weeks, and why every check the system ran on itself stayed green for all of them.

The worst class of production bug is the kind that confirms itself

Loud failures are easy. A crash, a 500, a queue that backs up. Somebody gets paged, a dashboard turns red, the bad commit gets found by lunch. This was the other class. The failure had the same shape as success. The dispatch function returned the right shape, the success counter incremented, the next layer trusted that, and the layer after trusted that. By the time the signal reached me it was wearing a uniform.

Every automation stack I have worked with has a version of this bug waiting somewhere. The teams that find it before their customers do are the ones that built reading-the-world-back into the contract from day one. It is the distinction Charity Majors draws in Honeycomb's observability manifesto: a system you can only ask about its own internal state is not observable, because the questions that catch this bug are the ones you did not think to instrument. The teams that skip it get a call from a customer asking why nothing has shipped since October.

Dispatch is not effect

Every system that does work on your behalf produces two observable events. Dispatch: I sent the instruction. Effect: the world is now different in the way I asked it to be. In a healthy system the two are coupled. The instruction lands, the world changes, the response confirms it.

Most of the time you only watch one of them, usually dispatch, because dispatch is cheap. The function returned, no exception was thrown, the response has the right keys. Effect is expensive: a second request, more code, more failure modes. So engineers instrument the cheap event and assume the expensive one tracks it. This is the same gap Dexter Horthy and the 12-factor agents authors are pointing at when they argue an agent's control flow and its observed state must be owned explicitly, not inferred from a hopeful return value. It is also why Anthropic's Building Effective Agents keeps returning to the same discipline: a tool call is not done until you have read back what it actually did.

For most systems on most days, the assumption holds. The day it does not is the day you discover your reliability dashboard has been measuring how often you asked for something to happen, not how often it actually did.

Two events, one of them watched
	Dispatch	Effect
What it claims	The call returned and the counter incremented.	The artifact exists where the customer would see it.
Cost to observe	Cheap. Observed by default.	Expensive. Usually skipped.
What I had instrumented	Everything.	Nothing.

Fig. 1. Most production systems measure dispatch and assume effect tracks it. The silent gap lives in the right column.

The fix is one line. The lesson took three weeks.

The mechanical fix to the specific incident was one line of code, in front of one call, in one library. Trivial.

The repair to the way I think about reliability cost the three weeks the bug ran undetected.

That is the actual price tag of this class of bug. Not the engineering time to fix it. The time before you knew you needed to.

Once you have the shape in your head, you start seeing it everywhere. A payment integration that logs charge.created and never reconciles against the processor's settled-transactions report. An email pipeline whose send queue drains cleanly while the messages sit in a spam folder nobody is reading. An AI agent that reports a tool call as complete because the function returned, when the tool quietly no-opped on a malformed argument. Each one is the same bug with a different uniform: dispatch instrumented, effect assumed. In every case the dashboard is green and the work did not happen, and in every case the only honest fix is a second observation against the system you do not control.

If your team ships automation, audit one thing this week

For every external write your system performs, ask one question. Is there a corresponding read, on a different code path, that confirms the artifact exists in the world the way you expect it to?

Not "did the call succeed." Not "was the response 200." A second observation, taken later, against the same external system, that proves the change is real.

If the answer is no for any path that matters, you do not have a publishing system or a payment system or an agent. You have a system that reports on itself.

Most teams discover this the same way I did. A customer asks why something did not happen. The team looks at the logs. The logs say it happened. The team looks again. The logs still say it happened. Someone, eventually, walks over to the third-party system and looks. The logs were wrong.

That walk-over is the read-back. The lesson of the incident is to put the walk-over inside the contract, before the customer has to. This is the operational edge of the discipline Hamel Husain writes about under evals: you do not get to trust a system you have not measured against ground truth, and a green dispatch is not ground truth.

Three rules for a system that does not lie to itself

If I were writing the design doc for a system that does work against third parties from scratch tomorrow, these would be the three rules.

Every write has a paired read

Against the third party, on a different code path, as part of the same logical transaction. If the read says the artifact is not there, the write counts as failed, no matter what the dispatch returned.

The read happens against the external world, not the cache

If the system you do not control says the artifact is not there, the artifact is not there. Local state does not get a vote. Simon Willison makes a version of this point repeatedly in his practitioner notes on LLM tools: the only state you can trust is the state you fetched back, not the state your own code believes it left behind.

Silent zero is the loudest signal

Going from one comment a day to zero is not an outage in any traditional sense. No errors, no timeouts, nothing the load balancer would notice. It is, however, the exact shape of this bug. Alert on the floor as loudly as you alert on the ceiling.

None of these are clever. All of them are unglamorous. All of them get descoped from MVPs because they double the request volume against systems that are already metered by the second. That is the objection I hear most, and it is real: a paired read is a second request, and second requests cost money and latency. But the math only looks expensive until you price in the three weeks. A read-back that runs on every write is a known, bounded, line-item cost. A silent failure that runs until a customer notices is an unbounded one, paid in trust, after the fact, with interest.

So pick one external write your system does today. The one whose failure would embarrass you most. Add the paired read this week. Make a silent zero page someone. The systems I trust are the ones that pay the cost anyway, before they have a three-week story to tell.

The API said success. The work never happened.