Failure Postmortem · 2026-05-22

Split reads from writes: how I cut my agent Chrome cost 90%.

For a few weeks this spring my agent was paying full price for a headless Chrome session every fifteen minutes just to read comments. Not write them. Not reply. Read. Out loud, when I described what was happening, even I was embarrassed for it.

A single channel forking into two diverging paths: the upper branch passes through a heavy bolted mechanical valve, the lower branch is a thin light open line, drawn as a hand-sketched editorial doodle.

The setup that produced this story was the cookie-transplant pattern, the only thing that publishes a TikTok comment without tripping the platform's bot filter. You launch a headless Chromium. You transplant the host browser's session cookies into a child context driven by Patchright, a patched Playwright build that closes the automation leaks Cloudflare and friends look for. You drive it to the page. The platform fingerprints something that looks like a real human session. The publish goes through.

Beautiful primitive. Worked the first time.

So when the next worker came along and needed to read comment threads on the same videos, I sent it through the same bridge. Same auth. Same selectors. One place to handle the cases where the platform's anti-automation layer leaned on me. One bridge for everything TikTok. Tidy.

Tidy is a feeling. Performance is a number. The number was ten seconds per read.

The cost I couldn't see

A Chromium spin-up. The cookie transplant. The navigation. The DOM scrape. The teardown. Seven to twelve seconds per round trip in steady state. Longer when the platform decided today was a fingerprinting day. The agent was polling forty-comment threads on tracked videos every fifteen minutes, and a browser tab was open for most of the agent's life, idling, just waiting for the next poll to ring its bell.

The whole machine had this idle anxiety to it that I couldn't shake until I named it.

Then I named it. The rare expensive path was setting the rules for the frequent cheap path.

A publish on TikTok happens, in the agent's life, when somebody comments something that crosses the reply-worth threshold. That is a few times a day. A read happens every fifteen minutes per tracked video, on schedule. The cost of those two things is not symmetrical, and the rule for designing them shouldn't be either.

Bot detection is a write-side tax. You pay it when you have to. You don't pay it when nobody is asking.

One commit

The fix was one commit. The function that reads comment threads, tiktokInboxRead in the tool server, stopped routing through the browser bridge. It started hitting /api/comment/list/ directly. No cookies. No session. No headless Chrome. A plain HTTP request against the same JSON endpoint the platform's own web client uses when an anonymous visitor scrolls a public page.

The platform's bot detection didn't care. There was no session to fingerprint. There was no cookie to recognize as automated. There was no browser identity to flag. There was a request, there was a response, and the response was a JSON array of comments matching the shape the web client wanted. That shape is the shape the agent wanted too.

After the split, per-read latency dropped from about ten seconds to under one. The browser tab on the host machine closed and stayed closed except for the rare publish event. The dashboard stopped feeling heavy.

Tone, and what a primitive is paying for

I want to say more about that feeling, because the architectural lesson lives there.

Software systems develop a tone. A system that spins up a browser to read a comment thread sounds heavy. You can hear it in the logs, in the worker memory graph, in the slight hesitation before every response. A system that fetches a JSON endpoint sounds light. The same job, the same payload, but the tone is different.

Tone tracks what a primitive is paying for. You can hear when you are overpaying.

The shape that generalized

The asymmetry isn't TikTok-specific. Every platform with a meaningful bot-detection surface treats reads and writes differently, because the abuse model is different. Scraping a public read endpoint is allowed-ish, in practice if not by terms. Posting without a recognized session is not. Reddit is the same. Substack is the same. LinkedIn is the same.

And it isn't even browser-specific. Any time a frequent cheap operation is routed through the same expensive primitive a rare operation needs, whether that's a headless browser, a paid model call, a locked transaction, or a rate-limited third-party API, you are paying the rare path's price on the frequent path. The transport changes; the asymmetry doesn't.

None of this is new. Martin Fowler named the database version of it Command Query Responsibility Segregation back in 2011: use one model to write and a different model to read, because the two have different shapes and different costs. Microsoft's own pattern writeup puts the cost case plainly, that separating read operations from write operations in high read-to-write workloads lets you tune each path for what it actually does. My traffic was exactly that shape, hundreds of reads against a handful of writes, and I had collapsed both onto the expensive one. I rediscovered the same instinct fifteen years later with a headless browser instead of a database. The lesson outlives the layer it shows up in.

So three rules, after the TikTok fix landed:

Writes go through the host-session bridge. Whatever it takes. The bridge is the only thing that has a real session.

Reads go to the lightest endpoint that returns the data. Public JSON if available. Authenticated REST if necessary. Browser DOM only when neither of those works.

The two paths share no transport, no auth, no code. They're different problems with different costs and they should look different.

That's it. The temptation to unify them under one tidy interface, the temptation I gave into the first time because tidy felt like good engineering, is the exact thing that costs you. Tidy is not free. Tidy charges you every fifteen minutes for the rest of your system's life.

What can break

Two failure surfaces, both manageable.

The first is endpoint drift. A platform's public-looking JSON endpoint is not a contract. TikTok can change the shape of /api/comment/list/ next week and the read path silently breaks. So the read worker validates the response shape before trusting it. If the shape drifts the worker logs the raw payload and alerts. Behavior changes, contracts don't exist, and the only protection is checking.

The second is rate-limit asymmetry. Hitting a public read endpoint hard from one IP eventually looks like scraping, even when nothing is technically authenticated. The bridge had a natural rate limit because it was expensive. The direct path doesn't, so the cadence has to be the rate limit. Fifteen minutes is well inside any sane budget for a handful of tracked videos. A hundred at once would not be.

Neither one is fatal. Both are cheaper to handle than paying the bridge tax on every poll.

The lesson, and the one afternoon

What I keep returning to, three weeks after the split landed, is the broader instinct. When you have a primitive that costs real money or real time or real risk, audit every caller and ask which ones actually need what they're getting. The default answer is no. Most callers are passing through your expensive primitive because it was the only thing on the menu when they got there.

The expensive primitive should serve the callers who genuinely need it. The cheap callers should get the cheap path.

The whole thing reads now like it should have been obvious. It wasn't. It cost me three weeks of confused dashboard-watching to even name what was wrong.

If you're running an agent that touches a third-party platform, spend one afternoon. List every external call the agent makes. Mark each as a read or a write. Find the cheapest endpoint that returns the read data. The bridge stays for writes. Everything else moves. Watch the latency curve on your read path drop. Watch the tone of the system change.

That's the win.