Product build

PagePurger
Three dollars a page to read junk

Solo build AWS Batch Bedrock Python 2026

In medical-legal record review, a person reads every page and bills for it, around three dollars a page. The catch is that a huge share of those pages are duplicates and filler: the same fax sent four times, blank separator sheets, records that have nothing to do with the case. PagePurger reads the stack first and pulls out the pages nobody should be paying to review.

Architecture: client records arrive over SFTP into an AWS Batch GPU pipeline that classifies each page for relevance with Bedrock Claude Haiku and dedupes with SHA-256 plus perceptual hashing, checkpointing so it can resume past 8,000 pages, and emits a review set with excluded pages, a CSV, and an invoice.
How it worked: SFTP intake to a finished review set, with a record of every cut.

The problem

A single case file can run past eight thousand pages. At three dollars each, the duplicates and the dead weight alone are a serious bill, paid by a human reading things a glance would reject. But you cannot just delete pages on a hunch in a legal record. Every keep-or-drop call has to be defensible later, which is exactly why people read all eight thousand by hand.

The build

Files arrived over SFTP and the heavy work ran on AWS Batch with GPU machines. A model classified each page for whether it mattered to the case, Claude Haiku on Bedrock or an OpenAI model. It caught duplicates two ways: exact copies by a straight fingerprint of the file, and the near-duplicates, the rescanned page that sits a hair off the original, by perceptual hashing that compares what the page looks like rather than its bytes.

An eight-thousand-page job cannot afford to die at page seven thousand, so it checkpointed as it went and picked back up where it stopped. The output was not one file but a set: the pages to review, the pages pulled out, the watermarked originals, a spreadsheet of every decision, and the invoice. A dashboard showed it happening live.

The outcome

The reviewer opened a stack with the junk already pulled, and a log of why each removed page went, because on a legal record someone can always come back and ask. The models did the routine part. Earning enough trust to drop a page was the part that took real work.