Product build

vLLM Video Intelligence
I read 100 videos for seventy cents

Solo build Python GPT-4.1-mini Gemma 27B Whisper 2025

The obvious way to read a video with AI is to send every frame to a vision model. It works, and it costs a fortune. So I built an engine that does not do that. It lays a video's frames out like a film editor reads a roll, a grid of stills on one sheet, and lets the model read eight at a time. It read a hundred short videos for about seventy cents.

The SubwayTakes explorer: a stat row reading 100 total videos, 496 topics covered, 3.2 hours watched, 4 analysis methods, above a browsable table of every Short with its contact sheet and per-model agreement.
The explorer: 100 Shorts, four models, every prediction browsable.

The problem

Reading a video frame by frame runs about eight cents a video. That sounds like nothing until you point it at a real catalog. A few thousand videos and the bill is the reason the tool never ships. The cost comes from the sheer number of pictures you hand the model, one slow, billable call at a time.

The build

The engine samples a frame every couple of seconds instead of grabbing all of them. Then it throws out the near-duplicates, the frames where nothing changed, by comparing them to each other. The survivors get tiled into a contact sheet: eight stills in a grid on a single image. One API call now reads eight frames instead of one, three to five sheets covering the whole video instead of forty separate calls.

On those sheets the vision models, GPT-4.1-mini and Gemma 27B, read what is on screen. Whisper handles the audio, so the read covers what a viewer both sees and hears. The whole move cut the cost of reading a video by about eighty percent, from eight cents to under a penny. At a penny a video you can give the tool away. At eight cents you cannot afford to run it.

The outcome

I ran it across a hundred Shorts from one channel, the SubwayTakes interviews, in about seven minutes for around seventy cents. To check the cheap path was not costing accuracy, I had four model setups read the same videos and compared them: GPT and Gemma agreed on what a video was about roughly eighty-five percent of the time. The explorer that browses all of it is one HTML page, no build step.