Architecture · 2026-06-02

LetMeCheckThatBot: a group-chat agent
that remembers everything you drop in

A group chat does not need a help desk. It needs a member, and a member is mostly memory. So I built one: a Telegram bot that sits in the thread, answers when summoned, and quietly turns everything that goes past, every image, video, file, link, and voice note, into context it can search by meaning weeks later. The interesting work was not the persona. It was the retrieval: the best RAG I know how to give a model, with the whole life of the room as its corpus.

LetMeCheckThatBot is the second production agent I run. It is a separate system from Penny, my autonomous operator, and a deliberately different shape: multi-user, conversational, real-time, and answerable to people who are not me. It lives in a long-running group thread as the extra member. You summon it with one keyword, it reads what is going on, and it does something useful: looks a fact up, reads the link someone dropped, transcribes the voice note, finds the exact clip the moment calls for. Then it goes quiet. This build has run daily since May 2026; the line of group-chat agents behind it goes back to 2022.

A group chat is a harder place to put an agent than a one-on-one assistant. The bot shares context with several people at once, and it has to decide when not to talk, which is most of the time. The keyword trigger solves the when cheaply: silence is the default, and a single word is the whole activation protocol. Everything after that is the same loop I trust everywhere else, wired through the Telegram Bot API.

AGENTS LetMeCheckThatBot the agent · lives in the Telegram thread, wakes on a keyword, decides when to talk, holds the voice, runs the workflows below calls WORKFLOWS ingest + index every drop → text, embedded → stored recall by meaning hybrid keyword + vector search the tool-loop read · act · repeat, up to four passes auto-profiler sketches who is in the room deliver + self-check no dodgingno dodging, one clean reply calls FOUNDATIONAL CAPABILITIES MODELS · three MiMo v2.5 · chat + tools · hosted GLM-4.6V · vision · hosted Qwen3-Embed 0.6B · embeddings · local TRANSLATORS → TEXT Whisper · voice/video → text vision OCR · image → text headless Chrome · page → text yt-dlp · video → text / clip STORE SQLite + vector index hybrid keyword + vector recall plain rows on disk, deletable TOOLS · EIGHTEEN look up search the web read a link screenshot search Reddit fetch a thread find a video recall search history recent messages see image + its text crop an image make + post web photos an image URL reaction GIF custom GIF facts slideshow video files interact ask the group delete its msg deliver reply Built bottom-up; called top-down. Agents touch workflows, which touch the foundational capabilities.

One agent, a loop, about eighteen tools

The core is the primitive Penny uses and the one Anthropic argues for in Building Effective Agents: an augmented model in a loop. The bot reads the recent thread, decides whether to act, and if it acts it calls a tool, reads the result, and decides again, up to four passes before it owes the chat an answer. The tools are wired with OpenAI-style function calling, and there are about eighteen of them sorted into a handful of jobs. It looks things up in the world: search the web, fetch and read a link, screenshot a page that fights back, dig through Reddit. It finds and cuts media: search for a video and trim the thirty seconds that matter, find a meme, build a gif, render a facts-slideshow. It reaches into the chat's own past: a keyword-and-vector search across the history, or just the recent run of messages. It sees: read an image and the text inside it, crop in on the part that matters. And it acts in the room: post photos, gifs, and clipped videos, put a quick either-or to the group, and delete its own message when it misfires.

One detail earned its place the hard way. Not every model honors the function-calling contract. Some emit a tool call as XML, or as a fenced code block, instead of the structured field the spec defines. So the loop carries a recovery path that parses those malformed calls back into real ones instead of giving up. The tool layer assumes the model will occasionally color outside the lines, and handles it rather than crashing. That single piece of defensiveness is what lets me swap the model underneath the bot without touching the loop, which is exactly what I did next.

I deleted the cascade I was proud of

The bot reaches models through OpenRouter, which puts many providers behind one credential. I built this part twice. The first version was a cost-aware cascade: Claude Haiku as the default, Gemini Flash below it, a free tier at the bottom, and a background job that watched the remaining credit and quietly dropped to free models when it ran low. It worked, and I was proud of it.

Then one model made the whole apparatus pointless. MiMo v2.5 tool-calls reliably, holds a million tokens of context, and caches input for almost nothing. By my own accounting it runs roughly twenty-eight times cheaper than Haiku for a full multi-search fact-check turn. So one model now handles chat and every tool call, the cascade collapsed into a single line of config, and the budget guard went in the bin with it.

That left the bot with no fallback at all, which turned out to be the right shape. If a model call fails, it fails, and the bot picks up on the next message. This is the same no-fallback rule I run in Penny, and in the first version I had argued myself into the opposite: that a silent downgrade to a weaker model was a feature. It was just a way to keep paying for worse answers without noticing. So I deleted it.

A forced second look

The model's most annoying failure mode is dodging: handing the question back to me instead of doing the work. Ask it something it could look up and it will sometimes reply with "need more" or "did you mean A or B" rather than just searching. So the loop watches for a reply that called no tools, and the first time that happens in a turn it stops and puts one internal question to the model: are you about to punt something you could answer yourself, look up, or take a fair guess at? If it was dodging, it goes and uses a tool. If the silent reply was the right call, a flat fact, an honest "no clue," a one-line roast, it passes through untouched. The check fires once per turn, so the bot reconsiders without arguing itself into a corner. It is the cheapest reliability fix on the project: one extra turn, fired at most once, that turns a reflexive dodge back into an answer.

Everything you drop in becomes context

This is where most of the work actually went, and it is the part nobody plans for. Most chat bots read text and go blind the moment someone drops anything else. This one treats every artifact as readable. A screenshot is handed to a vision model that transcribes the words inside it verbatim, every line, not a summary. A photo gets described. A video, or a silent GIF, is sampled into a contact sheet of frames and read across them, so the bot knows what happened in it and can lift any text off the screen. A voice note or video note nobody wanted to play goes to a local Whisper service and comes back as a transcript. A shared file has its contents pulled in. A link gets fetched and read, and when it hides behind a wall of script a headless Chrome renders it first; when it is a video, yt-dlp pulls its title, channel, description, and transcript. All of it lands in the thread as text, silently, with no reply at the time.

Seeing is its own model: images go to a dedicated vision model, kept apart from the chat model because it pulls text out of a picture better. None of this ingestion produces a message; it happens in the background, on the way past, so weeks later when someone asks what was in the thing so-and-so posted, the answer is already in the history. The bot runs the same machinery in reverse, too: ask it for the right thirty seconds of a video and it downloads the source, finds the moment, trims it, and drops the clip in. It does not describe a meme; it posts one. The one real gap is binary office files: a PDF gets logged but not parsed.

What I have just described has a name: retrieval-augmented generation. Most RAG systems point a model at a tidy corpus someone curated, a docs site or a wiki. This one points it at the actual lived history of a room, every modality of it, indexed and pulled back by meaning at the moment it matters. It is not what someone decided to write down, but everything that was ever said, shown, dropped, or linked, in whatever form it arrived.

Recall by meaning, not the last fifty lines

A chat bot with no memory re-meets everyone every morning. This one keeps the whole thread in a local SQLite database, and a background worker embeds each message as a vector as it lands, using a small embedding model of its own, Qwen3-Embedding-0.6B running on Ollama, right on the box and separate from the hosted chat and vision models. The index it builds never leaves the machine. When someone asks what that restaurant was that came up last week, it does not replay the last fifty lines and hope. It searches the entire history two ways at once: a plain keyword match for the things vectors miss, an @handle, a URL, a misspelled name, blended with vector similarity for the things keywords miss, the same idea worded differently. Then it pulls the handful of messages that actually answer. The query is a few hundred tokens and runs every turn, cheap enough to redo rather than cache. The store is plain rows on disk I can open, read, and delete by hand, which matters when the data is real people in a private chat rather than a faceless user table.

It remembers who is in the room

Memory is not only what was said. It is who said it. The bot works in any group it gets added to, not just the one I built it for, so it cannot assume it knows the cast. Once someone has said enough, a dozen messages is the bar, it builds a short character sketch of them from their own words: three to five sentences on how they talk, what they keep circling back to, the sense of humor, and the role they play in the room, the instigator, the contrarian, the earnest one, the one who only ever dumps links, the one who is the running joke. Those sketches go into the prompt as a "who is in this chat" note, so it has a feel for the people before it speaks. Dropped into a brand-new room with no history, it reads the recent backlog first and works out the dynamics. It is the same instinct as the message recall, pointed at people instead of facts: do not re-meet the room every morning.

The voice was the easy part

It is tempting to tell you the voice was the hard part, because the voice is the part that shows. It was not. The persona is a long character sketch in the prompt, and the only enforcement around it is small: a guard, a few dozen lines, that runs on the way out. It strips the canned safety disclaimers the model reaches for when it is unsure, and it catches the times the model prints its private planning into the chat instead of thinking it quietly. Either one shatters the illusion that there is a person in the room, so either one gets removed before anyone sees it. I wrote that guard once and have barely touched it since. Holding a voice turned out to be cheap.

One more piece of hardening in the same spirit is duller and more load-bearing: a lockfile with stale-process detection, so two copies of the bot can never both answer the same message. A group chat with a stutter is worse than one with no bot at all. But none of this was the work. The work was everything in the three sections above, the plumbing that turns each message into context and the recall that hands it back.

A member is mostly memory

Here is the principle I took out of building it. The model is a commodity; I swapped the one underneath this bot in an afternoon and the loop never noticed. The persona is a prompt and a small guard. What is left, the thing that actually separates a member of the group from a search box you can @-mention, is everything the bot has quietly absorbed and can hand back at the right moment. A bot that remembers the link you sent in March, the voice note you never replayed, the running joke about so-and-so, and which person in the room tends to start them, is a bot that feels like it was there. That feeling is not in the model. It is in the database, and in the unglamorous plumbing that fills it on every message that goes past.

LetMeCheckThatBot is small. One repository, one box. But it has stayed up, stayed cheap, and stayed useful every day, in front of people who would tell me if it got worse. The part they would miss if it vanished is not the wit and not the model. It is the memory. None of that came from the model. It came from the wrapping.

Some operational details in these essays have been changed for narrative or privacy reasons. The arguments, the design decisions, and the lessons are real.