@LetMeCheckThatBot · same model, three ways

A five-dollar AI scored 76.
The professionals scored 84.

One real task from 44 jobs, and a look at what an evaluation actually measures. The same $5 AI model, MiMo v2.5, open-weighta model whose weights are public, so anyone can download and run it from Xiaomi, run through two harnesses: openclaude, a standard agent harnessthe setup around the model: its tools, its time, and when it may stop with room to work, and @LetMeCheckThatBot's own harness. Against the human professional. Same judgethe AI that grades the work against the checklist, same checklist. The model never changes. The harness does, and the score moves with it.

openclaude @LetMeCheckThatBot harness human expert

← Read the full write-up

⚲

how this test works (30 sec)

The test isn't mine. Tasks come from GDPval, OpenAI's set of real occupational work built on the U.S. Department of Labor's job catalog. Each one ships the human professional's own finished work and a grading checklist written by experts.

The model never changes. Every run uses MiMo v2.5, the same open-weight model on the same $5/month key. What changes is the harness around it. openclaude, a standard agent harness, gives it a shell, Python, a browser, and dozens of steps. @LetMeCheckThatBot's harness is the one from my group chat: a handful of chat tools and a short loop, built to answer messages, not to grind out a deliverable.

The score. Claude Sonnet 4.6, made by Anthropic, a different company than the one behind the agenta model wired into tools, working in a loop, grades every deliverablethe actual file the work produces: the spreadsheet, the memo, the report against the same checklist. A second judge, Moonshot's Kimi K2.6, agreed on 91% of items across 32 tasks.

The point. Same model, same questions, same judge. When the two AI columns disagree, that is the harness talking, not the model. With room to work, the model averaged 76% across all 44 jobs. In @LetMeCheckThatBot's chat harness, built to fire off a quick reply, it averaged 70%. On the jobs both setups finished, the scores nearly match. The gap is mostly the chat harness stopping early on a few, before it saved a file. Change the wrapper and the number moves, on purpose. That is what an evaluation buys you.