the lab · penny
Penny.
One agent, many verbs. It writes its own workflows.
Penny works through one command-line surface. Every capability it has, reading mail, placing a call, publishing a website, is a verb on the same CLI, and the agent and its workflows call the exact same verbs. When a sequence of them keeps repeating, Penny distills it into an n8n workflow, and once that workflow clears four gates it owns the job for good. Exploration stays the agent’s job; anything it does often enough becomes a machine that runs without it.
A year ago I described a fleet of 300+ autonomous agents coordinating through a shared scheduler. I tore it down. What runs now is Penny: one agent in a loop, calling many tools, with a workflow engine as the biggest tool. I designed it, built it, and operate it on my own home server, every day. The shape lines up with Anthropic’s engineering guide and Cognition’s field report: don’t build multi-agents.
Penny reads my mail, runs my mornings, places my calls, trades my paper account, and ships essays and websites, all from a ThinkPad on a shelf. Two layers are worth keeping straight. On the channels people actually touch, the Substack, the phone, the TikTok, Penny is a persona with a name and a voice, because that is what a reader or a caller answers to. Underneath, it is an it: one agent in a loop, running tools against gates that do not care what you call them. This page is the underneath. Everything below is the engineering that holds the rest up.
Below is how it is built: the log of real runs, the architecture at a glance, the workflow engine that is its biggest tool, where its memory and state live, and a decision record for every choice, with the losing options I built first. The smaller design calls, buffering until go, the four gates, why it runs no context manager, are getting their own writeups.
Penny
Autonomous operator · single-user · home server
penelopelawrence.com · its Substack
Distills its own repeated work into new n8n workflows. Nothing takes over a job without passing all four gates.
Before the architecture, here is the log.
This is Penny’s actual n8n control plane, captured live from the running instance on June 11. Ninety-one workflows lived in there that afternoon. The corner says three hundred and seventy-three production runs and a four-point-six-percent failure rate, which is a polite way of saying two of the rows below are red. I left them in. A Reddit lane that timed out, an approval gate that errored on a slow turn. The honest version of “it runs every day” has red in it.
It is also, as of one in the morning on June 12, a before picture. The rearchitecture’s deletion pass ran that night: sixty-six workflows out in one sitting, every one of them an orphan of a lane that had already been rebuilt smaller. The fleet woke up at twenty-five, all of them active. I kept the screenshot because I was proud of it, which regular readers will recognize as the exact thing I say about systems right before they stop existing.
Most jobs are made of the same dozen verbs.
Strip a desk job to its verbs and the mystique goes fast. Read the inbox. Answer the person. Look it up. Write the thing. Keep the calendar. Make the call. Move the money. Publish the result. Titles are taxonomy; the verbs underneath repeat, across careers that do not think they resemble each other.
This is not my theory; it is the Department of Labor’s filing system. O*NET, the federal occupational database, describes a job by listing its tasks, and the current release needs 18,796 task statements to cover 923 occupations, which works out to about twenty verbs a job. When OpenAI built GDPval to test models on real work, they started from those same task statements: 44 occupations, deliverables a professional would actually hand in, graded against human experts. The best model’s work matched or beat the expert’s just under half the time. That model was a Claude, which is what Penny runs.
Penny is a bet on that list. Not artificial general intelligence. A general assistant: cover the recurring verbs, on real accounts, with a human tap on anything that leaves the building. Here is the list, against what actually runs.
The right column is the part I actually care about. Doing a task once is a demo. Turning a repeated task into a machine that survives the four gates is staff. The overlap of those two circles, can do the verb, can harden the verb, is the whole point of it.
Capabilities, workflows, agent, and a distillation loop.
Foundational capabilities at the base, workflows on top of those, an agent on top of the workflows. At runtime the call flows down: the agent reaches for a workflow, which reaches for a capability. Penny adds one loop the other way: it distills its own work back into new n8n workflows, gated by the safety chain. The sections below walk these layers, then the decisions behind them.
Every capability is a verb on one CLI.
Penny has no bespoke tool per task. Every capability is a verb on a single CLI, called as pen <domain> <verb>. The agent calls it from inside the loop, and the n8n workflows call the same verbs from outside it. One surface to write, one place to audit, one place to rate-limit. When a sequence of verbs keeps repeating, that sequence is exactly what gets distilled into a workflow.
Sixteen of the verbs are external writes, the ones that touch the outside world. Each is marked, and none can fire without a human approval token. That is gate four, and it lives in the exit code.
The agent and its workflows share one verb surface. Distillation is just a verb-sequence the agent ran often enough to deserve its own machine.
Twice a day, it runs a talk show about my life.
Everything in the control plane starts with me typing. This is the half that doesn’t. At eight every morning and six every evening, a workflow compiles the day, weather, calendar, inbox, the trading book, what moved, and writes a script for two hosts: Penny, and a co-host named Max who exists nowhere else in the system. A local Kokoro container gives them both voices, ffmpeg lays chapter cards over the audio, and a finished episode lands in my Telegram. Every number in it comes from the day’s actual data, and both voices are synthesized on the shelf, no cloud TTS anywhere in the chain.
The audio was the easy part, same as the other agent’s page says about its voice. The writing is where the work went, because the first scripts kept collapsing into a quiz show: Max feeding setup lines, Penny reciting answers, two roles instead of two people. The prompt that writes the episode now carries a standing rule against it.
THE ONE FAILURE MODE TO AVOID - the Q&A trap: do NOT make Max
a prompt-feeder ("desk?" ... "kalshi book?" ... "and the news?")
with Penny reciting answers. That is two roles, not two people.
If a host’s only job in a turn is to tee up the other’s answer,
that turn is broken - cut it or give that host something real
to say.
···
Max ADDS (the memory, the dot-connect, a take), he does not
just ask.
Many tools. The workflow engine is the biggest.
Penny calls tools. The biggest tool is a workflow engine (n8n in my case). It owns the durable, scheduled, branching work that has no business living inside a chat turn: inbox digests, a Substack engagement loop, a paper-trading rebalance at market open.
Ok, not really at market open. The romance is approximate. The schedule isn’t. Here is the schedule, per the crons:
Underneath every lane there is now exactly one grammar. Each capability is a verb on a CLI called pen: pen email context, pen kalshi order, pen telegram send, pen sites publish. Twenty-seven domains, ninety-seven verbs, one JSON answer each, and the exit code is the contract. A workflow node runs the exact command I can type in a terminal, so when a lane misbehaves at 9:30 in the morning I reproduce it by typing that same command myself. The fleet used to speak through custom typed n8n nodes I code-generated per account; the rearchitecture deleted all but one of them. The survivor is the approval gate, and its retirement paperwork is filed.
pen <domain> <verb> · 27 domains, 97 verbs
exit codes, not promises
Exit 3 is gate four wearing its work clothes: an external write without a human token refuses to run, every execution, no exceptions.
host-local CLIs, swapped by name
The runners swap by name. There is no capability cascade. I pick the one for the job. The router doesn’t pick for me.
The workflow engine also calls the same tool API Penny calls. Workflows aren’t downstream of the loop; they’re peer callers of the same surface. One set of tools to write, audit, and rate-limit.
Memory, state, and artifacts
Three stores, three different jobs. The memory vault holds what Penny remembers about me, my projects, my preferences. Postgres holds the workflow engine’s operational state: which workflows exist, which executions are running, which approval requests are pending. Git holds what Penny has shipped to the world.
I deliberately did not fuse them. Make a vault double as a database and you get a footgun. Make a queue carry memory and it loses both jobs.
penny-memory, a git repo of markdown
307 facts. SQLite and LanceDB are the store; a 4am cron mirrors every note to Obsidian-readable markdown, and the export writes that provenance into each note’s footer.
Postgres 16 + Redis, queue mode
Workflow definitions, running executions, pending approvals. Queue mode because a worker crash should lose nothing.
git, plus published surfaces
This site, Penny’s subdomains, expiring shares. Shipped is a different thing from remembered, and it lives in a different place.
The vault’s one rule: Penny does not own its own memory. I do. The vault is a git repo of mine, and its markdown mirror is readable and greppable without any of Penny’s code running.
Operational state lives in Postgres because the workflow engine needs queue-mode execution to survive a worker crash. Artifacts live in git because that’s how the world reads them.
# Penny Workflow Reliability
> [!summary]
> Generated from 7 Penny memory rows at 2026-06-11T16:55:01.590Z.
> SQLite and LanceDB remain the source of truth.
### Episodic
- **episode** · 2026-05-10 · Daniel discovered that aspirational workflow
entries in Penny’s memory don’t match actual n8n deployments
### Preferences
- **preference** · 2026-05-10 · Prefers verbose, story-style workflow
nodes that are explainable and read like a narrative sequence
···
| ID | Category | Created | Importance |
| `df0ff149-3b4e-...` | projects | 2026-05-13 | 0.9 |
| `578ed6ad-4125-...` | preference | 2026-05-10 | 0.8 |
Memory I can read without the agent. State in Postgres. Artifacts in git. None of them are Penny, and that’s the point.
What was on the table, and why it lost.
Every claim on this page came out of a decision with real options behind it. Most of the losing options I built first. The record, in question form:
Why one agent, and not a fleet?
ON THE TABLE
- ✗ a 300+ agent fleet with a shared scheduler (built it)
- ✗ more agents to supervise the agents (tried that fix too)
- ✓ one agent in a loop, many tools
REASONING
Every fleet version produced the same mountain: half-finished features, quietly buggy work, pieces that did not fit the pieces beside them. Supervising agents added coordination, not quality. One careful agent on real platforms holds together. Anthropic and Cognition published the same conclusion independently.
status: settled · the May 2026 rewrite
Why is there no fallback chain?
ON THE TABLE
- ✗ a cost-aware cascade that silently downgrades (letmecheckbot had one; it went in the bin)
- ✗ a capability router that picks the model for me
- ✓ one runner, picked by name, that fails loud
REASONING
A chain that quietly retries with a weaker model is how you ship confident answers nobody asked for. If the runner fails, I want to see it fail and decide. Penny launched without one; letmecheckbot’s went in the bin.
status: settled · v4 never had one
Why n8n instead of your own engine?
ON THE TABLE
- ✗ everything inside chat turns, no engine
- ✗ my own job engine (built it; it ran 6,442 jobs and worked)
- ✓ n8n in queue mode
REASONING
Durable, scheduled, branching work has no business living inside a chat turn. And maintaining my own engine was a second job. I retired a working system I was fond of because the boring one was better.
status: settled · v4
Why retrieval instead of a context manager?
ON THE TABLE
- ✗ rolling windows and summarize-then-prune cycles
- ✗ hierarchical summaries with an attention budget
- ✓ rebuild context every turn from retrieval
REASONING
Retrieval deletes the rolling-window logic, the summary passes, and every bug where the summary drifts from what happened. The cost is more retrieval work per turn. In exchange, Penny never imagines a world it isn't in.
status: settled
Why does the memory mirror to markdown?
ON THE TABLE
- ✗ an opaque store only the agent’s code can read
- ✗ a database browser as the only window into what it knows
- ✓ SQLite + LanceDB as the store, mirrored nightly to markdown in a repo I own
REASONING
The store is SQLite and LanceDB; every exported note says so in its own footer. The mirror means I can read and grep what my agent remembers with none of its code running, and the vault lives in a git repo that is mine. Penny does not own its own memory. I do.
status: settled
Why are writes gated in the exit code?
ON THE TABLE
- ✗ prompt discipline (“never send without asking”)
- ✗ allowlists per workflow
- ✓ a structural approval token, or the command exits with an error
REASONING
Prompt discipline is not a control. The gate is in the exit code, runs on every execution, and never retires. One human tap, every time something touches the outside world.
status: settled · the forever gate
Why buffer until “go”?
ON THE TABLE
- ✗ reply to every message as it lands
- ✗ a debounce timer that guesses when I'm done
- ✓ an explicit go from me
REASONING
Because I text in bursts. The buffer lets me send three messages, change my mind, and finish the thought before Penny acts on any of it. A timer guesses when I’m done; go knows. Spending tokens once instead of on every autocorrect is the bonus.
status: settled
Two boards, drawn from their own JSON.
Not diagrams of the system. The system: every node and connection below is rendered from the live workflow file, position for position. If the workflow changes, this drawing is wrong, which is exactly the bar the rest of the page is held to.
Where Penny shows up.
Everything above is the plumbing. This is where the output lands. Penny posts, comments, and ships under its own name, and rather than tell you the pages are real, here they are: full-page grabs, captured through Penny’s own Chrome profile, the same browser its workflows drive when it reads the web. The fourth one is this page. The agent file, fetched by the agent.
The accounts, for following along. Same loop, same tools, different rooms.
What I’m still chewing on.
Same rule as the other agent’s page: none of this is finished, and none of it is a promise. The threads I keep pulling on.
The outbox in the waiting room
Penny’s replies are delivered by a container it is also allowed to restart. Once, mid-run, it restarted that container: the work finished, the restart went clean, and the message saying so died with the messenger, so my phone read “working…” into the night. It now restarts itself last, which is a rule, not a fix. The fix is a durable outbox, and it is already written: invariants documented, quiet hours that hold messages instead of failing them, one decision in front of me at a time. It lives in a folder called _pending-integration, which is the most honest folder name in the repo.
status: written, not wired
The account with nothing on it
Penny has an X account. What survives of the lane after the rearchitecture is a daily trending scan that writes to memory, plus a posting verb that sits behind a tap I have not tapped, and the profile shows zero posts. A thing that never runs never errors, so every audit walks right past it. The X card used to be on this page, one row over from its Substack. I took it down. It comes back with receipts.
status: investigating
Nothing had taught the fleet to forget
This entry shipped on a Thursday saying the fleet only ever grows: ninety-one workflows, Penny distilling new ones out of its own repeated behavior, me deleting by hand and losing. The deletion pass ran that same night. Sixty-six workflows out in one sitting, the orphans of every rebuilt lane, approved as one batch. The fleet is twenty-five now, every one of them active, and what finally taught it to forget was a human staying up past one in the morning. The part that stays open: it still adds by automation, and the loop that grew ninety-one is still on. The number to watch is whether twenty-five holds.
status: 25, all active · the loop that grew 91 is still on
A long queue and the meaning of approve
Gate four holds every external write until I tap. The outbox holds every next decision until I have answered the last one. That is the design, and I still believe it: prompt discipline is not a control, and the gate runs in the exit code. But the system does more every month, and the approver still sleeps, parents, and goes to the movies. The question is not whether the gate holds. It is what a long enough queue does to the meaning of approve.
status: holding, by hand
The gate guards the road I paved
Gate four lives in an exit code. The pen wrapper returns 3 on an unapproved external write and the runner halts, and I have leaned on that as the control. It holds for everything that goes through pen. But pen is one road out and there are others: a workflow that reaches for a raw curl, or for n8n’s own HTTP node, never touches the wrapper and never sees the 3. Penny writes its own automations, which is the point, and nothing down at the network stops one from driving around the gate I built. So the honest version of the claim is narrower than the one above on this page: the gate holds the road I paved, not every way off the property. The real fix puts the check where the packets leave, and I have not built it.
status: holds the pen path, not the network
Two stores, no lock between them
What Penny did and what it produced live in different places: the run history in Postgres, the essay or the site change in files under git. Nothing binds the two into one transaction. So there is a window where a run reports done and the history agrees, while the git push that was meant to carry the artifact quietly fails. The record in one store and the file in the other drift apart, and no audit is watching the seam between them. It has not bitten me that I have caught, which is not the same as not happening. The textbook answer is a two-phase commit; on a one-person home server that is a heavier machine than the problem, so what I actually want is a reconciler that reads both sides and shouts when they disagree.
status: unbound, watching the seam
Three sister git repos, one home server, daily since October 2025; one agent since the May 2026 rewrite, the latest build in four years of agent work. The writing is what running it taught me, including everything that broke. This page is the substrate underneath it.
The fastest way to understand it is to read it.
Penny has exactly one user, and it is not accepting applications. What it has is a byline. The essays are what living next to this system sounds like from the inside, and they leave the house through the same gates as everything else.
Read its SubstackOr start at penelopelawrence.com, where it keeps the research.