Architecture · 2026-03-28

The memory system and composite scoring

How one production AI agent remembers across 5,252 jobs: a three-layer store, an LLM judge that gates writes, and a 197KB identity layer assembled at boot. Composite recall scoring gets its own essay.

A hand-drawn ink sketch of an open library card-catalog drawer of index cards: cards near the front stand crisp and upright while older cards toward the back fade and sink, and a single rust-colored accent line threads through to lift one older card forward.

This is the second half of a teardown of one production AI agent. The first half, routing and the failure log, covered how work gets dispatched across model backends and what breaks across 5,252 jobs. One of the five failure classes there was memory contamination: a false or outdated memory surfacing mid-session and corrupting downstream reasoning. This essay is the subsystem that fixes it.

Memory is where most agentic systems underinvest and then wonder why their assistant forgets everything. Memory is not a context window. It is a retrieval problem, a scoring problem, and a garbage collection problem at the same time. Once you commit to one agent in a loop rather than a fleet, the agent has to hold its own coherence across long horizons, which makes this the load-bearing skill. Anthropic calls the discipline context engineering, and names "context rot" as the failure mode where a long-lived context degrades into confidently wrong output. This system's memory layer exists to keep that rot out of the loop.

Three memory layers

The system uses three layers, each with a distinct job:

Layer 1, episodic. Append-only JSONL daily logs. What happened today. These are never queried directly for recall. They feed into the semantic layer during a nightly consolidation pass.

Layer 2, semantic. A LanceDB vector store. What the system knows: facts, preferences, decisions, entities. Queried at session start via hybrid search (vector plus full-text). This is the layer that grows: 230K vectors and roughly 1.9GB on disk at the time of writing, at an embedding cost of about $0.003 per month.

Layer 3, procedural. The same LanceDB instance, category procedure. How the system does things: learned patterns that inform future behavior. A fact is "the staging site is a separate git worktree from production." A procedure is "for SSRN paper submission, use the password-manager credentials, never the stale plaintext from legacy scripts." Different retrieval pattern, different write path.

What gets retrieved

The retrieval pipeline does not just return the highest cosine similarity vectors. Raw semantic similarity is a poor recall metric on its own: it surfaces whatever is most linguistically similar to the query regardless of whether it is recent or useful, so a February memory about a different dataset can outrank the one from three days ago. Pure similarity has no concept of time, and that is exactly the contamination class from part one. The fix is a composite score that blends similarity with recency, importance, and access frequency, plus a penalty that stops the same few memories surfacing every turn. I treat that scoring formula as its own subject, with the weights and the failure-driven derivation, in Composite scoring: fixing stale agent recall. The rest of this essay assumes that score and looks at what it operates on: the layers, the write gate, and the identity layer that frames every recall.

The LLM judge that gates writes

Retrieval is half the problem. The other half is deciding what enters the store at all. After every session the system runs memory auto-capture: an LLM reviews the conversation and proposes what to remember. Each candidate gets one of four judgments:

ADD - new information worth storing
UPDATE - conflicts with or extends an existing memory
DELETE - existing memory is now wrong
NOOP - not worth storing (transient, already known, low importance)

Two cheap gates run before the expensive judge ever fires. Near-duplicate detection: if a candidate embeds within cosine distance 0.92 of an existing memory, it is dropped silently. A count cap: if more than two near-identical records already exist, storage is blocked. Those gates handle the common case for the price of an embedding, and reserve the LLM judge call for genuinely ambiguous candidates. This is the same cheap-classifier-on-everything discipline the routing essay applies to job evals, pointed at writes instead of outputs.

Importance decay closes the loop. A memory that is never accessed fades toward an importance floor of 0.1 over 90 days. That is not accidental forgetting. It is deliberate pressure that forces the store to keep what it actually uses rather than archive everything forever. The dual nature of capture matters too: a conversation where I tell it "always check Reddit before starting research" produces both a fact (archival) and an instinct (a procedure). Same conversation, two memory types, two retrieval patterns. The single-agent-with-a-skill-library shape here is the one Wang et al. argued for in Voyager: one agent that accumulates reusable procedures rather than spawning new agents.

The identity layer: 197KB of context

The persistent session runs with a system prompt assembled dynamically at boot from workspace files, skill definitions, memory recall, and project context. The assembled prompt is approximately 197KB.

That number surprises people. The instinct is to minimize system-prompt size. The reality is that a richer, more specific context produces dramatically less hedging, fewer clarification requests, and more consistent behavior across sessions. The cost, slightly higher time-to-first-token, is worth it at this fidelity level. The judgment call is which context is load-bearing enough to always carry, and that is exactly the question context engineering asks.

Skills are the interesting part. The system has 28 skills, each a focused knowledge module: email management, smart browsing, research methodology, code review, and the rest. They are not all loaded at boot. The persistent session gets a compact skill catalog of one-line descriptions and trigger phrases. When a trigger fires, the full skill document is loaded on demand:

// identity.mjs - skills loaded at boot vs. on demand
const VOICE_SKILLS = {
  'writing':       'full',      // 11KB - always loaded
  'email-manager': 'condensed', // 25KB to ~5KB condensed
  'orchestration': 'condensed', // 42KB to ~3KB condensed
};
// All other skills: one-line catalog entry only
// Full skill doc loaded on trigger from skills/{name}/SKILL.md

This lazy-loading pattern mirrors how good engineers structure large codebases. You do not import everything at module load time. You import what you need when you need it. The analog applies directly to context windows. It is also why one agent can carry the breadth of a fleet without paying the token cost of a fleet on every job. Simon Willison has made a version of this argument repeatedly in his practitioner writing: what matters is what you choose to put in front of the model, not how many models you run.

If you are building this

Three decisions carry most of the weight, in order. First, blend recency and importance into your retrieval scoring before you ship anything persistent, the way the composite score does. Raw cosine similarity feels sharp for the first week and increasingly wrong after that. Second, gate writes with cheap embedding-distance checks before you reach for an LLM judge. The dedup gate at 0.92 distance does more to keep the store clean than the judge does. Third, let unused memories decay. A store that only grows is a store that rots.

The job engine in part one does not care how memory works. The memory system does not care how jobs are routed. That orthogonality is what lets a single agent on a home server stay coherent across 5,252 jobs without anyone managing it full time. The models were the easy part. Getting the memory right was the work.