Architecture · 2026-04-05

Context engineering in practice: a 3-tier memory system, 230K vectors

The architecture, code patterns, and production numbers behind a 3-tier memory system for one AI agent: a 75-line core file, a 230K-vector LanceDB archival store, and self-teaching procedural memory. 6,370 jobs, $0.003/month embedding cost.

A hand-drawn three-tiered funnel: a chaotic cloud of scattered scribbled marks pours into the top, passes through three stacked sieve layers, and emerges at the bottom as a single clean dot circled in rust, the one high-signal memory that survives the filter.

Last Tuesday at 2:14 AM, my production AI system dispatched a job to draft a research brief. It had run 6,300+ jobs by then. It knew my writing preferences, my active projects, my deadline for the week, and it referenced a conversation from three days earlier about a specific dataset. And it got the brief completely wrong.

The model was fine. The problem was that it retrieved a memory from February about a different dataset with a similar name, and that stale memory overwrote the correct context from the recent conversation. One bad retrieval cascaded into a confidently wrong output.

That's when I understood: context engineering is the load-bearing wall of agent reliability. Every downstream decision inherits the error if you get it wrong. Get it right and the system feels like it actually knows you.

Anthropic named this discipline directly in "Effective context engineering for AI agents": the job is curating the smallest set of high-signal tokens for the task, not stuffing the window. Their framing of context rot, where a model's ability to use information degrades as the context grows longer and noisier (a phenomenon measured directly by Chroma's research team across 18 models), is exactly the failure I hit at 2:14 AM. The stale memory didn't sit there harmlessly; it displaced the correct context. The same instinct runs through Anthropic's earlier "Building Effective Agents" (Schluntz and Zhang, December 2024): keep one augmented loop, give it the right tools and context, and resist the urge to add machinery. The memory system below is the context-curation half of that argument made concrete.

This is the architecture I built after that failure. Every number comes from the production database, every code sample from the running system. The composite-scoring formula that actually fixed the stale-recall bug gets its own deep-dive in a companion piece, Composite scoring: fixing stale agent recall; here I lay out the three tiers it lives inside.

Why "just dump everything in context" fails

The naive approach to agent memory is stuffing everything into the system prompt: preferences, history, project state, everything you might need. It works great for demos. It stops working around month two.

My system prompt assembles dynamically from workspace files, skill definitions, memory recall, and project context. The full payload for the persistent session lands around 17,000 tokens, already 13% of a 128K context window before a single user message arrives.

Here's what happens when you try to scale the naive approach:

  • Token costs compound. Every job reads the full context. Across 6,370 jobs, stuffing 50K tokens of "everything" into each prompt would've cost 3-4x more than tiered retrieval.
  • Signal drowns in noise. When the system knows 230,000+ facts, surfacing the right 20 for a task isn't a filtering problem. It's a ranking problem.
  • Stale memories poison decisions. A fact stored in February might be wrong by April. In a flat file that loads every session, there's no mechanism to decay or challenge it.
  • Contradictions accumulate silently. "Danny prefers dark mode" and "Danny switched to light mode last week" can coexist in a flat store forever, and the system flips between them randomly.

The 3-tier architecture is the minimum structure that keeps context accurate at scale.

Tier 1: Core Memory 75 lines, always loaded, identity + rules Tier 2: Archival Memory 230K vectors in LanceDB, composite scoring Tier 3: Procedural Memory learned behaviors, tag-boosted retrieval recall tags

Tier 1: core memory, the 75-line identity

Core memory is a single 75-line markdown file, always loaded every session, no retrieval needed.

core-memory.md (75 lines, 6,279 bytes)
├── About Me         - identity, personality, operating style
├── About Danny      - personal details, schedule, preferences
├── Current Priorities - 7 active projects with IDs
├── Critical Rules   - hard constraints (one-liners)
├── Key Preferences  - communication style, work habits
└── Operating Principles - research first, ship minimum viable

The file carries a hard budget comment at the top: 80 lines max, identity and priorities and critical rules only, with procedures, project details, and skill-specific notes pushed out to their own stores.

That budget constraint is doing real work. Without it, core memory bloats: every correction adds a line, and within a month you've got 300 lines of accumulated rules, half contradicting each other, all consuming tokens on every job. The 80-line cap forces ruthless prioritization. If something new goes in, something old comes out. The file gets edited directly. It's the system's working self-concept, not an append-only log.

What stays in core memory: identity, the human's key details (name, family, schedule, timezone), active projects as a compact list of IDs, hard rules (never skip multi-model review, never ship single-pass), and communication preferences. What gets pushed out to archival: project details beyond the one-liner, historical decisions and their rationale, learned procedures, entity details about people and organizations, and anything that changes more than monthly.

The split is roughly: core memory answers "who are you and what matters right now," archival answers "what do you know about X." The core-versus-archival vocabulary isn't mine; it comes from the MemGPT paper (Packer et al., 2023), which framed agent memory as an operating-system problem: a small in-context working set backed by a larger out-of-context store the agent pages in on demand.

Tier 2: archival memory, 230K vectors in LanceDB

Archival memory is where the real complexity lives: a LanceDB vector store, 1.9GB on disk, holding every meaningful fact, preference, decision, and entity the system has ever learned.

230K+
Vectors stored
1.9 GB
On-disk footprint
7
Memory categories

Why LanceDB, not files

I started with flat markdown files, one per topic. It worked for a few weeks, then I hit 200 files and search broke. Grep doesn't understand semantic similarity: "Danny's preferred writing style" and "how Danny likes articles written" are the same query to a human but completely different strings to grep. LanceDB solved this with embedded vector plus full-text search in a single query. No separate vector database, no Pinecone bill. It runs locally and the query interface is SQL-like enough that I didn't have to learn a new paradigm.

The table schema:

memories table (LanceDB)
├── id              UUID
├── text            string     the actual memory content
├── vector          float[1536] text-embedding-3-small
├── importance      float      0.0 - 1.0
├── category        string     fact|decision|preference|entity|procedure|...
├── source          string     auto-capture|manual|danny-correction
├── tags            JSON       tool hints for procedural recall
├── createdAt       bigint
├── lastAccessedAt  bigint
└── accessCount     bigint

Indexes: FTS on text, bitmap on category, B-tree on importance, plus a vector index. Hybrid search uses an RRF reranker to merge vector and full-text results into one ranked list.

Composite scoring: where the stale-recall bug got fixed

Raw cosine similarity isn't enough here. It has no concept of time, importance, or how often a memory has already surfaced, which is exactly how the stale February memory outranked the correct one. The fix is a composite score that blends similarity, recency, importance, and frequency, then penalizes anything recalled too recently. The exact weights, the scoring function from the codebase, and the 10-step retrieval pipeline that wraps it get a full walk-through in the companion essay, Composite scoring: fixing stale agent recall. What matters here is the result: retrieval surfaces 12-15 archival memories and 8-10 procedures per session, roughly 800-1,400 tokens, precisely targeted rather than dumped.

Tier 3: procedural memory, how the system teaches itself

Procedural memories live in the same LanceDB table as archival memories, distinguished by category='procedure', but they behave differently. An archival memory is a fact: "Danny's daughter turns 1 on March 25, 2026." A procedural memory is a behavior: "For SSRN paper submission, use password-manager credentials, never stale plaintext credentials from legacy scripts."

The difference matters for retrieval. Facts match on semantic similarity to the query; procedures match on tags. Every procedure gets tagged with the tools or domains it applies to:

// Example procedural memories with tags

"Articles for dnakhla.com must go through the council pipeline"
  tags: ["dnakhla", "article-publication", "council-pipeline"]

"Danny's methodology for positioning: research -> plan -> validate"
  tags: ["positioning", "strategy", "market-research"]

"Before building scrapers, research existing datasets and APIs"
  tags: ["research", "methodology", "data-collection"]

When a job prompt mentions "dnakhla" or "article," tag matching boosts relevant procedures by 10% per matching tag on top of the composite score. A procedure tagged [dnakhla, article-publication] gets a 20% boost when publishing to dnakhla.com, even if semantic similarity alone wouldn't have surfaced it.

How procedures get created

Most procedures originate from corrections. Danny says "don't do that, do this instead," and auto-capture extracts a fact (what he said) and an instinct (the rule to follow next time). The instinct becomes a procedure with importance 0.85-0.95, high enough to resist the 90-day decay and stay prominent in retrieval for months.

Some procedures come from post-mortems. When a job fails and the analysis reveals a preventable pattern, the system stores a procedure: "Before building scrapers, research existing datasets and APIs first." That one came from a real incident where we built a 3-day scraping pipeline for data already available as a public API.

The self-teaching loop

Here's the emergent property I didn't expect: as procedures accumulate, the system's behavior converges toward my preferences without me repeating myself. Early on, every article draft needed manual correction. After 40+ articles and the resulting procedural memories, first drafts now reflect my voice rules, structural preferences, and quality bar.

This isn't fine-tuning. The base model hasn't changed; the context has gotten more precise.

The system doesn't learn in the ML sense. It remembers corrections and re-applies them. The effect is the same from the user's perspective, but the mechanism is entirely different.

Auto-capture: the memory formation pipeline

Memories don't get stored manually. After every meaningful session, a 4-stage pipeline runs.

Stage 1: extraction

An LLM reads the conversation and proposes candidate memories. Two extractors run in parallel.

  • Semantic extractor - pulls facts, decisions, preferences, entities
  • Instinct extractor - pulls behavioral patterns, corrections, workflow rules

The dual extraction matters. "Always check Reddit before starting research" produces a fact ("Danny wants Reddit checked first") and an instinct ("scan Reddit for recurring debates before diving into papers"): two memory types with different retrieval patterns from one conversation.

Stage 2: validation

A quality gate filters noise. Conversational filler, transient state ("I'm working on this right now"), and vague observations get rejected. The gate checks specificity: is this a real fact or just a restatement of something obvious?

Stage 3: judgment

Each surviving candidate gets compared against existing memories via vector search. An LLM judge decides:

ADD - genuinely new information worth storing
UPDATE - supersedes an existing memory (old deleted, new stored)
DELETE - existing memory is now wrong
NOOP - not worth storing (already known, too vague, transient)

Before the judge even runs, a hard similarity gate catches near-duplicates: a candidate within cosine distance 0.92 of an existing memory is dropped silently, and a count cap blocks storage if more than 2 near-identical records already exist. These two gates handle the 80% case cheaply, saving the LLM judge call for genuinely ambiguous situations.

// Hard similarity gate - runs before the LLM judge
const topMatch = similar[0];
if (topMatch && topMatch.score > 0.92) {
  log.info({ text: candidate.text.slice(0, 60), score: topMatch.score },
    'auto-capture: skipped (near-duplicate)');
  continue;
}

// Count cap - blocks if >2 near-identical already exist
const highSimCount = nearDupes.filter(r =>
  (1 - (r._distance ?? 0)) > 0.95
).length;
if (highSimCount > 2) {
  log.info('store: blocked (>2 near-identical records exist)');
  return null;
}

Stage 4: contradiction detection

After new memories are stored, a contradiction detector scans them against the top 50 high-importance existing memories. It looks for incompatible claims:

  • "Danny prefers dark mode" vs "Danny switched to light mode"
  • "Use flux-pro for best quality" vs "Use flux-schnell for best quality"
  • "Broad Street Run is in May" vs "Broad Street Run is in October"

Each contradiction gets a resolution: keep the new memory (delete old) or keep the existing one (log but don't delete new). This is the garbage collection that stops the store from accumulating conflicting facts.

The whole pipeline costs almost nothing: extraction and judgment use the cheapest available model, near-duplicate detection is a vector search rather than an LLM call, and contradiction detection only runs when new memories are actually added.

The embedding cache: 3 layers for $0.003/month

Embedding costs could kill this architecture. Every recall query embeds the prompt and every auto-capture candidate embeds for dedup checking, so 6,370 jobs means a lot of calls. The 3-layer cache makes it negligible:

// embed() - 3-layer cache
async function embed(text) {
  const key = hashStr(text);

  // L1: In-process Map (50 entries, 2-min TTL, O(1) lookup)
  const cached = embedCache.get(key);
  if (cached && Date.now() - cached.ts < EMBED_CACHE_TTL) return cached.vector;

  // L2: Redis (persistent across restarts, 5-10ms)
  const redisCached = await redisGetEmbed(key);
  if (redisCached) {
    embedCache.set(key, { vector: redisCached, ts: Date.now() });
    return redisCached;
  }

  // L3: OpenAI API (text-embedding-3-small, 300-500ms)
  const vector = await llmEmbed(text, EMBEDDING_MODEL);

  // Store in both caches
  embedCache.set(key, { vector, ts: Date.now() });
  redisSetEmbed(key, vector); // fire-and-forget

  return vector;
}

After the system warms up (first 10-15 minutes), L1 and L2 handle 99%+ of embedding requests and L3 only fires for genuinely new text. Monthly embedding cost: roughly $0.003-0.006. Three-tenths of a cent, not a typo.

What broke and what I'd change

The 1.9GB problem

The LanceDB store is 1.9GB after 40 days, roughly 17GB per year at linear growth, and it won't stay linear. Compaction runs every 5 minutes and cleans up old versions, but I don't have aggressive enough garbage collection: memories below the 0.1 importance floor should get pruned, not just deprioritized. I haven't built that yet.

Recency bias

The 7-day half-life works well for operational context (what project am I working on, what did I ask yesterday) but poorly for durable knowledge (my daughter's birthday, my wife's name, my employer), which doesn't change and shouldn't decay. The planned fix is to split importance decay by category, and the scoring mechanics of that live in the companion essay.

The cold start

When the system restarts, the in-process embedding cache is empty, so the first 10-15 minutes hit the OpenAI API for every embedding: 300-500ms each versus 0ms from cache. Not expensive, but slow, and jobs dispatched during cold start have noticeably higher latency.

The fix is straightforward: serialize the L1 cache to disk on shutdown, reload on startup. Ten lines of code I haven't gotten around to, which is its own lesson about how production systems accumulate known-but-unresolved issues.

Memory audit tooling

I can query the LanceDB store, but I don't have a dashboard for browsing it. When a stale memory surfaces or a contradiction slips through, debugging means manual queries. I need a memory browser that shows what's stored, when it was last accessed, and what its composite score would be for a given query.

The numbers that matter

After 40 days and 6,370 jobs with this architecture:

6,370
Jobs dispatched
95.9%
Success rate
$0.003
Monthly embedding cost
Metric Value
Core memory 75 lines, 6,279 bytes. Loaded every session.
Archival store 1.9GB, ~230K vectors, 7 categories
Episodic logs 39 daily files, 6,802 entries, 2.1MB total
Retrieval per session 12-15 archival + 8-10 procedures = ~1,200 tokens
Total context overhead ~17K tokens (13% of 128K window)
Recall latency 2-3s hybrid search, <50ms cached

What I'd build differently

If I were starting this from scratch:

Separate the importance model from capture. Right now importance is assigned at capture time and only decays. It should be re-evaluated from actual usage: a memory recalled 50 times in a month should have its importance boosted, not just its frequency score. And build the memory browser on day one, because you can't debug what you can't see.

Start with stricter dedup thresholds. I began with 0.95 cosine similarity as the dedup gate and lowered it to 0.92 after the store filled with near-duplicates. Starting at 0.90 or even 0.88 would've kept the store cleaner from the beginning.

Context engineering is the infrastructure that makes everything else work. The models improve every quarter. The context architecture is what compounds.