← Back to Writing

Context Engineering / April 2026

Context Engineering in Practice: How I Built a 3-Tier Memory System for Production AI Agents

The memory architecture behind a production AI orchestration system that's run 6,370+ jobs. Core memory, archival vectors with composite scoring, and self-teaching procedural memory. Real code, real numbers, real failures.

Last Tuesday at 2:14 AM, my production AI system dispatched a job to draft a research brief. The system had run 6,300+ jobs by that point. It knew my writing preferences, my active projects, my deadline for the week. It referenced a conversation I'd had three days earlier about a specific dataset. And it got the brief completely wrong.

The problem wasn't the model. The model was fine. The problem was that it retrieved a memory from February about a different dataset with a similar name, and that stale memory overwrote the correct context from the recent conversation. One bad retrieval cascaded into a confidently wrong output.

That's the moment I realized context engineering isn't a nice-to-have optimization you add after your agent works. It's the load-bearing wall. Get it wrong and every downstream decision inherits the error. Get it right and the system feels like it actually knows you.

This is the architecture I built after that failure. Every number comes from the production database. Every code sample comes from the running system.

Why "Just Dump Everything in Context" Fails

The naive approach to agent memory is stuffing everything into the system prompt. Preferences, history, project state, everything you might need. It works great for demos.

It stops working around month two.

My system prompt assembles dynamically from workspace files, skill definitions, memory recall, and project context. The full payload for the persistent session lands around 17,000 tokens. That's already 13% of a 128K context window before a single user message arrives.

Here's what happens when you try to scale the naive approach:

  • Token costs compound. Every job reads the full context. At 6,370 jobs dispatched, even small per-job overhead adds up. Stuffing 50K tokens of "everything" into each prompt would've cost 3-4x more than tiered retrieval.
  • Signal drowns in noise. When the system knows 230,000+ facts, surfacing the right 20 for a given task isn't a filtering problem. It's a ranking problem.
  • Stale memories poison decisions. A fact stored in February might be wrong by April. If it sits in a flat file that loads every session, there's no mechanism to decay or challenge it.
  • Contradictions accumulate silently. "Danny prefers dark mode" and "Danny switched to light mode last week" can coexist in a flat store forever. Without contradiction detection, the system flips between them randomly.

The 3-tier architecture isn't clever engineering for its own sake. It's the minimum viable structure that keeps context accurate at scale.

Tier 1: Core Memory 75 lines, always loaded, identity + rules Tier 2: Archival Memory 230K vectors in LanceDB, composite scoring Tier 3: Procedural Memory learned behaviors, tag-boosted retrieval recall tags

Tier 1: Core Memory - The 75-Line Identity

Core memory is a single markdown file. 75 lines. Always loaded, every session, no retrieval needed.

core-memory.md (75 lines, 6,279 bytes)
├── About Me         - identity, personality, operating style
├── About Danny      - personal details, schedule, preferences
├── Current Priorities - 7 active projects with IDs
├── Critical Rules   - hard constraints (one-liners)
├── Key Preferences  - communication style, work habits
└── Operating Principles - research first, ship minimum viable

The file has a hard budget comment at the top:

<!-- BUDGET: 80 lines max. Identity + priorities + critical rules only.
     Procedures -> OPERATIONS.md or LanceDB procedures (category='procedure').
     Project details -> archival memory. Skill-specific -> skill files.
     Schedule -> ALWAYS run `pm schedule list`. Never answer from memory. -->

That budget constraint is doing real work. Without it, core memory bloats. Every time someone corrects a behavior, the instinct is to add a line. Within a month you've got 300 lines of accumulated rules, half of them contradicting each other, all of them consuming tokens on every single job.

The 80-line cap forces ruthless prioritization. If something new goes in, something old comes out. The file gets edited directly. It's the system's working self-concept, not an append-only log.

What stays in core memory:

  • Identity (who am I, what's my personality)
  • The human's key details (name, family, schedule, timezone)
  • Active projects (compact list with IDs, not descriptions)
  • Hard rules (never skip multi-model review, never ship single-pass)
  • Communication preferences

What gets pushed out to archival:

  • Project details beyond the one-liner
  • Historical decisions and their rationale
  • Learned procedures and workflows
  • Entity details about people and organizations
  • Anything that changes more than monthly

The split is roughly: core memory answers "who are you and what matters right now." Archival memory answers "what do you know about X."

Tier 2: Archival Memory - 230K Vectors in LanceDB

Archival memory is where the real complexity lives. It's a LanceDB vector store. 1.9GB on disk, holding every meaningful fact, preference, decision, and entity the system has ever learned.

230K+
Vectors stored
1.9 GB
On-disk footprint
7
Memory categories

Why LanceDB, Not Files

I started with flat markdown files. One file per topic. It worked for the first few weeks. Then I hit 200 files and search became a problem. Grep doesn't understand semantic similarity. "Danny's preferred writing style" and "how Danny likes articles written" are the same query to a human but completely different strings to grep.

LanceDB solved this with embedded vector search plus full-text search in a single query. No separate vector database. No Pinecone bill. It runs locally, stores everything on disk, and the query interface is SQL-like enough that I didn't have to learn a new paradigm.

The table schema:

memories table (LanceDB)
├── id              UUID
├── text            string     the actual memory content
├── vector          float[1536] text-embedding-3-small
├── importance      float      0.0 - 1.0
├── category        string     fact|decision|preference|entity|procedure|...
├── source          string     auto-capture|manual|danny-correction
├── tags            JSON       tool hints for procedural recall
├── createdAt       bigint
├── lastAccessedAt  bigint
└── accessCount     bigint

Indexes: FTS on text, bitmap on category, B-tree on importance, plus a vector index. Hybrid search uses an RRF reranker to merge vector and full-text results into a single ranked list.

Composite Scoring: The Core Innovation

This is the part that fixed the Tuesday 2 AM failure. Raw cosine similarity isn't enough for memory retrieval. Here's why.

Say I search for "dataset for research brief." Cosine similarity returns the 20 memories most linguistically similar to that phrase. But "dataset" appears in dozens of memories spanning months. The February memory about a different dataset scores just as high as the one from three days ago. Pure similarity has no concept of time.

The composite scoring formula:

// Composite scoring constants
const SCORE_WEIGHTS = {
  similarity: 0.45,
  recency:    0.25,
  importance: 0.20,
  frequency:  0.10,
};

const DECAY_HALF_LIFE_HOURS = 168;  // 7 days
const DECAY_RATE = Math.pow(0.5, 1 / DECAY_HALF_LIFE_HOURS);

// Anti-stale: recently-recalled memories get penalized
const STALE_PENALTY = 0.5;
const STALE_WINDOW_MS = 60 * 60 * 1000;  // 1 hour

Similarity (0.45) - Still the largest weight. Semantic relevance matters most. But it's less than half the total score, which means the other signals can override it when they should.

Recency (0.25) - Exponential decay with a 7-day half-life. A memory from yesterday scores roughly 0.9 on recency. A memory from two weeks ago scores roughly 0.25. A memory from a month ago scores roughly 0.06. This is what killed the stale February memory. It got a near-zero recency score that dragged its composite below the three-day-old correct memory.

Importance (0.20) - Assigned at capture time. Danny corrections get 0.95. Auto-captured facts get 0.5-0.7 depending on specificity. Importance itself decays with a 90-day half-life down to a floor of 0.1. Memories that aren't accessed gradually fade rather than cluttering the store forever.

Frequency (0.10) - Log-scaled access count. Memories that get recalled often score slightly higher. Creates a self-reinforcing loop for actually useful memories.

Anti-stale penalty (0.5x) - Any memory recalled in the last hour gets its score cut in half. Without it, the same 5-6 high-scoring memories surface every session. The system never reaches deeper into its archive.

Here's the actual scoring function from the production codebase:

function compositeScore(result, now) {
  const similarity = result.score;

  const lastAccess = Number(result.entry.lastAccessedAt || result.entry.createdAt);
  const hoursSinceAccess = Math.max(0, (now - lastAccess) / (1000 * 60 * 60));
  const recency = Math.pow(DECAY_RATE, hoursSinceAccess);

  const rawImportance = result.entry.importance || 0.5;

  // Time decay on importance: 90-day half-life, floor at 0.1
  const lastTouch = Math.max(
    Number(result.entry.lastAccessedAt || 0),
    Number(result.entry.createdAt || 0)
  );
  const daysDormant = Math.max(0, (now - lastTouch) / (1000 * 60 * 60 * 24));
  const importance = Math.max(
    IMPORTANCE_FLOOR,
    rawImportance * Math.pow(0.5, daysDormant / IMPORTANCE_DECAY_HALF_LIFE_DAYS)
  );

  const accessCount = Number(result.entry.accessCount || 0);
  const frequency = Math.min(Math.log2(1 + accessCount) * 0.1, MAX_FREQUENCY_BOOST);

  const lastRecalled = recentRecalls.get(result.entry.id);
  const stalePenalty = lastRecalled && now - lastRecalled < STALE_WINDOW_MS
    ? STALE_PENALTY : 1.0;

  const base =
    SCORE_WEIGHTS.similarity * similarity +
    SCORE_WEIGHTS.recency * recency +
    SCORE_WEIGHTS.importance * importance +
    SCORE_WEIGHTS.frequency * frequency;

  return { ...result, finalScore: base * stalePenalty };
}

The Retrieval Pipeline

When a job starts, recall runs through a 10-step pipeline:

1. Decompose prompt into sub-queries (sentences > 15 chars)
2. Embed each sub-query via text-embedding-3-small
3. Hybrid search: vector + FTS with RRF reranker
4. Overfetch 3x the target count
5. Composite score all results
6. Enforce diversity (category dedup)
7. Take top N
8. Mark recalled IDs as stale for anti-rotation
9. Batch update access counts
10. Format into <archival-memories> and <procedures> blocks

The overfetch-then-score pattern matters. If you ask LanceDB for 20 results and score them, your scoring can only reorder those 20. If you fetch 60 and score them, the composite formula can surface memories that would've been filtered out by raw similarity alone. The cost is negligible. Vector search over 230K records takes 2-3 seconds either way.

The output gets injected into the system prompt as structured XML blocks:

<archival-memories count="12">
- [decision] Danny flagged rhetorical pattern as an AI tell
- [fact] dnakhla.com is Danny's domain
- [preference] Writing: human, natural, personal
...
</archival-memories>

<procedures count="10">
These are learned behaviors. Follow them.
- Articles must go through the council pipeline (dnakhla, article-publication)
- Research methodology: Reddit scan first, then papers (research, reddit)
...
</procedures>

Typically 12-15 archival memories and 8-10 procedures per session. That's roughly 800-1,400 tokens of retrieved context. A tiny fraction of the window, but precisely targeted.

Tier 3: Procedural Memory - How the System Teaches Itself

Procedural memories live in the same LanceDB table as archival memories, distinguished by category='procedure'. But they behave differently.

An archival memory is a fact: "Danny's daughter turns 1 on March 25, 2026."

A procedural memory is a behavior: "For SSRN paper submission, use password-manager credentials. Never use stale plaintext credentials from legacy scripts."

The difference matters for retrieval. Facts match on semantic similarity to the query. Procedures match on tags. Every procedure gets tagged with the tools or domains it applies to:

// Example procedural memories with tags

"Articles for dnakhla.com must go through the council pipeline"
  tags: ["dnakhla", "article-publication", "council-pipeline"]

"Danny's methodology for positioning: research -> plan -> validate"
  tags: ["positioning", "strategy", "market-research"]

"Before building scrapers, research existing datasets and APIs"
  tags: ["research", "methodology", "data-collection"]

When a job prompt mentions "dnakhla" or "article," tag matching boosts relevant procedures by 10% per matching tag on top of the composite score. A procedure tagged [dnakhla, article-publication] gets a 20% boost when the job involves publishing to dnakhla.com, even if semantic similarity alone wouldn't have surfaced it.

How Procedures Get Created

Most procedures originate from corrections. Danny says "don't do that, do this instead." The auto-capture system extracts two things from that interaction:

  1. A fact (what Danny said)
  2. An instinct (the behavioral rule to follow next time)

The instinct becomes a procedure with importance 0.85-0.95. High importance means it resists the 90-day decay and stays prominent in retrieval for months.

Some procedures also come from post-mortems. When a job fails and the failure analysis reveals a preventable pattern, the system stores a procedure: "Before building scrapers, research existing datasets and APIs first." That one came from a real incident where we built a 3-day scraping pipeline for data that was already available as a public API.

The Self-Teaching Loop

Here's the emergent property I didn't expect: as procedures accumulate, the system's behavior converges toward my preferences without me having to repeat myself. Early on, every article draft needed manual correction. After 40+ articles and the resulting procedural memories, the system's first drafts now reflect my voice rules, structural preferences, and quality bar.

This isn't fine-tuning. The base model hasn't changed. The context has gotten more precise. The system doesn't "learn" in the ML sense. It remembers corrections and re-applies them. The effect is the same from the user's perspective, but the mechanism is entirely different.

The system doesn't learn in the ML sense. It remembers corrections and re-applies them. The effect is the same from the user's perspective.

Auto-Capture: The Memory Formation Pipeline

Memories don't get stored manually. After every meaningful session, a 4-stage pipeline runs.

Stage 1: Extraction

An LLM reads the conversation and proposes candidate memories. Two extractors run in parallel:

  • Semantic extractor - pulls facts, decisions, preferences, entities
  • Instinct extractor - pulls behavioral patterns, corrections, workflow rules

The dual extraction matters. A conversation where Danny says "always check Reddit before starting research" produces a fact ("Danny wants Reddit checked first") and an instinct ("scan Reddit for recurring debates before diving into papers"). The instinct becomes a procedure. The fact becomes archival. Same conversation, two different memory types with different retrieval patterns.

Stage 2: Validation

A quality gate filters noise. Conversational filler, transient state ("I'm working on this right now"), and vague observations get rejected. The gate checks specificity. Is this a real fact or just a restatement of something obvious?

Stage 3: Judgment

Each surviving candidate gets compared against existing memories via vector search. An LLM judge decides:

ADD - genuinely new information worth storing
UPDATE - supersedes an existing memory (old deleted, new stored)
DELETE - existing memory is now wrong
NOOP - not worth storing (already known, too vague, transient)

Before the judge even runs, a hard similarity gate catches near-duplicates: if a candidate embeds within cosine distance 0.92 of an existing memory, it's dropped silently. And a count cap blocks storage if more than 2 near-identical records already exist. These two gates handle the 80% case cheaply, saving the expensive LLM judge call for genuinely ambiguous situations.

// Hard similarity gate - runs before the LLM judge
const topMatch = similar[0];
if (topMatch && topMatch.score > 0.92) {
  log.info({ text: candidate.text.slice(0, 60), score: topMatch.score },
    'auto-capture: skipped (near-duplicate)');
  continue;
}

// Count cap - blocks if >2 near-identical already exist
const highSimCount = nearDupes.filter(r =>
  (1 - (r._distance ?? 0)) > 0.95
).length;
if (highSimCount > 2) {
  log.info('store: blocked (>2 near-identical records exist)');
  return null;
}

Stage 4: Contradiction Detection

After new memories are stored, a contradiction detector scans them against the top 50 high-importance existing memories. It looks for incompatible claims:

  • "Danny prefers dark mode" vs "Danny switched to light mode"
  • "Use flux-pro for best quality" vs "Use flux-schnell for best quality"
  • "Broad Street Run is in May" vs "Broad Street Run is in October"

Each contradiction gets a resolution: keep the new memory (delete old) or keep the existing one (log but don't delete new). This is the garbage collection mechanism that prevents the store from accumulating conflicting facts over time.

The whole pipeline costs almost nothing. Extraction and judgment use the cheapest available model. Near-duplicate detection is a vector search, not an LLM call. Contradiction detection only runs when new memories are actually added, which happens in a subset of sessions.

The Embedding Cache: 3 Layers for $0.003/Month

Embedding costs could kill this architecture. Every recall query embeds the prompt. Every auto-capture candidate embeds for dedup checking. At 6,370 jobs, that's a lot of embedding calls.

The 3-layer cache makes it negligible:

// embed() - 3-layer cache
async function embed(text) {
  const key = hashStr(text);

  // L1: In-process Map (50 entries, 2-min TTL, O(1) lookup)
  const cached = embedCache.get(key);
  if (cached && Date.now() - cached.ts < EMBED_CACHE_TTL) return cached.vector;

  // L2: Redis (persistent across restarts, 5-10ms)
  const redisCached = await redisGetEmbed(key);
  if (redisCached) {
    embedCache.set(key, { vector: redisCached, ts: Date.now() });
    return redisCached;
  }

  // L3: OpenAI API (text-embedding-3-small, 300-500ms)
  const vector = await llmEmbed(text, EMBEDDING_MODEL);

  // Store in both caches
  embedCache.set(key, { vector, ts: Date.now() });
  redisSetEmbed(key, vector); // fire-and-forget

  return vector;
}

After the system warms up (first 10-15 minutes of operation), L1 and L2 handle 99%+ of embedding requests. L3 only fires for genuinely new text. Monthly embedding cost: roughly $0.003-0.006. Three-tenths of a cent. That's not a typo.

What Broke and What I'd Change

The 1.9GB Problem

The LanceDB store is 1.9GB after 40 days of operation. That's roughly 17GB per year if growth stays linear. It won't stay linear. It'll accelerate as the system handles more domains and captures more memories per session.

Compaction runs every 5 minutes and cleans up old versions. But the fundamental problem is I don't have aggressive enough garbage collection. Memories below the importance floor of 0.1 should get pruned, not just deprioritized. I haven't built that yet.

Recency Bias

The 7-day half-life works well for operational context (what project am I working on, what did Danny ask yesterday). It works poorly for durable knowledge (my daughter's birthday, my wife's name, my employer). Those facts don't change and shouldn't decay.

The fix I'm planning: split importance decay by category. Entity and fact memories get a much longer half-life (180+ days). Decision and preference memories keep the 90-day decay. Procedural memories somewhere in between.

The Cold Start

When the system restarts, the in-process embedding cache is empty. The first 10-15 minutes hit the OpenAI API for every embedding. It's not expensive, but it's slow. 300-500ms per embedding vs 0ms from cache. Jobs dispatched during cold start have noticeably higher latency.

The fix is simple: serialize the L1 cache to disk on shutdown, reload on startup. Ten lines of code. I just haven't gotten around to it, which is its own lesson about how production systems accumulate known-but-unresolved issues.

Memory Audit Tooling

I can query the LanceDB store, but I don't have a good dashboard for browsing it. When something goes wrong - a stale memory surfaces, a contradiction slips through - debugging requires manual queries against the store. I need a memory browser that shows what's in the store, when it was last accessed, and what its composite score would be for a given query.

The Numbers That Matter

After 40 days and 6,370 jobs with this architecture:

6,370
Jobs dispatched
95.9%
Success rate
$0.003
Monthly embedding cost
Metric Value
Core memory 75 lines, 6,279 bytes. Loaded every session.
Archival store 1.9GB, ~230K vectors, 7 categories
Episodic logs 39 daily files, 6,802 entries, 2.1MB total
Retrieval per session 12-15 archival + 8-10 procedures = ~1,200 tokens
Total context overhead ~17K tokens (13% of 128K window)
Recall latency 2-3s hybrid search, <50ms cached

What I'd Build Differently

If I were starting this from scratch:

Separate the importance model from capture. Right now, importance is assigned at capture time and only decays. It should be re-evaluated periodically based on actual usage patterns. A memory that gets recalled 50 times in a month should have its importance boosted, not just its frequency score.

Build the memory browser first. I built the memory system before I built tooling to inspect it. You can't debug what you can't see. The memory browser should've been day-one infrastructure.

Start with stricter dedup thresholds. I began with 0.95 cosine similarity as the dedup gate and lowered it to 0.92 after the store filled with near-duplicates. Starting at 0.90 or even 0.88 would've kept the store cleaner from the beginning.

Implement category-specific decay from the start. The uniform 90-day importance decay was a quick first pass. It took a month of watching entity facts unnecessarily decay to realize different memory types need different half-lives. That's obvious in hindsight.

Context engineering is the infrastructure that makes everything else work. The models improve every quarter. The context architecture is what compounds.

Continue Reading

The full orchestration system behind this memory architecture.

Memory is one layer of the stack. The previous essay covers the complete production system: job engine, multi-model routing, fallback chains, and what actually breaks at 6,000+ jobs.