My production AI system dispatched a job to draft a research brief. It had run thousands of jobs by that point. It referenced a conversation I'd had three days earlier about a specific dataset. And it got the brief completely wrong.
The model was fine. The problem was that it retrieved a memory from February about a different dataset with a similar name, and that stale memory overwrote the correct context from the recent conversation. One bad retrieval cascaded into a confidently wrong output. That failure made me rebuild the retrieval scorer. The full memory system it lives inside, the 75-line core file, the 230K-vector LanceDB archival store, the self-teaching procedural tier, is laid out in the companion piece, Context engineering in practice. This essay is about one subsystem of that store: the function that decides which 20 memories out of 230,000 actually reach the model.
Anthropic, in "Effective context engineering for AI agents", names the phenomenon I hit: context rot, where a model's ability to use information degrades as the window fills with longer, noisier input. Chroma's research team measured it directly in "Context Rot: How Increasing Input Tokens Impacts LLM Performance", finding that retrieval accuracy drops as input length grows even when the relevant fact is sitting right there in the window. My February memory wasn't just dead weight. It was actively wrong context that displaced the right context. Their prescription is to curate the smallest high-signal set for the task. A retrieval scorer is the mechanism that does that curation, one query at a time.
Why raw similarity is not enough
The archival store is a LanceDB vector table holding every meaningful fact, preference, decision, and entity the system has learned. Hybrid search over it uses an RRF reranker, the reciprocal rank fusion method of Cormack, Clarke, and Buettcher, to merge vector and full-text results into a single ranked list. That part works. The problem is what "ranked" means.
Say I search for "dataset for research brief." Cosine similarity returns the 20 memories most linguistically similar to that phrase. But "dataset" appears in dozens of memories spanning months. The February memory about a different dataset scores just as high as the one from three days ago. Pure similarity has no concept of time. It has no concept of how important a memory was when it was stored, and no concept of whether the system has already surfaced it five times this hour.
Those are four different signals. Raw cosine similarity captures exactly one. The fix is to capture all four and weigh them.
The composite scoring formula
The constants the scorer is tuned to:
// Composite scoring constants
const SCORE_WEIGHTS = {
similarity: 0.45,
recency: 0.25,
importance: 0.20,
frequency: 0.10,
};
const DECAY_HALF_LIFE_HOURS = 168; // 7 days
const DECAY_RATE = Math.pow(0.5, 1 / DECAY_HALF_LIFE_HOURS);
// Anti-stale: recently-recalled memories get penalized
const STALE_PENALTY = 0.5;
const STALE_WINDOW_MS = 60 * 60 * 1000; // 1 hour
Similarity (0.45) - Still the largest weight. Semantic relevance matters most. But it's less than half the total score, which means the other signals can override it when they should.
Recency (0.25) - Exponential decay with a 7-day half-life. A memory from yesterday scores roughly 0.9 on recency. A memory from two weeks ago scores roughly 0.25. A memory from a month ago scores roughly 0.06. This is what killed the stale February memory. It got a near-zero recency score that dragged its composite below the three-day-old correct memory.
Importance (0.20) - Assigned at capture time. My own corrections get a high score. Auto-captured facts get a lower score depending on specificity. Importance itself decays with a 90-day half-life down to a floor of 0.1. Memories that aren't accessed gradually fade rather than cluttering the store forever.
Frequency (0.10) - Log-scaled access count. Memories that get recalled often score slightly higher. Creates a self-reinforcing loop for actually useful memories.
Anti-stale penalty (0.5x) - Any memory recalled in the last hour gets its score cut in half. Without it, the same 5-6 high-scoring memories surface every session. The system never reaches deeper into its archive. This penalty is the difference between a memory store that feels like it's thinking and one that loops on the same handful of facts.
Here's the actual scoring function from the production codebase:
function compositeScore(result, now) {
const similarity = result.score;
const lastAccess = Number(result.entry.lastAccessedAt || result.entry.createdAt);
const hoursSinceAccess = Math.max(0, (now - lastAccess) / (1000 * 60 * 60));
const recency = Math.pow(DECAY_RATE, hoursSinceAccess);
const rawImportance = result.entry.importance || 0.5;
// Time decay on importance: 90-day half-life, floor at 0.1
const lastTouch = Math.max(
Number(result.entry.lastAccessedAt || 0),
Number(result.entry.createdAt || 0)
);
const daysDormant = Math.max(0, (now - lastTouch) / (1000 * 60 * 60 * 24));
const importance = Math.max(
IMPORTANCE_FLOOR,
rawImportance * Math.pow(0.5, daysDormant / IMPORTANCE_DECAY_HALF_LIFE_DAYS)
);
const accessCount = Number(result.entry.accessCount || 0);
const frequency = Math.min(Math.log2(1 + accessCount) * 0.1, MAX_FREQUENCY_BOOST);
const lastRecalled = recentRecalls.get(result.entry.id);
const stalePenalty = lastRecalled && now - lastRecalled < STALE_WINDOW_MS
? STALE_PENALTY : 1.0;
const base =
SCORE_WEIGHTS.similarity * similarity +
SCORE_WEIGHTS.recency * recency +
SCORE_WEIGHTS.importance * importance +
SCORE_WEIGHTS.frequency * frequency;
return { ...result, finalScore: base * stalePenalty };
}
The weights aren't sacred. They're tuned for an operational assistant where most queries are about what's happening this week. A system retrieving durable knowledge (medical records, legal precedent) would weigh recency far lower and importance far higher. The point isn't the exact numbers. It's that retrieval ranking is a multi-signal decision, and pretending similarity alone is the answer is how you get a confidently wrong brief at 2 AM.
The retrieval pipeline around it
The scorer doesn't run in isolation. When a job starts, recall runs through a 10-step pipeline:
1. Decompose prompt into sub-queries (sentences > 15 chars)
2. Embed each sub-query via text-embedding-3-small
3. Hybrid search: vector + FTS with RRF reranker
4. Overfetch 3x the target count
5. Composite score all results
6. Enforce diversity (category dedup)
7. Take top N
8. Mark recalled IDs as stale for anti-rotation
9. Batch update access counts
10. Format into <archival-memories> and <procedures> blocks
The overfetch-then-score pattern in steps 4 and 5 matters. If you ask LanceDB for 20 results and score them, your scoring can only reorder those 20. If you fetch 60 and score them, the composite formula can surface memories that would've been filtered out by raw similarity alone. Vector search over 230K records takes 2-3 seconds either way, so the extra cost is negligible.
Step 8 is where the anti-stale penalty gets its teeth. After a memory is recalled, its ID is written into an in-process map with a timestamp. The next query within the hour sees that timestamp and halves the memory's score. This is what stops the same five facts from dominating every session and forces the retriever deeper into the archive.
The output gets injected into the system prompt as structured XML blocks:
<archival-memories count="12">
- [decision] Daniel flagged a rhetorical pattern as a tell of AI prose
- [fact] dnakhla.com is Daniel's domain
- [preference] Writing: human, natural, personal
...
</archival-memories>
<procedures count="10">
These are learned behaviors. Follow them.
- Articles must go through the council pipeline (dnakhla, article-publication)
- Research methodology: Reddit scan first, then papers (research, reddit)
...
</procedures>
Typically 12-15 archival memories and 8-10 procedures per session. That's roughly 800-1,400 tokens of retrieved context. A tiny fraction of the window, but precisely targeted. Anthropic's "Building Effective Agents" (Schluntz and Zhang, December 2024) argues for keeping one augmented loop lean and giving it the right tools; the right context is the other half of that, and 1,200 curated tokens beats 50,000 dumped ones.
What I'd watch if you build this
The 7-day recency half-life works well for operational context (what project am I working on, what did I ask for yesterday). It works poorly for durable knowledge (my daughter's birthday, my wife's name, my employer). Those facts don't change and shouldn't decay. The fix I'm planning is to split importance decay by category: entity and fact memories get a much longer half-life (180+ days), decision and preference memories keep the 90-day decay, procedural memories somewhere in between. A single global half-life is the first thing that will bite you.
The anti-stale penalty needs a ceiling you don't have yet. Halving a score once is good rotation. But the in-process map clears on restart, so right after a cold start the penalty does nothing and you're back to the same five facts until the map refills. If you build this, persist the recent-recall map alongside the rest of the cache.
If you're sitting on a memory system that retrieves by cosine similarity alone and occasionally produces confidently wrong output, the four-signal composite above is the smallest change that fixes the class. Start with similarity at 0.45 and a recency half-life matched to how fast your domain's facts go stale, overfetch 3x before you score, and add the anti-stale penalty the day you notice the same memories looping. The scorer is 30 lines. The failure it prevents is the one that erodes trust in the whole system.