Vector search is the default choice for AI agent memory retrieval. It's fast, it scales, and semantic similarity search works well for a large class of queries. But if you're building a production memory layer and relying solely on vector search, you will hit a specific class of failures. These failures are predictable, they're frustrating for users, and they compound over time as your memory store grows.
When you store a memory using vector search, you convert the text into a high-dimensional embedding — a numerical representation of the semantic meaning of that text. When you retrieve memories, you embed the query and find the stored vectors that are most similar by distance (cosine similarity or dot product, typically).
This works well when the query and the memory share conceptual overlap. "What are my preferences for API responses?" will retrieve "user prefers JSON with camelCase keys and always wants error codes in the response body" because those are semantically related.
The problem emerges when the query requires precision over similarity.
A user asks: "What API key prefix did I use for the staging environment?"
The stored memory says: "User's staging API key prefix is sk-stg-0041."
A vector search will return this memory — or will it? If you have dozens of stored facts about APIs, environments, keys, and configurations, the vector similarities are going to be close and the ranking becomes unreliable. The exact string sk-stg-0041 is not semantically encoded in the embedding — it's a random identifier. BM25 would find this immediately because that string appears verbatim in the document and nowhere else.
A user asks: "Did I ever mention anything about Graphiti?"
The stored memory says: "User evaluated Graphiti as a temporal reasoning layer for their agent pipeline."
Vector search on "Graphiti" will embed that term and look for similar vectors. But "Graphiti" is a proper noun — a specific framework. Its embedding is not going to cleanly cluster with memories that use the word unless you've fine-tuned embeddings on this specific vocabulary. BM25 solves this trivially. "Graphiti" appears in the query. "Graphiti" appears in the memory. Retrieved.
As your memory store grows — across many users and many sessions — the vector space becomes dense. More memories means more neighbors for any given query. The signal-to-noise ratio of pure similarity search degrades. What worked fine at 100 memories per user starts returning noisy results at 10,000. This is not a tunable problem. It's a fundamental limitation of dense retrieval at scale.
BM25 (Best Match 25) is a classical information retrieval algorithm from the 1990s. It scores documents based on term frequency — how often query terms appear in a document — normalized by document length and adjusted for term saturation (so a term appearing 10 times doesn't score 10x better than one appearing once).
BM25 is fast, deterministic, and requires no embeddings. It finds what you ask for lexically rather than conceptually.
The common dismissal is that BM25 doesn't understand meaning — it can't match synonyms or paraphrases. That's true. Ask for "communication style" and it won't retrieve a memory about "how the user likes to receive information" unless those exact words appear.
But this is exactly why you need BM25 alongside vector search, not instead of it. The cases where BM25 excels — exact terms, proper nouns, specific values, rare identifiers — are precisely the cases where vector search degrades. They're complementary failure modes.
The right architecture runs both retrieval strategies in parallel and merges the results.
async function retrieveMemories(query: string, userId: string) {
const [vectorResults, bm25Results] = await Promise.all([
vectorSearch(query, userId), // Semantic similarity
keywordSearch(query, userId) // BM25 keyword match
]);
return reciprocalRankFusion(vectorResults, bm25Results);
}Reciprocal Rank Fusion (RRF) is a simple, effective way to merge ranked lists from multiple retrieval strategies. For each result, you compute a score based on its rank in each list:
RRF score = Σ 1 / (k + rank_i)
where k = 60 (constant) and rank_i = position in the i-th result list
Results that appear in both lists get boosted. Results that rank high in one list but don't appear in the other still surface — you don't require a result to appear in both to be included.
function reciprocalRankFusion(
vectorResults: MemoryResult[],
bm25Results: MemoryResult[],
k = 60
): MemoryResult[] {
const scores = new Map<string, number>();
vectorResults.forEach((result, rank) => {
const current = scores.get(result.id) ?? 0;
scores.set(result.id, current + 1 / (k + rank + 1));
});
bm25Results.forEach((result, rank) => {
const current = scores.get(result.id) ?? 0;
scores.set(result.id, current + 1 / (k + rank + 1));
});
const allResults = [
...new Map([
...vectorResults.map(r => [r.id, r]),
...bm25Results.map(r => [r.id, r])
]).values()
];
return allResults
.sort((a, b) => (scores.get(b.id) ?? 0) - (scores.get(a.id) ?? 0))
.slice(0, 10);
}You don't need a specialized vector database to do this well. PostgreSQL handles both strategies natively.
SELECT id, content, metadata, 1 - (embedding <=> $1::vector) AS vector_score FROM memories WHERE user_id = $2 ORDER BY embedding <=> $1::vector LIMIT 20;
SELECT
id,
content,
metadata,
ts_rank_cd(search_vector, query) AS bm25_score
FROM memories,
to_tsquery('english', $1) query
WHERE user_id = $2
AND search_vector @@ query
ORDER BY bm25_score DESC
LIMIT 20;A memory layer that does this well has a few components working together.
When a conversation turn completes, you extract structured facts from the message and store them. Each memory entry includes the raw text, the generated embedding, and a tsvector for full-text search. The tsvector is generated automatically by PostgreSQL via a trigger — you don't need to think about it on every write.
Before the model generates a response, you run the hybrid retrieval query for the current user and inject the top-k results into the system prompt as retrieved context. The dual-path retrieval adds minimal latency when run concurrently — the bottleneck is whichever query takes longer, not the sum.
Running pgvector and BM25 in parallel over a properly indexed PostgreSQL table — with Redis caching for hot users — keeps end-to-end retrieval well under 100ms at p95. The memory retrieval step should be invisible to users.
Recently retrieved memories for active sessions can be cached at the edge. If a user sends three messages in quick succession, you don't need to re-run full retrieval for each turn. A short TTL (60-120 seconds) on session-scoped memory results eliminates redundant retrieval cost.
CREATE TRIGGER memories_tsvector_update
BEFORE INSERT OR UPDATE ON memories
FOR EACH ROW EXECUTE FUNCTION
tsvector_update_trigger(
search_vector,
'pg_catalog.english',
content
);Hybrid retrieval with RRF gets you 80% of the way there. For higher-stakes applications — customer support agents that will be quoted in tickets, or agents where a wrong memory recall has real consequences — you can add a cross-encoder reranker as a post-processing step.
A cross-encoder takes a (query, document) pair and produces a relevance score — not by comparing embeddings, but by attending to both texts jointly. This is more expensive than embedding similarity (it requires a forward pass per candidate), but you're only running it on the top 20 candidates from hybrid retrieval, so the cost is manageable.
async function rerank(
query: string,
candidates: MemoryResult[]
): Promise<MemoryResult[]> {
const scores = await crossEncoderScore(
query,
candidates.map(c => c.content)
);
return candidates
.map((c, i) => ({ ...c, rerankScore: scores[i] }))
.sort((a, b) => b.rerankScore - a.rerankScore);
}If you're building agent memory and using only vector search, you're leaving precision on the table. The failure cases — exact-match queries, rare terms, dense memory stores — are exactly the cases your users will hit and notice. The fix is not to abandon vector search. It's to run BM25 alongside it and merge the results. Memory is only as useful as what you can retrieve from it. Get the retrieval right.
RetainDB uses vector + BM25 hybrid retrieval with RRF by default. You get precision retrieval without having to build any of this yourself.