BlogTechnical Deep Dive

Technical

Why Vector Search Alone Fails
AI Agent Memory

Vector search is the default choice for AI agent memory retrieval. It's fast, it scales, and semantic similarity search works well for a large class of queries. But if you're building a production memory layer and relying solely on vector search, you will hit a specific class of failures. These failures are predictable, they're frustrating for users, and they compound over time as your memory store grows.

RetainDB Team

April 2026

18 min read

Try RetainDB free Read the post

In this post

How vector search works (and why that's a problem)
What BM25 is and why it's still relevant
Hybrid retrieval: combining the two

Implementing over PostgreSQL
What this looks like at the system level
When to go further: reranking
The practical tradeoffs

How vector search works — and why that's a problem

When you store a memory using vector search, you convert the text into a high-dimensional embedding — a numerical representation of the semantic meaning of that text. When you retrieve memories, you embed the query and find the stored vectors that are most similar by distance (cosine similarity or dot product, typically).

This works well when the query and the memory share conceptual overlap. "What are my preferences for API responses?" will retrieve "user prefers JSON with camelCase keys and always wants error codes in the response body" because those are semantically related.

The problem emerges when the query requires precision over similarity.

Exact keyword recall

A user asks: "What API key prefix did I use for the staging environment?"

The stored memory says: "User's staging API key prefix is sk-stg-0041."

A vector search will return this memory — or will it? If you have dozens of stored facts about APIs, environments, keys, and configurations, the vector similarities are going to be close and the ranking becomes unreliable. The exact string sk-stg-0041 is not semantically encoded in the embedding — it's a random identifier. BM25 would find this immediately because that string appears verbatim in the document and nowhere else.

Rare or domain-specific terms

A user asks: "Did I ever mention anything about Graphiti?"

The stored memory says: "User evaluated Graphiti as a temporal reasoning layer for their agent pipeline."

Vector search on "Graphiti" will embed that term and look for similar vectors. But "Graphiti" is a proper noun — a specific framework. Its embedding is not going to cleanly cluster with memories that use the word unless you've fine-tuned embeddings on this specific vocabulary. BM25 solves this trivially. "Graphiti" appears in the query. "Graphiti" appears in the memory. Retrieved.

The density problem

As your memory store grows — across many users and many sessions — the vector space becomes dense. More memories means more neighbors for any given query. The signal-to-noise ratio of pure similarity search degrades. What worked fine at 100 memories per user starts returning noisy results at 10,000. This is not a tunable problem. It's a fundamental limitation of dense retrieval at scale.

What BM25 is and why it's still relevant

BM25 (Best Match 25) is a classical information retrieval algorithm from the 1990s. It scores documents based on term frequency — how often query terms appear in a document — normalized by document length and adjusted for term saturation (so a term appearing 10 times doesn't score 10x better than one appearing once).

BM25 is fast, deterministic, and requires no embeddings. It finds what you ask for lexically rather than conceptually.

The common dismissal is that BM25 doesn't understand meaning — it can't match synonyms or paraphrases. That's true. Ask for "communication style" and it won't retrieve a memory about "how the user likes to receive information" unless those exact words appear.

But this is exactly why you need BM25 alongside vector search, not instead of it. The cases where BM25 excels — exact terms, proper nouns, specific values, rare identifiers — are precisely the cases where vector search degrades. They're complementary failure modes.

Hybrid retrieval: combining the two

The right architecture runs both retrieval strategies in parallel and merges the results.

TypeScript

async function retrieveMemories(query: string, userId: string) {
  const [vectorResults, bm25Results] = await Promise.all([
    vectorSearch(query, userId),   // Semantic similarity
    keywordSearch(query, userId)   // BM25 keyword match
  ]);

  return reciprocalRankFusion(vectorResults, bm25Results);
}

Reciprocal Rank Fusion (RRF) is a simple, effective way to merge ranked lists from multiple retrieval strategies. For each result, you compute a score based on its rank in each list:

RRF score = Σ 1 / (k + rank_i)

where k = 60 (constant) and rank_i = position in the i-th result list

Results that appear in both lists get boosted. Results that rank high in one list but don't appear in the other still surface — you don't require a result to appear in both to be included.

TypeScript — RRF implementation

function reciprocalRankFusion(
  vectorResults: MemoryResult[],
  bm25Results: MemoryResult[],
  k = 60
): MemoryResult[] {
  const scores = new Map<string, number>();

  vectorResults.forEach((result, rank) => {
    const current = scores.get(result.id) ?? 0;
    scores.set(result.id, current + 1 / (k + rank + 1));
  });

  bm25Results.forEach((result, rank) => {
    const current = scores.get(result.id) ?? 0;
    scores.set(result.id, current + 1 / (k + rank + 1));
  });

  const allResults = [
    ...new Map([
      ...vectorResults.map(r => [r.id, r]),
      ...bm25Results.map(r => [r.id, r])
    ]).values()
  ];

  return allResults
    .sort((a, b) => (scores.get(b.id) ?? 0) - (scores.get(a.id) ?? 0))
    .slice(0, 10);
}

Implementing hybrid retrieval over PostgreSQL

You don't need a specialized vector database to do this well. PostgreSQL handles both strategies natively.

Vector search with pgvector

SELECT
  id,
  content,
  metadata,
  1 - (embedding <=> $1::vector) AS vector_score
FROM memories
WHERE user_id = $2
ORDER BY embedding <=> $1::vector
LIMIT 20;

BM25 keyword search with tsvector

SELECT
  id,
  content,
  metadata,
  ts_rank_cd(search_vector, query) AS bm25_score
FROM memories,
  to_tsquery('english', $1) query
WHERE user_id = $2
  AND search_vector @@ query
ORDER BY bm25_score DESC
LIMIT 20;

What this looks like at the system level

A memory layer that does this well has a few components working together.

Write path

When a conversation turn completes, you extract structured facts from the message and store them. Each memory entry includes the raw text, the generated embedding, and a tsvector for full-text search. The tsvector is generated automatically by PostgreSQL via a trigger — you don't need to think about it on every write.

Read path

Before the model generates a response, you run the hybrid retrieval query for the current user and inject the top-k results into the system prompt as retrieved context. The dual-path retrieval adds minimal latency when run concurrently — the bottleneck is whichever query takes longer, not the sum.

Latency

Running pgvector and BM25 in parallel over a properly indexed PostgreSQL table — with Redis caching for hot users — keeps end-to-end retrieval well under 100ms at p95. The memory retrieval step should be invisible to users.

Caching

Recently retrieved memories for active sessions can be cached at the edge. If a user sends three messages in quick succession, you don't need to re-run full retrieval for each turn. A short TTL (60-120 seconds) on session-scoped memory results eliminates redundant retrieval cost.

SQL — Auto-update tsvector on write

CREATE TRIGGER memories_tsvector_update
BEFORE INSERT OR UPDATE ON memories
FOR EACH ROW EXECUTE FUNCTION
  tsvector_update_trigger(
    search_vector,
    'pg_catalog.english',
    content
  );

When to go further: reranking

Hybrid retrieval with RRF gets you 80% of the way there. For higher-stakes applications — customer support agents that will be quoted in tickets, or agents where a wrong memory recall has real consequences — you can add a cross-encoder reranker as a post-processing step.

A cross-encoder takes a (query, document) pair and produces a relevance score — not by comparing embeddings, but by attending to both texts jointly. This is more expensive than embedding similarity (it requires a forward pass per candidate), but you're only running it on the top 20 candidates from hybrid retrieval, so the cost is manageable.

TypeScript — Cross-encoder reranker

async function rerank(
  query: string,
  candidates: MemoryResult[]
): Promise<MemoryResult[]> {
  const scores = await crossEncoderScore(
    query,
    candidates.map(c => c.content)
  );
  return candidates
    .map((c, i) => ({ ...c, rerankScore: scores[i] }))
    .sort((a, b) => b.rerankScore - a.rerankScore);
}

Full pipeline

Hybrid retrieval (vector + BM25)RRF mergeCross-encoder rerankerTop-k injected into context

The practical tradeoffs

Pure vector search

Pros: Simple to implement. Works well for conceptual recall.

Cons: Degrades for exact-match queries and gets noisy at scale.

Verdict: Good enough for prototypes; not enough for production.

Pure BM25

Pros: Fast and exact. Great for specific terms and identifiers.

Cons: Misses synonyms and paraphrases. Fails for semantically-framed queries.

Verdict: Not appropriate as a sole retrieval strategy for natural language input.

Hybrid (vector + BM25 + RRF)

Pros: Materially better across the full query distribution.

Cons: More complex to implement.

Verdict: The right baseline for a production memory layer.

Hybrid + reranker

Pros: Best recall quality. Consistently outperforms on factual queries.

Cons: Adds 30-80ms latency for the cross-encoder pass.

Verdict: Worth it for high-stakes retrieval; optional where speed matters more.

The bottom line

If you're building agent memory and using only vector search, you're leaving precision on the table. The failure cases — exact-match queries, rare terms, dense memory stores — are exactly the cases your users will hit and notice. The fix is not to abandon vector search. It's to run BM25 alongside it and merge the results. Memory is only as useful as what you can retrieve from it. Get the retrieval right.

Hybrid retrieval, out of the box

RetainDB uses vector + BM25 hybrid retrieval with RRF by default. You get precision retrieval without having to build any of this yourself.

Get started free Read the docs

Why Vector Search Alone FailsAI Agent Memory