The illusion of intelligence

Large language models are, at their core, stateless functions. You pass in text, you get text back. The model has no awareness of yesterday, no concept of this user, no memory of the last time this conversation happened. What it has is parameters: billions of weights trained on enormous amounts of text, encoding patterns, facts, and reasoning strategies into the structure of the network itself.

This is often described as "what the model knows." But it's more accurate to say it's what the model has internalized through training. That knowledge is frozen at training time. It does not update when a user tells the agent something new. It does not get sharper with each conversation. And it has no mechanism for distinguishing one user from another.

The intelligence you perceive in a well-prompted AI conversation is not the model remembering you. It's the model reasoning over whatever you put in front of it right now. Memory is not a feature of the model. It's a feature of the system around the model.

Think of the model as an extremely well-read consultant who is also severely amnesic. Every time they walk into the room, they bring no recollection of previous meetings. They are brilliant in the moment. But if you don't hand them notes before the meeting starts, they know nothing about your history. The notes are the memory system. The consultant is the model.

Once you accept that memory is infrastructure, not a model capability, you can start designing it properly.

The four types of AI memory

Researchers and practitioners have converged on a useful taxonomy for AI agent memory, loosely borrowed from cognitive science. Each type serves a distinct purpose and requires different implementation choices.

In-context memory

This is the content of the current context window: the conversation so far, system prompt, tool outputs, retrieved documents. Everything here is available to the model right now. Nothing here persists beyond the current session unless something explicitly saves it.

In-context memory is fast and immediately relevant, but bounded by the context window size and inherently ephemeral. Stuffing every past conversation into a single context window is not the same as memory. It's a patch. It doesn't scale and it doesn't get smarter over time.

Episodic memory

Episodic memory stores specific events: "On March 12th, the user mentioned they were switching CRM providers." "In the last session, the user asked about Q1 invoices and seemed frustrated with the delay."

This type captures the narrative of a user's history with your product. It's time-indexed and event-scoped. It's the memory type most directly responsible for making a user feel recognized.

Semantic memory

Semantic memory stores facts about the user, their world, and their context: "This user works in healthcare compliance." "They prefer concise answers without preamble." "Their team has 14 people."

These are not events. They're stable, structured facts that describe the user and their situation. This is what enables genuine personalization. In production, it's often stored as structured key-value pairs, entity graphs, or tagged fact records.

Procedural memory

Procedural memory stores how to do things: "When this user asks for a summary, they want bullet points under 80 words." "When escalating a support ticket for this account, always CC the account manager first."

This is closer to learned behavior than factual recall. It makes agents feel trained to a specific user's operating style. In practice, it often lives in dynamic instructions prepended to each session, updated as new patterns emerge.

Most production systems implement some combination of episodic and semantic memory. Procedural memory is often underutilized but high-value. In-context memory is present in every agent by default, but it should not be confused with a real memory architecture.

How memory extraction works

Before anything can be remembered, it has to be recognized as worth remembering. Not every sentence in a conversation is a memory candidate. "Thanks" is not a fact worth storing. "I prefer to work in TypeScript and I'm allergic to YAML configs" absolutely is.

Extraction systems typically take one of several approaches, or combine them.

LLM-based extraction

After each turn or session, a secondary model pass reviews the conversation and identifies memory candidates. It extracts structured facts, preferences, events, and behavioral signals. High quality, but adds cost. Async extraction (running after the session ends) addresses the latency issue.

Rule-based extraction

Pattern matching on sentences containing "I prefer," "I always," "my deadline is," or structured data like names, dates, and numbers. Fast and cheap, but misses subtler signals and requires ongoing maintenance as conversation patterns evolve.

Agent self-reporting

The agent is instructed to write to memory as part of its operation, calling a memory write function when it encounters something worth storing. Direct control, but puts the burden of extraction on the agent's reasoning, which can be inconsistent.

Hybrid pipelines

The highest-quality production systems combine all three. Rule-based extraction catches obvious signals cheaply. An LLM pass handles nuanced cases. Agent self-reporting contributes explicit writes. Outputs are deduplicated, merged, and stored with confidence scores and source attribution.

A well-designed extraction pipeline also handles conflict resolution. If a user said they're based in London last month and mentions moving to Berlin last week, the system needs to decide whether to update, append, or version the memory. Naive systems overwrite. Sophisticated ones track change history.

How retrieval surfaces the right memory

Storing memories well is only half the problem. The harder half is retrieving the right ones at the right moment. Retrieval happens before the model generates a response. The memory layer takes the current query, searches its store, and returns a ranked set of relevant memories to inject into context.

Semantic (vector) retrieval

The query and stored memories are embedded as vectors. Retrieval returns memories whose embeddings are closest to the query in high-dimensional space. Works well for meaning-based matches even when exact words don't appear in the stored fact.

Keyword (BM25) retrieval

Traditional term-matching retrieval. Precise for exact or near-exact matches: specific names, project identifiers, product names. Fails when the vocabulary between query and memory diverges.

Hybrid retrieval

Combines vector and keyword search using Reciprocal Rank Fusion. This is the current best practice. Semantic search covers the vocabulary gap; keyword search provides precision. The union is more robust than either alone.

Recency and relevance weighting

Recent memories are often more relevant than older ones. A strong retrieval system weights both semantic similarity and temporal recency, so the most pertinent and current context rises to the top of every response.

Retrieval quality is the single biggest determinant of how good a memory system feels in practice. Teams that take memory seriously invest heavily in benchmarking retrieval quality, not just storage correctness.

Memory in production

Building memory in a demo is straightforward. Then you move to production, and the edge cases surface.

Latency budget

Memory retrieval happens in the critical path, before the model call. If retrieval adds 300ms, users feel it. Production memory systems target sub-100ms retrieval under load. This requires indexed vector stores, connection pooling, and caching for recently active users.

Memory conflicts and updates

Users change. A fact that was true six months ago may be stale. Production systems need update and deduplication strategies. Systems that just append without deduplication balloon in size and degrade retrieval quality over time.

Per-user isolation

Every user's memories must be stored and retrieved in strict isolation. A memory that belongs to user A must never appear in user B's retrieval results. Mistakes here are not just bugs. They're data breaches.

Privacy and deletion

Users have the right to request deletion of their data. A production memory system needs a clean mechanism for purging all memories associated with a user. This is both a compliance requirement and a user trust requirement.

The agent that actually knows you

When all of this works well, something remarkable happens. The agent stops feeling like a tool and starts feeling like a colleague. It references things you mentioned weeks ago. It anticipates your preferences without being told. It gets better the longer you use it, not because anyone retrained it, but because its memory has accumulated a rich, accurate picture of who you are.

The support agent knows this customer contacted you three times about the same issue and escalates proactively, rather than making them explain it again.

The productivity assistant knows your quarterly planning cycle and surfaces relevant context in the week before each review, without being asked.

The sales copilot knows this prospect mentioned budget constraints in March and avoids leading with price-heavy proposals in the follow-up.

The coding assistant knows you prefer functional patterns, never use ORMs, and always want error handling at the boundary rather than inline.

None of this requires a more powerful model. The intelligence was always there. What was missing was the context. Memory is the infrastructure that delivers that context reliably, at scale, across every session a user has with your product.

Give your agent a memory layer

RetainDB handles extraction, storage, and retrieval out of the box. Add persistent memory to your agent in minutes.

Get started free Read the docs