Guide

What is AI agent memory?

A practical, vendor-neutral guide to the category: what agent memory is, why stateless LLMs need it, which retrieval approaches exist, how to evaluate them, and how Engram fits in.

Published December 2, 2025 · By Jacob Davis and Ben Meyerson

The statelessness problem

By default, every call into a large language model is independent of every other one. The model has no record of yesterday's conversation, the project you explained last week, or the style guide you settled on three chats ago. Each request is a function of its input and nothing else. This is true of GPT, Claude, Gemini, the open-weights models you'd run yourself, and every wrapper on top.

From the user's side it looks like amnesia. The project you explained on Monday is gone on Tuesday. The agent you spent half an hour briefing on your codebase needs the same briefing the next morning. You compensate by pasting context back in at the start of every conversation, then by keeping a "context I have to paste" doc somewhere, then by quietly reaching for the agent less often on anything that depends on continuity. We hear this pattern from teams every week. It is the single most common reason an agent goes from "felt magical in the demo" to "I stopped using it after two weeks."

The first thing most people try is to stuff the whole history into the context window. It collapses for the same reasons every year. Tokens cost money on every turn, long prompts are slow, and even at a million-token window the right retrieved subset beats the whole history on both cost and quality. Anthropic's own published numbers on "lost in the middle" effects show recall accuracy degrading as the relevant span sits deeper in a long prompt. We've shipped versions of the everything-in-context approach internally and watched them lose to a smaller retrieved set every time we measured.

What "memory" actually means for agents

Agent memory is a dedicated retrieval system that persists facts durably and pulls only the relevant ones back at query time. A practical interface looks like: a store_memory(content) call that writes durably, a query_memory(question) call that returns the relevant subset, and an explanation alongside each result that tells you why it came back. Most memory products implement the first two well. The third is where they diverge.

The explanation is the part that decides whether you can run the system in production. When the agent answers wrong (and it will), you need to know whether the right memory was in the retrieved set and the agent misread it, or whether the right memory never came back at all. Those two failure modes have different fixes. A bare similarity score does not separate them. A trace that reads "memory M came back because BM25 matched on the token invoice and the graph linked it to the customer entity in the question" does. Engram returns one of those traces on every query — top-level memories, graph_facts, and entity_matches arrays on the /v1/query response, each entry pointing back at the source memory by id. Without that, every wrong answer is a coin flip between two possible bugs and the team debugging it ends up guessing.

Why a vector database is not enough

A lot of the early "agent memory" tooling was a vector database with a thin wrapper on top. Vectors do paraphrase matching well. "What email client does Alice use" and "tell me about Alice's mail setup" land near each other in embedding space, which is the case keyword search was always bad at. Where vector-only retrieval falls over is on two specific patterns you can't reach from cosine similarity alone.

The first is exact recall. The user says "my email is a@b.com" in session one and asks "what is my email" in session twelve. The right answer is a span buried in some old message. BM25 finds it in milliseconds because the literal string matches. A vector index returns whichever message scored highest on semantic overlap with the word "email," which is rarely the one containing the actual address. The same trap catches identifiers, file paths, API keys, dates, names: anywhere the user expects verbatim recall and the system returns a thematically related miss instead.

The second is multi-hop. A question like "who on Alice's team is working on the payments migration" doesn't sit in any single document. Answering it requires traversing relationships: Alice to team, team to members, members to projects. A knowledge graph handles this natively by walking the edges. Embeddings can't, because there's no single span that contains the answer. Negation also flips the sign on otherwise close embeddings, which is its own quiet failure mode. But the two patterns that publicly sink vector-only systems on long-horizon benchmarks like LongMemEval are exact recall and multi-hop.

The architecture that holds up is hybrid. BM25 for exact tokens, vectors for paraphrases, a graph for relations, fused into a single ranking. Engram runs five retrieval sources in parallel at query time, fuses them with reciprocal-rank fusion, and passes the candidate set through a final reranker before returning the top results. On LongMemEval-S, the public long-horizon benchmark from BAAI, this stack scores 458 of 500 (91.6%) with a canonical user-profile pass on top, and the v44 composer prompt we used is published under MIT in our benchmarks repo so the number is reproducible. Strip away any one of the retrieval lanes and the score drops by several points on the slices that depend on that lane.

Cascade delete is the part nobody talks about

The usual durability checklist in memory-product posts is three items: durable storage, deletion, scoping. The interesting one is the middle, and it's the one most systems get quietly wrong. When a user asks you to forget something, that memory is not sitting in one place. It's in the BM25 inverted index, in the vector store, and in the knowledge graph as several extracted triples that may or may not be shared with other memories. Delete the row in the primary table and miss any of the derived indexes and the memory comes back as a ghost recall a few weeks later, usually in front of the same user who asked you to forget it.

Engram's deletion path cascades through every index by default, including dereferencing graph nodes that go orphaned when their last referring memory disappears. Every create, update, and delete is also audited. Tenants can view the full mutation log under /history in the dashboard, and any mutation is rollback-recoverable for 90 days, so an over-zealous agent or a botched delete is not a permanent loss. Engram does not automatically decay or expire memories. What gets stored stays stored until the agent or user explicitly deletes it.

The other items on the checklist (durable storage across clients, per-bucket scoping) matter, but most products handle them correctly. Cascade delete is the one where systems silently fail, and where the failure surfaces on a bad day rather than a quiet one.

Why MCP matters

The Model Context Protocol is an open standard, originally from Anthropic, for connecting agents to tools and data. An MCP server exposes named tools; any compatible client can call those tools without bespoke integration code on either side. As of late 2026 the compatible-client list includes Claude Code, Cursor, Windsurf, OpenCode, OpenClaw, Goose, Cline, Continue, LibreChat, Codex, and ChatGPT (via custom connectors on supported plans), with new clients landing roughly every month.

Before MCP, adding memory to an agent meant writing code against someone's SDK and shipping it. With MCP, it's a config-file entry. That single shift is why Engram ships as a hosted MCP endpoint at mcp.lumetra.io with both OAuth and bearer-token auth, rather than as an SDK with bindings for every framework. The HTTP API still exists for cases where the agent host is not MCP-aware, but the MCP path is the one we expect most users to take.

How to evaluate a memory system

Three things matter when you're picking a memory layer to bet on. The first is recall quality on long-horizon benchmarks. LongMemEval and LoCoMo are reasonable public yardsticks, and a benchmark built from your own task distribution is better if you can spare the engineering time to assemble it. The public-benchmark number is what tells you whether the architecture works at all; your own benchmark is what tells you whether it works for the queries your users actually run.

The second is latency at p95. Memory adds a network hop to every turn, and once that hop sits above a few hundred milliseconds the agent starts feeling laggy in ways users will blame on the model. Engram's hosted endpoint sits around 200ms p95 on a warm query path; the synthesis step (an LLM call on the BYOK provider) is the part most users see as the dominant cost, and it's measurable independently.

The third is explainability. It never shows up on a marketing page and it almost entirely determines whether you can debug a misbehaving agent at 2am with a customer on the line. Ask any vendor for a sample retrieval response. If the answer is a list of memories with similarity scores and nothing else, you will spend your debug time guessing. If the answer includes which lane fired (BM25, vector, graph), what matched, and what the reranker did to it, you have a debug log.

The remaining procurement-checklist items (pricing shape, durability guarantees, data-handling posture) are real, but they're the kind of thing you check once during evaluation rather than the kind that decides whether the product works week to week.

How Engram fits in

Engram is Lumetra's hosted memory service for agents. It's MCP-native, runs the hybrid retrieval stack described above, and returns an explanation trace with every recall. Storage is per-bucket so projects, users, and teams stay scoped; deletion cascades through every index; mutations are auditable for 90 days. Inference is BYOK: extraction at write time and the composer call at read time both go through your own provider account, and a tenant without a configured provider key gets HTTP 412 on the first store or query rather than a silent fallback to a Lumetra-owned key. The BYOK enforcement lives in src/hybrid/byok_config.py in the open-source repo if you want to read it.

The Free tier covers 10K stored memories and 50K retrievals per month, no credit card. Start with Engram, or read the docs for client configs and the HTTP API.