Guide
Why bigger context windows haven't killed memory products
Every time a model vendor extends the context window, someone announces that memory products are about to die. Then the next year happens, and they don't. If you're still asking this question every time a vendor adds tokens, you're optimizing the wrong axis. The case for a memory layer got stronger as contexts grew, not weaker, and the reasons are structural rather than version-bound.
The recurring claim
You've seen this take. A model vendor ships a release-notes line reading "now supports 1M-token context," and within a day or so someone you respect posts a version of "why would anyone need a memory product anymore? Just stuff the whole history into the prompt." It's intuitively compelling. A million tokens is around the length of a long novel, and a long novel is more than enough room for everything a user has ever said to your agent, plus your system prompt, plus tool definitions, plus a generous margin. If retrieval was a workaround for the 8K-and-32K era, the workaround is no longer needed. Q.E.D.
We've watched some version of this argument run roughly once a year for three years now, with each new context-window release as the supposed final nail. None of them landed. The memory market has grown alongside the window expansion the entire time, which is a strange outcome for two things that are supposedly substitutes.
The substitution argument is wrong in a structural way, not in a "we just need a slightly bigger window" way. Bigger contexts make per-query cost worse, they make latency worse, they make the lost-in-the-middle recall problem worse, and they leave the durability problem exactly where it was. Memory products and context windows aren't competing for the same job. The gap between those jobs widens every time the window grows.
Two of those four reasons matter more than the rest. Cost kills the business case before anything else gets a chance to. Durability is the one that wouldn't go away even if cost and latency both went to zero tomorrow. The rest of this post mostly lives with those two; latency and recall we'll cover in passing.
Cost is the one that kills the business case
A 1M-token context is 1M input tokens of cost on every single query — not once at storage, every time the agent thinks. Mid-tier provider pricing of roughly $3 per million input tokens puts you at $3 per query before any output is generated. At frontier pricing closer to $15 per million, $15 per query. Even at the cheapest fast-tier rates we've seen, the floor lands somewhere between $0.25 and $1 per query just to re-read the user's history one more time.
To put real numbers on it: a long-running project agent that's accumulated 200K tokens of conversation history over a few months (meeting notes, code review threads, design decisions, customer feedback) is sitting comfortably inside current long-context limits and represents a mid-range workload for a real product. At $3 per million input tokens, every query that re-ships all 200K tokens costs about $0.60 in input alone before the model has written a single output token. A typical engineer using an agent like that fires 30 to 80 queries during a working day — call it 50 to round the math. That works out to about $30 per user per working day, or roughly $7,500 per user per year of working days, all of it spent on re-reading context. At enterprise scale, that line item buys you a junior engineer per ten users.
A memory product priced flat at $20 to $50 per user per month — which is roughly the going rate today — answers the same job for $240 to $600 per user per year, with storage, retrieval, indexing, and everything around them included. The gap is an order of magnitude, and it widens (not narrows) the longer the user uses the product, because context-as-memory scales linearly with history length per query while retrieval-as-memory scales sublinearly — the relevant subset doesn't grow anywhere near as fast as the full corpus does.
You can quibble with any of the numbers we picked. You can't fix the shape. Per-query context replay is a linear cost on history length multiplied by a linear cost on query volume. Retrieval breaks the first linear outright and keeps the second small. Your bill stops scaling with the length of the user's relationship with your product, which is the only version of this that holds up over years rather than weeks.
Latency and recall, briefly
Latency is the second axis and it's straightforward. Time-to-first-token scales with input length on every provider we've measured: 5 to 15 seconds at 200K tokens on the fastest providers, 30 to 90 seconds at 1M and frequently more. A memory query is a Postgres roundtrip with a vector index, a BM25 index, and possibly a graph lookup running in parallel, which lands in 50 to 200 milliseconds in most production setups. For a chatbot you check once an hour, the difference doesn't matter. For an agent in an IDE or an ops tool, it's the difference between something usable and something nobody opens twice.
Recall is the third. Long-context models still exhibit the "lost in the middle" effect across families: facts placed at the start or end of a long prompt get recalled reliably, facts in the middle don't, and the gap widens as the window grows. So you've paid the full price to ship a million tokens and the model is still preferentially looking at the first and last few thousand. Retrieval inverts that by putting the most relevant 2K to 10K tokens into the working set deterministically. A bigger window doesn't make the middle of the window better. It just gives you more middle.
Durability is the one that doesn't close
Context is per-query. Memory is per-tenant. That sentence is the whole argument, but it's worth unpacking, because durability is the reason the substitution argument doesn't get rescued even in the hypothetical world where contexts are free and recall is perfect.
The contents of a context window exist only for the duration of a single model call. When the user comes back tomorrow on a different device, you rebuild the context from scratch by pulling the right history out of your database, reassembling it in the right order, and pasting it back in. The same thing happens when your agent restarts, when you switch model providers, when a different agent in the org needs the same context. Each of those events is another paste. We've written about this elsewhere as the statelessness problem: large language models are stateless functions, every call is independent of every other, and any continuity the user perceives is the application developer's job to manufacture. A bigger window doesn't change that property; it just makes the packet you have to manufacture for each call bigger. The model is still amnesiac across calls. You're just dumping a longer payload into each one.
A memory product writes once and reads many times — across sessions, agent restarts, organizational boundaries, model swaps, whatever else the user's life throws at the system between calls. Persistence isn't the only thing it gives you, either. You also get structure on top: bucketing that keeps work context from bleeding into personal, recency-aware ranking so newer facts win when older ones are stale, deletion semantics that cascade through every derived index so a forgotten memory actually stays forgotten, audit trails for what the agent looked at and when. None of that is something a context window can do, because none of it is something a context window is for. The first regulator who shows up asking "why did your agent recommend that" is going to find "we stuffed everything into a 1M-token prompt and hoped" insufficient as an answer.
The durability gap is the one that doesn't close at any context length the vendors ship next. Extend the window to a billion tokens and the system still doesn't persist a thing across calls. Persistence is outside the model's job description, full stop, and you either build it yourself or you buy a memory layer that has.
What bigger contexts do change
We don't want to argue that nothing changed when contexts grew from 8K to 1M. The underrated thing it changed is that memory systems got more useful, not less. Bigger context windows mean the memory layer can confidently retrieve 30K tokens of supporting material per query instead of being squeezed into 4K. The retriever can afford to be inclusive: not just the single best chunk but a richer set of supporting facts, related entities, profile information, prior decisions. The bigger the window, the more headroom the memory product has to work with. Context length is a complement to memory, not a substitute. (There are smaller wins too: zero-retrieval pipelines become viable for tiny corpora, aggressive chunking is less urgent for documents. Neither is substitution.)
The historical pattern
Each of the major context-window expansions has been pitched as the one that ends the category. GPT-4 going from 8K to 32K. Claude moving to 100K, then to 200K. GPT-4-128K. Gemini at 1M. Every release came with at least one widely-shared post calling time on RAG, memory, or both. The retrieval-based memory market grew through every one of them. That isn't coincidence, and it isn't stubbornness on the buyer side. It's that cost, latency, and recall all scale unfavorably with input length, so they degrade further as windows grow, and durability doesn't depend on context length at all. The "this release will end memory" prediction has been wrong every time because it's wrong about the shape of the gap.
The pattern isn't going to break with the next release either. The next release will scale input length further along the same axes that are already failing. The list of things that would actually have to change for "just use the context window" to be the right architecture for agents is its own research agenda: flip the cost curve, fix lost-in-the-middle from first principles, drive long-context first-token latency below the retrieval roundtrip, and add cross-call persistence to the model itself. Each of those is a frontier problem; none look close to solved; doing them in parallel is more than a generation of model releases away.
Where the "context replaces memory" argument is right
There's a real version of the context-replaces-memory argument and we don't want to handwave it. It's right in a specific case: a single-session interaction where the user pastes in everything they need at the start, asks their questions, and walks away. That's a chat. It's a perfectly legitimate product. Plenty of useful tools live in that mode. For that mode, the context window is enough. Memory adds infrastructure you don't need.
The argument breaks the moment "single-session" stops applying. Agents, the actual things we're all building in 2026, have state across sessions, across teams, across days, across model providers, across organizational boundaries. The user comes back tomorrow. A different user on the same team asks a related question. The IDE restarts. The model gets swapped from one vendor to another for a pricing reason. Any of those events, and the context window resets to empty. The memory layer doesn't. The state lives on the application side because the model can't hold it.
So if the question is "do I need memory for my chatbot," and the chatbot is single-turn, the honest answer is often no. If the question is "do I need memory for my agent," the honest answer is essentially always yes, and the bigger the context window your model gives you, the more useful memory becomes.
Two different problems
Context windows answer "how much can I show the model in one shot?" Memory products answer "what should I show the model in one shot?" Those aren't the same question, and they don't fight each other; they compose.
The "what" question gets harder as the "how much" answer gets bigger. The larger the addressable space, the more important it becomes to pick the right subset, because there are more wrong subsets you could pick by accident. Small windows kept everyone honest about chunking and retrieval. Large windows tempt people to skip retrieval and pay for it later in cost and durability. We've talked to teams that made exactly this trade and came back. The "this release kills memory" framing has been wrong five releases running. The quieter version, that each new release makes a good memory layer more valuable, not less, has been right the same number of times. We're betting it stays right through the next one too.
Further reading
Closely related
- What is AI agent memory?. Vendor-neutral primer on the category: statelessness, hybrid retrieval, why a vector DB alone is not enough.
- RAG vs agent memory vs fine-tuning. Three layers that look similar and answer different questions. How to pick the right one and avoid stuffing the wrong one.
- The memory-aware RAG pipeline that knows when not to retrieve. Three categories of agent turn, a gating signal, and a soft-fail composer for the times the gate gets it wrong.
Engram
- Engram on LongMemEval-S: 91.6%. Full benchmark methodology and what didn't work.
- Engram docs. HTTP API, MCP setup for each client, SDK examples.
- Start with Engram. Free tier, BYOK, MCP-native.