Guide

RAG vs agent memory vs fine-tuning: three things that look the same and aren’t

They all promise to give your model context it doesn’t have. They solve different problems, at different costs, with different failure modes. This is a practical guide to telling them apart and picking the right one for the job.

Published December 9, 2025 · By Jacob Davis and Ben Meyerson

Where the confusion comes from

Spend an afternoon searching for how to make your LLM "remember more" and you'll come back with three terms used roughly interchangeably: retrieval-augmented generation, agent memory, and fine-tuning. Most of the writing about them is loose, and a lot of it switches terms mid-paragraph as if the three were dialects of the same idea.

They aren't. All three do extend what the model can act on past what fits in one prompt, which is where the resemblance ends. Underneath, they live at different layers, they spend your money in different shapes, and when they fail they fail in opposite ways. The cost of picking wrong is usually paid six months later, when you have to rebuild on top of customers who've already started using the thing.

Fine-tuning is the most-named and least-often-appropriate of the three, so we'll dispatch it quickly and move on. The interesting confusion is between RAG and memory. From a distance they're indistinguishable: text gets retrieved at query time, text gets put in the prompt, the model gets "smarter" about something it didn't know on its own. That's where most of the expensive mistakes happen. We've watched teams pick RAG for a problem that was really memory, scale it up, and discover at the six-month mark that no amount of better retrieval recovers facts that were never written down anywhere. We've watched the reverse too: a team encoding their internal API docs into memory, one atomic fact at a time, because the memory product was on the homepage that week.

Fine-tuning

Fine-tuning is training the model on your data. You take a base model someone else trained, you assemble a dataset of input/output pairs that demonstrate the behavior you want, you run additional training passes that adjust the weights, and out comes a new model with the patterns from your dataset baked in.

It’s good for changing the model’s behavior: tone of voice, output format, domain vocabulary it doesn’t use naturally, specialized tasks that require a consistent style across thousands of calls. It’s bad for storing facts. Updates require retraining. Costs are dominated by data curation, not GPU time. You can’t inspect a weight and ask why it produced an answer, and you can’t surgically remove a single fact if it turns out to be wrong or if a user asks to be forgotten. You retrain from a corrected dataset, or you live with it.

The cleanest framing we've found: fine-tuning teaches the model how to act, not what is true today. If you catch yourself fine-tuning on facts that change (prices, policies, who reports to whom), the tool is wrong for the job. In production agents we see fine-tuning mostly cameo as a small efficiency play late in the lifecycle, to shrink prompt size for a high-volume feature. As the primary mechanism for giving the agent context, it almost never shows up, and that's the last we'll say about it for a while.

RAG and memory: the real confusion

Both retrieve text at query time and put it in the prompt; both extend what the model can answer. What differs is the source of the text, and that difference cascades into almost everything else about how the two systems are shaped.

RAG

The corpus on the RAG side is something somebody wrote. Knowledge-base articles, product docs, internal wikis, policy PDFs, a snapshot of a codebase, a research library — whatever the source, an editorial process produced it. Someone decided it should exist, someone wrote it, someone keeps it current. The agent doesn't change that corpus; it reads from it.

The shape of problem RAG handles well is one where most questions a user might ask are answerable from inside that document set, "relevant" is a thing you can describe well enough for a retriever to find, and the corpus moves slowly enough that you can keep the index honest. Customer support over a knowledge base. Internal Q&A across a wiki. Legal research against a corpus of opinions. The quiet assumption underneath all three is that whatever update pipeline you have can keep up with how often the documents change — and that's the part most teams under-build until the agent starts confidently citing last quarter's pricing.

What RAG can't do is answer questions whose facts arose from the conversation itself. The user told the agent something three sessions ago and is asking about it now. No amount of better chunking gets you there, because the fact was never in the corpus and was never going to be.

Agent memory

Memory's corpus is one the agent built itself, out of the conversations it had. Things the user said, decisions that got made over time, preferences that surfaced, state from in-progress work. The corpus is alive — it grows when the agent is used, shrinks when memories get deleted, and the individual items inside it can get superseded by newer versions, get demoted by recency-aware ranking, or cascade-delete out when a user asks to be forgotten.

The job memory is built for is continuity. The user mentions in March that they prefer trains to flights for trips under six hours. In May they move to Berlin. In July they ask the agent how to get to Paris, and the agent answers using both facts without the user having to re-paste either of them. That root case generalizes outward into all the things that depend on the agent picking up where it left off: project conventions the team agreed on last sprint, design decisions that shouldn't have to be re-litigated, the state of an ongoing piece of work, the running texture of a relationship between an agent and a user who's been using it for months.

What memory is bad at is everything RAG is good at. A 40-page policy document doesn't belong in memory. A style preference does. A whole knowledge base doesn't belong in memory. A list of things this particular customer has historically complained about does. When you find yourself stuffing whole documents into the memory layer, the diagnosis isn't "use bigger memories" — it's that the wrong layer is holding the wrong shape of information.

Why the two get confused

From the outside, the two systems are indistinguishable. Both produce a list of text snippets that get pasted into a prompt. Vendor diagrams look the same, code paths look the same, and there's a vector database lurking in both pictures.

The separator sits upstream of retrieval. A RAG corpus was written, on purpose, by people. A memory corpus was extracted from conversations the agent had with users. That single asymmetry decides almost everything about the operational shape underneath. On the RAG side, you end up needing editorial workflow, document versioning, and a re-indexing pipeline that runs on whatever cadence your documents change at. On the memory side, you need extraction, deduplication across phrasings (the canonical example for us is that "my college roommate's wedding" and "Emily's wedding in the city" turn out to be the same event), supersession when a fact changes, recency-aware ranking so older statements get demoted in favor of newer ones, and per-user or per-project scoping so one user's memories don't bleed into another's. The retrieval surface looks identical from the outside. The substrate it sits on isn't.

The one-question test we use to sort the ambiguous cases is this: could the relevant fact have existed before the user ever talked to the agent? If yes, it belongs in RAG. If it could only have come from the conversation, it belongs in memory. Your refund policy being 30 days is RAG. This customer being on their third refund this quarter is memory. Python 3.12 deprecating datetime.utcnow is RAG. Your team agreeing last sprint to standardize on datetime.now(UTC) in new code is memory.

The test holds up on most of the ambiguous-looking ones too. "The agent should know our internal API" — that's RAG, because the API exists whether or not anyone talks to the agent about it. "The agent should know which endpoints this developer has been working on" — that's memory, because the fact only exists because the developer was using the agent. Both questions might land in the same prompt to answer a single question for the user, but they're coming from different layers and they fail in different ways. RAG breaks when the API docs get stale. Memory breaks when the agent forgets last week's decisions. When something is wrong, the first useful question is which of those two it was, and the test above is what lets you answer it cleanly.

The two layers also fail in opposite directions, which matters for how you instrument them. RAG's worst failure is loud: a confident citation of stale or wrong material from a corpus that didn't get re-indexed. Memory's worst failure is the silent kind, where the agent forgets something the user said and answers as if the conversation had never happened. The first is something a freshness check on your documents can catch before users do; the second only surfaces in explainability traces, after the fact, once you go look at what was recalled and what was filtered out. Either failure is fixable, but you can't reuse the same observability for both, which is the part most teams figure out a quarter too late.

If you want the longer version of how memory works under the hood (store/query/explain, why a vector database alone isn’t enough, the role of BM25 and knowledge graphs), we wrote that up separately in What is AI agent memory?

Things that look like one of the three and aren’t

Two patterns get pitched as substitutes often enough that they’re worth calling out by name.

The context window

The most common non-solution to this whole problem is just putting the whole history in the prompt. Million-token context windows make it tempting: stuff every previous conversation, every relevant document, the user's entire history into one call, let the model figure out what to look at. You can. It doesn't scale, but you can.

Long contexts are expensive per token, slow to process, and have measurable quality cliffs once they get full. The "lost in the middle" effect is well-documented at this point — the model attends reliably to material at the start and end of a long prompt and unreliably to whatever's in between. Even when the right answer is technically in there, you're paying full per-token cost on every turn for material the model is mostly ignoring. That isn't memory or RAG or fine-tuning; it's an expensive failure mode at scale.

The move that actually works is to retrieve a small relevant subset and put that in the prompt, which is what RAG and memory do for their respective corpora. The big context window is a tool for handling the retrieved subset comfortably, not a substitute for the act of retrieving in the first place.

The model vendor’s built-in “memory”

Most major LLM vendors now ship some version of built-in memory — usually a thin layer over a retrieval system that stores things you tell the model and pulls them back in when your current prompt looks related. For casual chat use it's genuinely useful. For developers building agents it tends to come up short, and the specific gaps that bite vary by vendor: no bucketing to separate work from personal or project A from project B, no programmatic access for inspecting or editing memories from code, no explainability when something surfaced and you can't tell why, coarse-grained deletion that can clear everything but can't surgically remove one fact across every derived index, no scoping for multiple agents or clients sharing the same state.

None of that makes the built-in memory bad. It makes it consumer-grade. If you're shipping an agent product, the layer you eventually need is the developer-grade version — one you can introspect, scope, audit, and wire into your own infrastructure.

Most production agents use two of them, sometimes all three

"RAG or memory or fine-tuning" is a framing borrowed from marketing pages and not from any system we've actually shipped. Real systems combine them, because the three live at different layers and answer different questions.

A reasonable production stack looks like: the base model handles language and reasoning, a long system prompt (sometimes with a light fine-tune on top) handles tone and output format, RAG handles the static-ish knowledge base the agent should be able to cite, and memory handles whatever the agent has learned from this user, this project, this run of work. Fine-tuning answers "how should this model behave by default." RAG answers "what does our organization know." Memory answers "what have we learned about this user, recently, here." Different question per layer, no overlap, no substitution.

The failure pattern we see most often is over-investing in one of the three at the expense of the others. The fine-tuning maximalists end up with a model that sounds right and is two months out of date on facts. The RAG maximalists nail documentation questions and give blank answers to anything personal. The memory maximalists can hold a conversation but can't tell you what's in their own product docs. None of those is fixed by doing more of the thing that broke the system; the fix is letting each layer cover the question it's actually built for.

The cost shape matters too, since it changes what you optimize. Fine-tuning is a one-time spend amortized across every later query: expensive once, near-free per call, pays back only at volume. RAG and memory are per-query: small retrieval plus some extra prompt tokens on every turn, individually cheap, collectively meaningful at scale. For an internal tool doing a thousand queries a month, fine-tuning rarely pays back and the per-query overhead of retrieval is rounding error. For a consumer agent doing millions of turns a day, the math inverts; a well-targeted fine-tune that shrinks the prompt becomes one of the higher-leverage moves on the table, less as a replacement for retrieval than as a way of cutting how much work each retrieved chunk has to do.

They age differently, too

Cost is what it takes to run a system today. Maintenance is what it takes to keep running it tomorrow without the quality degrading on you. The three approaches age in noticeably different ways, and the differences are easy to miss until you're six months past launch and one of them is the thing you spend Friday afternoons on.

Fine-tuning has to be redone. The base model updates on a cadence you don't control, and your fine-tune is locked to whatever version it was trained on. The choice is to freeze on the older base (and miss the quality improvements everyone else is now getting for free) or to re-fine-tune against the new base, which means paying the data-curation cost again. The same story applies when your own data shifts: last year's style guide doesn't produce this year's fine-tune. None of this work runs continuously. Each round is a project with a kickoff and a ship date, not a pipeline that takes care of itself.

RAG corpora have to be re-indexed. A document changes, the chunks it produced have to be re-embedded and re-inserted, and the stale versions need to be cleanly removed (if you're fastidious about it) so they don't hang around as ghost recalls weeks later. The pipeline that does all of that is one of the underbuilt parts of most early RAG stacks, and the bad week usually arrives the first time someone realizes the agent has been quoting last quarter's pricing in a customer-facing context. Continuous work, operational in shape; it runs as long as the corpus does.

Memory grows continuously and self-curates, which is a slightly misleading way to say "someone has to have built the parts that make growth and curation automatic." Timestamps on every memory, recency-aware ranking that demotes stale facts in favor of newer ones, supersession that marks older versions of a now-changed fact as historical, cascade deletion that ensures a forgotten memory actually disappears from every derived index. Engram doesn't auto-expire memories — what gets stored stays until the agent or user explicitly deletes it. Once the curation pieces exist, the layer keeps itself in good shape on its own, and the work shifts from re-index schedules to event-driven cleanup that happens as facts change. The work doesn't go to zero; it goes to a place where the layer can do most of it without you noticing.

None of the three is "set and forget." But the shape of the not-forgetting differs across the three, and that shape is what you're picking when you bet on one of them as the primary mechanism.

The reframe that answers the question

"RAG or memory or fine-tuning" is the wrong question to ask, because it treats three different layers as if they were alternatives to one another. The version of the question that's actually useful, and that mostly answers itself the moment you write it down, is: where in my agent does context live, and how often does it change?

Context that's about how the model should behave and that changes rarely (style, format, domain voice) lives in the weights, which is the fine-tuning layer. Context that's about what the organization knows and changes on the cadence documents change (occasionally, in batch, with someone reviewing it) lives in a corpus, which is RAG. Context that's about what the agent has learned from this specific user, project, or run of work and changes every time the agent runs lives in memory.

Most real agents have all three flavors of context, and most real agents end up running all three layers in parallel. The mistake isn't picking the wrong one. The mistake is not realizing there are three layers in the first place, and trying to bend one of them to do the work of the others. Fine-tuning on conversational facts produces a model that's incorrect about everything that's changed since training; jamming an entire knowledge base into something labeled "memory" produces a memory layer that's slow, expensive, and bad at the personalization it was built for; pointing RAG at last week's session log instead of a real memory store gets you stale ghosts of conversations the user has already moved on from. If the layer feels uncomfortable for what you're asking it to do, that's usually the signal: wrong question against wrong layer.

The payoff for getting the layering right is mostly that the system becomes easier to reason about under pressure. When a user tells the agent something new, there's one place to add it. When the agent gets something wrong, there's one place to look. When a user asks to be forgotten, there's one place that has to remember to forget them. And the three-way comparison the rest of the industry keeps having stops being a thing you have to have, because the three things were never alternatives in the first place.