Engineering

The memory-aware RAG pipeline that knows when not to retrieve

Most RAG pipelines retrieve on every turn. That's wrong. A practical framework for when memory should fire, when it shouldn't, and how to tell, without paying for a heavyweight classifier on every call.

Published February 24, 2026 · By Jacob Davis and Ben Meyerson

A user asks the agent what's 2 + 2. The pipeline retrieves three memories. The top hit is a fact from last month: the user's tax filing status is "single, no dependents." That gets injected into the prompt. The model, to its credit, ignores it and answers 4. Everybody moves on.

Nothing visibly broke. But three things happened that shouldn't have. Tokens were spent on a retrieval the model did not need. Latency was added to a turn that should have been instant. And on a slightly harder question, one where the model is less certain or where the retrieved memory is less obviously off-topic, the injected noise would have actively corrupted the answer.

This is the shape of the problem on every RAG system that retrieves unconditionally. It mostly doesn't fail loudly. It silently degrades, in a way that's hard to attribute and easy to underprice. This post is about the gate: deciding, per turn, whether memory should fire at all.

Why most pipelines retrieve on every turn

"Always retrieve" is the common default for reasons that, taken together, produce a defensible-looking architecture out of three individually weak choices.

For one thing, it's a single-line change. Add a memory client, call query before composing the prompt, splice the result into the system message, ship. Gating requires two code paths and a decision about which one to invoke; nobody is excited to write that on a Tuesday afternoon, and the unguarded version goes out the door.

The negative case is also hard to test. A successful retrieval shows up in the answer. A successful skip is the absence of a retrieval that would have been irrelevant — which produces exactly zero visible difference. Wins for skipping are non-events, non-events don't appear in dashboards, and the part of the system that benefits most from a gate is the part that's hardest to make a case for during the work.

And the cost feels invisible per call. A single retrieval is a couple hundred milliseconds and a few hundred tokens; each one rounds to nothing on its own. The arithmetic only stops rounding when you stack many turns per session and many sessions per day, and at that point you're looking at a real slice of your inference bill that's mostly being spent on turns that didn't need memory at all. We'll work that math later; the short version is that unconditional retrieval ends up costing a chunk of total token spend that surprises everyone who measures it for the first time.

Three categories of agent turn

Before you can build a useful gate, you need a working taxonomy of which turns deserve to fire retrieval and which don't. We bucket turns into three: ones that clearly should retrieve, ones that clearly shouldn't, and ones in the murky middle. The middle bucket gets most of the attention from people designing retrieval systems. The second bucket — turns where memory should sit still — is where most of the cost hides, and it's the bucket most RAG architectures behave as if they didn't believe in.

Should retrieve

The easy case, the one nobody argues about. The user's turn references something specific to this user, project, or conversation history — "what did we decide about caching last sprint," "what's my deployment date," "what was the bug Maria flagged yesterday." A general-purpose LLM has no way to answer those correctly on its own, because the relevant facts live entirely in memory and nowhere in the model's training data. Retrieval here isn't optional, and the question of whether to fire is moot.

Should not retrieve

The interesting case, and the one most agent designs leave unhandled. More real traffic ends up here than people building these systems usually expect, and the four sub-shapes we see most reliably are pure reasoning ("what's the time complexity of merge sort?"), pure knowledge ("what's the boiling point of nitrogen?"), pure code generation ("write a function that sorts a list of dicts by a given key"), and function-call follow-ups where the agent just got a response back from fetch_invoice and is composing the next step in a multi-step plan. The pattern across all four: the model already has the answer it needs, no user-specific context could sharpen it, and any retrieval that fires is at best a no-op and at worst noise that pulls the answer off-axis. The "2 + 2" case from the intro is the canonical bad version — the model has the answer cold, retrieval surfaces something, and now something is sitting in the context window taking up attention budget.

The reason this bucket gets ignored: the harm is invisible. When the agent answers a pure-knowledge question correctly despite a noisy retrieval, you don't notice. When it answers slightly wrong because the noise pulled it off, you blame the model. The retrieval that shouldn't have fired is never the suspect, because retrieval is supposed to be the thing that helps. So the cost piles up in the column nobody is looking at.

A related failure mode is the "the model anchored on the retrieval" pattern. The user asks a generic question, the retrieval surfaces something tangentially related, and the model decides that's the frame it's supposed to use. Output is technically responsive but oddly slanted toward the user's prior context. You see this most in coding agents that retrieve too aggressively and start every answer with "based on your previous discussion of..." even when the question was generic. The fix isn't a better composer prompt. The fix is not retrieving in the first place.

Ambiguous

The messy middle. Anything that mixes user-specific context with general knowledge.

  • "How should I structure my Postgres migration?" Depends on whether the user has discussed their schema, but might be a fully generic question.
  • "Is using websockets a good idea here?" "Here" is doing a lot of work and might or might not refer to a known project.
  • "Should I refactor the auth layer?" There's an "auth layer" that's user-specific, but the advice is mostly general.

Our default on ambiguous: retrieve, but cheaply, and let the composer soft-fail (we'll get to that). Missing a relevant memory is usually worse than spending a small latency budget on a retrieval the composer ends up ignoring.

The gating signal

You have two reasonable choices for the gate itself, with different cost and accuracy profiles. We use the second one in production and recommend it as the default; the first is what you reach for when you find a category where the rule isn't good enough.

A small-LLM classifier

Take the user's current turn, plus the last one or two assistant turns for context, and pass them to a small model (think 1B–8B parameters, locally hosted or via a cheap API). One-shot prompt: "Does answering this require user-specific memory? Reply yes or no." Cache the result keyed by turn hash.

A small classifier is more accurate than a heuristic, especially on the ambiguous middle. It's also more expensive, even with the cheapest available model, and it adds a roundtrip you're trying to save in the first place. The tax is a small amount of latency and a tiny amount of money per turn. Worth it when the heuristic misses, overkill when it doesn't.

A prompt-time heuristic

What we ship by default. The rule, in plain English: if the user's turn references the user themselves, a previous decision, a person by name, a project, a date, or a pronoun whose antecedent isn't in the current turn, retrieve. Otherwise, don't.

The implementation is a regex-and-keyword pass that runs in microseconds. It looks for first-person pronouns (my, our, we, I), demonstrative pronouns without a local antecedent (this, that, those), past-tense verbs with no explicit object ("what did we decide"), proper nouns that don't match a list of public-domain entities, and relative time references (yesterday, last week, this sprint).

On our internal labeled set, the heuristic agrees with the gold labels most of the time. The errors are asymmetric: it false-positives (retrieves when it didn't need to) more often than it false-negatives (skips when it should have retrieved). That's the asymmetry you want, because a false-positive costs you a retrieval and a false-negative costs you the right answer.

A small classifier hits noticeably higher agreement on the same set. The gap is mostly in the ambiguous category, abstract questions about systems the user has been working on. If your domain has a lot of that, classifier; otherwise, heuristic.

The "retrieve but soft-fail" pattern

Even when you do retrieve, the answer to "did the retrieval help" is not always yes. Sometimes the top-k results are stale, off-topic, or just not the memory you needed. The composer prompt has to handle that gracefully.

The pattern is one paragraph in the system prompt, but it matters a lot:

Use the retrieved context if and only if it is directly relevant to the question. If the retrieved memories do not address what the user is asking, ignore them and answer from your own knowledge. Do not force a connection.

That instruction is the difference between a model that helpfully says "the boiling point of nitrogen is -195.8°C" and a model that says "based on your notes about tax filing, the boiling point of nitrogen is..." when retrieval misfired and surfaced something unrelated. The composer needs explicit permission to ignore its inputs. Without it, models tend to over-anchor on whatever is in the prompt, because the prompt's presence implies relevance.

Soft-fail is also your safety net for false-positive gates. When the gate fires retrieval on a turn that didn't really need it, the composer should be able to look at the returned memories, recognize they don't apply, and answer cleanly anyway. With the soft-fail instruction in place, the cost of a false-positive gate drops to "wasted tokens and latency" instead of "wrong answer." That's exactly the cost shape you want: recoverable, not catastrophic.

The skip flag at the API level

Once you've built a gate above the retrieval call, the next question is what to do at the layer below. Engram's /v1/query endpoint supports skip_synthesis=true for cases where you want raw retrieval results without the server's built-in composer running on top.

This matters when you have your own composer downstream. The default behavior, where the server returns retrieved memories and a synthesized natural-language answer, is convenient for one-shot consumers but wasteful for agent pipelines that already have a composition step. If you're going to feed retrieved memories into your own GPT-5 call along with the rest of your prompt, you don't need our composer running first; that's two LLM calls where one suffices.

The pattern we recommend for production agent pipelines:

  1. Run the gate. If skip, don't call /v1/query at all.
  2. If retrieve, call /v1/query with skip_synthesis=true and pass the raw memory list to your own composer.
  3. The composer's system prompt includes the soft-fail clause from the previous section.

For human-facing chat interfaces (a support agent, a customer assistant), the default skip_synthesis=false is usually what you want, because the server's composer is already tuned for that use case and you don't need a second one. Pick per use case, don't pick globally.

The token-cost math

Time to make the invisible cost visible. The retrieval payload (top-k memories plus their explanation) is the dominant variable in a retrieving turn's token bill, depending on your k and how verbose your memory entries are. Everything else (system prompt, last few turns of history, current user message, model output) is comparatively fixed. Drop retrieval on a turn that didn't need it and you skip roughly a payload's worth of input tokens, plus a small additional saving from a leaner composer prompt that no longer has to include the soft-fail clause.

Now apply that to traffic. The exact fraction of skippable turns depends heavily on what your agent does. Coding agents, where almost every question references a project, skip a small slice. General assistants, customer-facing chat, and orchestrators with lots of intermediate tool-call steps skip a much bigger slice. Whatever your skip rate is, the per-session savings compounds: multiply tokens-saved-per-skipped-turn by skip rate by turns per session, and you arrive at a per-session figure that's small. Multiply that by sessions per month and it stops being small.

The dollar number is the boring part. The interesting number is the attention budget. The tokens you skip aren't just cheaper, they're tokens that were going into the model's context window and competing with the tokens that actually mattered. Quality lift from a tighter context is harder to measure than the cost lift, but it's the half of the win you should care more about.

Two ways to frame the savings, both real: against a baseline of "always retrieve," gating recovers a healthy fraction of retrieval-related token spend. As a fraction of total inference spend across your stack, the percentage is smaller, because the composer call itself isn't getting any cheaper. Be clear which one you're quoting.

The latency math

A retrieval call against a warm memory backend is a small but non-trivial chunk of wall time per turn: vector lookup, BM25, graph, fuse, rank, return. Per retrieval, per turn.

For a single-call agent (user asks, agent answers, done), shaving one retrieval is nice but not transformative. For a multi-agent orchestrator, where a planner spawns an executor, the executor spawns a critic, and the critic re-plans, every LLM call inside that loop is potentially a retrieval call, and the latencies compound. We've seen agent stacks where the planner alone fires retrieval several times during a single user-facing turn. Skipping the ones that didn't need it adds up to a chunk of wall time large enough to feel, which on a chat interface is the difference between "responsive" and "the spinner is doing something."

The general shape: if your average user-facing turn fires N internal LLM calls and your gate correctly skips on some fraction of them, the savings per user-facing turn scales with N. For orchestrator-shaped agents (high N), this is where the latency win lives, and it tends to move user-perceptible quality more than any LLM upgrade we've ever shipped.

The honest tradeoff

Gating is not free. The cost of a wrong gate is a missed memory, which means the agent either hallucinates or punts on a question it should have answered. The asymmetry of which wrong gate matters depends on your application.

If you're a coding agent, the asymmetry favors retrieve-by-default. Code questions are full of references to previously-discussed projects, files, decisions, and people, and the cost of an unnecessary retrieval is much smaller than the cost of forgetting that the user previously decided to standardize on Bun. Tune the gate toward retrieve; let the soft-fail clause clean up the false positives.

If you're an autonomous orchestrator burning tokens on long task plans, the asymmetry flips. Every retrieval becomes many retrievals once the orchestration loop unwinds, and most of the intermediate steps don't need any memory at all. They're mechanical compositions of tool outputs. Tune the gate toward skip; the rare false-negative is the price of not paying for several times the retrieval budget you actually need.

If you're a customer-facing chat assistant, the right answer is in between. Retrieve on anything that even mildly looks like a personalized question, skip on anything that's clearly a general inquiry. Customer-facing apps get punished hard for the obviously-wrong answers that come from forgetting context, so we lean toward retrieve, but we still skip on a noticeable chunk of greetings, small talk, and pure-knowledge queries that show up at the start of every conversation.

Whichever direction you lean, the calibration is empirical. Sample a few hundred turns from your production traffic, label them, and measure your gate's precision and recall. The right gate for your application is the one whose error distribution matches your failure-mode tolerances. There is no universal answer here, and anyone telling you otherwise is selling you the wrong abstraction.

"Query-first" isn't always the right default

Our own MCP setup instructions tell the agent to query memory before answering. That advice is correct for a chat-style agent that's been added to memory for the first time and needs to build the habit of checking. It is overkill for production pipelines where you've already invested in a proper gate.

The progression we see customers move through, in order:

  1. No memory. The agent forgets everything between sessions. This is the baseline.
  2. Memory enabled, no gate. Drop in the MCP server, follow the recommended system prompt, agent queries before every answer. Big quality win over baseline. Token cost goes up meaningfully; nobody notices yet.
  3. Gate added. Heuristic or small classifier in front of the retrieval call. Recovers most of the token spend, keeps the quality win. This is where most production deployments end up.
  4. Gate plus soft-fail composer plus skip_synthesis. The full pattern from this post. Recovers a bit more on quality (fewer noise-corrupted answers), saves another sliver on cost, and gives you a composition layer you can actually tune for your domain.

If you're at step two and your bill is climbing, the answer is step three, not "use less memory." Memory should be omnipresent; retrieval should be selective.

What to take away from this

Retrieval is a real cost, not a free win. You pay for it in tokens, in latency, and in the attention-budget contamination that doesn't show up on any dashboard but quietly corrupts the easy questions. "Always retrieve" treats memory like a switch you flip on and walk away from, and the bill arrives later in shapes that are hard to attribute.

A bad pipeline fires on every turn and slowly degrades the questions that should have been trivially right. A good one fires when context is actually needed and stays silent otherwise. The whole apparatus that gets you from the first to the second is a prompt-time rule, a soft-fail clause in the composer, and the willingness to go measure what your current pipeline is actually spending on retrievals nobody asked for. Build the gate.

Further reading

Closely related

Engram