Opinion

Patterns from agent papers that didn’t work for us

Four patterns that look like obvious wins on paper. We tried them on our LongMemEval pipeline, measured the result against baseline, and either dropped or shelved each one. The point isn’t that the patterns are bad. The point is that “agent papers from the last year” is not a substitute for measuring on your own distribution.

Published April 24, 2026 · By Jacob Davis and Ben Meyerson

Papers report wins on the configurations where the pattern won

This is a banal observation about academic publishing, but it has expensive consequences. A paper that improves a benchmark by N points is, almost by definition, a paper about a configuration where the pattern improved that benchmark by N points. Configurations where the pattern was neutral or harmful do not get written up. So the literature systematically over-represents win-conditions.

That is fine when you’re a researcher trying to figure out whether a technique works under any conditions. It is less fine when you’re an engineer trying to figure out whether it will work under your conditions. Real systems have to work across configurations, and any individual pattern’s contribution depends on whether the failure mode the paper assumed is the failure mode you actually have.

We spent the spring running Engram against LongMemEval-S. Along the way we tried four high-confidence patterns lifted directly from recent agent-paper literature. Three got dropped. The fourth got shelved pending a better gating signal. This post walks through what we tried, what we measured, why each one failed in our setup, and what we’d do differently before trying the next one.

For context on the benchmark itself, our retrieval stack, and where we ended up (91.6%), the main LongMemEval write-up is the prerequisite. This post is the long version of one section of that one.

Pattern 1: Critic-and-retry

The pattern. Pair your composer (the LLM call that drafts the final answer from retrieved memories) with an adversarial critic. The critic reads the draft alongside the memories that were retrieved, looks for unsupported claims or under-supported answers, and either approves the draft or sends a second-pass query back into retrieval to fill the gap. The composer then re-drafts with the augmented context. You stop when the critic approves, or after a budget.

Why it looked good. This pattern shows up everywhere in 2024 and 2025 agent-paper literature. The intuition is appealing: LLMs are uneven, a second LLM looking at the first one’s output catches mistakes the first one didn’t notice, and the cost of a critic call is small relative to the cost of being wrong. Multiple published systems report meaningful gains from this exact shape. It is the textbook “easy point” pattern.

What we built. v41 of our pipeline added a critic between the composer and the final response. The critic prompt was “here’s the question, here’s the retrieved memories, here’s the draft answer; identify anything in the draft that is not supported by the memories, or any obvious gap where a follow-up retrieval would help.” If the critic flagged a gap, the pipeline issued a follow-up query and re-ran the composer once. Otherwise the draft shipped.

What we measured. v41 caught 2 real recall gaps on the eval set: tasks where the original draft missed something the augmented retrieval found. It also false-positived on 4 previously-correct drafts. On those four, the critic asked for follow-up retrieval, the follow-up retrieval pulled in adjacent-but-irrelevant memories, and the re-drafted answer was worse than the original. Net: −1 point versus baseline.

The calibration attempt. We tried to make the critic more conservative. v42 required that any flag cite specific contradicting evidence from the retrieved memories. The critic had to point at a memory and explain why the draft was inconsistent with it. The default disposition was “approve.” This eliminated the false positives entirely. It also eliminated the true positives. v42’s flag rate dropped to roughly zero across the 120-task development set. The critic collapsed into a no-op.

Why it failed. The two failure modes the critic has to discriminate look nearly identical from a single LLM call’s vantage point. “The draft is wrong because a relevant memory contradicts it” and “the draft is fine, but a slightly more thorough version would also mention X” share most of their surface features. A one-shot judge call cannot reliably tell them apart, because the distinction depends on whether the user’s question actually demands the missing detail, which is itself a judgment call the critic is being asked to make recursively.

We tried a few intermediate calibrations between v41 and v42 (different thresholds, different prompts, different orderings of the memories in the critic’s context). The shape was always the same: more aggressive calibration recovered some true positives but recovered roughly proportional false positives along with them. The Pareto frontier was bad.

Where it might still work. We believe this pattern is probably solvable, just not with prompt engineering alone. A small fine-tuned classifier trained on actual labeled failures from the eval set, rather than a general-purpose LLM with a critic prompt, could plausibly learn the distinction. Or a non-LLM gating signal: an aggregate-versus-enumeration count mismatch, or a confidence-from-retrieval-fusion-score threshold, would give the critic something concrete to anchor on. We didn’t build either of those. The cost to do it right was higher than the cost to move on to other patterns, and we moved on.

Total cost: about a week of engineering, plus a non-trivial amount of inference spend re-running the eval set on each calibration attempt.

Pattern 2: A bigger, better extractor

If your memory system is missing facts on recall, the obvious lever is the extractor: the LLM call that runs at ingest time to read each incoming message and emit structured memories (facts, triples, entities) for later retrieval. We had a working theory that most of our multi-session failures traced back to the ingest stage. Multi-session questions like “how many weddings did I attend this year” require the system to have laid down clean, distinct, well-attributed memories. A garbled triple, a missed entity, a fact misattributed to the wrong speaker. Any of those propagates forward through retrieval and into the composer.

We were already running a Groq-hosted llama-3.3-70b extractor, which was the cheapest competent option but obviously not the strongest. So we isolated an 11-task stubborn failure set and re-ran the entire pipeline from scratch with the extractor swapped to GPT-5.5, roughly 10x the per-call cost. Same retrieval, same composer, same prompt. Only the extractor changed.

2 of 11 tasks recovered. The same 2 of 11 recovered when we re-ran with the cheaper llama extractor and a different random seed on the composer. The expensive extractor was not where the points were hiding.

The mental model behind the experiment was wrong. We had assumed the bottleneck was upstream, in how memories got laid down. The actual bottleneck was downstream, in the composer step: in how the composer interpreted the question and selected from the memories it was given. The memories were already there in retrieval. The composer just wasn’t using them correctly. This is obvious in retrospect. We had a strong prior that extraction mattered (because it’s the most upstream lever and feels load-bearing), and we acted on the prior instead of testing it. The 10x cost extractor was a $400 experiment that told us the prior was wrong.

Before throwing more compute at any specific stage, localize the failure to a stage. For each failing task: would the right answer be derivable from the retrieved memories, if the composer were operating perfectly? If yes, the composer is the bottleneck and no amount of extractor improvement will help. If no, the upstream stages are worth investigating, but you want to know that first, not after running the expensive experiment. About a week and a half lost, mostly waiting for re-ingest runs.

Pattern 3: A date pre-pass

Temporal-reasoning questions are a known weak spot. “What did I buy in the past two weeks?” asks the model to parse “past two weeks” into a date range relative to some anchor, check each retrieved memory’s timestamp against that range, and reason about the subset that falls inside. Each of those steps is something LLMs do unreliably, especially in combination. The standard move is to take the arithmetic out of the model’s hands: parse the temporal phrase up front into a concrete range, stamp each retrieved memory with an annotation like in_window=yes or days_ago=N, and pass the annotated memories to the composer. There’s a substantial literature on tool-augmented reasoning and pre-computed annotations, and the intuition is right at first principles: LLMs are bad at multi-step arithmetic on numbers that come from prose, and good at following labels placed in front of them.

So we built it. A regex-based annotator handled the dozen most common temporal phrases in LongMemEval (“past two weeks,” “since I started running,” “a month ago,” etc.), resolved them against the conversation’s anchor timestamp, and stamped each retrieved memory with both the absolute date and a derived window flag. The composer prompt was updated to reference the new labels explicitly.

Zero point change on the temporal-reasoning category. Not within noise. Zero. The composer faithfully used the annotations and arrived at the same wrong answers it had been arriving at without them.

The failures we’d been seeing weren’t about arithmetic. They were about interpretation. When the question said “a month ago,” the system had been correctly computing what “a month ago” meant. The problem was that the question was anchored ambiguously, or that the memory the question was actually about contained two dates (start of trip vs. end of trip) and the composer was picking the wrong one to compare against the window. The pre-computed labels were correct. The composer’s use of them was correct. The thing that wasn’t correct was the interpretation step that lives upstream of either. Telling the composer “this memory is in the window” doesn’t help when the bug is that the composer is checking the wrong memory’s wrong date against the window.

Pre-computation patterns work when the LLM is the thing doing the computation, badly. If the LLM is doing something else (reading the wrong input to compute over, or computing the right thing but answering the wrong question), pre-computing the right computation will be a no-op. Diagnose the actual error before moving the math. About two weeks total; the annotator itself was a couple of days, the rest was eval re-runs.

Pattern 4: Iterating the composer prompt past v44

Take your composer prompt (the long structured instructions the final LLM call sees) and keep adding rules. Each rule is targeted at a specific failure mode you observed in the previous eval run. Run again, observe the new failures, add new rules. Iterate until convergence.

This is what had worked for us up to that point. We had iterated the composer prompt across roughly 28 versions over a few months. Each version added points. By v44 we were at 91.6% on the full 500-task eval, with the prompt itself doing meaningful work on counting rules, conflict resolution, date arithmetic, refusal-to-hallucinate, and several other categories. The prior was “each new version has added points; the next one probably will too.” That’s the sort of prior that compounds quietly into a methodology problem if you don’t watch it.

v45 through v47 each targeted a specific set of failures we saw in the v44 run. v45 added rules for personal-best questions where the optimum-over-time should win rather than the most-recent. v46 tightened the guidance on what counts as a “currently owned” item versus a replaced one. v47 attempted a generalized counting refinement. v45 saved 3 of 4 targeted failures and introduced 3 new ones. Net zero. v46 saved 2 of 3 and introduced 2 new ones. Net −1. v47 introduced more regressions than it fixed and got abandoned mid-eval.

Diminishing returns on monolithic prompts hit hard once the prompt is doing real work. Every new rule adds a constraint that interacts with all the previous rules in ways the author can’t fully model. A rule that helps with counting questions makes the composer more pedantic about scope on preference questions. A rule that tightens date interpretation makes the composer slower to refuse on hallucination-prone questions. The constraint graph at v44 was already dense, and each new edge changed the equilibrium in non-local ways. We didn’t fully characterize which rule changes broke which previously-passing tasks. By the point we needed that characterization, every change was effectively a coin flip on the unrelated subset of the eval, and the iteration loop had stopped being useful as a tuning signal.

We shelved this in favor of per-question-type routing. Different prompts for count questions, temporal questions, preference questions, knowledge-update questions. Each prompt is shorter, each only has to be coherent on a narrower distribution, and rule changes inside one prompt don’t perturb the others. A lightweight classifier (LLM or learned) picks the route. This is consistent with the constraint-graph diagnosis: the monolithic prompt was failing because it was trying to be coherent across question shapes whose constraints conflicted. Splitting the prompt splits the constraint graph. Routing is partially done as of this writing: prompts for two of the six categories and a basic classifier. It’s premature to report numbers; we’ll write that up separately when it’s done. About two and a half weeks before we called it.

The shape of all four failures

Each pattern failed for a different proximate reason. The critic failed because two failure modes were indistinguishable to a one-shot judge. The extractor swap failed because the bottleneck wasn’t at the stage we were upgrading. The date pre-pass failed because the model was doing something other than arithmetic. The prompt iteration failed because the constraint graph saturated. But the meta-shape is the same in all four. A pattern published in the literature had been shown to work under conditions where its win-condition held. Our system had a different failure mode. The pattern’s methodology was sound; the assumed failure mode wasn’t ours. The pattern correctly solved a problem we didn’t have. Aggregated, the four were a month or more of net-negative work, not counting the opportunity cost of everything we weren’t doing while running them.

What we learned to do before trying the next pattern

We don’t think the lesson is “ignore agent papers.” The papers are good. The lesson is to do a small amount of structured pre-work before committing to build.

First, categorize the failures you want to fix as a distribution, not a list. How many are recall gaps versus composition errors versus interpretation errors versus arithmetic? If the failures all have the same shape, a pattern targeted at that shape can plausibly fix all of them. If they’re mixed, a pattern targeted at one shape will fix that fraction and leave the rest; if you don’t know the breakdown, you don’t know what fraction you’re buying. Second, estimate the cost to measure, not the cost to ship. The relevant number is whatever you have to spend to know whether the pattern would help, which is almost always smaller than the cost to ship it. Build the cheapest version that produces a real number on a real eval. If measuring takes a week, the pattern needs to be a credible 3+ point improvement to be worth even measuring. Third, look for inverse correlations: if a paper’s reported wins concentrate in categories where your own system is already strong, the pattern is unlikely to lift you further, because the headroom isn’t there. And before changing any stage of the pipeline, prove that stage is the bottleneck. For each failing task, ask whether the next stage would have produced the right answer if it had received a correct output from the previous one. If yes, the current stage is the problem. If no, look further upstream.

None of this is exotic. It’s standard engineering discipline. The reason it’s worth writing down is that the surface appeal of agent-paper patterns (the “easy point” quality) tends to bypass the discipline. The pattern looks like such an obvious win that you skip the pre-work and start building. That’s the failure mode we kept hitting.

Patterns we haven’t tried but might still

A few patterns are still on our list. We’re flagging them here partly because they look promising and partly so we have something to be embarrassed about if we end up writing “these also didn’t work” in a few months.

  • Per-question-type routing. Already partially done, as discussed under the prompt-iteration section. Promising in our setup because it’s a direct response to the saturation we observed on the monolithic prompt. The risk is that the classifier itself becomes a new failure mode and we end up with worse-on-misrouted-questions outweighing better-on-correctly-routed-questions.
  • Tool-augmented arithmetic. Let the composer call out to a calculator (or generate Python code that gets executed) instead of doing multi-step arithmetic in-prompt. We expect this to help on the date-math and sum-of-prices failure categories where models are reliably bad. The risk is that the tool-call overhead introduces new failures at the boundary (parsing the inputs to the tool wrong, formatting the outputs into the response wrong).
  • Iterative profile consolidation. Re-summarize the canonical user profile over time, so stale entries get pruned and newly-stable entries get promoted. We expect this to help on the recall-gap residual we’re still seeing: cases where the right fact existed in some message but didn’t make it into the profile. The risk is that consolidation drops information that turns out to matter later. Profile-pass-as-such is the lever that got us from 87% to 91.6%, so we’re cautious about modifying it.

Each of these is something we’d like to try after applying the pre-work above. None of them are scheduled urgently.

Agent-paper patterns are hypotheses, not solutions

The framing we’re trying to land at is this. A paper that reports a pattern working under specified conditions has produced a hypothesis: “this pattern would also work on similar systems under similar conditions.” The hypothesis is testable. It is not the same as a result on your system.

You wouldn’t ship a code change based on someone else’s benchmark of a different codebase. You’d run it on your own. The same standard applies to architectural patterns lifted from papers. Treat them like any other hypothesis: measure on your distribution, against your baseline, before scaling.

We’re writing this up because the patterns that didn’t work for us could easily have looked, in a different write-up, like patterns that worked. We could have shipped the v41 critic and reported “critic-and-retry: yes, this works.” The eval was right there saying otherwise. We’d rather report the −1 honestly, because the −1 is the thing that’s actually useful to other people trying to figure out what to try next.

If you’re working in the same space and want to compare notes, we’re reachable. The methodology and the raw eval data behind the patterns above are in the same benchmarks/results/ tree as the main LongMemEval write-up. We’d rather other people make different mistakes than the same ones.

Further reading

Closely related

Engram