Benchmark

Engram on LongMemEval-S: 91.6%

We ran Engram against LongMemEval-S, the public benchmark for long-term memory in conversational AI. The full 500-task run, end to end, with our v44 composer prompt and a canonical user-profile pass: 458/500 = 91.6%.

Published April 7, 2026 · Last updated May 14, 2026 · By Jacob Davis and Ben Meyerson

This post covers how we got there, what we tried that didn’t work, and where we still have headroom. It’s long. Most reported benchmark numbers don’t reproduce from the paper alone, so we erred toward publishing more methodology than needed.

What LongMemEval actually tests

LongMemEval-S has 500 conversations, each with around 480–600 messages spread across multiple sessions. Every conversation has one question with a hand-crafted gold answer, graded by GPT-4o.

The questions break into six categories. The easy ones are direct fact lookups: find something the user said in one session, find something the assistant said, or pull a stable fact like a job title. Most memory systems handle these. The interesting failures are in the other three categories:

  • Multi-session. “How many weddings did I attend this year?” The user mentions a wedding in March, another in August, references the August one again in October from a different angle, and casually drops a third in November. Your memory system has to recognize these are three distinct events and not double-count the August one.
  • Temporal-reasoning. “Which device did I set up first, the smart thermostat or the mesh router?” The user mentioned them in different sessions, and the dates are implicit in session timestamps, not stated.
  • Knowledge-update. “How many followers do I have on Instagram now?” The user gave three different numbers across the conversation, each with a date. Latest one wins, but the system has to actually track that.

Multi-session is the hardest category. Published baselines from the LongMemEval team sit around 50% on this category. Anything north of 80% is a real result.

What we built

The benchmark stack has three pieces, in roughly this order of impact: server-side ingest and retrieval, a canonical user-profile pass that runs once per conversation, and a long composer prompt.

Server-side ingest and retrieval. Every message gets POSTed to Engram. The server does three things in parallel: it generates an embedding for vector search, it extracts subject-predicate-object triples that get written into a knowledge graph (with aggregate nodes counting distinct entities by type, useful for “how many weddings” style questions), and it stores the raw text for BM25 lookup. All three live in Postgres (pgvector for embeddings, plain tables for triples and text). Queries hit a hybrid pipeline that reads from all three: vector similarity for semantic match, BM25 for keyword precision, and graph lookups for structured facts and entity counts, fused via reciprocal rank fusion. This is the part of the stack our customers get out of the box.

Canonical user profile pass. This is the new piece. After ingest, a single LLM call reads the entire conversation history and emits a structured profile of the user: people they know, events they attended, items they own, places they’ve been, recurring activities, stable facts about their life. The key word is canonical: when “my college roommate’s wedding” appears in one session and “Emily’s wedding in the city” appears in another and they’re clearly the same event, the profile merges them into a single entry with both phrasings recorded as aliases. The composer sees this merged view, not 47 mentions across different sessions.

The profile pass is the biggest single thing we changed. We measured the lift at +4 points against a retrieval-only baseline running the same composer prompt.

Composer prompt. A long structured prompt with rules for counting (always emit a candidate table before you commit to a number), for handling conflicting facts (latest dated mention wins, except for personal-best questions where the optimal value wins), for date arithmetic, for distinguishing things the user actually did versus things the assistant suggested, and for refusing to invent facts that aren’t in retrieval. We’ve iterated on this prompt across roughly 28 published versions over the last few months. Each version was scored on a 120-task subset before we tried it on the full 500.

The full v44 composer prompt is published alongside this post at composer_prompt_v44.md, MIT-licensed, copy-paste-ready against /v1/query responses. The profile schema (the JSON shape the server emits) is part of the closed-source layer; it’s tightly coupled to how the server stores and serves profiles, and changing it would change a lot of downstream code in our stack.

The score, by category

CategoryScore%
single-session-assistant56/56100.0%
single-session-user69/7098.6%
knowledge-update74/7894.9%
temporal-reasoning124/13393.2%
multi-session111/13383.5%
single-session-preference24/3080.0%

Multi-session is our weakest category at 83.5%. That’s still above any LongMemEval baseline we’ve seen reported. Temporal-reasoning at 93.2% surprised us; we expected date arithmetic to drag harder. Best guess is that the profile pre-merges co-referent events with their dates, so the composer mostly has to read off a sorted timeline rather than reconstruct one, but we haven’t gone task by task to confirm. The 80% on preferences is partly the small N (only 30 tasks in that category) and partly that the judge is deciding whether a response “personalizes enough,” which is harder to score consistently than yes/no fact questions.

Running the full 500 wasn’t clean

If you’re planning to do something similar, here’s what running this actually looked like. Total wall-clock was around six days across restarts, of which most was Phase-1 ingest. We hit OpenAI quota walls twice mid-run; the first time, the script’s exponential-backoff retries burned six hours of wall time before we noticed (each task retries up to six times with up to 5-minute waits, which accumulates fast when 50 tasks fail in parallel). Chunk 6 was the worst single burst: 31 of 50 ingest tasks 502’d or 504’d under sustained load at 8 ingest workers per task; we recovered them in a second pass at 4 workers and never hit it again. Mid-run we deleted 350 stale benchmark buckets from previous experiments because pgvector got noticeably slower past around 500 buckets in the same tenant. One profile-generation call (task 8464fc84) timed out three times in a row and we regenerated it by hand on the third try.

None of that affected the final score, but it cost us wall-clock time and a non-trivial amount of inference spend on backoff retries that produced nothing. If you’re benchmarking another memory system on this dataset, plan for restartable pipelines and a generous quota budget.

Can we recover the failures?

We had 42 task failures in the full 500 run. We wanted to know whether sampling the composer multiple times and taking a majority vote would meaningfully recover any of them.

We re-ran the composer on the 42 failures, three samples each, judged each independently. Seven of those tasks pass on at least 2 of 3 fresh samples. Another seven pass on exactly 1 of 3. Those would need a critic-as-selector rather than majority vote to capture.

Before getting excited about that, we had to check the other side: does running the composer three times and taking majority also flip our wins to losses? We sampled 191 wins (a stratified random sample; we wanted 200 but the single-session-preference category only had 8 untested wins available) and re-ran the composer N=3 on each. 188 stay pass, 3 regress. That’s a 1.6% regression rate with a Wilson 95% upper bound of 4.5%.

So the math: 458 wins, +7 from failure-side flips, −7 expected on the win side. Net zero, basically. N=3 majority isn’t the clean +9-point boost we initially hoped for once you count both directions. At 3× the per-query cost, it’s not obviously worth shipping as a default unless the use case is very accuracy-sensitive.

Things that didn’t work

Critic plus retry

We spent a week on what looked like the obvious next step: pair the composer with an adversarial critic that audits the draft against the memories, and lets the system call back into retrieval if the critic spots a gap. The pattern shows up all over the place in agent papers from the last year or so.

Our v41 implementation made things worse. The critic correctly flagged 2 real recall gaps. But it also false-positived on 4 previously-correct drafts and “fixed” them into wrong answers. The net was −1 point versus baseline.

We tried calibrating the critic to be more conservative: require it to cite specific contradicting evidence from the memories before flagging anything, default to “ok” otherwise. The calibrated v42 had zero false positives. It also had zero true positives. It collapsed into a no-op.

We couldn’t find a single-prompt formulation that caught the real errors without misfiring on edge cases. The two failure modes the critic has to discriminate (“draft is wrong” vs “draft is fine but could be more thorough”) look too similar to a one-shot judge call. Solving it properly probably needs either a small fine-tuned classifier or a different gating signal (an aggregate-vs-enumeration count mismatch is one we considered). We didn’t get there.

Better extraction

We assumed the Groq llama-3.3-70b extractor we use server-side might be the bottleneck on multi-session questions. So we swapped it for GPT-5.5 (about 10× the per-call cost) on a stubborn 11-task failure set.

Result: 2 of 11 recovered, same as the cheaper extractor. Extraction wasn’t where we were losing points. Whatever was hurting us on those 11 tasks happened in the composer step, not in how the memories got laid down at ingest time.

Date pre-pass

For temporal-reasoning questions, we wrote a regex-based annotator that parses phrases like “in the past two weeks” into actual date ranges and stamps each retrieved memory with in_window=yes/no or days_ago=N. The idea was to take date arithmetic out of the composer’s hands.

It made zero difference. The composer faithfully used the annotations and arrived at the same wrong answers it had been giving without them. The failures weren’t from arithmetic; they were from interpretation. Which “ago” anchor does the question use? Is this date the start or the end of a trip? Telling the model “this date is in your window” didn’t change the fact that it was misreading which date the question was anchored to in the first place.

Iterating the composer prompt past v44

We hit diminishing returns hard around v44. Each new rule we added to fix a specific failure introduced a new failure somewhere else. v44 saved 3 of 4 targeted regressions and introduced 3 new ones. Net zero.

At that point we stopped iterating on the prompt. We didn’t fully characterize which kinds of new rules tended to break which previously-passing tasks; we hit the point where every change was a coin flip and decided routing (different prompts for different question shapes) was probably a more promising direction than prompt-tuning a single monolithic prompt further. That investigation is still open.

Where the failures live

We classified each of the 42 failures by failure mode:

  • Off-by-one count errors. The composer makes a defensible-but-wrong scope judgment. “Does the old 5-gallon tank count as a tank you currently have, or did you replace it?” Both readings are reasonable; gold says one, composer says the other.
  • Sums and arithmetic. Adding $200 + $20 + (an unstated price) and confidently reporting the partial sum.
  • Date math edge cases. “How many weeks since I recovered from the flu when I went on my 10th jog?” requires correctly identifying which jog was the 10th, then computing weeks. Models often get this off by a few weeks.
  • Hallucinated specifics. When the user mentions taking a bus from the airport but never said how much it cost, our composer will sometimes confidently estimate ¥3,200 and compute savings off that. It should refuse.
  • Genuine recall gaps. A few tasks where the right item exists in some message but didn’t make it into the profile or retrieval. These are the ones an iterative consolidation pass (multi-step refinement of the profile over time) might catch.
  • Preference subjectivity. “Should I attend my high school reunion?” The gold wants the response to reference the user’s positive high school memories like debate team and AP classes that were mentioned earlier. Our response was a thoughtful but generic recommendation that happened to miss those specific anchors.

Routing might help on the off-by-one and preference cases (different prompts for different question shapes), which we haven’t tried yet. For arithmetic and date math, we’d want to let the composer call a calculator instead of computing in-prompt. Models still aren’t reliable at multi-step arithmetic when the inputs come from prose.

What’s actually shipping

The profile pass shipped server-side on May 5th, and as of this month it runs inside the memory-agents framework. The Bucket Profiler agent owns generation, runs on a schedule (default @every:1h) or on-demand via the Run-now button on the /agents page, and writes its output to a meta-bucket (_bucket_profiles) that synthesis reads from at query time. When you POST a query against a bucket whose profile hasn’t been generated yet, you can either install the agent and trigger an immediate first tick (POST /v1/buckets/{id}/profile/ensure-agent?run_now=true) or hit POST /v1/buckets/{id}/profile/regenerate to queue a regeneration (202 returned, agent picks it up). GET /v1/buckets/{id}/profile reads the latest profile. The profile is also returned in the bucket_profiles field of the query response — a dict keyed by bucket name, populated for every queried bucket that has one.

One behavioral note worth flagging on the query side: /v1/query still takes a buckets list. Omitting it now defaults to ["default"]; passing an explicit empty list returns 400. Multi-bucket queries get every queried bucket’s profile injected into synthesis.

Two practical caveats on the productionized version:

  • First-query latency. Profile generation is one LLM call with the entire conversation history as input. For a fresh bucket with 500 messages of history, the first agent tick takes around 60 seconds. Because the Bucket Profiler runs on a schedule, by the time a real query lands the profile is usually already cached; the ensure-agent endpoint exists for cases where you want to kick the first tick the moment a bucket is created.
  • The composer prompt is yours to plug in. Our reference v44 prompt is published in our benchmarks repo at composer_prompt_v44.md (MIT). Copy it, point its {profile} slot at bucket_profiles[bucket_name] from a /v1/query response, fill in the question and retrieved memories, and you’re running the same composer that produced the 91.6%. Tune for your domain; the rules matter less than the patterns. The profile schema (the JSON shape the agent emits) stays in the closed-source layer.

A customer wiring up Engram today gets the profile pass automatically and can drop in the published composer prompt. The 91.6% number is reproducible end to end: live server, published prompt, GPT-5 composer.

What’s still in flight:

  1. More memory agents. The Bucket Profiler is one of several operators in the framework. Watchdog, Logger, Janitor, and Consolidator are also installable from /agents, and we’re adding more. The Consolidator in particular targets the recall-gap residual we describe below.
  2. Best-of-N with a critic-as-selector to capture the seven “lucky flip” cases we found. Projected upside is roughly another point and a half.

After that, the larger architectural bets get more interesting: per-question-type routing, tool-augmented arithmetic, and an iterative consolidation pass for the recall-gap residuals. Those each look like 1–3 weeks of work and 2–5 points each.

How to read this number

Two things to know if you’re comparing LongMemEval-S scores across systems.

The composer model matters a lot. We used GPT-5. A stronger reasoner gets free points on count and ordering questions; a weaker one drops you several. Anyone publishing a LongMemEval-S score should publish the composer model and prompt alongside it, because the architecture and the LLM are jointly responsible for the result and you can’t tell what’s contributing what otherwise.

The GPT-4o judge also has real variance. We re-graded our full 500 three more times to bound it: the additional runs came in at 448, 449, and 450 against our original 458. So the central estimate is closer to 90.25% with a standard deviation around ±0.8 points. We’re reporting 91.6% because that’s the single-grade methodology everyone else uses, but anyone reproducing should expect to land within ±2 points depending on which judge roll they get.

Reproducibility

Everything that produced this post is in benchmarks/results/20260420/: hypotheses, grader output, run metadata, per-chunk metrics, the regression-check raw data, the v44 composer prompt. The benchmark itself is public from the LongMemEval team.

If you want to run a controlled comparison against another memory system on these same questions, the data’s there. We’re happy to coordinate one if it’s useful.

Further reading

Closely related

Engram