Benchmark

Reproducing the 91.6%: a step-by-step from the LongMemEval-S run

A direct follow-up to our 91.6% on LongMemEval-S. This is the explicit "if you want to verify the number, do this." Exact stack, the v44 composer prompt, the profile schema it expects, the judge config, the retrieval knobs, and where the variance comes from once you've matched every choice.

One thing up front: if you only read the section on knobs, read the profile-pass one. It's the difference between reproducing our number and reproducing a different experiment.

Published April 9, 2026 · By Jacob Davis and Ben Meyerson

Why this post exists

"X scored Y on Z benchmark" is almost never reproducible from the announcement alone. The composer model, the composer prompt, the judge model and version, the retrieval config, the profile or summary layer that may or may not be running before the question hits the model, and a long tail of small choices each move the score by a meaningful amount. Two systems can publish the same headline number against the same dataset and be doing entirely different things underneath.

We published a 91.6% on LongMemEval-S in April. The original post described the architecture and what we tried; this one is the runbook. If you want to verify our number, audit it, or use it as a real baseline to compare another system against, this is what we'd hand you.

The short version: you need an Engram-equivalent server with hybrid retrieval and a canonical user-profile pass, the v44 composer prompt published in our repo, GPT-5 as the composer with your own OpenAI key, and the LongMemEval team's official GPT-4o grader script. If any one of those is different, you are not running our pipeline and the numbers will not match. They might be better, they might be worse, but they aren't comparable.

A note from the first time we ran the whole thing on a clean machine: we expected ingest to be the wall-clock hog, and it was, but the surprise was that the profile pass took most of an afternoon by itself. One LLM call per conversation sounds fast until you've made 500 of them with a full message history in each prompt. Budget accordingly.

The stack you need running

The reproduction has four moving pieces — three of them software you stand up, one of them a grading script you run at the end against a published gold file.

An Engram-equivalent server, or Engram itself. The ingest and retrieval layer needs to do three things in parallel on every message: extract subject-predicate-object triples into a knowledge graph (with aggregate count nodes for entity types), generate an embedding for vector search, and store the raw text for BM25. At query time, those three signals get fused via reciprocal rank fusion with a cross-encoder reranker sitting on top. If you want the hosted version, point your runner at Engram's /v1 endpoints and you can skip ahead; if you're rebuilding the layer locally to compare against something else, you need feature parity on all three engines or you're not testing the same architecture.

BYOK for OpenAI. Both the composer call (GPT-5) and the default profile-generation call run against your own API key. The benchmark run cost us a few hundred dollars in OpenAI inference total; the breakdown comes later in the post. Engram exposes a per-tenant LLMConfig so the composer and the profile pass can be routed to your own credentials independently of the rest of the server.

The canonical profile pass. This is the load-bearing piece of the whole pipeline. One LLM call per conversation, after ingest finishes, reading the full message history and emitting a structured profile. We measured the lift at +4 points over a retrieval-only baseline running the identical composer prompt — which is to say, skip this step and you're running a different experiment, even if every other knob matches ours.

The official GPT-4o judge from the LongMemEval team. Not your own grader, not a re-prompted version of theirs. The published grader script is the thing that produces a comparable number, and judge variance on it alone is worth ±0.8 points (covered in detail below). It's also the only piece you re-run every time you want a fresh number you can compare to ours; the other three you stand up once and reuse.

The pipeline, step by step

1. Ingest every message

For each of the 500 conversations in LongMemEval-S, walk the message list in order and POST each message to /v1/buckets/{bucket}/memories. One bucket per conversation. The server fans out: extract triples (Groq-hosted Llama-3.3-70B in our production config, but the extractor model is a tunable; see the knobs section), embed locally with all-MiniLM-L6-v2 via sentence-transformers, and write the raw text for BM25. All three writes happen in parallel and the call returns when all three are durable.

Practical notes. We ran 8 ingest workers per task in parallel during the first attempt and got hammered by 502s and 504s on the sixth chunk of 50 buckets. Dropping to 4 workers per task cleared it. The server itself is happy with more concurrency than that; the bottleneck is upstream LLM rate limits. Plan for restartable ingest. The script we used checkpoints after every successful POST, so if a chunk dies you can pick up where it left off without re-ingesting messages that already landed.

2. Generate the canonical profile

This is the step. After all of a conversation's messages have been ingested, hit POST /v1/buckets/{id}/profile/regenerate. One LLM call, the entire conversation history as input, structured JSON profile as output. The output is cached on the bucket; subsequent queries read it out of explanation.profile in the query response, and GET /v1/buckets/{id}/profile lets you inspect it without running a query. The schema is part of the server layer; the shape the composer prompt expects is documented in the next section.

What this step actually does, in one sentence: it resolves co-reference across the whole conversation. "My college roommate's wedding" in March and "Emily's wedding in the city" in October become a single profile entry with both phrasings recorded as aliases. The composer sees the merged view, not 47 fragmented mentions across sessions. That merging is what turns multi-session questions from a 50%-ish category into the 83.5% we landed at. Every other step in this pipeline is recognizable from any retrieval system you've built before. This one is what makes the number what it is, and it's the step we'd want you to look at hardest if you're comparing against your own pipeline.

3. Query for retrieval and the profile

POST to /v1/query with the question text in the query field, the bucket name in buckets, and the conversation's reference date in the context field (the server uses it as the "today" anchor for the profile pass). You'll use three fields from the response: answer (Engram's own answer, sanity check only), explanation.retrieved_memories (top retrieved memories with content and scores), and explanation.profile (the cached canonical profile). The published number uses a separate composer pass on top of this retrieval, not Engram's built-in answer, so the prompt stays inspectable as an artifact people can rerun.

4. Composer pass

Fill the v44 prompt's four slots. Single GPT-5 chat completion at temperature 0.1. Whitespace-clean the response. That's the hypothesis the judge grades.

Ingest, profile, query, compose. Five hundred conversations, five hundred questions, five hundred hypotheses, then the judge.

The v44 composer prompt

The prompt is published in the repo at benchmarks/results/20260420/composer_prompt_v44.md, MIT-licensed, ready to copy and run. It's the prompt that produced the 458/500.

It has four template slots:

{profile}: the canonical profile JSON. Drop in the explanation.profile field from /v1/query verbatim.
{context}: the retrieved memories. We use the memories field from the query response, rendered as a numbered list with each memory's text and any associated date metadata.
{question}: the task question.
{date}: the reference date, formatted as YYYY-MM-DD. This is the "today" the model anchors its date arithmetic against.

Read the prompt before you run it. It is long and rule-heavy. The rules that matter most for the final score are the count-question rules (always emit a candidate enumeration before committing to a number, explicitly mark whether to count or exclude edge cases like "the one you replaced"), the temporal-reasoning rules (sort dated events into a timeline before answering ordering questions), the conflict-resolution rules (latest-dated mention wins, except for personal-best questions where the optimal value wins), the attribution rules (separate things the user did from things the assistant suggested), and a hard refusal clause when retrieval doesn't actually contain the answer.

Do not edit the prompt and call the result reproduction. Every rule in v44 was added or moved in response to a specific failure mode on the 120-task development subset, then validated against the full 500 before the version got cut. We have versions v1 through v44 in the repo; the score moves at every step. v43 lost a point on knowledge-update relative to v44. v45 (which is also in the repo as a branch, not a release) traded a point on multi-session for a point on temporal-reasoning. The prompt is jointly tuned with the profile schema, and changes in either invalidate the comparison.

The profile schema, in shape

We don't publish the profile schema as a standalone artifact. It's tightly coupled to how the server stores entities and to how the prompt indexes into it, and decoupling it would be a meaningful refactor. But the shape it takes, and why each field is there, is straightforward to describe and is what you need to know to reproduce.

A profile is a JSON object with named sections, each a list. The sections we use:

people: named individuals the user has mentioned. Each entry has a canonical name, a list of aliases (other phrasings used in the conversation), a relationship label ("college roommate", "coworker", "neighbor"), and notable facts. Matters because "my friend Sarah," "Sarah from work," and "Sarah Chen" routinely refer to the same person.
events: specific events the user attended or referenced. Each entry has a canonical description, aliases, a date or date range, a location, and the people involved. This is where "weddings" get counted, where "trips" get counted, and where the multi-session category lives or dies.
possessions: items the user owns or has owned. Canonical item, aliases, acquisition date if known, current status ("owns," "replaced," "sold"). This is the section that makes "how many tanks do you currently have" answerable.
places: locations the user has been or lived. Canonical name, aliases, dates of association, role ("lived," "visited," "worked at").
activities: recurring things the user does. Activity name, frequency, notable instances. The "10th jog" question lives here.
preferences: stable preferences and constraints. The category, the value, the date the user stated it, and whether it's been superseded.
facts: durable facts about the user that don't fit the above. Job, family structure, dietary restrictions. Each fact dated and versioned, latest one wins.

Every entry in every section carries date metadata where the conversation supports it, and every entry is canonicalized: one entity per real-world thing, with aliases recorded explicitly. The composer prompt reads these sections by name; the v44 rules reference events by section name when handling count questions, and reference facts by name when handling knowledge-update questions. If you build a profile pass that emits a differently-shaped object, you'll need to adapt the prompt accordingly, and the score won't be directly comparable.

The judge

The grader is GPT-4o, running the LongMemEval team's official grader script against your hypothesis file and their gold file. We did not modify the grader. We did not re-prompt it. The whole point of using their published grader is that it's the comparable surface across systems.

Run the grader three times and average the per-task verdicts. We did this for our published number: the original grade landed at 458, and three additional grade rolls came in at 448, 449, and 450. The standard deviation across rolls is roughly 0.8 points (out of 500). That's not noise you can hand-wave away. It's the dominant single source of variance in a properly-matched reproduction.

Why does GPT-4o disagree with itself? On the easy fact-lookup categories it doesn't; the per-task verdicts are stable. The variance comes from the borderline cases in multi-session and preference categories, where the hypothesis is "close enough" or "almost right but misses one anchor" and the judge could reasonably mark either way. Different rolls flip different borderline tasks. Averaging three rolls cuts the standard deviation roughly in half, to ±0.4 points.

If you only run the grader once and land 460, or 451, or 456, you have not necessarily produced a different number from ours. You may have produced the exact same number on a different judge roll. Report the average of at least three rolls if you want the comparison to mean anything.

The knobs that move the score

One of these matters far more than the others. We'll treat it that way.

Profile pass on vs off: +4 points. Against the same composer prompt, against the same retrieval, with every other knob held constant. This is the single biggest "did you do the same thing we did" question, and it isn't close. The next-biggest individual lever we measured (composer model substitution) is a different story in a different direction; this is the only knob whose presence or absence reliably moves the number by several points in a controlled comparison against our own pipeline. If your system doesn't run a canonical-profile pre-pass that resolves co-reference across the full conversation, you are running a meaningfully different experiment, regardless of what's in the composer prompt or how good your retrieval is. The four points you'd be missing are almost entirely concentrated in the multi-session category, which is the category most "memory" systems claim to be good at.

For completeness, the other knobs:

Composer model. We used GPT-5. A mid-tier substitution we tested cost 6 points before we stopped the experiment. A newer or stronger reasoner could plausibly gain a few. Pin and report it.
Composer prompt. v44 as-is. Each rule was earned against a specific failure mode.
Retrieval k. k=100 semantic candidates, entity_k=8, max_graph_facts=50. Halving k hurts multi-session recall; doubling doesn't help.
Reranker. cross-encoder/ms-marco-MiniLM-L-6-v2. Removing it loses about two points. Larger cross-encoders were inconsistent and slower.
Extractor model. Groq-hosted Llama-3.3-70B. Insensitive band; swapping in something significantly weaker will drift.
Temperature. Composer 0.1, profile 0.0.

What it costs to reproduce

End-to-end, the 500-task LongMemEval-S run cost us in the ballpark of $300–$450 in OpenAI spend, depending on how clean the run was. The breakdown:

Ingest embeddings. Around 280,000 messages embedded across the 500 conversations, all running locally on all-MiniLM-L6-v2 via sentence-transformers. No OpenAI line item; the cost is CPU/GPU time on the machine running the server.
Triple extraction. One extractor call per message. We use a Groq-hosted Llama-3.3-70B for this, so this line item lives on a Groq invoice rather than OpenAI's: call it $30–$50 in Groq spend across the run. If you route extraction to GPT-4-class instead, expect this to be the second-largest line item by a long way.
Profile generation. 500 calls, each with a full conversation as input. This is the largest single line item. Call it $120–$180 in GPT-5 input tokens.
Composer pass. 500 calls, each with a profile, retrieved memories, and the question. Around $100–$150.
Judge. GPT-4o grader, 500 verdicts per roll, three rolls. Around $20–$30 per roll.
Wasted retries. On our first run we burned around $40 on exponential-backoff retries against rate-limited endpoints that produced nothing useful. Plan for this; the actual line item is real.

Wall clock is multi-day. Our cleanest run was around six days end to end including restarts, most of it in Phase-1 ingest. The pipeline is restartable at every stage; keep your hypothesis JSONL and your bucket-id mapping checkpointed and you can resume from any failure point without redoing the expensive work.

Where the variance comes from once you're matched

Suppose you've matched every knob. Same composer model, same prompt, profile pass on, same retrieval config, same reranker, same judge. You run the pipeline. Where can your number diverge from ours?

The judge roll dominates. Everything else is noise on top of it. One roll of the official GPT-4o grader carries ±0.8 points of standard deviation against itself; three-roll averaging cuts it to ±0.4. Our re-grades came in at 448, 449, and 450 against an original 458, a 10-point swing from a single source of variance, with the architecture, prompt, retrieval, and composer all held constant. When we first saw the 450 we assumed something had broken in our run script and spent two hours diffing hypothesis files before remembering we'd done this experiment specifically to bound judge variance. If you land within four points of our published number on a single roll, you've effectively reproduced. If you want a tighter bound, run the grader three times and report the average; this is the single highest-leverage thing you can do for the credibility of your number.

The other contributors are small and listed for completeness:

Composer sampling variance. Small at temperature 0.1. N=3 re-runs on 191 wins gave 188 pass, 3 regress. Roughly 1–2 points of round-trip noise if you draw unlucky.
Profile generation sampling variance. Temperature 0, but GPT-5 still has sampling noise at zero. Edge entries (a borderline event, an alias spelling) shift across re-generations. Best estimate is well under a point of total drift.
Order effects in ingest. Negligible. Retrieval is order-independent, the profile pass sees the whole conversation at once.

Net: expect ±2 points on a single grade roll, ±1 point on three-grade averages. If you reproduce at 89.5%, that's our number. If you reproduce at 93.5%, also our number. Outside that band, check the knobs before concluding the architecture is different.

What to publish if you do reproduce

If you reproduce, or if you build something on top of this pipeline and publish your own LongMemEval-S number, publish the same things we did. Without these, "we got X on LongMemEval-S" is unfalsifiable:

Composer model and version. Provider, model name, exact identifier. "GPT-5" is not enough; the model identifier matters.
Composer prompt version. A link to the artifact, or paste the prompt inline. "Our internal v3 prompt" is not auditable.
Profile pass: on/off. If on, describe the schema shape and whether it's generated once per conversation or incrementally. If you do something different from a one-shot canonical pass, say so.
Retrieval config. k, entity expansion, max graph facts, reranker model. The hybrid pipeline's choices.
Judge model. The official LongMemEval grader uses GPT-4o; if you used something else, the number is on a different scale.
Number of judge rolls. One roll plus a standard deviation, or an N-roll average. Both are fine. A single number with no variance estimate is the thing that doesn't survive scrutiny.
Per-category breakdown. The total is a single weighted average over six categories with different difficulties and different N. Two systems can hit the same total via very different strengths. The per-category numbers are where the architectural story actually lives.

We published all of these for our 91.6%. We are happy to coordinate a head-to-head against another memory system on this dataset; the runner is in our repo and the gold answers are public. The cost of running a clean comparison is a few hundred dollars and a week of wall clock, which is small compared to the cost of taking unverifiable numbers at face value.

The point of all of this

We don't think every reader of this post is going to go run the pipeline. Most won't, which is fine, because the point isn't to manufacture reproductions on demand. Benchmark numbers function as a coordination surface for the field, and they're only useful when the pipeline that produced them is legible enough that someone else could check the work, even if no one ends up doing it.

Any score a memory system reports is jointly the work of an architecture, a model, a prompt that does the final answer, a judge that grades it, and some retrieval knobs sitting between those. Move any of those without saying so and the headline number stops being a comparable artifact. "We got 91.6% on LongMemEval-S" with no surrounding context is, by itself, unverifiable — and that's true whether we're the ones saying it or someone else is.

The disclosure standard we'd want from anyone publishing on this benchmark is the one we tried to meet for our own run. If you're evaluating memory systems on long-horizon recall, the level of detail in this post is what's worth asking for from whichever systems you're comparing against. A number without that detail behind it is a number you can't act on, no matter how well it ranks on a leaderboard.