Opinion
Why we open-sourced the composer prompt
Alongside our 91.6% LongMemEval-S writeup, we published the full v44 composer prompt at composer_prompt_v44.md, MIT-licensed and copy-paste-ready against any retrieval system’s output. Here’s the reasoning, both why we made it public and why specific other pieces of the pipeline stayed in the closed-source layer.
The composer prompt took 28 iterations to land. Most of those iterations were a single rule added or rewritten: how to count, how to break ties between conflicting facts, how to decide whether a date in the conversation is an anchor or a reference, how to refuse to invent specifics. The published v44 is the result of running each version against a 120-task subset, watching where it broke, and rewriting the rule that broke. Every line in the prompt is there because some specific failure made us add it.
We sat on the decision to publish for about a week. The instinct that pushed us toward open-sourcing was simple: the prompt is the part of our stack the community can actually iterate on, and an unreproducible benchmark number is barely a number at all. The instinct that pushed us toward keeping it closed was the standard one (we’d done the work, why give it away). We’re writing this post partly to explain the decision externally and partly because the reasoning is the kind of thing we wish more vendors in adjacent categories would publish about their own choices.
Why open-source the prompt
The main reason is that the prompt is the part of our stack the community can actually improve, and we don’t think we’re going to be the best at it forever.
The composer prompt is mostly rules: how to count when the user mentions overlapping events, how to break a tie between two retrieved memories that give different numbers with different dates, how to handle "last week" when there's no anchor date in the retrieved window, how to tell a stated fact apart from an assistant suggestion the user never confirmed. Each of those rules is local. You can read it, argue with it, and replace it without having to understand the rest of the document. The prompt as a whole is more than the sum of its rules, but not by much, and that's exactly the shape of artifact that benefits from outside readers.
Every team running this prompt against a different dataset is going to discover edge cases we didn’t. A tester ran the prompt against a customer-support corpus and immediately found two failure modes our rules didn’t cover: one around order-status questions where the "latest mention wins" heuristic does the wrong thing, one around questions that should resolve to "the agent doesn’t know" but the prompt was too eager to commit. We’ve fixed a lot. We haven’t fixed everything. If the prompt is a closed artifact, every team that hits a new edge case maintains a private fork and the lessons don’t flow back. If it’s a public artifact with a canonical version, fixes can be proposed, evaluated, and merged. Someone running a memory product against a vertical-specific corpus is going to find better rules for that corpus than we’re going to find sitting on LongMemEval-S, and we’d rather be the team that maintains the canonical version of a prompt other teams contribute to than the team that hoards a slightly-better-than-public prompt for as long as we can.
The honest framing is that the prompt isn’t where our defensible work lives anyway. A prompt is a text file. Anyone who reads it carefully can write a similar one. What makes ours useful is the surrounding system: the retrieval that produces the context, the profile that gives the composer a coherent view of the user, the ingest pipeline that decided what was worth remembering. Strip those away and the prompt is a list of rules with nothing to apply them to. So when we asked ourselves "what do we lose if we publish this?" the honest answer was: very little.
There’s a secondary reason, which is that "we got 91.6% on LongMemEval-S" is suspect when the composer prompt is a trade secret. The benchmark grades final answers, and a long structured prompt over the right retrieval set is doing a lot of the work that produces those answers. A vendor reporting a score without publishing the prompt is reporting a number nobody else can reproduce, which means the number is unfalsifiable. The whole point of publishing on a public benchmark, versus an internal eval only we have access to, is that other teams can check the work. Publishing under MIT (rather than source-available-but-not-modifiable) is the minimum required for the check to be useful, because the check usually involves running the prompt against a different retrieval system to see whether the result is robust to the swap.
The last reason is quick: it signals who we’re building for. Engram’s customers are developers who read prompts, tune prompts, and don’t trust black boxes more than they have to. A developer evaluating Engram against alternatives is more likely to trust a vendor that publishes its work than one that doesn’t. The MIT license is doing work as a positioning signal as well as a usage signal.
What we didn’t open-source, and why
Three pieces of the stack stayed in the closed-source layer. The most consequential one, and the one worth explaining at length, is the canonical profile schema. The other two we’ll cover briefly.
The canonical profile schema
The Bucket Profiler memory agent generates profiles asynchronously, emitting a structured view of the bucket: people the user knows, events they attended, items they own, recurring activities, stable facts. The composer reads from this profile as part of its input. The prompt’s {profile} slot expects something with a specific shape: certain top-level keys, certain nested structures, certain conventions for how aliases get recorded.
We didn’t publish that schema. The reason isn’t secrecy. It’s that the schema is tightly coupled to how we store profiles in Postgres, how we version them across releases, how the server merges new information into an existing profile without rewriting everything, and how downstream code in our stack consumes the result. If we published it as a public artifact, one of two things would happen. Either teams would fork it and produce incompatible variants, at which point the published v44 prompt would no longer reliably consume the field it’s named after, defeating the point of publishing it. Or we’d freeze ourselves out of evolving the schema as the product matures, which would slow down everything else we’re trying to do.
The right framing is "closed for coherence, not for secrecy." If you’re running Engram, you get the profile in your /v1/query response and you pass it to the prompt; the schema is implementation detail you don’t need to think about. If you’re running another memory system and want to use the v44 prompt, you hand-roll an equivalent input. The prompt assumes structure, not the specific shape we emit. We tried to write it in a way that survives substituting your own profile representation, and the surviving-substitution property is the thing we’re actively protecting by keeping the schema in-house. A worse profile produces a worse answer no matter how good the rules around it are, but that’s a different failure mode than an undefined-contract failure, and the second one is the one a published schema would create.
This is the piece where the closed/open decision actually carries weight. The prompt without a good profile is a list of rules with nothing coherent to apply them to, and a published schema that drifts into incompatible forks would gradually break the published prompt’s usefulness as a reference artifact. Keeping the schema under one roof is what keeps the prompt portable, paradoxically: a stable shape on our side is what lets reproducers on the other side build their own structured input and trust that the prompt will read it the way it was designed to. The version of "open" where every implementation detail leaks out tends to produce twenty subtly-different versions of the contract, and none of them are the contract anymore.
Retrieval orchestration and the ingest pipeline
The reciprocal-rank-fusion implementation, the per-bucket FAISS plumbing, the BM25 plus graph plus vector merge logic, the scoring weights, the per-engine cutoff heuristics. We considered publishing the RRF step on its own (the technique is twenty years old and the implementation is a few dozen lines of code) but isolated from the rest of the pipeline it would have created a specific failure mode: someone plugs it into their own BM25 plus vector setup, gets a result that underperforms ours, and concludes our retrieval doesn’t work as advertised. The actual issue would be that the score distributions feeding RRF in their setup don’t match ours, and the fusion parameters we tuned don’t transfer. The published code would look like ours; the behavior wouldn’t be; the bug reports would land on us. So we kept retrieval orchestration as one unit. The boundary is the API, not individual functions inside it.
The ingest pipeline (triple extraction, classification, conflict resolution, the atomic-memory counter logic that keeps "how many distinct events of type X" answerable without re-counting) is the same story, plus one more thing. Ingest is where a lot of the interesting decisions live: what counts as a fact, what gets a confidence score, how a contradiction between two sessions gets resolved. Public review would probably improve them, and this is the piece where readers will most reasonably push back. The honest answer is that we’d like to open more of it eventually, but the current implementation is moving too fast for a stable public API to make sense. We’re not going to publish a thing labeled "Engram ingest" if the shape of it is going to change again in the next quarter.
The pattern
The heuristic we ended up applying, and the one we’d offer to anyone making a similar decision, is roughly: open-source the artifacts the community can iterate on, keep coherent ownership of the architecture that has to evolve as one system.
A prompt is an artifact. Each rule is local, each rule can be argued with on its own merits, and a community of users hitting the prompt against different corpora is going to find rules we didn’t. The marginal value of community iteration is high; the cost of keeping the prompt private is low (we’d each maintain forks; the canonical version would diverge). Match: open-source it.
A retrieval pipeline is architecture. The pieces depend on each other. A change to BM25 scoring shifts what RRF sees; a change to triple extraction shifts what graph lookup returns; a change to the embedding model shifts vector recall. The fix that looks right in isolation often breaks downstream in non-obvious ways, so the marginal value of community iteration on any single piece is low. The cost of keeping it coherent is the cost we already pay as the team that owns it. Match: keep it owned.
This isn’t a new idea. It’s the same heuristic OpenSSL and curl follow, in different vocabulary. The interface and the data formats are public; the implementation evolves under one roof. We didn’t invent the pattern, we just applied it to a memory stack.
What an independent reproduction would tell us
The number we’d most like to see verified is the 91.6% itself, against the LongMemEval-S 500-task set, with whatever retrieval and composer model the reproducer wants to use. We re-graded our own run three times and reported the variance, but a number is only really pinned down once someone outside the building runs it.
The reproductions don’t need to be flattering. If your retrieval setup plus our prompt scores 84%, that tells the field the prompt is doing maybe 7 points of work on its own and the rest is retrieval. If it scores 92%, that says the prompt generalizes past our specific stack. Either is informative; the only outcome that isn’t is the one where the prompt stays locked up and nobody runs it.
Practical pointers for users
The thing worth saying up front is that the prompt has four slots, named {profile}, {context}, {question}, and {date}, and the profile slot is the one that matters. If you’re using Engram, it’s the explanation.profile field of a /v1/query response. If you’re using another retrieval system, hand-roll an equivalent: a structured summary of the user with people, events, items, places, and stable facts. The prompt assumes structure, not the specific shape we emit, but the closer your structure is to ours, the less rule-rewriting you’ll need. The context slot takes whatever your retrieval system returns for the question, as a list; the prompt has rules for handling conflicts, gaps, and irrelevance, so don’t over-filter before passing it in.
A few smaller notes. The prompt is designed for GPT-5 (we used it throughout; weaker models drop a few points on count and ordering questions, stronger ones probably gain a few, but we didn’t systematically ablate the composer model). If you do, we’d love to see the numbers. The license is MIT: use it, modify it, ship it. If you ship a paper, citation is appreciated but not required. If you ship a product, we’d be curious to hear how it went, but you don’t owe us anything.
Where the leverage actually lives
Whether or not to open-source a given piece of the stack comes down to where the leverage on improvements is. A prompt benefits enormously from community iteration: many readers hitting many edge cases against many corpora, with each rule local enough to argue about on its own merits and patch in isolation. Architecture is the opposite. Coupled pieces, integration tests that only make sense viewed as a whole, a maintenance surface that fragments badly when individual components diverge across forks. Match the license to whichever of those two shapes the piece actually has, and the decision usually makes itself.
The published v44 prompt lives at composer_prompt_v44.md in our benchmarks repo, alongside the rest of the published artifacts, MIT-licensed. If you run it and reproduce the number, tell us what you got. If you find a rule we missed, tell us what it was. Either one is more useful to us than another internal pass at the prompt would have been.
Further reading
Closely related
- Engram on LongMemEval-S: 91.6%. 458/500 on the public benchmark. Hybrid retrieval, canonical profile pass, v44 composer prompt (MIT-licensed).
- Iterating the extraction prompt: 28 versions and what each one fixed. A partial history of the EXTRACTION_PROMPT, version by version. Each one fixed a failure the previous one caused.
- Reproducing the 91.6%: a step-by-step from the LongMemEval-S run. The runbook to verify our number end-to-end. Exact stack, v44 prompt, judge config, and where the variance comes from.
Engram
- Engram on LongMemEval-S: 91.6%. Full benchmark methodology and what didn't work.
- Engram docs. HTTP API, MCP setup for each client, SDK examples.
- Start with Engram. Free tier, BYOK, MCP-native.