Benchmark

The cost-per-query frontier: synthesis model, latency, and accuracy

"Cost per query" gets quoted as a single number. It isn't one. The synthesis model you point at retrieved memories sets most of your bill; retrieval k and accept-latency budget are the secondary knobs. Where you set the three determines a point on a frontier, not a fixed price.

Published April 28, 2026 · By Jacob Davis and Ben Meyerson

We get asked "what does Engram cost per query" roughly once a week. The honest answer is "it depends, and the dependencies aren't on us." The dominant cost is the synthesis LLM call you make against retrieved memories. We don't run that model; you do, against your own provider key. So our cost-per-query is small and bounded. Your total cost-per-query depends on which model you point at the retrieved context, how much context you ask for, and how patient your users are.

This post lays out the frontier with numbers from a real workload: the LongMemEval-S 500-task run we published earlier this month, re-graded across six synthesis models from four different providers, with retrieval k swept and latency measured at p50 and p95. Where the curve bends, what the kinks cost in dollars and milliseconds, and how to pick a point on it.

The slider that matters, and the two that adjust it

A single agent-memory query runs through a small pipeline, but the part of that pipeline that moves your bill is the synthesis model. The LLM reading retrieved memories and writing the answer spans roughly two orders of magnitude in per-call cost, from sub-cent small models at one end to several cents per query for the frontier reasoners. Most of the other tuning decisions you'll make are downstream of that choice — either compensating for a weaker model or freeing a stronger one to do more work.

The other two knobs worth naming are retrieval k (how many memories the synthesis step gets to look at) and your accept-latency budget (the p95 you're willing to ship to users). A larger k usually buys better recall at the cost of a longer, slower, more expensive synthesis call; tighter latency budgets push you off the strongest models or into streaming territory; looser budgets let you sample harder. Both are real levers. Both move cost in the tens of percent, not in the 10× jumps the synthesis model does.

You can't optimize all three. In practice two of the constraints bind and the third gives, and the work in picking a memory stack is mostly about figuring out which two are binding in your situation, then choosing model and k that sit on the frontier between them. The rest of this post is what that frontier looks like.

What a single query actually costs

The headline number is synthesis: one LLM call with the retrieved context and the user's question. The range across reasonable model choices is the difference between a small open-weights model and a frontier reasoner, which is two orders of magnitude. That gap dominates by something like 10x to 100x over everything else in the bill combined. If you want to move cost-per-query in any meaningful way, this is the line item you change. Everything below is in the noise relative to it.

The other four components: a canonical profile pass that's a single LLM call against the full conversation history but amortizes across every subsequent query in that conversation (we use a mid-tier model and the per-query slice is small); a count-canonicalization step that runs a small-model pass on the subset of queries that need it (a minority of queries we see); the retrieval pipeline itself, which is database CPU and embedded GPU time and is effectively free per call at any reasonable volume; and a question embedding call that's a rounding error. Sum the four together and they're still small next to the synthesis call. They become interesting only once you've already pushed the synthesis model down to its floor.

Six synthesis models, one workload

We re-ran the same 500-task LongMemEval-S benchmark across six synthesis models, holding everything else constant: identical retrieved memories, identical v44 composer prompt, k=50, same GPT-4o judge. The retrieval stack is Engram's hybrid pipeline, the same one any customer hits. The only variable is which model reads the retrieved context and writes the answer.

Costs are blended in/out token costs for a representative query (roughly 3.2K input tokens of retrieved context + profile + prompt, ~250 output tokens). Latencies are end-to-end accept latency, measured server-side, including retrieval and synthesis. Accuracy is on the same 500-task slice, comparable to the 91.6% number we published for the full Engram stack with GPT-5 synthesis.

Synthesis model $/query p50 latency p95 latency Accuracy
Groq llama-3.3-70b $0.0012 820ms 1,400ms 74.8%
DeepSeek V3.2 $0.0021 1,350ms 2,650ms 83.2%
Together Qwen3-235B $0.0058 1,600ms 2,900ms 87.6%
Anthropic Claude Sonnet 4.7 $0.0094 1,950ms 3,200ms 89.4%
OpenAI GPT-5-mini $0.0061 1,750ms 3,050ms 88.0%
OpenAI GPT-5 $0.0290 2,850ms 4,400ms 91.6%

Two things jump out. First, the cost range across "reasonable" choices is 24× (from $0.0012 to $0.0290) but the accuracy range is only about 17 points. The dollars-per-accuracy-point gets steeper fast. Second, the latency range is narrower than you'd expect: the cheapest model is roughly 3× faster than the most expensive, not 24×. Cost scales with output cost and reasoning steps; latency scales with model size and provider infrastructure. They aren't the same axis.

Where the curve bends

The frontier isn't smooth. There are three rough regimes, and the kinks between them are where the interesting decisions live.

Under $0.002/query: small-model territory

You're running on small instruction-tuned models — the cheapest tier on Groq, Together, Fireworks, or similar. Accuracy on single-session lookups holds up better than we expected (low 90s on the easy LongMemEval categories), but multi-session and temporal-reasoning collapse hard: we measured a 12-point drop on multi-session and 9 points on temporal against the next tier up. If your workload looks like "what's my email" or "what did I tell you about my preferences last session," this regime is fine. If it's anywhere near "how many distinct customers have churned this quarter," it isn't, and the drop won't be subtle.

$0.005 to $0.015/query: the sweet spot

Mid-tier models — GPT-5-mini, Claude Sonnet, Qwen3-235B via Together, or the equivalent on whichever provider you're already wired up to. p95 latency stays under three seconds, accuracy lands within 2–4 points of frontier-model results on multi-hop questions, and most of the production coding-agent and customer-support workloads we see end up living here. Each marginal dollar from $0.006 to $0.009 buys you roughly two accuracy points; another marginal dollar from $0.015 to $0.029 buys you the same two points. The curve is bending hard by the time you cross out of this band.

Above $0.02/query: frontier reasoners

GPT-5, Opus 4.7, the top-tier reasoning models. The case for spending here is workloads where accuracy is the actual product — legal research, medical chart Q&A, financial-statement summarization, anything where the cost of one wrong answer eats more than 10× the per-query premium does in a year. The case against is a 1,000-query-per-day coding agent where the cost of a wrong answer is "I'll try again" and the cost of a right answer is "thanks" — the per-query premium pays for almost nothing that user notices.

The shape, in one line: each doubling of cost inside the sweet spot buys you about two accuracy points; each doubling above it buys you under one. Past about $0.03 we couldn't reliably measure a lift on this benchmark at all — judge variance alone (±0.8 points at three-roll standard deviation) eats whatever signal might be there.

Retrieval k: the second slider

k is the number of memories handed to synthesis. More memories means more context, more tokens, more dollars, more latency. Less means a faster cheaper call that occasionally misses the one memory the question actually needed.

We swept k from 20 to 300 at fixed synthesis model (GPT-5-mini) on the same 500-task slice:

kAvg input tokens$/queryp95 latencyAccuracy
201,650$0.00332,200ms83.4%
503,180$0.00613,050ms88.0%
1005,800$0.01023,800ms89.1%
20010,500$0.01885,200ms89.3%
30015,200$0.02716,600ms89.0%

k=50 is our default for a specific reason. The jump from k=20 to k=50 buys you 4.6 accuracy points at less than a 2× cost bump, which is the best deal anywhere on this curve. k=50 to k=100 buys another 1.1 points at ~1.7× cost, which is fine. k=100 to k=200 buys 0.2 points and roughly doubles your latency, which isn't. Past k=200 the accuracy stops climbing entirely and in some configs starts dropping — the reranker holds up, but the synthesis model genuinely gets worse at picking the relevant memory out of a longer haystack, especially when the reasoning budget isn't there to compensate.

So: k=20 if you're hard on cost, k=50 as the default, k=100 if accuracy matters more than the latency cost. Above 100, the reranker's job gets harder and the synthesis model's gets worse, both at the same time.

The shape of latency budgets

Latency isn't just "how fast"; it's a shape. Three regimes cover most production deployments.

Under 500ms p95

Hard mode. Only achievable with synchronous retrieval and either a small synthesis model or a pre-fetched/cached answer. We can hit it with k=20 against Groq llama-3.3-70b (410ms p95 in our internal measurements) but accuracy drops to the mid-70s. If your product is an autocomplete or a one-token classifier and the memory layer is decorative, fine. Anything that needs reasoning over multiple memories doesn't fit here.

1–3 seconds p95

Where most production agents live. Mid-tier synthesis, full hybrid retrieval, reranker, k=50. Users perceive this as "thinking briefly." Cursor, Claude Code, and most coding assistants run their tool calls in this range. The slot lets you use Sonnet 4.7 or GPT-5-mini comfortably; it doesn't quite fit full GPT-5 unless you stream tokens and let users start reading before generation finishes.

3–8 seconds p95

Frontier-model territory. Acceptable for customer-support reply generation, document summarization, single-shot research questions, anything where the user has clicked a button and expects to wait. Not acceptable for tab-by-tab agentic loops where the model is making 50 tool calls and each one waits on memory.

One thing the table at the top doesn't show: streaming changes everything. If you stream tokens to the user, the relevant latency is time-to-first-token, not end-to-end. Frontier models with reasoning typically have higher TTFT than mid-tier models (the reasoning happens before the first emitted token), which is why a streamed Sonnet response often feels faster than a streamed GPT-5 response even when the total wall-clock is similar.

The composer model is part of the product

Something we've said elsewhere and want to repeat here: the synthesis model you point at retrieved memories is part of the system, not a free variable. A stronger reasoner gets "free" points on count and ordering questions because it correctly enumerates from the retrieved candidates without dropping or double-counting. A weaker one loses several points on the same questions with the same retrieved memories. We measured that gap directly with the same retrieval and prompt, swapping only the synthesis model: GPT-5 to Sonnet 4.7 costs 2.2 points; GPT-5 to Qwen3-235B costs 4.0 points; GPT-5 to llama-3.3-70b costs 16.8 points.

The implication for shopping comparisons: a cost-per-query number without a synthesis model attached is uninterpretable. "Our system costs $0.003 per query" tells you nothing if the underlying composer is a small model that wouldn't have hit the accuracy floor your workload needs. Conversely, "we hit 92% accuracy" tells you nothing if it cost $0.04 per query against the most expensive reasoner on the market. The two numbers are jointly produced. Anyone benchmarking memory systems honestly has to publish both, plus the model and prompt that produced them.

This is also why Engram is BYOM by default. We don't bundle a synthesis model because the right choice depends on your workload, your latency budget, and your spend. Pointing the same retrieval layer at three different synthesis models gives you three different product experiences at three different price points. That's a feature, not a missing piece.

Picking a point on the frontier

We'll skip the generic per-vertical advice and describe one concrete customer shape we see often: a coding agent serving developers, continuous traffic, latency-sensitive because the developer is sitting there waiting. For that shape we'd start with a mid-tier synthesis model (GPT-5-mini, Sonnet 4.7, or Qwen3-235B depending on which provider they're already wired up to), k=50, streamed tokens, and a p95 target in the low single-digit seconds. The product feel that lands is "the agent thinks for a beat and answers"; anything slower and the developer alt-tabs away, anything cheaper and the agent gets count questions wrong often enough that trust erodes. From that starting point, the two knobs we'd actually expect them to move are k (down if their workload is more single-session-lookup than multi-hop, up if the opposite) and synthesis model (up to a frontier reasoner if their evals show count or temporal questions failing too often to ignore).

The pattern: pick the two constraints that bind hardest for your workload, typically two of accuracy, latency, and spend, and let the third float. Trying to optimize all three simultaneously usually means you're paying too much, going too slow, or accepting accuracy you'd reject if you measured it.

Three things people get wrong

We've run this analysis with a handful of customers picking a point on the frontier. The same misconceptions show up.

Treating cost-per-query as fixed. The most common mistake. People hear "$0.01 per query" once, anchor on it, and don't realize the number depends entirely on a choice they get to make. Different traffic shapes, different question difficulty distributions, and different synthesis-model choices move the number by 10× either direction. Anchoring is dangerous because it makes you under-budget for accuracy-critical paths and over-budget for cheap-and-cheerful paths.

Ignoring the amortization on profile generation. The profile pass is a single LLM call against the whole conversation history. It looks expensive when you stare at the line item: $0.04–$0.12 with a frontier model on a long history. People sometimes turn it off for cost reasons. But it's amortized over every subsequent query in the conversation, and the accuracy lift it provides (+4 points in our LongMemEval run) more than pays for itself over a 20-query session because the alternative is using a more expensive synthesis model to compensate for missing canonicalization. The per-query economics favor keeping it on except for very-short-conversation workloads.

Optimizing latency in isolation. "We need under 1 second p95" is a constraint people pick before knowing what their workload actually demands. Sometimes it's a real constraint (voice interfaces, real-time autocomplete) and sometimes it's a wish. Real-time constraints push you toward small synthesis models or pre-computed answers, both of which cost accuracy. Wishful constraints just leave accuracy on the table for no benefit. Measure what your users actually tolerate before committing to a latency budget.

How we measured

Methodology, for the table: 500 LongMemEval-S tasks, the same set we published earlier this month. Retrieval pipeline frozen at our v44 production config (hybrid BM25 + vector + graph, reciprocal rank fusion, cross-encoder reranker, k=50 unless varied). Composer prompt v44, the published MIT-licensed prompt. Profile pass on, generated once per conversation using GPT-5-mini. Only the synthesis model varies row to row. Costs use published per-provider token prices as of April 2026, blended on the actual input/output token ratios from each run. Latencies are server-measured end-to-end accept latency, not just generation time. Accuracy is single-grade against GPT-4o judge; ±0.8 points expected judge variance.

We're not publishing the per-task cost and latency dumps for every cell of the table (it's a lot of data), but if you're doing your own comparison and want the raw numbers, get in touch. The benchmark itself is public; the retrieval stack is documented; the v44 prompt is in benchmarks/results/20260420/composer_prompt_v44.md.

Frontier, not a number

Cost-per-query for an agent-memory system describes a frontier rather than a single point. You're choosing a position on a surface defined by synthesis model, retrieval k, and latency budget, and that position trades off against accuracy in a way that's specific to your application. There's no globally-correct answer because there's no globally-correct cost on a wrong response or a slow one — the price of either depends on whether you're shipping a coding agent, a support assistant, or something a regulator will eventually read transcripts of.

Two of the three constraints almost always bind, and the third is your degree of freedom. Pick which two, ride the frontier between them, accept that the third is the variable you live with. And before you compare a cost-per-query number from one vendor against another, check which synthesis model produced it; the LLM is doing most of the spending, and quoting the number without naming the model is the same kind of incompleteness as quoting an accuracy number without the judge.

Engram is BYOM-default for this reason. Whichever synthesis model lands at your point on the frontier is the one we'll route to; we handle retrieval, canonicalization, and the profile pass. The shape of your spend stays in your hands, including the part where you change it.

Further reading

Closely related

Engram