Framework

How to think about token costs across an agent-memory pipeline

The most common question we get from customers wiring up BYOK is "what will my key cost me per month?" There isn't a single number we can give you. Engram's pipeline has six LLM-using components with six different cost shapes, and the right model for each is not the same model. This post is a framework for thinking about that, not a benchmark of measured dollar figures.

Published May 1, 2026 · By Jacob Davis and Ben Meyerson

When a prospect asks "what does Engram cost to run," they're usually rolling two things into one number: the platform fee (which is ours to set) and the inference bill (which goes directly to their model provider under BYOK). The platform fee is easy. The inference bill is what people anchor on when comparing memory systems, and it's the number people consistently model wrong, because they assume one model handles the whole pipeline.

It doesn't. Engram's request path touches the LLM in six distinct places, and the workload at each place looks nothing like the others. The extractor runs on every store with short inputs and short outputs. The synthesis call runs on every query with a large retrieved-context input and a shortish answer. The profile generator runs once per bucket per regeneration with the entire conversation history as input. Wiring a frontier model into all six is the easy default and also a bad one: you'll likely overpay several times over without measurably better recall on most of the pipeline.

What you actually want is to pick a model per component, paying frontier prices only where reasoning quality dominates and small-model prices everywhere else. Engram's BYOK config is built around exactly this split: each tenant can configure model, provider, and prompt per component, and the request path routes to the right client at runtime.

We are not going to publish a table of "Engram costs $X per 1,000 queries on Provider Y." We haven't measured those numbers on our own production traffic in a clean enough way to put them in writing, and your workload almost certainly differs from ours anyway. What we can do is walk the six components, describe each one's cost shape qualitatively, suggest the model class that fits, and hand you the formula to run on your own input/output sizes and the prices you see on your provider's page today.

The six components

A request through Engram fans out into up to six LLM-touching subsystems depending on what the request is. Not every store hits all six (most stores skip the conflict resolver entirely), and most queries hit a single one (synthesis) plus optionally one specialist (count canonicalizer for "how many X" questions). The components, in roughly the order you'd care about them:

1. Extractor

Runs on every store. Reads the raw memory content and emits subject-predicate-object triples that get written into the knowledge graph alongside the vector and BM25 indices. Inputs are short (the memory text, usually one or two sentences) and outputs are short too (a small JSON object). The work is mostly pattern recognition and structured output, not deep reasoning, but precision matters because every false triple becomes a phantom edge in the graph.

Cost shape: per-store. If you're ingesting 100,000 memories a month, this runs 100,000 times. Latency matters because it's in the ingest critical path. Quality matters less than you'd think. We've benchmarked a frontier extractor against Llama-3.3-70B on the same extraction set and seen only a 2-of-11-recovery difference on a stubborn failure subset (see our LongMemEval writeup). Not nothing, but not the place to spend frontier-model money.

Right model class: small/fast, roughly 70B open-weight or a mid-tier closed model. Today's default is gpt-5.4-mini; the next-cheapest equivalent on Groq is llama-3.3-70b-versatile.

2. Synthesis

Runs on every query. Reads the retrieved memories (BM25 + vector + graph fused via reciprocal rank fusion) plus the user's question and the canonical profile, then composes the answer. This is the composer prompt we shipped v44 of in April. Inputs are large (several thousand tokens of retrieved context is typical, plus the profile, plus the question). Outputs are usually a few hundred tokens.

Cost shape: per-query, and this is where most of your inference bill lives. Quality dominates: a stronger reasoner gets free points on count, ordering, and date-arithmetic questions, and a weaker one drops you several. We use GPT-5 on the benchmark stack for exactly this reason. If you're going to splurge anywhere in the pipeline, this is the one.

Right model class: mid-to-frontier reasoner. Anthropic's Claude Sonnet 4.5 or OpenAI's GPT-5 family for accuracy-sensitive workloads; llama-3.3-70b-versatile on Groq is the budget option that's surprisingly close.

3. Classifier

Runs on every store. Reads the memory and labels it by topic and scope (work, personal, project, preference, fact, event, and so on). Token counts are trivial in both directions. The work is essentially small-label classification, and anything that can return reliable JSON with one of N labels is fine.

Cost shape: per-store, but per-call cost is so low it usually rounds to nothing relative to the extractor. Use the smallest viable model. Today's default is llama-3.1-8b-instant.

Right model class: 8B class or the cheapest API tier your provider offers. Don't overthink it.

4. Conflict resolver

Runs only when ingest detects that a new memory contradicts an existing one ("user lives in Boston" vs the just-stored "user lives in Seattle"). Decides keep, replace, merge, or mark-superseded. Most stores skip this entirely; in our production traffic, well under 10% of stores trigger the conflict path.

Cost shape: per-conflict, which makes per-store amortized cost low. But the decisions are subtle (is this a real contradiction or a context shift? does the new fact supersede or coexist?), so the per-call quality bar is closer to the synthesis model's than the extractor's.

Right model class: mid-tier. Today's default is gpt-5.4-mini; we'd be comfortable running llama-3.3-70b-versatile here for cost-sensitive deployments.

5. Profile generator

Runs once per bucket per regeneration. Reads the entire conversation history for that bucket (could be 500 messages, could be 20,000) and emits a structured canonical profile: people, events, items, places, recurring activities, stable facts. This is the call that does the heavy semantic merging: "my college roommate's wedding" and "Emily's wedding in the city" get unified into one entry with both aliases.

Cost shape: per-bucket-per-regenerate. Input can be very large for an active bucket; output is structured JSON. The frequency is what saves you: typically once on first query for a fresh bucket, then again when you trigger /v1/buckets/{id}/profile/regenerate after a meaningful update. Amortized over the hundreds of queries that then hit the cached profile, the per-query share of profile cost is small.

But the absolute cost of each call is the largest single LLM hit in the pipeline by far, and quality dominates everything else (this is the +4 points we measured on LongMemEval-S). Use the strongest reasoner you can justify.

Right model class: frontier reasoner. GPT-5, Claude Sonnet 4.5, or Opus 4.5. We've tried running profile on smaller models; the recall lift drops measurably.

6. Count canonicalizer

Runs only on count queries: "how many weddings did I attend this year," "how many books have I bought." Takes the retrieved memory set and a candidate enumeration, and dedupes co-referent mentions before committing to a number. Without it, "Emily's wedding" and "my college roommate's wedding" double-count.

Cost shape: per-count-query. Most production traffic doesn't trigger this. In our LongMemEval runs, count-shape questions are roughly 20% of the multi-session category, well under 10% of total queries. The work is precision-bound (one off-by-one wrecks the answer) but latency-tolerant (count questions are slow questions already; users accept a couple extra seconds).

Right model class: mid-tier. Today we run it on gpt-5.4-mini.

Provider tiers, roughly

We are deliberately not publishing a table of dollar figures here. Provider pricing pages move; the numbers in any post we write would be stale by the time you read it. Go look at the current rate on the provider's own page before you do the math.

The rough shape of the market in mid-2026 is three tiers, and the spread between them is large enough to drive the architectural decision even without exact prices:

  • Small / cheap. Roughly 8-15B parameter open-weight models hosted on Groq, Together, Fireworks, DeepInfra, and others; or each closed-model provider's cheapest tier (OpenAI's mini-class, Anthropic's Haiku-class). The cheapest serious-use options here are very inexpensive, on the order of pennies per million input tokens.
  • Mid. Roughly 70B-class open-weight (Llama-3.3-70B on Groq, Together, Fireworks; DeepSeek's chat tier) or the mid-tier closed models. As of writing this, mid-tier is a few times the small-tier rate, give or take.
  • Frontier. The flagship reasoning models: GPT-5, Claude Sonnet 4.5 and Opus 4.5, the strongest DeepSeek reasoning model. Anthropic's Opus tier is the most expensive serious option you can wire in. Frontier closed-model rates are commonly tens of times the mid-tier rate, and roughly 100x the cheapest small-tier rate.

The big picture: across the same tier, providers vary by several times. Across tiers, the spread is one to two orders of magnitude. Output tokens generally cost a few times more than input tokens at most providers, which matters more for the count canonicalizer (sometimes emits a long enumeration) than for the extractor (small output).

For BYOK planning, check each provider's pricing page directly (OpenAI, Anthropic, Groq, Together, Fireworks, DeepSeek) and plug today's numbers into the formula below.

The formula, and why we won't run it for you

The math itself is trivial. For any component:

per_call_cost = (input_tokens / 1,000,000) × $in_per_M + (output_tokens / 1,000,000) × $out_per_M

Then multiply by call frequency. A store fires the extractor and the classifier, plus the conflict resolver on the small fraction of stores that contradict an existing memory. A query fires the synthesis call, plus the count canonicalizer on the fraction of queries that are count-shaped. The profile generator fires once per bucket on first query, then on explicit regeneration.

We're not going to publish a table of "Engram costs $X.YZ per 1,000 stores on Provider A" because we'd be making up the input/output token sizes for your traffic. Memory length, retrieval context size, profile size, and conflict rate all vary substantially by workload. The honest version of this post hands you the formula and tells you to run it on your own measurements once you've stored a few thousand memories and watched the token counts come back from your provider's billing page.

What we will say qualitatively: synthesis tends to dominate, because retrieved-context inputs are large and synthesis runs on every query. Extractor and classifier costs are usually small in absolute terms even at high ingest volumes, because both calls are short. Profile generation is the largest single call when it fires, but it fires rarely enough that the amortized per-query share is negligible as long as you don't regenerate profiles aggressively. If your inference bill comes in higher than expected, look at synthesis first; that's almost always where the money went.

The mix-and-match strategy

Most of the customers we work with end up in roughly the same configuration after a few weeks of running BYOK:

  • Extractor: Groq llama-3.3-70b-versatile. Cheap, fast, accurate enough.
  • Classifier: Groq llama-3.1-8b-instant. Trivial work, smallest model.
  • Conflict resolver: OpenAI gpt-5.4-mini or Anthropic Haiku 4.5. Subtle decisions, mid-tier is enough.
  • Synthesis: OpenAI GPT-5 or Anthropic Sonnet 4.5 for accuracy-bound tasks; Groq 70B for cost-bound tasks. This is the routing choice that matters.
  • Profile: OpenAI GPT-5 or Anthropic Opus 4.5. Frontier reasoning, infrequent calls, amortized well.
  • Count canonicalizer: OpenAI gpt-5.4-mini. Precision-bound, low volume.

We can't honestly tell you what that mix costs end-to-end without inventing your traffic shape. The qualitative answer is what matters: against an all-frontier stack running GPT-5 or Opus on every component, this mix puts frontier money only on synthesis and profile (where it earns its keep) and small-model money on extractor, classifier, conflict resolver, and count canonicalizer (where the recall gap, in our measurements, is small). For the workloads we've helped customers configure, that's typically been a meaningful multiple of cost savings versus the all-frontier baseline, with no measurable accuracy regression on the parts that got cheaper.

Push further toward the cheap end (Groq 8B and 70B everywhere, DeepSeek chat for synthesis) and the savings grow again. Whether the recall and synthesis quality hold up at that point is exactly the question you have to measure against your own task distribution before committing. The savings are large enough that running the comparison is worth the engineering time.

BYOK config sets in Engram are built around this. You can define a "cheap" set, a "premium" set, and a "default" set, and route per request via the X-Engram-Config-Set header. Internal eval traffic on cheap, customer-facing traffic on premium, batch backfills on whatever's cheapest right now. The plumbing is in the request path; you're configuring routes, not rewriting code.

Practical recommendations

We get versions of this question often enough that here's the short form, mapped to the optimization target you actually care about.

If you're cost-optimizing aggressively: small models (Groq 8B) for extractor and classifier, mid-tier (Groq 70B or DeepSeek chat) for everything else including synthesis. Accept that synthesis quality will be measurably below frontier on hard reasoning questions. Expect 70-90% savings versus an all-frontier stack.

If you're accuracy-optimizing: frontier model (GPT-5 or Sonnet 4.5/Opus 4.5) for synthesis and profile, mid-tier for the rest. Don't waste frontier-model money on the extractor. We benchmarked the swap and it bought us 2-of-11 recoveries on a stubborn failure set, which is to say not nothing but also not the lever to pull first. Expect to pay several times the cheap stack, with one to three points of measurable benchmark lift on long-horizon recall.

If you're latency-optimizing: Groq for any small or mid-tier model. Their p95 latency on the same model class is consistently 2-4x faster than the closed-model APIs. Fireworks is the next-fastest for mid-tier. For the synthesis call, GPT-5 and Claude Sonnet 4.5 are roughly comparable on p95; both are slower than the open-weight options on Groq, but the quality bump is usually worth it if you can absorb the latency.

If you don't know yet: start with mix-and-match. Groq for extractor/classifier, OpenAI mid-tier (gpt-5.4) for synthesis, OpenAI gpt-5.4-mini for conflict resolver and count canonicalizer, and the strongest model you can afford for profile. Measure, then move the synthesis dial up or down based on what your traffic actually shows.

Pitfalls to watch for

A few things that have bitten customers we've helped configure BYOK in the last few months.

Tokenizer quirks. Different providers count tokens slightly differently. Groq's tokenizer for Llama-3.3 counts code blocks and structured JSON 5-15% higher than tiktoken does. If you're modeling cost from a token count generated against OpenAI's tokenizer and then routing to Llama-3.3, expect your actual usage to come in a bit over plan. Always validate against a real bill before committing to volume.

Character-based pricing. A small number of providers (mostly regional cloud providers and a handful of specialty inference shops) charge per character rather than per token. Their pricing pages will quote a per-million-token figure that's actually a per-million-character figure converted at an assumed ratio. If your inputs are non-English or contain a lot of code, the conversion can be off by 20-40%. Read the fine print on the pricing page.

Output-token weighting. Most provider pricing splits input and output tokens, with output a few times more expensive. The count canonicalizer is the one component where output-token cost can swamp input cost, because it sometimes emits long enumerations. Plan accordingly.

Rate limits and tier escalation. Frontier-model rate limits on OpenAI and Anthropic scale with org tier. A customer hitting GPT-5 at $5K/month spend has a much higher TPM limit than a customer at $100/month. If you're modeling cost on $5K of GPT-5 traffic at $5K-tier rate limits, you'll be fine, but the first month while you ramp up to that tier may not have the headroom you planned for. We've seen production rollouts stall on this exact issue.

Function-calling reliability. The extractor and classifier both depend on the model returning well-formed structured output. JSON mode and function calling are reliable on OpenAI, Anthropic, and Groq today, less reliable on some Together / Fireworks model variants depending on which weights they're hosting. Test extraction yield on a few hundred real memories before you bulk-deploy.

The takeaway

The thing most worth carrying away from this post is that under BYOK, you don't have to pick a single model for the whole pipeline — and once you look at the per-component cost shapes, you shouldn't. Engram's six LLM-touching components each have their own quality-vs-cost curve, and matching the model class to the component is the single largest lever you have on the inference bill. Done well, it's been a meaningful multiple of savings for the customers we've helped configure it, with no measurable accuracy hit on the parts where the model class didn't matter.

The per-component configuration is in the product today; the model choices stay yours. The formula is above, the provider pricing pages are open in a tab somewhere, and the qualitative shape of the answer is what this post is. If you want to walk through your specific traffic shape with us before committing to a deployment, drop a note and we'll work through the math together.

Further reading

Closely related

Engram