Opinion

Hosted inference vs BYOK: the unit economics of agent memory

A memory product is not one LLM call per user action. It is three to five per ingest and two or three per query, fanned out across extraction, classification, conflict resolution, profile generation, and synthesis. Whoever pays for those tokens defines what your pricing page can look like. We pay none of them. Here is the arithmetic for why.

Published March 24, 2026 · By Jacob Davis and Ben Meyerson

Most agent-infrastructure pitches wave a hand at inference cost. The pricing slide reads "$29/month, unlimited usage" and the founder mentions, almost as an aside, that they cover the model calls. It sounds generous right up until you sit down and count the calls.

A memory product is not a CRUD service with an LLM bolted onto it. Every meaningful operation fans out into several model calls that the caller never sees — a single store_memory request is three to five round-trips, a single query_memory is two or three, and none of them are optional if you want the quality numbers the category gets pitched on.

We built Engram on the assumption that someone has to pay for those tokens, and that pretending otherwise quietly corrupts the rest of the pricing model. That single assumption is the reason Engram is BYOK (bring your own key) rather than hosted inference. The rest of this post is the arithmetic behind that decision, and what the math means if you're sitting on the buyer side of an evaluation between us and a hosted-inference alternative.

What actually runs when you call `store_memory`

The MCP tool surface is small: store a memory, query for memories, list buckets, clear a bucket. From the caller's side it looks like a key-value store with semantic lookup. The server side is not that.

Here is the fan-out for a single store_memory call, in the order the Engram server runs it (other serious memory products run something similar; the components and their names vary, the count does not):

Embedding generation. A local sentence-transformer pass (all-MiniLM-L6-v2) vectorizes the new content for retrieval. Runs on the server; no inference provider involved, so it doesn't show up on anyone's LLM bill.
Triple extraction. One LLM call that reads the raw content and emits subject-predicate-object triples for the knowledge-graph layer. This is the call that catches "Alice manages the payments migration" so a later question about Alice's team finds it. Models we route to here: a small instruction-tuned chat model. Engram defaults to gpt-5.4-mini.
Classification. One LLM call that decides whether this content is a durable fact, a transient state, a preference, or something to ignore. Without this step every offhand comment becomes a permanent memory and recall quality collapses inside a week. Engram routes this to a smaller, faster model: llama-3.1-8b-instant by default.
Conflict resolution (conditional). If the new content contradicts something already in the bucket ("I moved to Berlin" arriving after a stored "lives in Lisbon"), one more LLM call decides whether to supersede, merge, or keep both with a date stamp. Fires on a meaningful fraction of stores, not every one. Call it 25-40% of the time for a chatty user.
Profile generation (amortized). Once per conversation, a single LLM call reads the full session and emits a canonical user profile: merged entities, deduped events, resolved aliases. We covered why this matters in our LongMemEval write-up; it is worth roughly four points of accuracy on the public benchmark. The cost amortizes across every store in the conversation, but it has to be paid by someone, and the input is the entire conversation history, which is the longest input in the pipeline.

So a single store_memory resolves to three guaranteed LLM calls plus one conditional plus one amortized — roughly 3–5 round-trips per memory ingested in expectation. None of those calls is optional if you want the quality numbers the category is sold on, and none of them is huge in isolation. They're individually unremarkable. Together they're the whole cost story.

And on `query_memory`

Retrieval itself is not an LLM call. Hybrid retrieval (BM25 over a Postgres full-text index, vector similarity over pgvector, graph traversal over a small relational schema) is plain database work. It is fast and free in the sense that we already paid for the writes.

The LLM calls in the query path are around the retrieval, not in it:

Query embedding. Local sentence-transformer pass to embed the question for vector lookup. Same model as on the write path; same "free per call once you've paid for the box" story.
Count canonicalization (conditional). "How many weddings have I been to?" needs the system to enumerate distinct events and decide whether the user's college roommate's wedding (mentioned in March) and "Emily's wedding in the city" (mentioned in August) are the same event or two. One LLM call when the question looks like a count question; zero otherwise. Fires on roughly 10-20% of queries in our LongMemEval-style traffic mix.
Synthesis / composer. One LLM call that reads the retrieved memories and the profile and produces the answer string. This is the call the customer actually sees in their agent's response. Largest input of any single call (retrieved memories plus a long structured prompt), and the most expensive per query.
Profile pass (amortized again). Same profile as above. If the bucket is fresh, the first query generates it synchronously; subsequent queries hit cache. Cost shows up exactly once per conversation.

So a query is roughly 2 calls guaranteed plus 1 conditional plus a one-time profile cost. Per-query LLM cost ranges from about $0.001 on a short query against a small bucket to $0.05 on a long synthesis against a 500-message conversation with a count-canonicalization pre-pass. That spread is real and it is mostly driven by the synthesis call's input length.

The per-action breakdown, in one table

Putting it all in one place. "Per call" is order-of-magnitude using current public pricing on small instruction models for the cheap components and a frontier model for synthesis. Your numbers will vary; the ratios will not.

Action	Component	Fires	Per call
store_memory	embedding (local)	always	—
	extractor	always	$0.0008
	classifier	always	$0.0001
	conflict resolver	~30%	$0.0008
	profile (amortized)	once / conv	$0.01 - $0.04
query_memory	query embedding (local)	always	—
	count canonicalization	~15%	$0.001
	synthesis / composer	always	$0.005 - $0.05
	profile (cached)	once / conv	amortized

Mean per-store, including the amortized profile share: roughly $0.003 - $0.012 depending on model choice. Mean per-query: roughly $0.008 - $0.04. Those are the numbers we will multiply through next.

Worked example: a five-person coding team

Pick a plausible workload. A small engineering team running a coding agent, which is the kind of customer this category exists to serve. Five engineers, each running an agent against a shared memory bucket. The agent stores roughly 10 memories per engineer per working day (decisions, conventions, gotchas, project state) and queries roughly 100 times per engineer per working day (most queries are inside multi-turn agent loops, not direct user questions). Working days, not calendar days.

Per day, that is 50 stores and 500 queries against the memory layer. Per working year (call it 250 days), that is 12,500 stores and 125,000 queries.

Multiplying through near the middle of the per-action range, at $0.008/store and $0.02/query (which is roughly what a typical mix with a frontier-model synthesizer costs today):

Stores: 12,500 × $0.008 = $100/year
Queries: 125,000 × $0.02 = $2,500/year
Combined: ~$2,600/year in raw inference, for one small team.

We've seen the real spread come out wider than that. Teams running an agent inside tighter multi-step loops, with closer to 30 stores and 300 queries per engineer-day, land in the low four figures per seat-year; quieter usage from solo developers using an agent intermittently lands closer to a couple thousand for the whole team. The point isn't the precise number, which depends on how chatty the agent is and which model the synthesis call routes to. The point is that you're not in the tens-of-dollars range for a small team using the product seriously, and a flat $29/month subscription doesn't cover it.

Those are the inference numbers, exclusive of storage, retrieval infrastructure, observability, and the MCP server itself. Just the LLM calls fanning out of the memory layer.

The hosted-inference vendor's choice

Now imagine a hosted-inference memory vendor charging that same five-person team a flat $29/month, call it $350/year, for unlimited usage. The math is unsubtle:

Revenue: $350/year.
Inference cost: $2,600/year on the typical workload, more on the active one.
Gross margin: deeply negative per customer, in the multiple-thousands range.

The realistic move, and the one most hosted-inference vendors in this space actually make, is to cap usage at the subscription level. "Up to 5,000 memories and 50,000 retrievals per month." It is at least honest about the underlying cost. It just creates a different problem. The cap has to be set where the median customer roughly breaks even for the vendor, which means a meaningful slice of customers will hit it on any busy month, get throttled or paywalled, and experience the product as broken at exactly the moments it most needed to feel solid. The customers who exceed the cap are also the ones who got the most value out of the product, and they are the ones most upset about hitting it. We have watched this play out from the customer side more than once.

The two alternatives are worse but worth naming. A vendor can price for the inference and meter it at a markup over what the model provider charges; this works arithmetically but loses you the customer the first week they run hot and notice the markup against public token prices. Or a vendor can simply eat the cost and bet on scale and model-cost compression to fix it later, which is a fine pitch except per-token prices have been falling for two years and per-task token counts have been rising at roughly the same rate. The vendor underwater today is underwater tomorrow on slightly cheaper tokens. We do not see anyone shipping that bet seriously; it shows up mostly in seed-stage decks.

So the realistic outcome is the capped-subscription one, and the realistic experience of that outcome is intermittent product brokenness for the customers who got the most out of it. The fan-out per user action is too high and the workload variance across customers is too wide for any flat hosted-inference price to clear without one of those failure modes.

The BYOK alternative

Bring your own key (BYOK) is the simple alternative. The customer brings credentials for their preferred model provider. The memory product makes calls against those credentials. The provider bills the customer directly for the tokens consumed; the memory product never touches that money flow.

The vendor's pricing collapses to one variable: the cost of running the memory product itself: storage, retrieval pipeline, the MCP server, durability, observability, the engineering work that keeps recall above some quality bar. That is the part that genuinely is fixed-shape: roughly proportional to memories stored and retrievals served, dominated by infrastructure costs that scale predictably with workload, and unaffected by which model the customer chose this quarter.

A flat-subscription tier (or a small meter on memories-stored and retrievals-served) is enough to cover the platform cost with a working gross margin. The customer pays the model provider directly, at whatever volume discount they have negotiated, and sees their inference bill in the place where they already track inference bills.

Engram's components route per-component to whatever the tenant has configured. The five components that fire on store and query (extractor, synthesis, classifier, conflict_resolver, profile) each have their own model selection. By default, they route to a sensible mix: a small fast model for classification, a mid-size model for extraction and conflict resolution, a frontier model for synthesis and profile. A tenant can override any or all of them, point any component at any OpenAI-compatible endpoint, and the unit economics for that component shift to that tenant's cost basis.

The result is a pricing model with no hidden gradient. The platform fee is what it says. The inference bill is what the customer's model provider charges. No one is upset on either side.

Secondary benefits of BYOK

The economics are the headline argument, but there are a couple of operational properties that matter more than we expected once we started watching customers actually run the thing.

The biggest one is that the vendor (us) never sits in the middle of the inference data path. Memory content is, almost by definition, the most context-rich data a customer ships through any AI infrastructure: verbatim conversation logs, decisions, project state, personal preferences. When that data leaves the customer's environment to be summarized or classified, it goes directly from the memory server to the customer's chosen model provider, over the customer's credentials, under the customer's data-processing agreement. We see the data at rest because we have to (that is the product), but we are not standing in front of every model call rebroadcasting it. For customers in regulated industries that distinction is often the difference between "yes" and "no" on the procurement form, and we have closed deals where it was the deciding factor and lost deals to hosted-inference competitors who couldn't answer the question. It is the single property of this architecture that customers bring up unprompted most often.

The other one is that BYOK puts the memory layer's inference into the same observability the customer already uses for the rest of their AI spend. Their provider dashboards and FinOps tooling already cover their LLM bill, and with BYOK the memory layer's calls just show up there alongside everything else. Under hosted inference the same activity shows up as an opaque line item on the memory vendor's invoice, and "you used the product more" is not an actionable answer to "why did our March bill double." Volume discounts and provider portability come along with this for free, since the customer's existing provider contracts and existing connection-string-swap habits both apply to the memory layer too.

The honest objection: BYOK is harder to set up

The fair criticism of BYOK is that it adds friction at signup. Hosted-inference vendors skip the key step entirely; you sign up, get a memory endpoint back, and start storing. BYOK gates the first call on pasting a credential into a form.

Yes, it's an extra step. Roughly thirty seconds for a developer who already has a model-provider account, which is approximately every developer who shows up to evaluate an agent-memory product. The bet we're making with that thirty seconds is that a customer who uses the product for a year doesn't remember setup at all and very much remembers the bill, so we'd rather take the friction at the front than build a pricing model that breaks once they start using us in earnest.

The setup ergonomics we ended up shipping take most of the sting out anyway. A single key covers all components by default, with per-component overrides for customers who want to mix providers. The key gets validated on submission and any error from the provider gets surfaced verbatim, so a bad paste fails loudly inside a second or two instead of producing a confused 412 on the first real call. Any OpenAI-compatible endpoint works, which by 2026 is most of the market. The shape that lands is close to paste-click-test-done, and that's about as far as we think the setup can be reasonably compressed without losing the architectural property the rest of the post is about.

We should also flag the obvious thing: we're partly arguing our book here. Engram is the BYOK product in this category and BYOK is the model we want customers to prefer. The arithmetic above isn't ours, though — it's what any vendor running a comparable fan-out has to face — and we ran it as a hosted-inference vendor during early prototyping before walking away from that model. The conclusion lands the same way whether Engram is part of the picture or not.

Where hosted inference could still make sense

To be honest about the trade: there are workloads where hosted inference could plausibly work for a memory product.

One is when the per-action fan-out is dramatically smaller than what we have described. A memory product that is just a wrapper over a vector database (embed on write, similarity search on read, no extraction, no classification, no profile, no synthesis) has a fan-out of approximately one LLM call per operation. The economics there look more like a search index than what we are describing. Some of the memory products advertised in the category are actually shaped like this. They can host inference because they barely use any. They also do not perform well on long-horizon benchmarks, but that is a separate post (we wrote one of those too: Engram on LongMemEval-S).

The other is at the very smallest end of the workload distribution: a hobbyist running a personal agent against their own notes, with single-digit operations per day. The total inference cost is small enough in absolute terms that a vendor can absorb it as a marketing cost on a free tier. This works, but it does not generalize. The customers who matter to the vendor's business are not the ones using the product five times a day.

Outside those two edges, the arithmetic above holds. The fan-out is real, the per-action cost is real, and the only stable pricing model is the one where the customer pays the model provider.

What this means for Engram

Engram is flat-subscription on the memory product itself. The platform fee covers the storage, the hybrid retrieval pipeline, the MCP server, durability, observability, profile generation orchestration, and the engineering that keeps recall above its current bar. Memories-stored and retrievals-served are metered with generous limits. The meters exist so we can size infrastructure honestly, not as a usage-cliff mechanism.

Inference is between you and your model provider. Bring an OpenAI key, an Anthropic key, a Groq key, a key for any OpenAI-compatible endpoint you operate yourself, or any mix of those across components. The five components that fire on store and query each pick up their model and credentials at request time. There is no markup on tokens and no path through us for the model bill.

The practical consequence for a customer evaluating us: your monthly bill from Lumetra is predictable and bounded. Your monthly bill from your model provider scales with how hard you use your agent, at exactly the rate it scales for every other thing you do with that provider. Nothing is hidden in either invoice.

We did not arrive at this model because BYOK is fashionable. We arrived at it because we ran the numbers on hosting inference for our own workload during the early prototype phase and concluded that any pricing we could write down was either going to lose us money or lose us customers. The third option (the customer pays for what they use, at the rate their provider charges them) is the only one we could keep writing down and still believe in two years.

Hosted inference vs BYOK: the unit economics of agent memory

What actually runs when you call `store_memory`

And on `query_memory`

The per-action breakdown, in one table

Worked example: a five-person coding team

The hosted-inference vendor's choice

The BYOK alternative

Secondary benefits of BYOK

The honest objection: BYOK is harder to set up

Where hosted inference could still make sense

What this means for Engram

Further reading

Closely related

Engram

What actually runs when you call store_memory

And on query_memory

The per-action breakdown, in one table

Worked example: a five-person coding team

The hosted-inference vendor's choice

The BYOK alternative

Secondary benefits of BYOK

The honest objection: BYOK is harder to set up

Where hosted inference could still make sense

What this means for Engram

Further reading

Closely related

Engram

What actually runs when you call `store_memory`

And on `query_memory`