Guide

What BYOK actually means (it's three different things)

"BYOK" is on a lot of pricing pages right now. Different vendors mean wildly different things by it. The version a vendor implements determines whether you actually own your inference, your data, or just the receipt.

Published March 17, 2026 · By Jacob Davis and Ben Meyerson

A developer wires up an agent platform on a Tuesday afternoon. The pricing page says "BYOK supported." Their mental model is straightforward: paste an OpenAI key into a settings field, the vendor calls the model with that key, the inference dollars land on the OpenAI invoice instead of on the vendor's bill with a markup. Three claims rolled into one label. For most vendors who put the acronym on the page, at least one of those claims is wrong, and the customer usually doesn't find out which one until the second or third invoice.

"Bring your own key" is doing too much work as a phrase. It collapses three independent properties into a single label, and a vendor can be genuinely BYOK on one of them while staying fully hosted on the other two. The three are credential scope (whose key sits in whose database, encrypted how), billing scope (which company the LLM provider invoices), and inference path (whose servers actually see the prompt body in flight). Each one solves a different problem. Each one fails in a different way. A vendor's BYOK claim is worth roughly the weakest of the three axes they're willing to commit to.

The middle one — billing scope — is where the unit economics actually fail or hold up, so most of this post lives there. The other two get a paragraph each.

Credential scope: where your key lives

The question on this axis is what happens if the vendor's database leaks tomorrow morning. The weak answer — common enough that you should assume it's where you are until told otherwise — is that your sk-... key is sitting on a row in the application database. Sometimes plaintext. More often base64-encoded or wrapped in a reversible "encryption" whose key lives one environment variable away from the data. A dump of the database is, in that case, a dump of every customer's key. Real credential-scope BYOK means the key is encrypted at rest with a master key held outside the application database, decrypted only at the moment of an outbound request, and versioned so it can be rotated without downtime. Our implementation uses AES-256-GCM with a 32-byte master key the application database has no read access to, a fresh nonce per ciphertext, ciphertext-only backups, per-row key-version tags, and a blob format of nonce(12) || ciphertext || authentication-tag(16) base64-wrapped. The scope of this axis is narrow on purpose: it solves the database-dump problem and nothing else. It tells you nothing about billing, and it tells you nothing about what the vendor's servers are doing with your prompt body while the request is in flight.

Billing scope: who the LLM provider invoices

This is the axis where most "BYOK" claims quietly fall apart, and it's the one developers usually mean when they assume BYOK on a pricing page. The question is simple: when the model call settles, whose name is on the invoice?

The weakest version is what we'd call BYOK-in-name-only. The vendor takes your key, calls the model with it, then bills you through their own account with a markup attached. You get one invoice from the vendor that bundles platform fees and inference together, and the OpenAI dashboard you logged into expecting to see the spend shows nothing — because no spend ever landed on your OpenAI account. The cost shape here is indistinguishable from straight hosted inference. The "BYOK" label is doing the work of telling you that some credential of yours is involved somewhere in the architecture, but the economic relationship hasn't moved: you pay the vendor, the vendor pays the LLM provider, the vendor keeps a margin you didn't get to inspect.

The softer version is the "BYOK at parity" pitch. The vendor uses your key, claims they pass inference through at exactly the provider's posted rate, no markup. This is genuinely better than markup. It still leaves the vendor sitting in the middle of every inference dollar you spend. Cost visibility ends up on the vendor's invoice rather than on your provider's dashboard, which means spend anomalies — a runaway agent loop, a workload spike from a new feature flag, a misconfigured retry policy — surface on the wrong dashboard, often days late, after the vendor's billing batch closes. You're also trusting that "at parity" stays at parity through every contract renegotiation, every pricing-page revision, every "we've simplified our billing" email that lands in the next two years. A pass-through rate is a promise the vendor made. A direct provider invoice in your account is a fact the system enforces.

Real BYOK on this axis means the LLM provider bills your account directly. Your OpenAI dashboard, your Anthropic console, your Groq usage page: that's where the inference dollars land, in the same place they'd land if you weren't using the vendor at all. The vendor sees zero of those dollars. The vendor charges you separately for what they actually do, which is platform: storage, retrieval, orchestration, the product itself. Two invoices, two cost lines, two relationships that are decoupled and stay decoupled.

The decoupling is the whole point. It means you can audit each cost independently, because they're produced by independent systems. It means you can optimize each independently: cut your prompt size, the OpenAI bill drops and the vendor bill stays the same; switch buckets to a cheaper retrieval tier, the vendor bill drops and the OpenAI bill stays the same. It means you can renegotiate each independently. Your enterprise commit with OpenAI, the committed-use credit you negotiated last quarter, the volume discount your finance team got from Anthropic, the provider-of-the-month routing your platform team ships on Friday: all of it flows through, unmodified, because the vendor isn't in the path. A vendor on managed inference can't pass any of that through cleanly because their margin is built on top of the posted rate; passing through your enterprise discount eats their margin, so they don't.

The incentive structure flips, too, and this is the part that matters past the first invoice. A vendor running managed inference has every incentive to keep your model spend high, because your model spend is their revenue. Better retrieval that reduces prompt size hurts them. Shorter system prompts hurt them. A "use a cheaper model for this class of query" feature hurts them. None of those features get prioritized; the ones that do tend to be the ones that grow per-call token counts. A vendor on real billing-scope BYOK has the opposite incentive: every token they save you is a feature they can sell, because you keep the savings. The two business models look identical on a pricing page and diverge sharply once you've been a customer for six months.

A concrete scenario for why this matters in production. You ship a memory-backed agent on Monday. By Wednesday it's running 4 million tokens a day across your user base. On Friday someone on your team notices the OpenAI dashboard is flat at near-zero, which is confusing because traffic is clearly hitting the agent. The reason: every call routed through the vendor's account. The vendor's invoice arrives at the end of the month. It's a single line item, "Platform usage," for $11,400. There's no per-call detail, no model breakdown, no temporal pattern, no way to attribute spend to features or users without the vendor exposing that view. Now imagine the same situation under real billing-scope BYOK. The traffic shows up on your OpenAI dashboard in real time, broken down by API key tag, with model and token granularity per call. You see the spend the same day it happens. You can attribute it the same hour it happens. The vendor's invoice is a flat platform charge that doesn't move with token volume, and the inference bill is yours to inspect the way you inspect any other infrastructure cost.

The same scenario explains why the "BYOK at parity" pass-through isn't the same thing. Even if the rate is honest to the cent today, the data path runs through the vendor, which means the dashboard you trust for cost decisions is theirs. When a senior engineer asks "what's our LLM spend pattern look like this quarter?" the answer has to come from a vendor invoice that wasn't designed for cost analytics, and any tooling you build on top is reading a derived number rather than the primary source. Real BYOK puts the primary source in your account, where the rest of your observability already lives.

At Engram this is the default and the only mode. We don't run managed inference at all. Every model call the platform issues runs against the tenant's own credential, against the tenant's own provider account, and lands on the tenant's own invoice. We charge for the platform (memories stored, retrievals issued) on a separate bill that has no relationship to your token spend, because there's no way for it to have one. The architecture forces the decoupling; we couldn't insert ourselves in the middle of the inference dollars if we wanted to, because we don't hold a vendor-side LLM credential anywhere in the stack.

Inference path: whose servers see the prompt

The third axis is the one most vendors won't label honestly, because the honest label is uncomfortable. Whose servers process the body of your prompt, and the body of the model's response, while the request is in flight? For any vendor doing orchestration on the customer's behalf (extraction, retrieval, memory synthesis, agentic loops, basically anything past a thin proxy) the answer is: theirs. The vendor's servers receive your request, assemble a prompt that mixes your input with retrieved context and system instructions, POST it to the LLM provider, and pipe the response back. The payload sits in process memory, briefly in TLS-terminated form at the load balancer, and potentially in error logs and APM traces if anything goes wrong. "We don't log prompts" is the right policy, but it's a contractual commitment, not a physical one. The strong version of this axis (the customer's own infrastructure makes the LLM call, the vendor's servers never see the prompt) is feasible for narrow workloads like client-side embeddings, but impractical for orchestrated agent-memory pipelines where the vendor is sequencing calls with intermediate state. The honest answer for anyone running orchestration is: yes, our servers see the prompt in flight; here's the architecture that keeps it off disk and out of our other systems; here's the contract that binds us. At Engram that's a single routing module as the only code path that touches a prompt body, no logging of bodies anywhere, no training on customer data ever contractually, structural metadata only in observability. Anyone advertising "your data never touches our servers" while running a multi-step pipeline is hand-waving one of those clauses.

The question worth asking

There's a long checklist version of this post that ends with six questions to ask a vendor's pricing page. We wrote it. It read like a checklist. The single question that does most of the work is one you can ask in an email: show me a sample invoice from the inference run my account would have produced last month. Whose name is on it? If the answer is the vendor's, you're on managed inference no matter what the pricing page says. If the answer is yours, ask the follow-up: was that the posted rate, or did the vendor apply a markup, a pass-through, or a discount before it reached the provider? A direct provider invoice in your name, at the provider's posted rate, with no vendor entity in the dollar flow, is the only configuration that survives the next pricing-page rev. Everything else is a promise.

So what does this mean for you

When you see "BYOK" on a vendor's pricing page, you're being told one of three different things and probably can't tell which. The credential-scope version is mostly a database-dump story. The billing-scope version is where the unit economics get decided. The inference-path version is where the contract carries most of the weight, because the architecture can't fully eliminate the data flow. Picking a vendor is about figuring out which of those three they actually offer and which they're being quiet about, and the gap between the strongest axis and the weakest one is usually the part the marketing avoided talking about.

Further reading

Closely related

Engram