Security

"We never train on customer data": what that actually requires

Open ten AI product marketing pages. Nine will say some version of "we don't train on your data." The sentence is doing all of the work, and the customer reading it has no way to verify any of it. Here's what's actually behind that sentence, for us, and as a buyer's checklist for anyone else.

Published March 3, 2026 · By Jacob Davis and Ben Meyerson

One quick disclosure before we start: we're an AI company writing about not using customer data, so feel free to read this with a skeptical eye. The whole point of the post is that the sentence is cheap on its own, so we'll try to earn it.

The phrasing varies from page to page. "Your data is never used to train our models." "We don't use customer inputs for model improvement." "Your prompts stay yours." It's on the homepage, the security page, the trust center, the SOC 2 one-pager, the DPA summary, and sometimes the favicon. And the customer reading it cannot see the training pipeline, the contracts with the upstream LLM provider, what gets logged where, or who else in the supply chain (the embedding provider, the analytics vendor, the ticketing system, the error tracker) has a copy of what went through. They're reading one sentence and trusting that it summarizes a dozen architectural and contractual facts they will never inspect.

The way to tell whether a vendor means it is to ask what's behind the sentence. What does the data path look like, hop by hop? Who has agreed not to train on it, in writing? What is logged? Who is on the subprocessor list? What is the retention period? You do not need to know the answers in detail to use them. You just need to ask, and watch which vendors answer in specifics versus which ones restate the headline.

What "training on customer data" actually means

The phrase has gotten loose, so let's be narrow about it. In the sense buyers care about, training on customer data means a vendor (or a provider the vendor uses) takes prompts, responses, stored memories, or other customer content and uses them as training data, whether that's supervised fine-tuning, RLHF preferences, or pre-training corpus additions, for a model the vendor or provider eventually deploys to other customers.

The harm is concrete. Private data (a product roadmap, an internal incident report, a memory about a teammate's salary, the contents of an executive's calendar) gets embedded into model weights. Those weights serve other customers' queries. A skilled prompter, or an unlucky one, can extract verbatim or near-verbatim fragments later. We've seen it with public web data; there's no reason to assume it can't happen with private data, we just have less evidence because the data starts private.

That specific harm is what the "we don't train on customer data" sentence is supposed to prevent. It's a narrower claim than "no part of your data is ever processed by a third party," which would be a different and basically impossible promise. Third parties process the data, obviously. The LLM provider runs inference. The cloud provider hosts the database. The embedding service generates vectors. What the sentence is meant to rule out is whether any of those processors retain content past the request, whether any of them use it as training data, and whether the contracts and the architecture line up well enough to make either of those answers unambiguous.

The architectural prerequisites

There's no one-liner architecture that gives you "we don't train on customer data" for free. The most important piece, by a wide margin, is where the inference call is billed. The rest matters too, but it's variations on a theme once you've answered that question.

BYOK on the inference path

The strongest version of "we don't train on your data" is the one where the vendor never had the data in a way that would let them train on it. That's BYOK on the inference path: bring your own key, where the customer's API key for the underlying LLM provider does the inference call, billed to the customer's account.

This matters because of who the upstream provider thinks the customer is. When inference happens against the customer's own provider account, the upstream's training-data policy applies to the customer, not to the vendor. The customer's OpenAI no-training assurance, the customer's Anthropic zero-retention tier, the customer's enterprise contract with whichever provider: those govern the prompt and the response. The vendor sits in the middle of the pipe shaping requests and parsing responses, but the content goes from the customer's product to the customer's LLM provider and back, on the customer's bill. The vendor cannot use it for training because the vendor does not own the relationship with the LLM provider for that data. It is, structurally, not theirs to train on.

This is the cleanest version, and it's also the version most vendors avoid. The reason is straightforward. The inference markup is what funds most of the AI product industry. If a vendor is billing you per token, per query, per "memory write," at a rate that includes inference cost, they're almost certainly the inference principal, meaning their relationship with the upstream LLM provider is what governs your prompts, not yours. That isn't automatically bad. It just means everything in the rest of this post matters more, because you've lost the structural guarantee and you're back to relying on contracts and operational discipline.

The reason we keep coming back to this is that the rest of the architectural posture is recoverable from bad to good through engineering work. Sloppy logging can be fixed. A missing subprocessor list can be published next week. An upstream agreement can be re-signed at the next renewal. The inference principal is harder to change because changing it usually requires rebuilding the business model. So when a buyer asks one question, this is the one to ask first.

Zero-retention with upstream providers

If the vendor is the inference principal, the upstream provider's policies are what determine whether your data could be used for training. Every major LLM provider offers some form of no-training or zero-retention tier (OpenAI's API default, Anthropic's commercial terms, the cloud-hosted variants on Azure, Bedrock, and Vertex), but these are not on by default for every account, and no-training is not the same thing as zero-retention, which is not the same thing as "we delete logs on request." A vendor has to ask, sign, and sit on the right contract tier. The answer you want from a vendor here is specific: which provider, which tier, what it covers, what the retention window is for transient abuse-monitoring logs. "We trust our providers" is not an answer.

No payload logging

Even if the upstream provider isn't training on your content, the vendor's own infrastructure is a logging surface, and once a payload is in a log aggregator it inherits that aggregator's retention, access controls, and breach posture. The right answer is to log the metadata you need to operate (tenant ID, timestamp, status, latency, error class, request size) and not the bodies. Most vendors log more than they need to, usually because someone set up logging for a fast dev feedback loop and never tightened it. Asking what's logged and for how long is one of the highest-leverage questions a buyer can ask.

Published subprocessor list

Every modern SaaS vendor uses subprocessors: cloud hosts, observability platforms, ticketing, analytics. A published subprocessor list doesn't prove anything by itself, but it gives the buyer enough information to ask the next question for each one (do they train, what does the DPA say, are they on a no-training tier). No list is a yellow flag. Refusing to share it under NDA is a red flag.

The contractual prerequisites

Architecture without contracts is a promise that lasts until the next pivot. Contracts without architecture are paper that doesn't bind the systems. You need both. The contractual side of "we don't train on customer data" lives in the DPA, the MSA, and the subprocessor agreements downstream, and four pieces of language do most of the work: an explicit prohibition on training that also covers derivative works (so a vendor can't agree not to train on raw inputs while quietly reserving the right to train on embeddings or aggregations); audit rights beyond SOC 2 Type II, since SOC 2 attests that controls exist but does not specifically attest the no-training claim; subprocessor notification with a defined objection window and a defined remedy, rather than "we may add subprocessors from time to time;" and a breach notification timeline tight enough to be useful. Almost every enterprise DPA negotiation we've been in has come down to specifics on this list, and a buyer who walks in asking for the specific language gets a much better contract than one who reads the trust page and clicks accept.

What we do at Engram, specifically

Engram is BYOK, default and only. Customers bring their own API keys for the LLM providers we use in the retrieval pipeline (extraction at write time, the composer call at read time, optional re-ranking), and inference calls go from our server, authenticated with the customer's key, to the customer's provider account, billed on the customer's bill. We don't have a hosted-inference mode where Lumetra is the inference principal, and we don't bill per token or per query against our own provider relationship. That decision means we make less money per customer than we could, and it means the upstream provider's no-training assurance applies to the customer's account rather than ours. We had this argument internally during pricing design; the post on our pricing model is explicit about why we rejected the alternatives.

BYOK does require us to hold customer credentials so the server can make the call on the customer's behalf, and plaintext storage of API keys is unacceptable. We use AES-256-GCM under a 32-byte master key (BYOK_MASTER_KEY) held in environment configuration, with the ciphertext stored as base64(nonce(12) || ct || tag(16)) and a per-row version tag so we can rotate the master key without re-encrypting every row in a single transaction. The master key lives outside the database, so a full table-dump of the credentials column doesn't yield usable API keys without also obtaining the key from server config. The relevant code is in src/hybrid/byok_config.py; BYOK is live and required end-to-end today. A tenant that hasn't configured a provider key gets a 412 from the inference endpoints rather than a silent fallback to a vendor-owned key.

On logging, we capture request metadata (tenant ID, timestamp, request path, status, latency, error class, request size) and nothing else. No payload bodies, not for queries, not for memory writes, not for retrieval responses. Error traces that include payload fragments are redacted server-side before they leave the application boundary, so what surfaces is "request 4f2b9c1d failed at the extraction step with error type X" rather than the prompt that triggered the failure. Operational logs retain for 30 days and are then deleted; billing summaries, which are aggregated counts with no content, are kept indefinitely because they're the basis of our invoices. When a support ticket requires us to look at a specific record, we pull it from the live database with the customer's consent on the ticket and discard the working copy when the ticket closes. We don't pre-emptively capture payloads "in case we need to debug later."

And the customer's actual data, the memories stored in their buckets, is not used as training data for any model. It's not sent to a training pipeline, not aggregated into a fine-tuning corpus, not embedded into a model we deploy. Memories flow into the retrieval pipeline only on the customer's own queries, the composer call runs against the customer's BYOK provider, and the retrieval pipeline itself is deterministic: it ranks memories, fuses signals, returns the top results. The LLM-augmented steps all run on per-call inference against the customer's provider account. None of them populate a training corpus.

The hard case

The honest version of this post has a section on the cases that don't fit neatly into "we never train on your data." The one we want to dwell on is support.

When a customer opens a ticket to help us debug a retrieval issue, they often paste a memory, a query, or a snippet of their conversation history into the description. That content lands in our ticketing system, which is a subprocessor, with its own retention policy and its own access controls. We do not use ticket content for product improvement or for training, and the contracts with the ticketing vendor reflect that. But the content is there. It sat in an email inbox on the way in. It will sit in a search index on the way out. A support engineer pulled it up on a monitor in an office.

We don't think we can engineer this risk surface away. The whole point of a support ticket is that a human has to look at the thing, and the thing is usually whatever was failing for the customer at the time. What we can do is be deliberate about it. Retention follows the ticketing tool's default unless we delete the ticket explicitly, and we will delete on close if a customer asks. We ask customers to redact what they can before pasting, even though we know that asking the person whose system is broken to also do data hygiene at the moment they're frustrated is a losing battle some of the time. We have an internal rule that an engineer who pulls a record from production for a ticket discards their working copy when the ticket closes, and we audit for that, though "we audit for that" is a phrase that does a lot of work and we'd rather be honest that it's a process control with a human in the middle than pretend it's an air-gapped capability.

The reason we're spending paragraphs on this rather than burying it is that it's the place where the gap between the marketing sentence and the truth lives, for us. The training pipeline is structurally clean. The logging is clean. The credentials are encrypted. But customer data does land in our ticketing system when customers send it to us, and if you're evaluating us on whether the no-training claim is real, this is the seam to push on.

Two smaller cases worth mentioning briefly. The embedding model: we generate embeddings server-side using a small open-source model that runs on our infrastructure, fixed, not fine-tuned on customer data, and we don't collect customer text to improve it. If we ever changed that, we'd announce it in advance and offer a BYO-embedding option. And telemetry: we collect counts, latencies, and error-class distributions to operate the service, none of which contains memory or prompt content. The seam there is high-cardinality string fields where a developer can accidentally pipe a payload fragment in. We audit for it; we refuse free-text fields that could carry content. It's a real concern but a tractable one.

Questions a buyer should ask any vendor

If you're evaluating any AI vendor (memory, agent platform, copilot, RAG-as-a-service), two questions will reveal more than the security page ever will.

The first is who the inference principal is. Is the LLM call billed to your account or to mine? If it's yours, which provider, which tier, and what does your contract with them say about training and retention? If it's mine, confirm that the only credentials with the LLM provider are mine and that you do not maintain a fallback inference path under your account. The answer routes the rest of the conversation: if the vendor is the principal, the next questions are about their upstream contracts; if you are, the next questions are about logging and what the vendor's middle-of-the-pipe servers are doing with the payload.

The second is what the vendor logs, and for how long. Specifically whether prompts, responses, or retrieved memories land in any payload body, where that body goes (application logs, an observability vendor, a SIEM, an error tracker), and the retention period on each. Vendors who can answer in specifics have thought about the problem. Vendors who can't have either not thought about it, or have built fine systems and just never written it down; you can usually tell which is which by how they react when you ask.

Close

The sentence is cheap to type onto a webpage. The architecture, the contracts, and the audit discipline that make it actually true are not, and each of those carries a real cost: BYOK on the inference path gives up the margin most agent-infra companies make their money on, zero-retention with upstream providers narrows your operational surface for debugging, not logging payloads means flying with less visibility than you'd want, and a published subprocessor list is a thing you have to update every time you onboard a vendor.

We've made those trades at Engram because the version of the product where the sentence is honest is the only one we'd want to build. That said, we haven't solved every problem in this space. Support tickets are still a way customer data leaves their environment, telemetry-field discipline lives in process and not in architecture, and the embedding model is something we run rather than something the customer brings. We watch those gaps, and we'd rather acknowledge them than write a security page that pretends they aren't there.

If you're reading this as a buyer, the most useful thing to do with the framing is to take it to the next vendor pitch and ask for specifics. The answers will tell you most of what you need to know about which axis is real and which one isn't. And the question itself, asked often enough, eventually moves the industry to a place where the marketing sentence costs the vendor something to type.