Guidance

Latency budgets for agent memory: what to measure and what to expect

If you are picking a memory layer for an agent or building one yourself, you need a latency budget. The usual one-liner is "memory under 200ms p95." It conflates four very different operations and gives you the wrong target on three of them. This is the version of that conversation we wish we had been handed when we started.

Published February 17, 2026 · By Jacob Davis and Ben Meyerson

A read-only auth check and a query that triggers a synthesis call are different physical events. They share a URL prefix and a bearer token and almost nothing else. One is a hash compare and a row fetch. The other is a hash compare, a row fetch, three index queries, a fusion pass over the results, an LLM roundtrip to a provider you don't control, and a JSON serialization. The first finishes in single-digit milliseconds. The second finishes when the model finishes thinking.

Quoting one latency number for "memory" averages those together and tells you nothing useful about either. This post is the framing we use internally for setting per-operation budgets, the ranges we think are plausible on each class for a memory layer that looks roughly like ours, and the few specific optimizations we have done that moved the numbers in ways worth writing down.

We are not publishing a measured benchmark here. We have not run one at the scale needed to publish a credible table, and we did not want to ship one that looked credible without being. Where we cite specific numbers below, they are from changes we made that produced before-and-after deltas large enough to be unambiguous. Everywhere else, the post is guidance: what to measure, what to expect, and where the leverage actually is.

The budget, framed honestly

Agent UX research tends to converge on the same numbers for human-perceived snappiness. Under 100ms feels instant, under 300ms feels responsive, under 1s feels acknowledged, under 3s feels slow, over 10s feels broken. When people say "memory under 200ms p95," they are picking the second tier and applying it to the whole product. That works for the read side. It does not work for the write or the synthesis side without redefining what "memory" means.

A memory system is at least four different kinds of operation, each with its own physics. Treat them as separate budgets:

Read-only auth-gated lookups. Things like GET /router/buckets, GET /account/plan_usage, GET /v1/buckets/{id}/profile. One Postgres query, sometimes a small join, no external dependency. Plausible target: under 50ms p95 once you have done the obvious work (the bcrypt note below is the first piece of that).
MCP tool dispatch with no LLM call. Things like list_memories, list_buckets, delete_memory. Same underlying read, plus the MCP SSE handshake and tool-dispatch overhead. Plausible target: under 200ms p95 with warm transport state, higher on cold sessions.
Memory writes with model-assisted extraction. A store_memory call that runs an LLM extractor to pull subject-predicate-object triples out of the input. One LLM call dominates the budget; database writes are noise. Plausible range: 800ms–2s p95, set entirely by the extractor provider.
Queries with model-assisted synthesis. A query_memory call that composes an answer from retrieved memories. One LLM call against a longer prompt, often a longer output. Plausible range: 1.5s–5s p95, again set by the provider.

Putting these four classes on a single budget number erases the only useful information. The read side is bounded by your code. The LLM side is bounded by someone else's model. You optimize them differently and you report them separately. If a vendor quotes one number, ask which of the four they are quoting and what the others look like.

What to actually measure

The minimum useful instrumentation is one record per request with a request ID, an operation label, an outcome flag, and a wall-clock latency at the server boundary. Add structured per-hop timings under the same request ID so you can attribute totals: auth-decode, db-read, retrieval-fuse, llm-call, serialize. Most of the value comes from being able to ask "where did this 1.8s actually go" after the fact, not from any specific percentile in a dashboard.

A few practical notes from our own setup. Capture latency at the server boundary, not at the client; client-side roundtrip adds five to forty milliseconds of network depending on region and is not yours to optimize. Drop samples above some cutoff (we use 30 seconds) before computing percentiles, and report the count and root cause of the cutoff cases separately. Most of those will trace back to an LLM provider that hung. If you fold them into the tail, your p99 will be a story about your vendor's bad day instead of about your system.

Grade outcomes by response-shape match rather than by latency. A slow correct response is still a pass. You want latency distributions over the requests that did what they were supposed to, not a distribution polluted by error paths that returned quickly because they bailed out early.

The smoke harness pattern is worth copying. Run a fixed set of checks (one operation against one fixture: a known bucket, a known query, a known expected shape) on a per-deploy cadence, write the timings to a small append-only store, and chart them. You do not need a million synthetic queries to know whether last week's deploy made the median read slower.

Where the milliseconds go on each class

The per-hop breakdown matters more than the total if you are trying to figure out what to optimize. Here is the shape we see on each of the four classes. Numbers are illustrative ranges, not measurements; the point is the ratios.

Read-auth (router/buckets)

Auth decode against a keyed HMAC is sub-millisecond. A single indexed select on a small table is a few milliseconds. Serialization and framework overhead are another millisecond or two. Total floor: under 10ms p95 once auth is not bcrypt (see below). There is no slack worth chasing past that; the operation is already at the floor of what a roundtrip through your web framework and Postgres costs.

Read-derived (plan usage, profile metadata)

The hop that grows is the database side, because you are now aggregating across tables instead of fetching one row. The p99 tail on this class often comes from rare lock contention on rollup-write paths that briefly block reads. Batching the writes is usually enough to bring p99 down by a factor of two or more.

MCP non-LLM dispatch

Now you are paying for the MCP transport. SSE handshake or streamable-HTTP session lookup, tool dispatch (resolving the named tool and validating args), the underlying DB read, tool-response framing, SSE send. Most of the latency in this class is the MCP layer, not the work. Caching session state and short-circuiting handshake on warm connections is where the wins live.

Memory writes with extraction

The extractor LLM call is the budget. The DB write of the memory plus extracted triples is noise next to it. The tail (the slowest one percent) is dominated by provider rate-limit responses that trigger internal retries. Embedding generation locally (a sentence-transformer running on the same box) costs tens of milliseconds and does not move the budget. If you are watching extraction p99 climb, look at your extractor provider's status page before you look at your code.

Queries with synthesis

MCP transport overhead, hybrid retrieval (BM25 plus vector plus graph, fused via reciprocal rank fusion), synthesis LLM call, response. Retrieval on a well-indexed corpus is fast and consistent. Synthesis is slow and variable. The synthesis hop holds the budget; everything else is single-digit percent of total latency. The tail is cleaner than the store-path tail because synthesis is reliably the dominant hop. If it is slow, the request is slow; there is not a fat infrastructure-side tail competing for attention.

The headline from this breakdown is simple. On read paths, the bottleneck is Postgres, and most of the work is done. On MCP non-LLM paths, the bottleneck is the MCP transport, and there is still meaningful room. On LLM-bound paths, the bottleneck is somebody else's GPU, and your contribution to total latency is small enough that further optimization on your side is the wrong place to spend the week.

p50 vs p95 vs p99: which one to budget against

We report p95 in any external material when we report a single number. p95 is the right central tendency for budgeting agent UX. It is the latency the typical user sees on a typical bad day, which is the boundary at which "responsive" starts to feel like "broken." A product whose p95 is held to budget feels reliably fast. A product whose p50 is held to budget but whose p95 wanders feels jittery. The user notices the slow ones, not the fast ones.

We report p99 to oncall. p99 is an incident-response number. It tells you about lock contention, provider outages, and resource-pool exhaustion. Customers do not experience p99 directly often enough to budget around it; engineers experience p99 as the shape of their pager. Watch its ratio to p95 rather than the raw value. A p99 that is more than 5x p95 means you have a tail issue (provider transient, lock contention, GC pause). A p99 that sits at 2x to 3x p95 is a healthy distribution even if the absolute number sounds slow.

We almost never report p50. p50 understates real perceived latency for the same reason "average page load time" understates real perceived page load time. Half your users are slower than the median by definition, and the slower ones are the ones who write about you on Twitter. p50 is useful for measuring the floor (how fast is fast?), and we use it internally when we want to know how cheap an operation has gotten. We do not use it externally.

One rule we use internally: hold p95 to budget; let p99 be at most 1.5x p95 unless the tail is structurally heavy, and accept higher when it is. The structurally heavy cases (auth-gated reads where infrastructure transients dominate, LLM-bound paths where provider transients dominate) get capped at the 30s cutoff and surfaced cleanly to customers, not optimized below the natural floor.

The bcrypt to HMAC dividend

One specific finding from our own work, because the lesson generalizes. Until last month, API-key verification on every authenticated request was a bcrypt compare. Bcrypt is the right primitive for password hashing, where you want intentional slowness to deter offline attacks. It is the wrong primitive for API-key verification, where the key has 256 bits of entropy and offline brute force is already infeasible. The slowness was costing us roughly 200ms per request, on every authenticated path, all the time.

We swapped to keyed HMAC-SHA256 against a per-deployment secret, with a fallback path for legacy bcrypt-hashed keys still in the database. The migration is in src/hybrid/shared_utils.py (the compute_api_key_hmac helper). Read-path p50 dropped from a couple hundred milliseconds to single digits across every authenticated operation. The cheap reads went from looking expensive to looking cheap.

The actual lesson is not "use HMAC instead of bcrypt." The lesson is that a constant cost evenly distributed across every request will not show up as a single slow query in profiling. It will not flag in any dashboard. You find it by asking "what is the floor for the simplest possible authenticated request, and does our actual number sit near that floor?" If the answer is no, there is a wrong primitive somewhere in the hot path. That is the hardest kind of latency work and almost always the highest leverage.

The pre-warm dividend

The other specific finding worth recording. The first MCP tool call after a process restart used to be around seven seconds. The cause was lazy loading: the sentence-transformer embedding model was not loaded until the first call that needed it, which meant the first user request paid the full cold-start cost (model download check, weight load, allocation, first-call JIT warmup).

The fix is in src/hybrid/mcp_sse_server.py: a pre-warm hook that runs during application startup, before uvicorn binds the listening socket. It loads the embedder, runs one inference against a fixed sentinel string, and exits. The startup cost is paid by the deploy pipeline, not by a user. After the change, first-call latency drops into the same distribution as steady-state. There is no observable cold-call penalty.

This is the kind of fix that does not move any of the percentile numbers in a smoke harness, because the smoke harness does not sample cold-start calls representatively. It does move the user-perceived experience for the customer whose agent is the first to wake the worker. We noticed it in support tickets going away rather than in dashboards changing shape. Worth doing; not worth quoting a percentile against.

Failure modes and the cutoff

Pick a hard cutoff (we use 30 seconds), drop samples above it before computing percentiles, and report the count and root cause of the cutoff cases separately. On the read side, hitting a 30s cutoff is vanishingly rare and almost always indicates a Postgres incident. On the LLM-bound side, the vast majority of timeouts trace back to the upstream provider: a model server that hung, a rate-limit retry that overflowed the timeout, a connection reset that was not recovered. Most of these are upstream incidents you do not control.

The thing to do about this matters more than the number. Surface timeouts to the customer as a typed error with the upstream-provider hint, rather than absorbing them as a generic 500. A customer who sees "your provider returned a 504 after 30s" knows where to look. A customer who sees "memory failed" looks at you. Both classes of customer want a stack trace; only one of them can act on it.

We do not try to chase the 100th-percentile number below the cutoff. Below that line, the work is on classification and surfacing. Above it, the work is on retry-and-fail-cleanly. Both are higher-leverage than shaving milliseconds off a path that is already bounded by a provider you do not run.

The honest tradeoff on chasing p99

You can always push the p99 down further. The question is what it costs.

On read paths, getting a sub-50ms p99 below 30ms typically requires pinning Postgres connections per worker, dedicating a read replica, and tuning autovacuum more aggressively. The engineering cost is roughly a week per percentile point, the operational cost is permanent, and the user impact is bounded. A 20ms improvement at p99 changes nothing visible. We have decided not to.

On LLM-bound paths, halving the p99 generally requires parallel-calling redundant providers and racing them, which doubles BYOK cost for the customer and adds operational complexity we do not want to own. We have decided not to.

That decision is a tradeoff against fractional-customer impact. A small number of customers will hit the p99 in any given hour. We would rather spend the engineering effort on features they will all use than on percentile points only a few of them will ever observe. If you are running a different product (a tight inner-loop agent, a financial use case where every tail matters) you will draw the line in a different place. The framing is per-operation; the choice of where to draw the line is yours.

The framing, stated once

Latency budgets are per-operation, not per-product. Quoting "memory latency under X ms" without naming which operation is the same kind of trick as quoting "average web request time" — it averages a CDN-cached static asset against a database-bound dashboard query and tells you nothing useful about either of them.

The four operations have four different physics. A read-only auth check is bounded by Postgres and your own framework code, and once auth itself isn't a 200ms bcrypt compare, it should sit under 50ms p95 reliably. An MCP non-LLM dispatch picks up the transport overhead on top of the DB and lands at roughly 200ms p95 once you've warmed connections. A memory write with extraction is bounded by whichever model the customer's BYOK config routes the extractor to; current-generation extraction models give you something in the 1.5–2s p95 range, and that's about as far as your own code can push the number. A query with synthesis is the same story with a longer prompt and a longer completion attached, which puts the p95 in the 3–5s range.

The honest move when reporting any of this isn't to publish one number and let the reader figure out which operation it applies to. It's to publish four numbers, one per class, with a sentence on each describing what bounds the upper end. Different operations, different physics underneath, different budgets the reader can actually act on.