Engineering

When pgvector slowed down past 500 buckets per tenant

Mid-benchmark, our query p95 drifted from ~150ms up past 450ms. The instinct was “pgvector at scale.” The instinct was wrong. Here’s what we actually measured, what we ruled out, and the one-line fix that recovered most of the latency.

Published January 13, 2026 · By Jacob Davis and Ben Meyerson

The observation

We were partway through a LongMemEval-S 500-task run when our chunk-by-chunk latency report started looking off. Chunks 1–3 ran at our usual baseline. Chunk 4 was a touch slower. By chunk 6 we were seeing p95 retrieval latencies 2–3x what we’d measured the week before on the same machine, against the same dataset, against the same Postgres. Nothing about the workload had changed. Nothing about our build had changed. The numbers were just worse.

The reflex reaction in the room was “pgvector at scale.” We’re running pgvector with HNSW on a moderate-sized Postgres, and pgvector has known performance characteristics that get talked about a lot. So that was the first hypothesis. It was also wrong.

The actual signal, once we stopped looking at pgvector and started looking at the data, was the bucket count per tenant. A tenant with ~50 buckets ran at ~150ms p95. The same query shape, against a tenant with ~550 buckets, ran at ~450ms p95. Same data volume in memories. Same SQL. Same index. Same machine. Just more buckets sitting next to the live ones.

The data point that pointed us the right way

We were carrying a lot of historical state. Engram tenants in our dev environment had been collecting bucket churn for months: prior benchmark runs, prior prompt iterations, one-off experiments where each variant got its own bucket so we could A/B retrieval. Most of those buckets had zero memories in them; we’d cleared the contents but never deleted the buckets themselves. The carry from those experiments stacked up.

The two numbers that mattered, measured on a single tenant:

Tenant state	Active memories	p95 query latency
~50 buckets total	~12,000	~150ms
~550 buckets total (~350 empty)	~12,000	~450ms

Same memory count. Roughly 3x slower at the bucket-count-heavy end. That ratio was the tell. If this had been a pgvector HNSW scaling problem driven by vector volume, the two rows should have been roughly equal. They had the same number of vectors indexed.

What we suspected first, and why each wasn’t it

Hypothesis 1: pgvector HNSW degradation

This is where we spent most of the first day. We’ve all read the threads. HNSW build cost grows superlinearly with vector count, recall/latency depends on ef_search, and there’s a known set of failure modes when an index gets large. Two of us had separately landed on “the index is finally biting us” before lunch, and we started drafting a tuning plan: bump ef_search, look at a rebuild with higher m, maybe partition the embeddings table by tenant if we had to. The shape of the curve felt right. We’ve been carrying pgvector for over a year and this is the kind of slowdown we’d been half-expecting eventually.

The reason it wasn’t it: our HNSW is scoped over (tenant_id, embedding) in effect. Vectors live in one table, filtered by tenant at query time. The total vector count on this tenant didn’t change between the two rows above. 12,000 vectors is 12,000 vectors whether they’re sliced into 50 buckets or 550. pgvector doesn’t know or care how we logically partition them. We checked anyway: EXPLAIN ANALYZE on a single vector query against this tenant came back with the same plan, the same node count traversed, and the same Postgres-side execution time (~7ms for the actual ANN scan) in both states.

Postgres was fine. The slowdown was happening above Postgres. We’d burned most of a day on the wrong tree.

Hypothesis 2: per-bucket FAISS lock contention

For some retrieval paths we maintain per-bucket in-process FAISS indexes under a per-bucket asyncio lock. 550 lock objects acquired per query during fan-out sounded plausible. We instrumented the path: median 1.2ms, p95 4.8ms, completely uncontended. Not it.

Hypothesis 3: per-query bucket fan-out

Our retrieval orchestration, when called with no explicit bucket filter, scans all enabled buckets for the tenant and merges results. A Python loop over the bucket list, dispatching a sub-query per bucket and reducing the candidate sets into the final reranked list.

On 50 buckets, that loop is invisible. On 550 buckets, even if 350 of them are empty, the loop still spins through them: enumerate the bucket, check enabled state, set up the sub-query coroutine, acquire the per-bucket lock, hit Postgres with a filter that returns zero rows, release the lock, append the empty result, move on. Each empty bucket cost us roughly 600–900µs of Python overhead. 350 × 750µs = ~260ms of wall time spent doing nothing useful.

That was the gap.

The diagnosis

We profiled a single end-to-end query against the heavy tenant with py-spy record, then walked the flamegraph. Postgres execution accounted for ~14% of wall time. The rest was orchestration: bucket enumeration, coroutine scheduling, lock acquire/release pairs against per-bucket lock objects, candidate merge, RRF fusion across a per-bucket-keyed candidate dict. 80%+ of the latency was Python orchestration over an N that had quietly grown 10x.

This is the kind of bug that doesn’t register as a bug for a long time. None of the per-bucket costs are individually expensive. The loop isn’t obviously wrong. It’s the same loop you’d write today. There’s no log line that screams. The slowdown is smooth and proportional to a number nobody’s thinking about, because the number is bucket count and bucket count is a logical organizing concept, not a load-bearing axis. You have to look at the right metric on the right tenant on the right day to see it.

For us, the “right day” was when we’d been running benchmarks for long enough to accumulate hundreds of empty buckets and the cost finally surfaced as a chunk-6 p95 regression on the benchmark dashboard.

The actual fix

The first fix was data hygiene. We wrote a one-off script that listed every bucket in our dev tenant, counted memories per bucket, and deleted any bucket with zero memories that hadn’t been touched in the last 14 days. That dropped us from 550 buckets to ~200 in a single pass. Re-ran the same query: p95 back down to ~180ms.

~180ms vs. ~150ms baseline is still a small regression (we now have 200 active buckets instead of 50), but it’s a realistic, manageable slope. We weren’t chasing a constant; we were chasing a 3x cliff that wasn’t supposed to exist.

The deeper fix shipped a week later. Two pieces:

Skip-empty-buckets in retrieval orchestration. Before fan-out, we now batch a single SELECT bucket_id, COUNT(*) FROM memories WHERE tenant_id = $1 GROUP BY bucket_id to identify buckets with zero memories and exclude them from the loop entirely. No coroutine, no lock acquire, no sub-query. One ~2ms aggregate query replaces N × ~750µs of orchestration. For the 550-bucket / 350-empty case, that single change saved ~260ms.
Quarterly orphan-bucket GC. A scheduled job that flags buckets with zero memories and no writes in the last 90 days. Flagged buckets get a notification email; deletion is opt-in. We didn’t want a job that silently destroyed customer buckets, so we kept it advisory by default and aggressive only for our internal dev tenants.

Combined, the heavy tenant went from ~450ms p95 to ~155ms p95, within noise of the original baseline. The skip-empty path made the system robust to the next batch of stale buckets we inevitably accumulate, and the GC job prevents the accumulation from getting absurd in the first place.

What we wouldn’t have caught without measuring

If we’d trusted the first instinct, we’d have spent the week on pgvector tuning. We had a list ready: bump ef_search, rebuild HNSW with a higher m, partition the embeddings table by tenant, evaluate switching back to IVFFlat for some workloads, look at maintenance_work_mem during index rebuilds. All of that is real and useful work for the right problem. None of it would have moved this number a millisecond, because Postgres was fine.

The thing that saved us was running py-spy against a live query before we touched a config. The flamegraph told us within 90 seconds that pgvector wasn’t the bottleneck. Everything after that (finding the empty-bucket count, identifying the per-bucket loop cost, designing the skip-empty path) was straightforward.

The thing we’d tell ourselves a year ago

Profile before you tune. That’s the whole lesson. Index tuning is legible: there are dials, there are graphs, there are blog posts about which dial to turn. That’s exactly why it’s a tempting wrong answer when latency drifts. A flamegraph would have told us in 90 seconds what the tuning plan would have taken a week to disprove, and we’d like that day back. The reason this one stung is that empty buckets felt free; they had no memories in Postgres, so we never thought of them as carrying cost. They carry cost in the Python layer above Postgres, which doesn’t care whether you call it with content or not. It still does the work.

An honest caveat about pgvector

To be fair: pgvector absolutely does have performance characteristics worth knowing about. HNSW index build cost on bulk inserts grows superlinearly and can stall ingest if you’re not batching carefully. Recall vs. ef_search is a real tradeoff and we’ve tuned it more than once. The community’s move from IVFFlat to HNSW changed the right defaults for most workloads. Concurrent index builds matter for production. There’s also the question of when pgvector stops being the right tool, which depends on your vector volume, query rate, and tolerance for recall variance.

This post isn’t about any of that. It’s about a case where the obvious-looking pgvector story was the wrong story. If you want to read the real pgvector performance literature, the official pgvector README on GitHub is the best starting point, and the team behind it publishes credible benchmarks. There’s also been good third-party work on HNSW parameter tuning specifically for ANN recall at production scale. Those are the right places to go if pgvector itself is your bottleneck.

Our experience here is just: don’t assume it is.

How this shows up in our product

The skip-empty-buckets behavior is on by default in Engram now. There’s no flag to turn off, no migration to apply. If you’re a tenant with a lot of buckets (for example, you’re using one bucket per project and you have a long history of completed projects), your retrieval calls will skip the zero-memory buckets transparently. The bucket still exists in the API, still appears in list_buckets(), still accepts new writes; it just doesn’t cost you query latency when it has nothing in it.

The advisory GC job runs against internal dev tenants only. For customer tenants we surface orphan-bucket counts in the admin UI but never delete on your behalf. The general philosophy is that buckets are user-facing artifacts and disappearing them silently is a worse failure mode than carrying a few empty rows.

If you’re building on Postgres and pgvector yourself: the skip-empty pattern is a five-line change. Add an aggregate count up front, filter the fan-out list against it, move on. We’d have shipped it on day one if we’d known.

One more thing

Most “scaling problems” at moderate scale aren’t the index. They’re orchestration costs upstream of the index, growing with some axis you weren’t thinking about. The index is famous, so it gets blamed. The orchestration layer is your own code, so it’s easy to overlook. We knew this in the abstract and still spent a day on the index.

For us, this time, it was the Python.