Engineering
Zero-downtime backfill migrations: the HMAC rollout in detail
A companion to our bcrypt-to-HMAC writeup. The earlier post argued why the hash had to change. This one is the migration playbook: column shape, the partial index, the two-query verify path, and the rollout day. No revoked keys, no scheduled downtime, no "please regenerate" email.
If you've already read the bcrypt post, skip the constraint recap; the meat starts at "the verify path." The short version of the design: opportunistic backfill driven by the hot path itself, leaning on a Postgres partial index. We'll spend most of our time on the verify function, because that's where the work happens and where we almost shipped a bug that would have leaked old hashes through the new column. The expand and contract phases on either side of it are short by comparison, and we'll treat them that way.
The constraint we couldn't violate
API keys are credentials customers paste into agent configs. A Cursor user pastes a key into ~/.cursor/mcp.json, a Claude Code user pastes it into claude mcp add-json, a server-side agent pastes it into a deployed config. The key lives there for months. Customers do not, in the general case, return to our dashboard between issuance and the heat death of the universe; they paste once and forget.
That created two hard constraints on any change to how we hash those keys:
- No revocation. If we invalidate a key, the customer's agent starts returning 401s the next time it tries to authenticate. The agent has no way to know it should rotate. The customer notices when their workflow breaks mid-task, which is the worst possible time to find out about an infrastructure change.
- No downtime. Auth is in the hot path of every single request. A migration that requires writes to pause, or that adds a multi-second pre-check, is a billing-impacting incident. Several customers have agents that fire every few seconds against our API; a 30-second window of 503s is a real outage by their SLAs.
So the migration goal was: every existing key keeps working, untouched. New keys use the new format. Old keys quietly transition to the new format the next time their owner uses them. Customer experience: nothing changed. Our operational experience: bcrypt verification time falls from the floor we'd been complaining about to sub-millisecond.
This shouldn't be exotic, but the first three migration plans we sketched on whiteboards all involved either a maintenance window or a "please regenerate your keys" email. It took a couple of iterations to convince ourselves we didn't need either.
The plan in one paragraph
Expand-migrate-contract, with the migrate step doing all the interesting work. Phase A (expand) adds the key_hmac column and its partial unique index, and starts dual-writing both hash formats on every create_api_key. Reads still go through bcrypt. Phase B (migrate) flips the verify path: try the indexed hmac lookup first, fall through to the bcrypt prefix scan if nothing hits, and on a fallback hit write the hmac for that row before returning. Each actively-used legacy key migrates on its next request. Phase C (contract) kicks in once trailing-30-day coverage crosses 99%: stop writing the old column, drop it whenever the calendar allows. Two deploys, one SQL migration on the first deploy, no maintenance window. Phase B is where the bug surface is, and where the rest of this post lives.
The schema change, and why the partial index matters
The migration SQL is two statements:
ALTER TABLE api_keys ADD COLUMN key_hmac TEXT;CREATE UNIQUE INDEX uq_api_keys_hmac ON api_keys(key_hmac) WHERE key_hmac IS NOT NULL;
The ALTER is uneventful. Adding a nullable column without a default is a metadata-only operation in modern Postgres: no table rewrite, no locking, milliseconds even on hot tables. We've done it a hundred times. The partial index is where the trick is.
A naive plan would have built a full unique index on key_hmac right at the start. That has two problems. First, you can't make the column UNIQUE while the existing rows all have NULL, because NULLs in unique indexes are allowed in Postgres but it still has to scan the whole table to build the index. Second, and more importantly, a regular unique index over a mostly-NULL column is a structure that has to track every row of the table, including rows that will be NULL for weeks while their owners haven't used them yet. That's wasted bytes, and worse, it means every backfill UPDATE has to touch the full index rather than slot a single new row in.
A partial index over WHERE key_hmac IS NOT NULL dodges both problems. Rows with NULL key_hmac don't exist as far as that index is concerned. Building the index is cheap because there are zero qualifying rows at creation time. As keys backfill, they get inserted into the index one at a time, in the natural cadence of customer traffic, and the index size grows proportionally to the migrated population, not the full table.
The uniqueness guarantee is exactly what we need. Any non-null key_hmac must be unique across the table, because two distinct raw keys hashing to the same hmac would mean either an HMAC collision (vanishingly unlikely with SHA-256) or a bug. Either way we want the database to reject the insert rather than silently allow the second one and let us authenticate the wrong tenant.
A footnote we noticed during testing: the WHERE key_hmac IS NOT NULL predicate has to match exactly the queries the planner sees, or it won't use the index. We caught this in staging when the explain plan kept showing a sequential scan. The SELECT in our fast path queries by key_hmac = $1, which the planner correctly inferred implies IS NOT NULL, so we were fine. But if you were ever tempted to write a query like WHERE key_hmac = COALESCE($1, '') you'd lose the index. Worth checking explain plans.
The verify path: two queries instead of two hundred milliseconds
The Phase B verify path is short. Compute the HMAC of the supplied raw key. Try the fast path first:
SELECT id, tenant_id, scopes FROM api_keys WHERE key_hmac = $1 AND status = 'active'
If that returns a row, you're done. Update last_used_at, return (tenant_id, scopes). Total work: two queries, both indexed, both sub-millisecond. The HMAC compute itself is a single SHA-256 round, in the microseconds.
If it returns no row, fall through to the legacy path. Pull the prefix off the supplied raw key (we store a short non-secret prefix on every row for exactly this lookup), then:
SELECT id, key_hash FROM api_keys WHERE key_prefix = $1 AND status = 'active' AND key_hmac IS NULL
The key_hmac IS NULL predicate is doing real work in that query. Without it, the fallback would keep returning rows whose hmac was already populated. Those can't possibly match (we already tried the hmac and missed), but we'd still pay bcrypt cycles ruling them out. The predicate also means the legacy path only ever looks at rows that haven't migrated yet, which is by construction a shrinking set.
Loop over the matching rows, bcrypt.checkpw(raw_key, row.key_hash) each one until one passes. (Prefix scans usually return one row; sometimes two if two keys happen to share a prefix, which is allowed.) On success, run the migration UPDATE:
UPDATE api_keys SET last_used_at = now(), key_hmac = $1 WHERE id = $2
That single UPDATE does two things: bumps the heartbeat, and writes the hmac. The next verification of the same key will hit the fast path. We measured this on a known-cold key right after deploy. First verify: 205 ms (the bcrypt fallback). Second verify against the same key: 0.1 ms (the indexed lookup). Once-per-key cost, exactly the cost we were paying before, plus one row update. Forever after, free.
One nasty thing we caught in code review and want to flag, because it's the kind of mistake that would have happened to one of us if the other hadn't been reading carefully. The original draft of the fallback UPDATE wrote the hmac from a variable that, on a subset of code paths, was still bound to the hmac of an unrelated key from earlier in the request. The unit test happened to pass because the test reused the same key. We caught it in a second review by reading every assignment to key_hmac_hex and asking "what is this value at the moment of the UPDATE." The fix was to recompute the hmac immediately before the UPDATE, against the same raw key we'd just verified. The lesson: when a migration writes to a column that's about to become the source of truth for authentication, treat the assignment with the paranoia it deserves. A bug here doesn't surface as a 500. It surfaces as the wrong tenant authenticating, six weeks later, when someone notices.
A confession on the rollout sequence, while we're being honest: we did not actually measure Phase A coverage before deploying Phase B. We confirmed by hand that a few new keys had non-null key_hmac values and called it good. The right move would have been a one-line dashboard panel showing the share of newly-issued keys with hmac populated, watched for the full 24-hour soak. Phase A could have been silently dropping hmac on some code path we hadn't exercised, and we wouldn't have caught it until Phase B's fallback started carrying more traffic than it should. It didn't happen. We got lucky that the dual-write was just three call sites and they were all on the hot path.
Why opportunistic beats a scripted backfill
The obvious alternative to opportunistic backfill is a script: iterate every row of api_keys, compute the hmac, UPDATE. A few of us preferred that initially because it's familiar and finishes in a known wall-clock time.
It can't work. Bcrypt is a one-way function. We do not have the raw API keys stored anywhere; that's the entire point of hashing them. So a backfill script has no input from which to compute compute_api_key_hmac(raw_key) for an existing row. The script can either (a) skip every existing row, in which case nothing migrates and we still need an opportunistic path on top, or (b) email every customer asking them to re-submit their raw key through a backfill endpoint, in which case we've broken the "no customer communication" constraint and probably created a phishing-vector that won't end well.
The opportunistic backfill works for one reason: verify_api_key is the only context in our entire system where we have the raw key. It's the customer's Authorization header, in flight, on its way to becoming a hash that gets compared to one we've stored. For the few hundred microseconds between the request arriving and us returning a response, we hold the only ingredient the migration needs. If we don't capture it at that exact moment, we never see it again.
So the verify path is the only place a migration can happen at all. Once you accept that, "make verify do the migration" stops looking like a clever hack and starts looking like the only design. Every other architecture is some variation of "delay the inevitable until a quieter time of day," and the quieter time of day never comes because there's always another change you wanted to ship instead.
The same shape works any time the verify path is the unique vantage point onto the raw secret: hash-scheme rotation, key-format rotation, session-token re-signing, encrypted-at-rest column migrations where the decryption key is the one you're changing. Read both formats, write the new one on hit, drop the old format when coverage is high enough. The hard part is having the discipline to use an existing hot path as your migration mechanism instead of writing a script.
Measuring "are we done yet"
The clean signal for "Phase B is complete and we can move to Phase C" is the share of actively used keys that have a non-null key_hmac. We don't care about keys that haven't been used in months. By definition, they're not authenticating anything right now, and if they ever wake up they'll migrate on their next call. So the metric we track is coverage within the trailing-30-day, 60-day, and 90-day active cohorts:
SELECT count(*) FILTER (WHERE key_hmac IS NOT NULL) * 1.0 / count(*) FROM api_keys WHERE last_used_at > now() - interval '30 days'- Same with 60 days. Same with 90 days.
The 30-day cohort is the one that matters for Phase C readiness. When it crosses 99%, we know virtually every key that's going to authenticate today is already on the new format. The 60- and 90-day cohorts are useful sanity checks. If the 30-day is at 99% but the 90-day is at 80%, that tells us about long-tail keys that wake up occasionally, which is fine; they'll migrate on wake.
We also track raw count of rows with NULL hmac, broken down by last_used_at age. After a day, that distribution stops shrinking on the recent end (everything active has migrated) and slowly bleeds on the older end as occasional users come back. The shape of that decay told us we had no surprises hiding: no cohort of keys was systematically refusing to migrate, no integration was caching auth in a way that bypassed our verify path.
The rollout, what actually happened
Phase A deployed on a Tuesday afternoon. The SQL ran in 14 ms. The code change was minimal: create_api_key got six new lines (compute the hmac, include it in the INSERT). We sat on Phase A for 24 hours, mostly to confirm nothing weird happened with the new column: that the index built clean, that no INSERT path had been missed, that every new key issued during that window had a non-null hmac. The ones we spot-checked did.
Phase B deployed the next afternoon. This is where the interesting telemetry showed up. The smoke test runs an end-to-end auth check against our own internal key, which had been created weeks earlier and was still on the bcrypt-only format. First verify after deploy: 205 ms. The fallback path triggered, bcrypt ran, the UPDATE wrote the hmac. Second verify, a few seconds later from the next smoke-test iteration: 0.1 ms. Indexed lookup, no bcrypt. The migration happened in the time it took the smoke test to complete one loop.
Within an hour the trailing-30-day cohort was at 71% coverage. Within four hours, 94%. Within 24 hours, 98.4%. We have a handful of customers running constant-cadence agents, and those keys migrated within their first request. The long tail is the keys belonging to humans who use the product a few times a week; those took the full week to cross over.
Things that surprised us in a good way: the median verify latency dropped from around 95 ms to under 2 ms within the first hour. Half of our auth traffic was already new-format keys (the ones created since Phase A) or actively-used legacy keys (the ones that migrated on their first request after Phase B). The improvement was visible in our customer-facing latency dashboards within minutes. Several customers messaged us asking what we'd done; we hadn't told them anything was changing.
Things that surprised us in a less good way: the prefix-scan fallback occasionally hit two rows with the same prefix, and the bcrypt loop over both rows brought the legacy path closer to 400 ms than 200 ms. That's still a once-per-key cost (after the first verify the key is on the fast path), but it briefly worried us that a customer with a brand-new key colliding-by-prefix with an unmigrated key would see a slow first request. The collision rate is low (the prefix is wide enough) and the bcrypt cost is a one-time tax. We didn't change anything.
One other thing that didn't go cleanly. We'd staged the dashboard panels for coverage tracking the day before, and one of the SQL queries had a typo (referenced api_keys.key_hash_hmac, a column that doesn't exist). The dashboard was showing all-zeros for the first hour. We thought the migration wasn't working. We ran the query manually, saw coverage already past 60%, fixed the dashboard. Lesson: validate observability before the thing it's observing actually starts moving, not after.
What we'd do again
The opportunistic-backfill-on-verify pattern is the design choice that made the whole thing possible. We'll reach for it any time we need to migrate the storage shape of a credential, a token, or anything else where the only place we see the original input is in the act of using it. It feels uncomfortable the first time because there's no single moment of completion, no script you can point at and watch finish. The discomfort is the cost of admitting the script you wanted to write can't actually exist: bcrypt is one-way, and the raw key only ever lives in flight on the verify path.
The partial unique index. It made the index build trivially cheap, kept the index size proportional to migrated data, and gave us the uniqueness guarantee for free without locking the table. Postgres partial indexes are an underused feature and this is a case where they're exactly the right tool.
Coverage-by-cohort as the readiness metric. The naive "what percent of all rows have an hmac" number lags forever because of dormant keys that may never come back. Splitting it by last_used_at bucket let us make the call to drop the bcrypt write path with confidence, well ahead of when 100% of the raw row count would have flipped.
The 24-hour soak between Phase A and Phase B. We were tempted to ship both together since the diffs don't conflict. The reason to separate them: if Phase A's dual-write was broken, Phase B's fallback would silently paper over it for newly-issued keys, and we wouldn't notice until much later. The soak caught nothing this time. The next time we run this pattern, it might.
The generalizable lesson
The principle worth stealing: when the data needed for a migration only exists in the hot path, make the hot path do the migration. That's true here because bcrypt is one-way and the raw key lives in the request, but the same logic shows up far outside credential storage. Cookie-based session stores that want to rotate signing keys without invalidating logged-in users do it on the next request that arrives with the old-format cookie: verify against the old key, re-sign with the new one, set the new cookie on the response. No scripted re-sign job, no forced re-login, just the natural traffic gradually pulling the population over. Anywhere you find yourself wanting to write a backfill script and noticing the input isn't there, look at what is there at the moment of use, and let usage do the work.
Further reading
Closely related
- The 200ms auth floor: replacing bcrypt with HMAC for API keys. 50,000× speedup on hash verification by matching the primitive to the entropy of the secret it protects.
- Building a 22-second deploy smoke that catches real bugs. ~100 checks across 13 groups in 22 seconds. Caught six real bugs during construction. Design notes.
- Designing the memories table for a system you can't easily migrate. Two years of column-by-column iteration on the spine of the system, with the painful migrations called out.
Engram
- Engram on LongMemEval-S: 91.6%. Full benchmark methodology and what didn't work.
- Engram docs. HTTP API, MCP setup for each client, SDK examples.
- Start with Engram. Free tier, BYOK, MCP-native.