Engineering

Building a 22-second deploy smoke that catches real bugs

The smoke that runs against every Engram deploy is around 100 checks across 13 groups: REST, OAuth, MCP, admin, webhooks, and the error paths in between. It provisions a fresh tenant, exercises every customer-facing surface, and cleans up, in about 22 seconds. While we were building it, it caught 6 real bugs that had landed on main without anyone noticing.

Published February 10, 2026 · By Jacob Davis and Ben Meyerson

This post is about how that smoke is shaped, the questions each check actually asks, and the bugs that fell out of writing it. It’s long, because the value of the thing is in the specifics. If your smoke is "GET /health, assert 200, ship it," you have a status page, not a smoke. We wanted something that would fail when something real broke.

A smoke that always passes isn’t a smoke

The first principle is almost embarrassing to state. A smoke test that always passes is not telling you the deploy is healthy; it’s telling you the smoke isn’t looking at anything. There is a strong selection pressure toward easy checks: they’re fast to write, they don’t flake, they’re green on the dashboard. The output of that pressure is a script that exercises 5% of the surface area and reassures you about the wrong things.

The corollary is the useful one: a smoke is worth what it costs to build only if it catches bugs you haven’t found yet. Not bugs you’re trying to regression-test against. Bugs the developer doesn’t know are there. If the act of writing the smoke flushes out latent breakage, the smoke is paying for itself before it’s even running on a real deploy.

We tried to design every check with that in mind. Each one had to ask a specific question, not "did the endpoint respond" but "did the endpoint respond with the thing customers depend on it returning." That single shift in framing was responsible for every real bug the smoke caught during construction.

Six bugs the smoke caught while we were building it

Before we get to the design, here’s the receipts. These are the bugs the smoke surfaced before its first scheduled run on a real deploy. Every one of them existed on main when we started. Every one of them was found because a check asked a more specific question than "did it 200."

  1. SQLite/Postgres method drift on get_recent_queries. The admin stats endpoint called a method that existed on the Postgres adapter but not on the SQLite adapter. Production was fine. Local dev (SQLite) crashed the moment the smoke hit admin stats. The smoke runs both flavors, so it caught it.
  2. Same drift on list_bucket_memories. A router endpoint for "list memories in bucket N" worked against Postgres but the SQLite adapter had a stale signature from a refactor six weeks earlier. No customer ever hit it locally because nobody was running the router against SQLite. The smoke does.
  3. Same drift on the query_log table. The Postgres migration had landed in March; the SQLite equivalent had been silently skipped because of an IF NOT EXISTS branch that didn’t notice the table actually didn’t exist on the SQLite path. Query logging looked fine until the smoke read from the table.
  4. byok_validated_at serialization crash on SQLite. The field came back as a datetime from Postgres and as a string from SQLite, and the response serializer called .isoformat() unconditionally. Production never hit it. The smoke’s "validate BYOK config" check did, on its first SQLite run, with a stack trace.
  5. /admin/customers/<id> not returning billing_plan. A field the admin UI displayed had been dropped from the response shape during a model refactor. The UI silently rendered an empty string and nobody noticed. The smoke explicitly asserted the field’s presence; the deploy failed immediately.
  6. store_memory("") silently accepted. A memory with empty content was being accepted by the store endpoint, indexed against nothing, and then returning zero hits forever after. The contract said empty content should 400. The smoke explicitly asserted the 400; production was returning 200.

The pattern across all six: each was caused by a check that asked a specific question instead of accepting a generic 2xx. None of them would have been caught by "ping /health."

Anatomy of a useful check

Every check has the same shape: a name, a group, and a small function that returns (passed, status, detail). The runner streams each result to stdout as the run progresses so you can watch it land, not stare at a 30-second hang waiting for a summary block. A single line looks like this:

[ok ] mcp.query.marker_present   200   marker found in answer (1 hit, 47 chars)

The group prefix (mcp) is one of thirteen, and lets us scope a run with --only mcp when something is on fire. The dotted name reads top-down (surface, action, assertion), so you can tell what the check is asking before you read the detail. Outcome is ok / err / skip; skip earns its keep, because some checks need creds that aren’t available in every environment and we’d rather see "skipped: no admin password" than a false red. Status is the HTTP code, or - for non-HTTP checks. The detail field is where we’re strictest: it’s easy to write a check whose detail is the word "ok," and twice as useful to write one whose detail is "marker found in 2 of 4 retrieved memories, top score 0.87." The second one is debuggable from the log line alone. The first one isn’t.

The thirteen groups

We’ll spare you the full grid of checks. They fall into thirteen functional groups; here’s one per group: the one that did real work, not the easiest one to describe.

health

Both surfaces (/health on the admin app and on the MCP app) get hit independently. They run as different processes, so checking one doesn’t tell you anything about the other. We learned this the hard way once when only one was up for forty-five minutes.

public

The unauthenticated marketing-adjacent endpoints (/pricing, /auth/config) shouldn’t be hard to keep up, but they’re the ones a customer hits first. auth/config in particular is read by the sign-in flow; if its JSON shape regresses, sign-in breaks before the customer ever sees a useful error.

discovery

OAuth discovery is structured and standardized: RFC 8414 Authorization Server metadata, RFC 9728 protected-resource metadata, and the WWW-Authenticate header on a 401 response from a protected resource. We assert the JSON keys (issuer, authorization_endpoint, token_endpoint, registration_endpoint, scopes_supported, code_challenge_methods_supported) are present and well-formed. A misconfigured deploy here means an MCP client can connect to nothing.

auth

Sign-in works. Wrong password returns 401, not 500. Duplicate signup returns 409. Logout invalidates the session, and we test that last one by calling a session-protected endpoint after logout and asserting 401. The negative check is more important than the positive one. Anyone can write a sign-in test that passes; the question is whether sign-out actually signs you out.

apikeys

Create an API key. Use the new key on a protected endpoint and assert 200. Revoke the key. Use it again and assert 401. The last step is the one that earns its keep. Revocation that doesn’t actually revoke is the kind of bug that ships quietly and gets discovered during an incident.

byok

Bring-your-own-key config sets: CRUD on the config set itself, a component-override round-trip (set an override, read it back, confirm the value matches), a live validate ping that actually calls the configured provider with a one-token request, and an SSRF protection check that submits a config pointing at an internal IP and asserts we reject it. The SSRF check has a tendency to silently regress whenever someone touches the URL parser. We caught it twice during construction.

portal

Every /account/* read endpoint the customer-facing portal calls: profile, usage, billing, settings, sessions, API keys, BYOK configs. Plus a PUT /account/settings round-trip: set a setting, read it back, confirm the value is the one we set. Reading endpoints alone is a smoke that lies; the write path is where the bugs live.

router

The router app (separate from MCP and admin) exposes /router/buckets and /router/buckets/<n>/memories. These are what the customer dashboard calls to render bucket lists and memory feeds. Same surface, different code path from the MCP tools, and historically a place where shape drift between the dashboard and the MCP server has caused inconsistency.

oauth

The largest single group. Dynamic Client Registration (RFC 7591), unauthenticated /authorize bounces to the sign-in flow with the original request preserved, authenticated /authorize issues a code, code exchange yields an access token and a refresh token, the access token actually authenticates an MCP connection (more on this in a moment), refresh token rotation works, a reused refresh token is rejected (RFC 6749 best practice), /revoke invalidates a token, and the negative paths (missing PKCE, mismatched redirect_uri, unknown client_id) each return the right error code. OAuth is the surface where being wrong quietly is most expensive, so we asserted the negative paths as carefully as the happy ones.

mcp

Connect and initialize the MCP session, list tools, then call each tool with realistic arguments. The most important check here is query correctness, which gets its own section below. We also test that storing the same content twice doesn’t create a duplicate (dedup), that storing without specifying a bucket auto-bucketizes correctly, that empty content is rejected at the MCP boundary, that oversized content is rejected, and that the ?config_set= query parameter routes to the right BYOK profile.

admin

Admin stats endpoint, customers list, customer detail (with the billing_plan assertion that caught bug #5), presets endpoints. Plus a single 403 boundary check: take a JWT for a regular customer, hit an admin endpoint, assert 403. The negative check matters more than any of the positive ones. If a customer JWT can read the customers list, that’s the kind of incident that ends careers.

webhooks

Stripe webhook endpoint, POSTed with no signature, then with a bad signature. Both should return non-2xx. If the smoke ever sees a 2xx here, we have a serious problem: an attacker could fake subscription events.

errors

A handful of explicit negative-path checks: an unauthenticated request to a protected endpoint returns 401, an invalid token returns 401, a request for a non-existent resource returns 404, a wrong-method request (GET on a POST endpoint) returns 405. These are the boring contracts customers depend on without knowing they do.

Hard parts: query correctness

Asserting that a query returned success: true is the bug, not the test. A query response can succeed structurally (200, valid JSON, well-formed envelope) and contain wrong content. The synthesis call ran. It returned something. Nothing is wrong from a transport perspective. Everything is wrong from a customer perspective.

The check we settled on works like this. Before issuing the query, we store a memory whose content contains a unique marker string, something like SMOKE-MARKER-7f3a2b9c, a UUID-flavored token guaranteed not to appear anywhere else in the tenant. Then we issue a query that should retrieve that memory. Then we grep the answer text and the retrieved context for the marker. If neither contains the marker, the check fails.

That sounds simple, and it is, but it forces a real question: did the system actually retrieve the thing we stored? Not "did it return a plausible-sounding response." Not "did the embeddings produce a high similarity score." Did the specific content we put in come back out? Markers are how we asserted that without writing a brittle string-match against the entire response body.

There’s a subtlety in where the marker is allowed to appear. We accept it in either the answer or the context. The composer is allowed to summarize, and a strict requirement that the marker appear verbatim in the answer would penalize a system that correctly retrieved and correctly summarized away the marker. Accepting it in the retrieved context preserves the actual signal: did retrieval find this thing? Yes or no.

Hard parts: dedup

Storing the same content twice should not create two memories. Customers pipe noisy event streams in and expect duplicates to collapse rather than silently growing a bucket to a million identical entries. The smoke checks two things at once: the second store_memory call should return the same memory_id as the first (the MemoryHashConflict path), and list_memories against the bucket should report exactly one row. Belt and suspenders, because the store endpoint could theoretically return a stub ID without actually consulting the store, and the list could theoretically lie in the opposite direction. We don’t require both to pass; we require neither to obviously fail.

Hard parts: OAuth-token-on-MCP

The hardest single check in the suite. The OAuth flow produces an access_token. The MCP server is configured to accept OAuth bearer tokens as an authentication method. These are two systems that exist independently and have to agree on what a valid token looks like.

The check opens an MCP SSE GET request with Authorization: Bearer <access_token> and asserts a 200 response. SSE is a streaming protocol, so the request doesn’t naturally close, so we set a short timeout (a few seconds) and treat "timed out mid-stream after a 200 status" as success. We got past the authentication gate; the stream is doing what streams do. If the auth gate had rejected the token, we’d have gotten a 401 immediately, well before the timeout.

The trick is reading the result correctly. A naive implementation will see the timeout exception and mark the check as failed. The right implementation distinguishes "timed out while connecting / before any response headers arrived" from "received 200 headers and then timed out reading the stream body." Only the second is success. We’ve had to fix this check twice when the underlying HTTP client library changed how it surfaces timeouts. It’s worth it because this is the single most fragile integration in the system.

Provisioning and cleanup

The smoke runs against a real environment, so it needs to be a polite tenant. Each run signs up a fresh test account with a deterministic email pattern (smoke-<timestamp>@engram-smoke.test) through the same /auth/signup path a real customer uses. That gives us a clean tenant with default settings and no historical state. From there, the run sets up a BYOK config, mints an API key, executes the full check suite against that tenant, then revokes the API key and clears BYOK at the end.

The tenant row itself stays around. We don’t expose a self-delete on the signup path (intentionally; that’s a security hazard in its own right), so the smoke leaves an inert row behind every run. Garbage collection is trivial: anything with the @engram-smoke.test domain suffix and no API keys, no BYOK config, and no memories is a finished smoke tenant and can be reaped on a schedule. We let those accumulate for the first month before adding the GC job, just to see how the smoke behaved against the same tenant pattern over time. (Answer: fine. Tenant rows are cheap.)

Cleanup-on-failure is its own subtle thing. If the smoke fails partway through, we still want to revoke the API key it created, because a hanging API key with BYOK credentials attached is a small but real security smell. Each provisioning step registers a cleanup callback before it runs the dependent checks, and the runner walks the registered callbacks in reverse on exit regardless of whether the run passed or failed. That part is unglamorous; it also took two iterations to get right.

Two wrappers, two environments

The smoke ships with two thin shell wrappers. scripts/smoke_local.sh sources .env, pings the admin app and the MCP app to verify both are reachable, and if either isn’t, bails out with the exact start command a developer should run instead. There’s nothing worse than a smoke that fails for the wrong reason and sends you on a thirty-minute hunt; if the local server isn’t up, the script tells you so and exits before running any checks.

scripts/smoke_prod.sh is the production wrapper. It reads SMOKE_BYOK_KEY, SMOKE_ADMIN_EMAIL, and SMOKE_ADMIN_PASSWORD from the environment. If SMOKE_BYOK_KEY is missing, the BYOK group skips itself with a clear "skipped: no BYOK key configured" message rather than failing red. If admin creds aren’t supplied, the admin group skips. Skipping is a real outcome, not a failure, and the run is still considered green as long as no check returned err. The point is that the smoke is useful in environments where not every credential is available, without devolving into a configuration scavenger hunt.

Why 22 seconds matters

A 22-second smoke runs in a deploy pipeline without anyone resenting it. A 5-minute smoke gets skipped under deadline pressure. A 30-second smoke that hangs and times out gets disabled by the first engineer who has a bad afternoon with it. We aimed for under 30 seconds wall-clock and landed at 22 in steady state, with a hard ceiling of 60 seconds before the runner aborts the whole suite.

Most of the budget goes to network round-trips. We run checks within a group sequentially (because some of them depend on state set up by earlier checks in the group) but run the groups in parallel where their dependencies allow it. The two groups that can’t parallelize are oauth (a state machine of code → token → refresh → revoke) and mcp (which needs the OAuth flow to have produced a token first). Everything else fans out.

The streaming output is part of why 22 seconds feels reasonable rather than tedious. A run that prints a check per 200ms gives the developer something to watch; the same 22 seconds spent in a single blocking call feels like the script is hung. The cost of streaming is one extra print per check and the discipline of writing each check to be standalone. Worth it.

What we’d do again

A smoke earns its keep in the disagreements, not the agreements. Every green run is a few seconds of relief and zero new information; every red one is a question about whether it’s telling you something true you didn’t already know. All six bugs the smoke caught during construction were in that second category, things sitting on main that nobody had noticed, surfaced by checks that asked specific questions instead of generic ones.

Starting over, the order of operations would be the same. Pick the most-paranoid version of each contract you can think of, write the check that would fail if that contract regressed, run it locally. If it goes green on the first run against a system you haven’t intentionally broken, look harder. There’s a decent chance the check isn’t actually asserting what you think it is. Markers, round-trips, negative paths, status-code-plus-shape produced real signal. "Did it 200" never did.

The smoke runs against every deploy now. Twenty-two seconds, six bugs we wouldn’t have caught otherwise. We expect it to keep catching them.

Further reading

Closely related

Engram