Engineering

Why text_hash beats embedding-based dedup for agent memory

The same fact shows up twice. "User prefers dark mode" gets stored on Monday, and again on Thursday after a different turn extracts it from a different paraphrase. The second store should no-op and return the first memory's id. Here is how we make that happen, why we lead with a hash instead of an embedding, and the failure modes we shipped through to get there.

Published January 6, 2026 · By Jacob Davis and Ben Meyerson

Dedup is one of the under-appreciated jobs of a memory product. It sounds boring (collapse equal facts into a single row), but the second-order effects are not. Without dedup, a bucket fills with near-duplicates that bloat retrieval (the same fact crowds out unrelated context in the top-k), double-count facts in graph queries (one user, three "dark mode" preference nodes, a count aggregate that lies), and quietly drain the storage and retrieval quota the customer is paying for.

So every memory system has to answer one question: how do you decide two incoming chunks of text are "the same"? Broadly there are two camps. The embedding-similarity camp computes a vector for the new memory at ingest, queries the existing vector index, and treats anything above some configured cosine-similarity threshold as a duplicate. The text-hash camp canonicalizes the content, hashes the result, and looks it up in a unique-indexed column; a hit means duplicate, a miss means new.

We do both — but in a specific order. The hash runs first, the embedding check runs after, and the order does most of the load-bearing work. The rest of this post is the engineering-level case for why that order matters more than the choice between approaches.

What we actually do

The embedding-similarity approach is the one most people reach for first. Embed the new memory, query the vector index, and if anything sits above a configured cosine-similarity threshold (commonly 0.92–0.97), treat it as a duplicate. It's a knob, and you tune it per embedding model and per domain. We use a variant of this as a secondary check at 0.95.

The primary lane is text hashing. The whole thing is small:

def text_hash(content: str) -> str:
    canonical = canonicalize(content)            # lowercase, collapse ws, strip terminal punct, optional lemmatize
    return hashlib.md5(canonical.encode("utf-8")).hexdigest()

# At ingest:
h = text_hash(content)
try:
    db.insert_memory(tenant_id, bucket_id, content, text_hash=h, ...)
except IntegrityError:                            # unique-index collision
    existing_id = db.lookup_by_hash(tenant_id, bucket_id, h)
    raise MemoryHashConflict(text_hash=h, existing_id=existing_id)

One canonicalization pass, one hash, one INSERT. Either it succeeds, or the database tells you, atomically, that the row already exists. In our production pipeline this fires first and short-circuits roughly 70% of duplicate-content stores. The embedding check picks up paraphrases the hash can't see, like "user likes dark mode" vs "dark mode is preferred by the user." We'll come back to that path. First, the case for putting the hash in front.

Why the hash goes first: determinism

The fundamental property a dedup primitive should have is determinism. The same content, fed in twice, should produce the same dedup decision every time. Forever. Across model versions, across infrastructure changes, across whatever cosmic-ray bit-flip your data center is having today.

A canonical hash has this property by construction. MD5 of "user prefers dark mode" is the same string today, tomorrow, and on a fresh database in a different region. There is no model to upgrade, no threshold to tune, no embedding drift to worry about.

Embedding similarity has none of that property, and there are three specific ways it doesn't.

Model versioning. You'll upgrade your embedding model every six to twelve months because the new one is either cheaper or better, and every embedding stored before the upgrade is now on a different basis from every embedding stored after. A vector produced by text-embedding-3-small and one produced by a successor model aren't comparable to each other. The choice is either re-embedding your entire corpus on every model change (which is expensive) or letting "dedup" silently mean different things in different generational layers of your data, which is worse.
Threshold drift. 0.95 is a guess. Sometimes it's the right guess; sometimes a perfectly clean duplicate lands at 0.943 because of an adverb the rephrasing slipped in. You tune the threshold per domain, then a customer onboards with a new corpus shape and the threshold is wrong again. Hash-based dedup carries no threshold at all, which is most of what makes it boring and most of what makes it correct.
Embedding noise. The same text embedded twice doesn't come out bit-identical in practice. Providers run batched inference non-deterministically, and 32-bit float accumulation order isn't stable across runs. Two embeddings of "user prefers dark mode" produced by the same model in the same week can sit at 0.9998 cosine, not 1.0. That's fine for retrieval, where the question is "what's most similar." Leaning on it for an identity decision is a category error, because identity demands an answer that doesn't drift with the underlying numerics.

None of this means embeddings are bad. Embeddings are a fine tool for finding things that look similar. They are a bad tool for deciding two things are the same. That distinction is the whole post in one sentence.

The schema and the partial unique index

The shape on Postgres is small. We add a text_hash column to the memories table:

CREATE TABLE memories (
    id            UUID PRIMARY KEY,
    tenant_id     UUID NOT NULL,
    bucket_id     UUID NOT NULL,
    content       TEXT NOT NULL,
    text_hash     VARCHAR(32),                   -- MD5 hex of canonicalized content
    -- ... embedding, metadata, timestamps, etc.
    created_at    TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE UNIQUE INDEX uq_memories_tenant_bucket_hash
    ON memories(tenant_id, bucket_id, text_hash)
    WHERE text_hash IS NOT NULL;

Two things deserve attention here. The first is that the index is scoped to (tenant_id, bucket_id, text_hash), not just text_hash. Two different customers (or two different buckets within the same customer) that happen to store the same atomic fact ("user prefers dark mode") are not duplicates of each other. They are separate facts about separate users. The uniqueness is local to a bucket.

The second is the WHERE text_hash IS NOT NULL clause. The hash column is nullable, and the index is partial. Rows without a hash (legacy rows from before we added this column, or rows ingested through a path that deliberately opts out) are not subject to the constraint and don't collide with each other. That detail mattered when we shipped this on an existing table with millions of rows; we'll come back to the migration at the end.

The application code that drives this is straightforward. The INSERT is unguarded, with no "look first, then insert" check, and the catch translates the database-level integrity error into a domain exception:

try:
    cur.execute(
        "INSERT INTO memories (id, tenant_id, bucket_id, content, text_hash, ...) "
        "VALUES (%s, %s, %s, %s, %s, ...)",
        (mem_id, tenant_id, bucket_id, content, text_hash, ...),
    )
except psycopg2.errors.UniqueViolation:
    existing_id = check_memory_hash_exists(tenant_id, bucket_id, text_hash)
    raise MemoryHashConflict(
        text_hash=text_hash,
        existing_id=existing_id,
    )

On SQLite (our embedded development and test backend) the same shape applies. CREATE UNIQUE INDEX IF NOT EXISTS uq_memories_tenant_bucket_hash ON memories(tenant_id, bucket_id, text_hash) WHERE text_hash IS NOT NULL, catch sqlite3.IntegrityError, look for "UNIQUE constraint" in the message, raise MemoryHashConflict. The exception type is shared across backends so callers don't have to care which database is underneath.

Why the unique index, not SELECT-then-INSERT

The naive way to write this is to check first:

# DON'T DO THIS
existing = db.query(
    "SELECT id FROM memories WHERE tenant_id=%s AND bucket_id=%s AND text_hash=%s",
    (tenant_id, bucket_id, h),
)
if existing:
    return existing.id
db.execute("INSERT INTO memories ... text_hash=%s ...", (..., h, ...))

This is broken under concurrency. Two workers handling two ingest requests for the same fact, arriving within a few milliseconds of each other, both run the SELECT, both see no row, both run the INSERT, and you now have two memories with the same hash. The unique index catches this; SELECT-then-INSERT doesn't.

The race window isn't theoretical. We see it in the wild any time an agent fans out multiple tool calls in parallel and several of them extract the same fact ("user is using Python 3.12") on slightly different prompts. Without the unique index, those races create duplicates exactly when the system is being used most aggressively.

Letting the database enforce uniqueness moves the decision into a place where it can be atomic. The INSERT either commits the row or fails with an integrity error; there is no "in between" state where two writers can both succeed. That's the only correctness story that survives concurrent ingest.

Canonicalization

The hash is only as good as the canonicalization step in front of it. Equal facts that produce different canonical forms hash differently and the dedup misses. Unequal facts that collapse to the same form hash the same and the dedup over-merges. Both are bugs; the second is worse because it loses information.

The obvious stuff is obvious. We lowercase, trim, collapse internal whitespace runs to a single space, and strip one trailing ., !, or ?. The choices that took thought are the ones we made about which transformations not to apply.

We lemmatize content words for extracted facts ("user likes dogs" and "user liked dogs" collapse) but not for direct user quotes, because the surface form is part of what the customer is storing. We don't normalize numerals: "3 cats" and "three cats" hash separately, on the theory that they behave differently in downstream count-aggregate queries and that if they really are duplicates the embedding pass will catch them and run conflict resolution. We leave currency symbols alone, because "$5" and "5" are not the same fact. And we don't touch negation words. Stripping "not" would collapse "user does not like dogs" and "user does like dogs" into the same hash, which is the worst possible failure for a dedup primitive.

We treat any change to the canonicalizer as a migration-level event. The rules are pinned by a unit-test file of around 80 input/output pairs, and changing them retroactively changes every hash already in the database. We've touched the canonicalizer exactly once since shipping it, behind a feature flag with a backfill.

Embedding similarity as the secondary lane

Text-hash dedup catches what it catches. It does not catch paraphrases. "User prefers dark mode" and "the user has set their preference to dark mode" canonicalize to different strings and hash differently, even though they're obviously the same fact.

That's where embedding similarity earns its keep. After the hash check passes with no collision, we compute the embedding anyway (we need it for retrieval) and check it against existing memories in the same bucket. The threshold is 0.95 cosine. Anything above that is a candidate duplicate.

A candidate, not a confirmed one. Two memories with cosine 0.96 might be paraphrases of the same fact, or they might be contradictions that happen to share most of their tokens ("user likes Python" vs "user dislikes Python" sit closer than you'd expect). Above the threshold, we run a conflict-resolution step: a short LLM call that looks at the two pieces of content and decides whether they are a paraphrase (merge, keep the older memory's id), a contradiction (resolve via a domain policy, newest-wins by default with overrides per bucket), or a genuinely-distinct fact that happens to embed close (keep both).

The embedding lane is where the smarts live. But it's the secondary lane on purpose. By the time we get there, the hash has already taken the ~70% of cases that don't need any LLM reasoning at all. We don't pay extraction cost, embedding cost, or LLM-conflict-resolution cost on those.

The MemoryHashConflict exception, and why it's the API

When the unique index fires, the database raises a backend-specific integrity error. We catch it at the data-access boundary and translate it into a single domain exception, MemoryHashConflict, that carries two fields: the text_hash that collided and the existing_id of the row that was already there.

class MemoryHashConflict(Exception):
    """Raised by create_memory() when a memory with the same text_hash
    already exists for this (tenant_id, bucket_id)."""

    def __init__(self, text_hash: str, existing_id: UUID):
        self.text_hash = text_hash
        self.existing_id = existing_id
        super().__init__(
            f"Memory with text_hash={text_hash!r} already exists "
            f"(id={existing_id})"
        )

The application code at the next layer up catches this exception and short-circuits the rest of the ingest pipeline:

try:
    new_id = store.create_memory(tenant_id, bucket_id, content, ...)
except MemoryHashConflict as conflict:
    # Existing memory found. No new extraction, no new embedding,
    # no new graph writes, no new aggregate updates. Just return the id.
    return {"memory_id": conflict.existing_id, "deduped": True}

The cost of a duplicate-content store, end to end, is one DB roundtrip: the INSERT that fails, plus the SELECT that pulls the existing id out of the index. No extraction call, no embedding call, no graph traversal, no aggregate recompute. The dedup hit is the cheapest path through the system, which is the right shape: it should be cheaper to be told "you've seen this already" than to ingest a new fact.

The reason we lift the database error into a domain exception, rather than just returning None or a tuple from create_memory, is that the dedup hit is not the common path. Ninety-something percent of the time, ingest succeeds. Reserving the exception channel for the unusual case keeps the happy path readable and forces every caller to deal with the conflict explicitly. You can't silently miss a dedup hit and accidentally proceed as if you wrote a new row.

Failure modes we've actually hit

The biggest one is metadata divergence. A customer stores "user prefers dark mode" from a settings screen with one set of provenance fields, and two days later the same content arrives from chat extraction with different fields and a fresher timestamp. Today our dedup is on text alone, so the first store wins and the richer metadata on the second store is dropped. That's defensible, the fact is the same, but a fraction of customers want the metadata merged. We have an open design for (text_hash, metadata_signature) dedup, where the signature is a hash of a configurable subset of metadata fields. The bar for shipping it is hearing from enough customers that the current behavior is wrong for them, and we aren't there yet. Adjacent to this is whitespace-significant content: the canonicalizer collapses whitespace runs, which is right for prose and wrong for code snippets stored as memories. We've sidestepped that by telling customers to put code in dedicated buckets with whitespace collapse disabled. The proper fix is a per-bucket canonicalizer profile, on the roadmap.

The most embarrassing one was unicode. A customer had a Japanese corpus where the same fact appeared sometimes with half-width katakana and sometimes with full-width. Two visually identical strings, two different hashes, dedup rate near zero on that bucket until we noticed. The fix was four lines: apply NFKC normalization before hashing. The lesson was bigger. Canonicalization silently encodes assumptions about "the text," and those assumptions fail loudly the moment the corpus isn't in the language family you wrote them for. We now run NFKC by default and have a runbook entry that starts with "if dedup isn't firing on a multilingual bucket, check unicode forms first." Hash collisions in the cryptographic sense round out the list, and we mostly don't care about them. MD5 collisions exist; the threat model isn't adversarial; the probability of two semantically distinct, naturally occurring sentences colliding inside the same (tenant_id, bucket_id) is small enough that it hasn't appeared in any production data we've looked at. SHA-256 would be safer at higher CPU cost; we picked MD5 and made that trade-off deliberately. It's a one-line change if we ever decide otherwise.

Layering, not either-or

We're not embedding-similarity-vs-text-hash; we're text-hash-then-embedding-similarity. The cheap deterministic check handles the bulk of cases and the expensive probabilistic check handles the long tail. That's the same pattern that shows up in BM25-then-vector retrieval and in extraction-then-LLM-verification, and it's the order that matters: if you reach for a similarity threshold first to answer an identity question, you've made the decision non-reproducible at the front of the pipeline, and nothing downstream can recover that.

The migration that got us here

We shipped text_hash on an existing memories table with several million rows across hundreds of buckets. The migration had to run without locking out writes for more than a few seconds at a time. The shape that worked:

Add the column, nullable. ALTER TABLE memories ADD COLUMN text_hash VARCHAR(32);, a metadata-only change in Postgres on a nullable column with no default, fast even at scale.
Deploy the writer change. All new INSERTs start computing and storing a hash. The unique index doesn't exist yet, so collisions don't fire; we collect new hashes silently.
Backfill in batches. A worker chunks through existing rows in batches of a few thousand at a time, computes text_hash for each, and writes it back. Throttled to keep replica lag bounded. Crucially, the backfill is idempotent and resumable: if it dies mid-batch, restarting picks up from the last committed row.
Detect pre-existing duplicates. Once the backfill is complete, run a one-time query that groups by (tenant_id, bucket_id, text_hash) and finds groups of size > 1. These are real duplicates that existed before dedup did. We resolve them per-bucket: keep the oldest row, point any references to the dropped rows at the kept one, and delete.
Create the unique index. CREATE UNIQUE INDEX CONCURRENTLY uq_memories_tenant_bucket_hash ON memories(tenant_id, bucket_id, text_hash) WHERE text_hash IS NOT NULL;. CONCURRENTLY avoids the long write lock; the partial-index WHERE clause means any rows that didn't get a hash for some reason (race during the backfill window, future opt-out path) don't block index creation.
Enable the catch. Deploy the application code that translates UniqueViolation into MemoryHashConflict. Until this point, a collision would surface as a 500 to the API caller; after this, it surfaces as a clean dedup hit.

The whole migration ran over about a week of calendar time. Step 3 was the long pole. Steps 4 and 5 were the scary ones (pre-existing duplicates and concurrent index builds), and both went without incident thanks to staging dry-runs on a restored snapshot. If you're doing something similar on a production memory table, the order above is the order to do it in; in particular, do not create the unique index before you've verified there are no existing duplicates, because the build will fail at the last second after hours of work.