Engineering

Iterating the extraction prompt: 28 versions and what each one fixed

A prompt is a contract between you and a model. Every word matters, but you only find out which word matters when something breaks. Our extraction prompt has been through 28+ versions over the last year. This is the abbreviated history.

Published April 17, 2026 · By Jacob Davis and Ben Meyerson

Prompt iteration is one of the most underdocumented activities in shipping a memory product. Public writeups usually present a single finished prompt as if it sprang fully formed from a quiet afternoon. The reality at our end is closer to a junk drawer of dated text files, each one fixing a problem the previous one created.

We wrote a separate post earlier this month about the composer prompt (the one that takes retrieved memories and produces an answer) and walked through how it got to v44 on the LongMemEval benchmark. This post is the equivalent for the extraction prompt, the one that runs at ingest time and turns raw text into structured knowledge. Different prompt, different model, different failure modes. Same arc.

What follows is a partial history of EXTRACTION_PROMPT, the string that lives in src/hybrid/groq_extractor.py and runs against every message that arrives at POST /v1/buckets/{bucket}/memories. We picked seven versions to walk through. Each one fixed a specific failure we'd seen in production or in an eval. Each one introduced a new one.

What the extractor actually does

When someone calls store_memory(content, bucket) against Engram, three things happen in parallel server-side: we embed the text for vector search, we store the raw text for BM25, and we run an extractor that turns the text into a list of subject-predicate-object triples for the knowledge graph. The triples are what makes "how many weddings did I attend this year" answerable. Graph aggregation over a count of distinct event entities is a different shape of query than vector similarity.

The extractor itself is small. It's one LLM call, JSON output, no tool use, no retries beyond a single parse-failure fallback. The model behind it has changed three times over the prompt's life (llama-3.1-70b → llama-3.3-70b → currently a mix depending on tenant config), but the prompt has done most of the load-bearing work.

The contract the prompt has to enforce:

  • JSON array output, no surrounding prose, no markdown fences.
  • Predicates lowercased and snake_cased: works_at not worksAt or "works at".
  • Subjects and objects in a normalized form: short noun phrases, no trailing punctuation, the same entity referred to the same way across triples.
  • Relative date phrases resolved against an observation_date variable that we pass in for every call.
  • No invented facts. Only things the source text actually says or directly implies.

Every one of those bullets is the scar from a failure. The first version of the prompt enforced exactly one of them.

Version snapshots

v1: "extract triples from this text, return JSON"

The first version was nine words. We had a working pipeline that needed a placeholder string and we wrote one. It looked roughly like this:

Extract triples from this text. Return JSON.

Text: {text}

This worked maybe 60% of the time, where "worked" means "produced parseable JSON that contained at least one usable triple." Modes of failure included: returning a prose summary instead of triples, returning a single triple as a bare object instead of an array, inventing entities not present in the text ("the user works at Acme Corp" when the text mentioned a meeting at Acme without saying the user worked there), and using a different schema every few calls (sometimes {subject, predicate, object}, sometimes {s, p, o}, sometimes {from, relation, to}).

We accepted v1 as a baseline because it made downstream work possible. The graph existed. Quality was the next problem.

v6: a worked example

One example at the bottom of the prompt. A canonical triple in the JSON shape we wanted, labeled "Example." JSON-validity jumped from around 60% to around 92% on a small eval set we'd assembled by hand from production logs.

What survived was more interesting. The model now produced valid JSON in the right shape, but roughly 15% of the triples were hallucinated. "I had lunch with Sarah" would produce (Sarah, works_at, Google) on no evidence whatsoever. Our one example was clearly an "interesting" triple, and the model took the hint that we wanted interesting things, whether or not the text contained them.

v12: "do not invent"

v12 added an explicit instruction at the top of the rules section: extract only entities and facts literally present in the input text. Hallucination dropped from ~15% to ~3%. We celebrated for about a day, then noticed a new category of failure on the eval set. The text said "I went to my brother's wedding in March." The model was refusing to emit (user, has_brother, true) because "brother" wasn't a standalone claim; it was a possessive embedded in a larger sentence. Useful inferences like "user has a brother" or "the user attended a wedding in March" were getting filtered out as "not literally stated." We'd over-corrected. The rule was too strict in one direction, where previously it had been too permissive in the other.

v18: the light inference whitelist

v18 introduced a short, explicit list of inferences the model was allowed to make. The list was deliberately narrow: family relations implied by possessives ("my brother" implies the user has a brother), roles implied by activity ("I'm presenting at the standup" implies the user attends standups), basic temporal anchoring (an event mentioned with a date is an event that happened on that date).

The whitelist recovered most of the inferences v12 had killed without bringing back the broader hallucination v6 had let through. The rule looked something like this:

Extract ONLY facts stated or directly implied by the text.
Direct implication is allowed in these cases ONLY:
  - Possessives: "my X" implies the user has an X.
  - Role/activity: "I'm presenting" implies the user gives presentations.
  - Stated dates: an event mentioned with a date occurred on that date.
Otherwise, do not invent entities or relationships.

The new failure was subtle. The whitelist tempted the model to apply each rule maximally. "I'm at my parents' house" would emit (user, has_father, true) AND (user, has_mother, true) AND (user, has_parents, true), three triples for the price of one. The graph started bloating. We'd buy that back later in normalization, but at v18 it was a small regression on triple-count metrics in exchange for a large win on recall.

v22: date grounding

Up through v21, the model would happily extract (user, attended, "wedding yesterday") as a triple. The object string was the user's exact words. Which meant the graph stored "yesterday" as if it were a stable timestamp, and three months later a query for "what did I do last week" would surface a wedding that had actually happened in March.

v22 added the observation_date variable (the timestamp of the message itself, passed in at format-time) and a block of rules telling the model to resolve relative date phrases against it. This was the single biggest temporal-reasoning improvement we ever shipped. The rules section grew, but it was specific:

Observation Date: {observation_date}

TEMPORAL GROUNDING: resolve EXPLICIT relative date phrases inside
the text to absolute dates based on Observation Date, and bake the
absolute date INTO the object string itself.
Ground these (concrete event-time references):
  - "yesterday"             -> day before Observation Date
  - "last week" / "a week ago" -> the week before Observation Date
  - "5 days ago"            -> Observation Date minus 5 days
  - "this morning" / "today" / "tonight" -> Observation Date
  - "next Tuesday"          -> next Tuesday after Observation Date
  - "last month" / "in March" -> compute from Observation Date

The new failure: the model started grounding things it shouldn't. "I water the herb garden every morning" became (user, waters_herb_garden_on, "2024-03-15") as if "every morning" referred to one specific morning. Habitual phrases were getting collapsed to event times.

We patched it inside the same version with a second list of phrases the model was specifically told not to ground: "every morning," "always," "sometimes," "for a while." The same rule block now carries both halves:

Do NOT ground habitual / recurring / vague phrases — they have no
single event date. Leave them as-is in the object string:
  - "every morning", "every Sunday"
  - "always", "sometimes", "usually", "often"
  - "for a while", "for a few days" (duration, not a date)
  - "recently" without further specification

The pair of rules (ground the concrete, don't ground the habitual) is the part of the prompt we've touched least since. It's been stable from v22 onward, which is unusual for any block in this prompt.

v25: update patterns

Users restate things. "Remember when I got pre-approved for $400,000?" is a casual reference to a fact the user had told the assistant about two weeks earlier, when the number was $350,000. The earlier number was wrong or had been revised, and the user was now telling us the current value while pretending to remind us. We'd been quietly losing points on this for months without realizing it.

Extractors are good at extracting facts the first time. They are bad at noticing that a casually-mentioned fact is actually a knowledge update that should supersede an earlier value. The pre-v25 model's instinct was to skip the new triple as redundant. It had already seen pre-approval mentioned; why mention it again?

v25 added an explicit rule. Restatements are not redundancy; they're supersession. Emit them as fresh triples and let the downstream layer pick the latest timestamp:

UPDATE PATTERNS: when the text restates a previously-known value
with a NEW value, capture it as a fresh triple — even if it's
mentioned casually or as a reminder. These are knowledge updates
that should supersede any earlier value via timestamp ordering.
Watch for:
  - "remember when I got pre-approved for $400,000"
    -> emit (user, got_pre_approved_for, $400,000) — even if an
       earlier message said $350,000
  - "I just ran 25:50, beat my old 27:12!"
    -> emit (user, has_5k_personal_best, 25:50) — supersedes 27:12
  - "actually, my budget is $3,000 now"
    -> emit (user, has_budget, $3,000) as a fresh fact

This single addition was responsible for a measurable jump on the LongMemEval knowledge-update category. The new failure: the model started over-extracting on every casual reference, treating any mention of a number as a potential update. We narrowed the rule with the "we agreed on X / we settled on X / now it's X" phrasings to anchor it, which helped.

v28: predicate normalization

v28 fixed the longest-standing problem in the prompt's history. The model had always been free to invent predicate names. The same relation, "X lives in Y," would come out as lives_in, lives in, residence, located_at, resides_in, sometimes lives_at if the object was an address. Graph queries that joined on predicate name were fragmenting. A user could "live in" a city three different ways and a count query would return three different counts depending on which predicate it asked for.

v28 added a short preferred-predicate list and a normalization rule:

Use simple, normalized predicates. Prefer the following forms when
applicable:
  works_at, lives_in, has_role, attended, owns, uses, knows,
  has_preference, has_budget, has_personal_best_in
Predicates MUST be snake_case, lowercased, no spaces.

We measured the impact on downstream graph fragmentation by sampling 200 buckets before and after v28 rolled out and counting distinct predicates per relation cluster (cluster defined by string-similarity on the predicate plus shared subject/object types). Distinct-predicate count dropped by roughly 40%. Query results got a corresponding bump on relational questions.

The new failure introduced by v28 was the most predictable one of the entire history: the model started forcing every relation into the preferred list even when the list didn't fit. A medical history reference would get extracted with has_role as the predicate because has_condition wasn't on the list. We've been iterating on the preferred list itself rather than on the rule. v28 itself is stable.

The pattern across versions

If you read through those seven snapshots back to back, the shape of what changed is more interesting than any individual change. Almost none of the fixes were of the form "make the model smarter." They were all of the form "remove the ambiguity in the request."

The model was already capable of extracting triples in v1. It just didn't know which triples we wanted, in which shape, with which constraints. Every constraint we added came from a specific failure we'd seen in the wild. We didn't sit down and brainstorm a list of rules upfront and ship them all at once. We shipped, watched things break, named the failure mode, wrote a rule that addressed it, watched a new thing break. The interval between "ship a fix" and "discover what that fix broke" was usually a week, sometimes less. v22's habitual-phrase regression showed up the same afternoon we deployed the date-grounding rule.

This is the part most prompt-engineering writeups gloss over. The prompt at any point in time is a record of which failures have been seen and addressed, not a record of what a thoughtful person believes the model should be told. Those are different artifacts. The first kind is short, ugly, and earns its keep. The second kind tends to be long, elegant, and full of rules that don't address anything.

Things we tried and rolled back

Multi-step extraction

The pattern is obvious in theory: extract triples in a first pass, then run a second pass with a verifier prompt that audits the first pass and removes any triple not supported by the source text. We spent a week on this around v14. It made things worse.

The two passes disagreed with each other on edge cases. The extractor would emit (user, has_brother, true) from "my brother's wedding"; the verifier would refuse it because "the source doesn't directly state the user has a brother, it states an event involving a brother." Both readings were defensible. We tried a tiebreaker (a third call to arbitrate). The tiebreaker had its own failure modes: when the extractor and verifier disagreed on the strict reading, the arbiter would sometimes side with whichever of the two had written more text, regardless of the underlying call.

We also tried tightening the verifier prompt to bias toward keeping the extraction. That worked until the extractor started getting more aggressive on its own (because the verifier was no longer pushing back), and the system drifted back toward v6-style hallucination within about a hundred eval examples.

Net effect after a week of variants: extraction quality was roughly the same on average, with higher variance, at 2x the inference cost per message. The real lesson was that "valid inference" is not a property two prompts can agree on by negotiating; it's a definition that needs to live in one place. We rolled back to single-pass and pushed the definition into the v18 whitelist, which was the right place for it. The whitelist did in one pass what we'd been trying to get two passes to agree on.

Few-shot with 20 examples

Around v16 we beefed up the worked-example section to 20 examples covering possessives, dates, restatements, negations, and lists. Quality went up modestly on the eval set. Then production started returning truncation errors on long payloads (journal apps, meeting notes) where the prompt plus 20 examples plus the input was hitting the model's context limit on the largest entries. We cut back to three examples covering the most common failure modes and kept the rules section explicit instead. The three did most of the work the 20 had been doing, and we've never gone back above three.

Asking for confidence scores

v19 had each triple emit an optional confidence field. The plan was to threshold writes on it. The confidences weren't calibrated: possessive-derived triples came back at 0.6, hallucinations at 0.95. The model was most confident on the wrong ones, and threshold-filtering was actively harmful. We pulled the field in v20. Self-reported confidence on an open-ended generation task, where the labels themselves are being generated, isn't a signal.

The current prompt

Here's the version of the prompt that's live at the time of writing. It lives in src/hybrid/groq_extractor.py as the EXTRACTION_PROMPT constant. MIT-licensed, copy-paste-friendly. It carries every lesson from the seven versions above plus a few we didn't cover.

Extract factual relationships from the following text as
subject-predicate-object triples.

Observation Date: {observation_date}

Rules:
1. Extract ONLY explicit facts stated in the text.
2. Use simple, normalized predicates (e.g., "works_at",
   "lives_in", "has_role").
3. Keep subjects and objects concise but complete.
4. TEMPORAL GROUNDING: resolve EXPLICIT relative date phrases
   inside the text to absolute dates based on Observation Date,
   and bake the absolute date INTO the object string itself.
   Ground these (concrete event-time references):
     - "yesterday"             -> day before Observation Date
     - "last week" / "a week ago" -> the week before Obs. Date
     - "5 days ago"            -> Observation Date minus 5 days
     - "this morning" / "today" / "tonight" -> Observation Date
     - "next Tuesday"          -> next Tuesday after Obs. Date
     - "last month" / "in March" -> compute from Obs. Date
   Do NOT ground habitual / recurring / vague phrases — they
   have no single event date. Leave them as-is in the object
   string:
     - "every morning", "every Sunday"
     - "always", "sometimes", "usually", "often"
     - "for a while", "for a few days" (duration, not a date)
     - "recently" without further specification
   Bad:  object = "made beef stew last Tuesday"
   Good: object = "made beef stew on 2023-05-23"
   Bad:  object = "waters herb garden on 2023-04-15"
                  (was "every morning")
   Good: object = "waters herb garden every morning"
                  (habitual, no event date)
5. UPDATE PATTERNS: when the text restates a previously-known
   value with a NEW value, capture it as a fresh triple — even
   if it's mentioned casually or as a reminder. These are
   knowledge updates that should supersede any earlier value
   via timestamp ordering. Watch for:
     - "remember when I got pre-approved for $400,000"
       -> emit (user, got_pre_approved_for, $400,000)
     - "I just ran 25:50, beat my old 27:12!"
       -> emit (user, has_5k_personal_best, 25:50)
     - "actually, my budget is $3,000 now"
       -> emit (user, has_budget, $3,000)
   Do NOT skip these as redundant. Always include the
   "timestamp" field with the message's Observation Date so
   supersession works correctly.
6. Return triples in JSON format.

Text: {text}

Return a JSON array of triples. Each triple should have:
subject, predicate, object, and optionally timestamp.
Example: [{"subject": "John", "predicate": "works_at",
          "object": "Acme Corp", "timestamp": "2023-01-15"}]

Triples:

Two placeholders, both required: {text} (the raw input) and {observation_date} (the timestamp of the message, in YYYY-MM-DD form). Tenants on the BYOK plan can override this with their own prompt template as long as both placeholders survive. If a custom template is missing a placeholder, ingest falls back to the default rather than failing. That's a deliberate safety net we added after the first BYOK tenant accidentally shipped a template missing the date variable.

What didn't change

The JSON output shape is identical from v1 to today. {subject, predicate, object, timestamp?}. We added the optional timestamp field around v22 alongside date grounding, but the rest of the schema has been frozen. Downstream code (the graph writer, the BM25 indexer, the aggregation tables) depends on it. Changing the shape would be a migration, not a prompt change.

The model behind the prompt has changed three times. The first version ran on llama-3.1-70b on Groq. v6 through v20 ran on llama-3.3-70b. v22 onward has run on a mix: llama-3.3-70b for the default tenant, GPT-5.4-mini for tenants who've explicitly opted into the OpenAI extractor, and a few customers running fully BYOK against their own provider. The prompt has stayed essentially the same across all three model swaps, which is itself a small piece of evidence that most of the work is being done by the constraints rather than by anything model-specific.

That said: each model swap caused a small dip in extraction quality on the same prompt before it caused an improvement. Prompts that worked perfectly on llama-3.1-70b would occasionally re-introduce hallucination on llama-3.3-70b until we re-ran the eval and tightened a rule. The newer model wasn't strictly better on every axis. It was better on most axes and slightly worse on a few. Re-test on every model upgrade.

A prompt is a living artifact

We're at v28 of an extraction prompt that started as nine words. We expect to be at v35 in another year. We still see new failure modes every couple of weeks. Each one is a candidate for a new rule, a tightening of an existing rule, or (occasionally) the deletion of a rule that's stopped earning its keep.

The thing we keep coming back to: almost none of the fixes were of the form "make the model smarter." They were of the form "remove the ambiguity in the request." The model was already capable of extracting triples in v1. It just didn't know which triples we wanted, in which shape, with which constraints. Each version of the prompt is a record of which failures have been seen and addressed, not a record of what a thoughtful person believes the model should be told. Those are different artifacts. The first kind is short, ugly, and earns its keep. If you're shipping a memory product or anything else that depends on an LLM doing the same thing the same way over a long horizon, expect the prompt to be doing a lot more of the work than the model.

Further reading

Closely related

Engram