Tutorial

Engram + Vercel AI SDK: memory-aware chat in a Next.js app

A working integration guide. We wire Engram's HTTP API into a Next.js App Router project using the Vercel AI SDK, define three tools the model can call, and write the system prompt that makes the agent actually use them. By the end, you can close the tab, come back, and the agent still remembers what you told it.

Published May 12, 2026 · By Jacob Davis and Ben Meyerson

The Vercel AI SDK gives you a tidy way to stream model responses and call tools from a Next.js route handler. What it doesn't give you is durable memory across sessions. Once the request ends, your agent forgets everything. This post adds that layer.

The integration is small: install @lumetra/engram for the transport, define three tools, write one route handler, and one system prompt. The bulk of the work is the prompt. Getting the model to call query_memory before it answers and store_memory after it learns something durable is what makes or breaks the feature, and we'll spend the most time there.

Everything in this post targets Next.js 14+ with the App Router and the ai package version 4. We use the OpenAI provider for concreteness; swap in any other AI SDK provider if you'd rather route Anthropic or Groq.

The full working example from this post lives at engram-js/examples/vercel-ai-sdk — clone, paste in two API keys, and run.

What you need before starting

Concrete prerequisites:

A Next.js project on the App Router. create-next-app output works as-is.
npm install ai @ai-sdk/openai zod @lumetra/engram. The ai package gives you streamText and tool; the OpenAI provider plugs in as the model; @lumetra/engram is the official TypeScript client for the memory API.
An Engram account with BYOK configured. Engram is bring-your-own-key: extraction and query LLM calls go through your provider, not ours. Configure that at the /models page in the Engram dashboard before your first request, or every store/query call will return 412.
An Engram API key from the dashboard. It looks like eng_live_....
An OpenAI (or other provider) API key for the chat model itself. This is separate from the key Engram uses internally.

The two model keys do different jobs. OPENAI_API_KEY is what the Vercel AI SDK uses to run the chat completion in your route handler. The key you paste into the Engram dashboard is what Engram uses to run its own internal extraction and embedding calls when you POST a memory. Two separate billing relationships, both BYOK.

Environment variables

Three variables in .env.local:

OPENAI_API_KEY=sk-...
ENGRAM_API_KEY=eng_live_...
ENGRAM_BASE_URL=https://api.lumetra.io

Notice none of these are prefixed with NEXT_PUBLIC_. We never want any of these in client code. The Engram API key in particular grants full access to every memory in your tenant. Leaking it to the browser is the integration failure mode you most want to avoid. All Engram calls happen server-side, inside the route handler.

If you're tempted to skip the server-side hop and call api.lumetra.io directly from a React component, don't. The cost of a serverless function invocation is negligible next to the model call you're already making, and the route handler is where you'll later want to add per-user authentication and rate limiting anyway.

The Engram client

The official client lives on npm at @lumetra/engram (source: lumetra-io/engram-js). Zero runtime dependencies, ESM + CommonJS, full typings, works on Node 18+, Bun, Deno, and Vercel's edge runtime. Drop this at lib/engram.ts:

// lib/engram.ts
import { EngramClient } from '@lumetra/engram';

export const engram = new EngramClient({
  apiKey: process.env.ENGRAM_API_KEY!,
  baseUrl: process.env.ENGRAM_BASE_URL, // optional, defaults to https://api.lumetra.io
});

The methods we'll use in the tools: storeMemory(content, bucket?), query(question, { buckets, topK, returnExplanation }), and listMemories(bucket?, { limit }). The bucket lives in the URL for store and list, but query takes a buckets array so you can fuse across multiple buckets in one call. The query response is { answer, explanation: { retrieved_memories, profile, graph_facts }, usage }. For a chat app you'll mostly read answer and explanation.retrieved_memories. If you don't want the server-side synthesis (because you're composing the answer yourself with your own model), pass skipSynthesis: true and you'll get retrieval results without the LLM call on our side.

The client throws EngramError on HTTP failures, with the status code on err.status — useful for the 412 / BYOK handling we cover below.

Tool definitions

The AI SDK exposes a tool() helper that pairs a Zod schema with an execute function. The model sees the description and parameter schema; when it decides to call a tool, the SDK validates the arguments against Zod and invokes execute. Whatever you return becomes part of the conversation the model sees for its next step.

Three tools is the minimum set: one to write, one to read, and one to list. We name and describe them carefully. The model picks tools based mostly on the description, so vague descriptions get vague behavior.

// lib/tools.ts
import { tool } from 'ai';
import { z } from 'zod';
import { engram } from './engram';

export function makeMemoryTools(bucket: string) {
  return {
    store_memory: tool({
      description:
        'Save a stable fact about the user, their preferences, or the project. ' +
        'Use this after the user shares something durable that will matter in a future conversation. ' +
        'Keep each stored fact short and atomic (one concept per call).',
      parameters: z.object({
        content: z.string().describe('A short, declarative fact. One sentence is ideal.'),
      }),
      execute: async ({ content }) => {
        const result = await engram.storeMemory(content, bucket);
        return { stored: true, id: result.id };
      },
    }),

    query_memory: tool({
      description:
        'Search the user\'s memory for relevant facts before answering. ' +
        'Call this whenever the answer might depend on prior context, preferences, ' +
        'or anything the user told you in a previous session.',
      parameters: z.object({
        question: z.string().describe('A natural-language query. Phrase it as a question.'),
      }),
      execute: async ({ question }) => {
        const result = await engram.query(question, {
          buckets: [bucket],
          topK: 8,
          returnExplanation: true,
        });
        const retrieved = result.explanation?.retrieved_memories ?? [];
        return {
          answer: result.answer,
          memories: retrieved.map((m) => m.content),
        };
      },
    }),

    list_memories: tool({
      description:
        'List recently stored memories. Use this rarely, only when the user explicitly asks ' +
        'to see or audit what you have remembered about them.',
      parameters: z.object({
        limit: z.number().int().min(1).max(50).default(20),
      }),
      execute: async ({ limit }) => {
        const result = await engram.listMemories(bucket, { limit });
        return { memories: result.memories.map((m) => ({ id: m.id, content: m.content })) };
      },
    }),
  };
}

A note on the return shapes. We deliberately flatten the responses before handing them back to the model. The raw API returns scores, explanations, and other metadata that the model doesn't need to see; including them just dilutes the signal in the next turn's context. Return the smallest useful payload.

The factory function (makeMemoryTools(bucket)) is what lets us scope memories per user; we'll bind a per-user bucket in the route handler. If you only have one user, hardcode 'default' and skip the factory.

The route handler

The route at app/api/chat/route.ts is where the AI SDK runs. It accepts the chat messages from the client, calls streamText with the tools and the system prompt, and streams the response back. The SDK handles the tool-call loop automatically: if the model emits a tool call, the SDK runs execute, feeds the result back, and continues generating until the model emits a final text response.

// app/api/chat/route.ts
import { openai } from '@ai-sdk/openai';
import { streamText } from 'ai';
import { makeMemoryTools } from '@/lib/tools';
import { SYSTEM_PROMPT } from '@/lib/prompt';

export const runtime = 'nodejs';
export const maxDuration = 60;

export async function POST(req: Request) {
  const { messages, userId } = await req.json();

  // Per-user bucket. Replace with your real auth/session lookup.
  const bucket = userId ? `user-${userId}` : 'default';

  const result = await streamText({
    model: openai('gpt-5'),
    system: SYSTEM_PROMPT,
    messages,
    tools: makeMemoryTools(bucket),
    maxSteps: 5,
  });

  return result.toDataStreamResponse();
}

maxSteps: 5 is the cap on tool-call rounds within a single user turn. The model can call query_memory, see the result, call store_memory, see that succeeded, and then write a final response. That's three steps. Five gives headroom for chained queries. Setting this too high risks the model calling tools in a loop on a degenerate input; five is a reasonable ceiling for memory tools specifically.

The userId field in the request body is a placeholder. In a real app you'd get the user from your session cookie or auth provider on the server side. Never trust a userId sent from the client unsigned; anyone can spoof it and read another user's memories. Whatever your auth story is, resolve it server-side before constructing the bucket name.

The system prompt

This is the part that makes the integration go from "tools are wired up" to "the agent actually uses them." Models are not, by default, inclined to call a tool when they could just answer from their parametric knowledge. You have to tell them, explicitly, when to query memory and when to store. The prompt below is adapted from the one we recommend for our MCP integrations, with a few additions specific to the AI SDK setting.

// lib/prompt.ts
export const SYSTEM_PROMPT = `You are a helpful assistant with persistent memory.
You have Engram memory tools. Use them proactively.

Tools:
- store_memory(content) — Save a stable fact about the user or project.
- query_memory(question) — Search memory for prior context.
- list_memories(limit) — Audit what is currently remembered.

Policy:
- Query-first. Before answering any question that may depend on prior context,
  preferences, or anything the user told you previously, call query_memory.
  Ground your answer in the results.
- Proactive storing. When the user shares a stable fact — a preference, a profile
  detail, a project decision, a deadline, an outcome — call store_memory. Do this
  on the same turn, before your final response.
- One concept per memory. Each store_memory call should be one short, declarative
  sentence. If the user shares three facts, make three calls.
- Don't store ephemera. Skip small talk, jokes, and one-off context that won't
  matter next session.
- Don't pre-announce tool use. Just call the tool and answer. The user does not
  need to see "let me check my memory…" — they will see the result.
- Trust retrieved memories. If query_memory returns a fact, use it. Do not
  second-guess or hedge unless you have a specific reason to.

Style for stored content: short, declarative, atomic.
Examples:
- "User prefers dark mode."
- "User's timezone is US/Eastern."
- "Project Alpha deadline is 2026-10-15."
- "User decided to use Postgres over MySQL for the new service."

If a memory tool returns an error, briefly tell the user "I couldn't reach my
memory right now, but here's what I can answer from this conversation" and
continue without it.`;

A few clauses earn their keep. "Query-first" is the single most important rule. Without it, the model will happily answer from context window and never recall anything across sessions. "Don't pre-announce tool use" stops the model from emitting "Sure, let me check my memory…" on every turn, which feels mechanical fast. "One concept per memory" matches how Engram's retrieval works best; atomic memories rank cleaner than paragraph-length ones.

The "don't store ephemera" rule is doing more work than it looks like it should. Without it, models tend to store every nicety the user says ("user said hi", "user is in a good mood today"), which dilutes the bucket and hurts retrieval quality over time. Be explicit that not everything is worth remembering.

Bucket strategy for a multi-user app

For anything beyond a personal demo, you need to scope memories per user. Engram's bucket model is the right tool: every memory belongs to a bucket, queries only see their bucket, and there's no cross-bucket leakage. The pattern is simple: bucket name = stable user identifier.

We use user-{userId} in the route handler. The userId should be a UUID or similarly stable internal ID from your auth provider, not an email, which can change, and not a session ID, which is per-login. Once you've named a bucket, you want to be able to write into it again and again across years.

If your app has both per-user and per-project memory (an AI dev tool, say), use compound names: user-{userId}-project-{projectId} for project-scoped facts, user-{userId} for the user's preferences across projects. Pass the right bucket to makeMemoryTools based on which view the user is in. The model doesn't need to know about the bucket at all. It's a server-side concern.

One more pattern: a shared team bucket. If multiple users on the same team should see the same project memory, give them all a team-{teamId} bucket and route writes there for shared context, while keeping personal preferences in user-{userId}. Tools can be wired to write to one bucket and read from a fused list, but for a first pass, stick with one bucket per user.

Production caveats

412 means BYOK isn't configured

This is the one you'll actually hit. Engram returns 412 Precondition Failed on store and query calls when the tenant has no LLM provider key configured. It's by design: Engram is BYOK and won't make a call against our provider account if the customer hasn't set up theirs. But the model never sees the 412. It only sees whatever your tool's execute function returns, which on an unhandled throw is a generic tool-error string. From the agent's perspective, memory just stopped working for no clear reason, and it will hedge or apologize in ways that look like an Engram bug.

Catch it and surface the real cause:

import { EngramError } from '@lumetra/engram';

execute: async ({ content }) => {
  try {
    const result = await engram.storeMemory(content, bucket);
    return { stored: true, id: result.id };
  } catch (err) {
    if (err instanceof EngramError && err.status === 412) {
      return {
        stored: false,
        error: 'Memory provider is not configured. Visit Engram settings to add an LLM key.',
      };
    }
    return { stored: false, error: 'Could not reach memory service.' };
  }
},

Now the model can pass the actionable string to the user ("looks like the memory provider isn't set up yet") instead of silently failing or, worse, hallucinating a confirmation. If you only do one thing from this section, do this.

Other things to keep in mind

Never expose the API key client-side. ENGRAM_API_KEY belongs in .env.local; everything goes through the route handler.

Batched stores exist. POST /v1/buckets/{id}/memories accepts a memories array. Worth wiring up if you see contention, but for most apps one-store-per-call is fine and the model handles it more reliably.

Retry on 5xx, not 4xx. One retry with 250ms backoff covers the network blips; anything more elaborate rarely earns its complexity at this volume.

Tool errors are recoverable. If execute throws, the AI SDK feeds the error back to the model, which usually apologizes and continues without memory. That's generally what you want; for memory specifically, "couldn't recall, here's my best guess" beats failing the stream.

Streaming while tools execute

The AI SDK streams the model's text response token-by-token. While a tool call is in flight, the stream pauses on the assistant text but emits tool-call and tool-result events that your client UI can render. The Vercel useChat hook handles these events for you. By default, you get a brief "calling tool…" state while execute runs, then text resumes.

For Engram specifically, query_memory typically returns in 200–800ms and store_memory in 400–2000ms (the longer tail comes from the synchronous extraction step). At those latencies the pause is noticeable but not jarring. If you want to mask it entirely, render a subtle indicator ("recalling context…") tied to the tool-call event; the SDK emits these as part of the stream so you can react to them in real time.

One thing not to do: don't await store_memory before responding to the user if the response doesn't depend on the store result. Either let the model call it inline (which the AI SDK will stream around) or fire-and-forget it in a separate non-blocking handler. The user shouldn't wait two seconds for "got it" just because the agent decided to write a memory.

The failure mode you'll hit first

You'll wire everything up, run the dev server, type "my favorite color is octarine," and get a fluent, friendly response that does not mention storing anything. Open a new tab, ask what your favorite color is, and the agent will say it doesn't know. You'll assume Engram is broken.

It isn't. You forgot the system prompt, or the system prompt isn't being passed to streamText, or there's a typo and only half of it is loading. The model has the tools attached, sees they exist, and decides it doesn't need them because nothing in the conversation told it to. From outside it looks identical to a memory backend failure: no stored facts, no recall, agent acting amnesiac.

The tell is in the response stream itself. If the model never emits a tool-call event, the problem is upstream of Engram. Log the assembled request right before streamText runs and confirm system is the full prompt string, not undefined or an empty default. Once the system prompt is actually attached, the loop closes on the first turn and the integration starts behaving like a memory-aware agent.

After it's running

Once the loop closes and the tool-call events are firing, open the Engram dashboard and let your agent run a few real conversations while you watch what it actually stores. That's where the rest of the work lives — tightening your system prompt against the noisy or near-duplicate memories you'll see in the first day or two, and at some point deciding whether to layer the profile endpoint (GET /v1/buckets/{id}/profile) on top for users who stick around long enough to accumulate one. The plumbing is done by now. Everything past it is product-specific.