Free
$0forever
- 10K memories
- 50K retrievals / mo
Durable, explainable memory for AI agents over MCP.
Bring your own model. Works with Claude Code, Cursor, ChatGPT, and more.
1const result = await engram.query_memory({
2 question: "who is leading the payments migration?",
3 bucket: "acme-eng",
4});
5
6// → synthesized answer + the graph facts that backed it
7{
8 answer: "Alice owns the payments migration; Sam and Priya pair on reconciliation.",
9 memories_found: 2,
10 graph_facts: [
11 { subject: "Alice", predicate: "owns", object: "payments migration" },
12 { subject: "Sam", predicate: "pairs_with", object: "Priya" },
13 ],
14} Agents remember conversations, facts, and decisions — and can show why results were chosen.
Connect via MCP or HTTP in minutes. Works with popular IDEs and agent runtimes.
Manage retention and cleanup with simple tools — no extra plumbing required.
Connect Engram to your existing AI agents over MCP
Three steps to memory in your agent
Authorization: Bearer <api-key>with the API key from your portal.Two ways to install; pick one.
Option A: Plugin marketplace (recommended)
One-line install. Adds the Engram MCP server plus /engram-remember and /engram-recall slash commands. No JSON file editing.
/plugin marketplace add lumetra-io/engram-claude-plugin/plugin install engram@lumetraOption B: Manual CLI install
Skips the slash commands but wires the MCP server directly. Includes the required Authorization header.
claude mcp add-json engram '{"type":"sse","url":"https://mcp.lumetra.io/mcp/sse","headers":{"Authorization":"Bearer <api-key>"}}'A vector database stores embeddings and returns the chunks closest in cosine space. Engram is the memory product on top of that.
It runs three retrieval engines in parallel (BM25 for exact lexical recall, vectors for paraphrase, a knowledge graph for relational queries), fuses them with reciprocal rank fusion, reranks with a cross-encoder, and returns an explanation trace alongside the answer.
A vector DB is one of the engines; Engram is the system you point your agent at.
Every recall comes back with a trace: which engines fired, what they matched on, the graph facts that contributed, the score each retrieved memory got, and the canonical profile the composer saw.
When the agent answers wrong, the trace tells you whether retrieval missed the right memory or the composer misread what it was given. "cosine: 0.87" is not a debug log entry. "BM25 matched on 'invoice'; graph linked it to the customer entity in the question" is.
Three places worth checking.
On pricing, a managed-inference vendor bills you for tokens your LLM provider already charged for; Engram is bring-your-own-model end-to-end and never sits in your inference invoice.
On surface area, Engram is MCP-native (config-file install on any compliant client) and exposes a REST API for everything else, so adding it to a new client is a few lines of config rather than an integration project.
On transparency, the composer prompt that produced our 91.6% on LongMemEval-S is published under MIT, the benchmark methodology is documented end to end, and you can reproduce the score on your own retrieval stack.
Compare like for like (the architecture, the prompt, the judge) and the differences become specific instead of vibes.
Three independent properties, and most "BYOK" claims only deliver one or two.
Credentials: your model key is encrypted at rest with AES-256-GCM under a master key the application database has no read access to.
Billing: your LLM provider invoices your account directly. We never sit in the dollar path or take a margin.
Inference path: our servers do orchestrate the calls, but we never log prompt bodies and never use customer content for training.
The full breakdown is in the BYOK post; the short version is that your inference bill stays on your provider's dashboard, where you already track it.
Per operation, not per product.
A read-only authenticated call (e.g., listing buckets) lands under 50ms p95 once auth is HMAC-not-bcrypt. An MCP tool call with no LLM (list_memories, list_buckets) is under 200ms p95. A memory write with extraction is 1.5–2s p95, bounded by your extractor model, not by anything we control. A query with synthesis is 3–5s p95, bounded by your composer.
The retrieval itself (hybrid fusion across three engines, rerank) is single-digit percent of any query's wall time.
Dedup runs as a database invariant at write time: we hash normalized content and lean on a partial unique index, so duplicate INSERTs fail atomically rather than racing.
Paraphrases get caught by a secondary embedding check at 0.95 cosine, and an LLM conflict-resolver decides supersede, merge, or keep-both when content actually contradicts ("moved to Berlin" arriving after "lives in Lisbon").
A profile pass after ingest merges co-referent entities at the conversation level. "My college roommate's wedding" and "Emily's wedding in the city" collapse into one entry with both aliases recorded.
Every memory carries four scoping axes: tenant_id, bucket_id, and optional agent_id and run_id.
Buckets are the user-facing namespace; make as many as you need, because they are never a pricing dimension. Multi-agent tenants can share a bucket without polluting each other's retrieval, because the agent and run filters live behind partial indexes that don't cost anything when nullable.
Cross-tenant isolation is enforced in every retrieval query as the leading filter, with foreign-key cascades on tenant deletion.
A single delete cascades through every derived index in one transaction:
The memory does not come back as a ghost recall a week later.
This is the durability promise most memory systems quietly fail (delete the row, miss one derived index, watch the memory resurface). We treat full cascade as table stakes, not a feature.
Both.
The MCP layer at mcp.lumetra.io is the recommended path for clients that already speak MCP (Claude Code, Cursor, Windsurf, ChatGPT Connectors, OpenCode, OpenClaw), and it's the path with the fewest moving parts: a config-file install.
For anything else (your own Next.js or Python app, a CRON job, a custom orchestrator), there's a REST API at api.lumetra.io with the usual endpoints: /v1/buckets, /v1/buckets/{id}/memories, /v1/query, /v1/buckets/{id}/profile/regenerate. The MCP server is a thin wrapper around that REST API, not a separate product.
Hosted is the default and what most customers run.
Enterprise customers can request VPC or on-prem deployment: same server, same encrypted-at-rest credential storage, same audit-log surface that runs in our cloud.
A self-serve self-host tier is on the roadmap. If you need on-prem before then, get in touch.
Primary references behind the benchmarks, protocol, and competitor claims you'll see across this site. Verified May 13, 2026.