Narrative memory (Summarization)
Your support agent is on turn 47 of a single conversation. The user is debugging an integration issue and the back-and-forth has been going for 90 minutes. The context window is filling up with detail that’s no longer relevant — the LLM is one tool call away from losing the thread. The fix isn’t a bigger context window — it’s compressing the older turns into a summary so the LLM keeps the BEAT of the conversation without the noise.
What summarization memory is
Section titled “What summarization memory is”defineMemory({ type: EPISODIC, strategy: { kind: SUMMARIZE, recent, llm } }) — keeps the last N turns raw, runs older turns through a cheap LLM summarizer, persists the summary as a “beat” for future reads.
The recent parameter sets the boundary. Turns inside the window are stored as raw messages. Turns past it are compressed into one summary per write batch. On reads, the messages slot gets: [summary of older turns] + [last N raw turns].
Cheaper than infinite raw retention. More semantic than truncation.
Define a summarizing memory
Section titled “Define a summarizing memory”const memory = defineMemory({ id: 'long-chat', type: MEMORY_TYPES.EPISODIC, strategy: { kind: MEMORY_STRATEGIES.SUMMARIZE, recent: 6, // keep last 6 turns raw, summarize older llm: summarizer, // dedicated cheap model for compression }, store,});The llm parameter is the dedicated summarizer model. Use a cheap one — Haiku, GPT-4o-mini, or any local model. The summarizer doesn’t need to be smart, it needs to be FAST and CHEAP because it runs on every memory write.
How summarization fires
Section titled “How summarization fires”The write subflow runs after every successful turn. Each invocation:
- Loads the current entries from the store
- If count >
recent, takes the OLDEST(count - recent)turns - Sends them to the summarizer LLM with a fixed compression prompt
- Replaces the older turns with one summary entry tagged
kind: 'summary' - Keeps the
recentturns unchanged
The summary is itself an entry in the store, so future writes summarize the previous summary + new older turns into a fresh summary. The conversation BEATS get progressively coarser the further back you look — exactly like human episodic memory.
When to use this vs window vs extract
Section titled “When to use this vs window vs extract”| Situation | Strategy |
|---|---|
| Short chats (under window size) | WINDOW (no compression needed) |
| Long chats where the BEAT matters more than the detail | SUMMARIZE (this guide) |
| Long chats where you want STRUCTURED facts, not narrative | EXTRACT |
| Cross-run “what did we settle on last time?” | NARRATIVE type (different shape) or CAUSAL type (decision evidence) |
Multi-tenant identity scoping
Section titled “Multi-tenant identity scoping”Every read and write scopes by MemoryIdentity. A summary belongs to one (tenant, principal, conversationId) tuple — no cross-tenant leakage even if the same store backs many tenants.
Anti-patterns
Section titled “Anti-patterns”- Don’t use the production model as the summarizer. Cost will dominate. Use a cheap model dedicated to compression.
- Don’t summarize every turn. The library already batches — summarization fires only when the over-
recentcount justifies it. - Don’t summarize facts. Use SEMANTIC × EXTRACT for facts. SUMMARIZE compresses NARRATIVE flow; EXTRACT distills DATA.
Next steps
Section titled “Next steps”- Memory guide — the full type × strategy matrix
- Fact extraction — the structured-data alternative to summarization
- Auto memory (hybrid) — stack SUMMARIZE alongside other layers