Narrative memory (Summarization)

Your support agent is on turn 47 of a single conversation. The user is debugging an integration issue and the back-and-forth has been going for 90 minutes. The context window is filling up with detail that’s no longer relevant — the LLM is one tool call away from losing the thread. The fix isn’t a bigger context window — it’s compressing the older turns into a summary so the LLM keeps the BEAT of the conversation without the noise.

What summarization memory is

defineMemory({ type: EPISODIC, strategy: { kind: SUMMARIZE, recent, llm } }) — keeps the last N turns raw, runs older turns through a cheap LLM summarizer, persists the summary as a “beat” for future reads.

The recent parameter sets the boundary. Turns inside the window are stored as raw messages. Turns past it are compressed into one summary per write batch. On reads, the messages slot gets: [summary of older turns] + [last N raw turns].

Cheaper than infinite raw retention. More semantic than truncation.

Define a summarizing memory

const memory = defineMemory({
  id: 'long-chat',
  type: MEMORY_TYPES.EPISODIC,
  strategy: {
    kind: MEMORY_STRATEGIES.SUMMARIZE,
    recent: 6,        // keep last 6 turns raw, summarize older
    llm: summarizer,  // dedicated cheap model for compression
  },
  store,
});

The llm parameter is the dedicated summarizer model. Use a cheap one — Haiku, GPT-4o-mini, or any local model. The summarizer doesn’t need to be smart, it needs to be FAST and CHEAP because it runs on every memory write.

How summarization fires

The write subflow runs after every successful turn. Each invocation:

Loads the current entries from the store
If count > recent, takes the OLDEST (count - recent) turns
Sends them to the summarizer LLM with a fixed compression prompt
Replaces the older turns with one summary entry tagged kind: 'summary'
Keeps the recent turns unchanged

The summary is itself an entry in the store, so future writes summarize the previous summary + new older turns into a fresh summary. The conversation BEATS get progressively coarser the further back you look — exactly like human episodic memory.

When to use this vs window vs extract

Situation	Strategy
Short chats (under window size)	WINDOW (no compression needed)
Long chats where the BEAT matters more than the detail	SUMMARIZE (this guide)
Long chats where you want STRUCTURED facts, not narrative	EXTRACT
Cross-run “what did we settle on last time?”	NARRATIVE type (different shape) or CAUSAL type (decision evidence)

Multi-tenant identity scoping

Every read and write scopes by MemoryIdentity. A summary belongs to one (tenant, principal, conversationId) tuple — no cross-tenant leakage even if the same store backs many tenants.

Anti-patterns

Don’t use the production model as the summarizer. Cost will dominate. Use a cheap model dedicated to compression.
Don’t summarize every turn. The library already batches — summarization fires only when the over-recent count justifies it.
Don’t summarize facts. Use SEMANTIC × EXTRACT for facts. SUMMARIZE compresses NARRATIVE flow; EXTRACT distills DATA.

Next steps

Memory guide — the full type × strategy matrix
Fact extraction — the structured-data alternative to summarization
Auto memory (hybrid) — stack SUMMARIZE alongside other layers