Build

Narrative memory (Summarization)

Compress long conversations into beats. Older turns are LLM-summarized into a shorter narrative; recent turns stay raw. Trades one cheap LLM call per write for token savings on every read.

Your support agent is on turn 47 of a single conversation. The user is debugging an integration issue and the back-and-forth has been going for 90 minutes. The context window is filling up with detail that's no longer relevant — the LLM is one tool call away from losing the thread. The fix isn't a bigger context window — it's compressing the older turns into a summary so the LLM keeps the BEAT of the conversation without the noise.

What summarization memory is

defineMemory({ type: EPISODIC, strategy: { kind: SUMMARIZE, recent, llm } }) — the long-conversation strategy. The recent parameter sets the boundary: the goal is to keep the last N turns raw and compress older turns into one cheap-LLM summary, so the messages slot gets [summary of older turns] + [last N raw turns].

Cheaper than infinite raw retention. More semantic than truncation.

Current shipped behavior

In the published release, the SUMMARIZE strategy compiles to the same read pipeline as WINDOW with loadCount: recent — it loads the recent most-recent turns and the llm summarizer is not yet invoked automatically by defineMemory. The compression-of-older-turns wiring is planned. To run real summarization today, compose the summarize read-side stage yourself via mountMemoryRead (see Compose it manually).

Define a summarizing memory

const memory = defineMemory({  id: 'long-chat',  type: MEMORY_TYPES.EPISODIC,  strategy: {    kind: MEMORY_STRATEGIES.SUMMARIZE,    recent: 6,        // keep last 6 turns raw, summarize older    llm: summarizer,  // dedicated cheap model for compression  },  store,});

The llm parameter is the dedicated summarizer model. Use a cheap one — Haiku, GPT-4o-mini, or any local model. The summarizer doesn't need to be smart, it needs to be FAST and CHEAP because it runs on every read that crosses the trigger threshold.

How summarization fires

The summarize stage runs on the READ side, inside the load pipeline (loadRecent → summarize → pickByBudget → formatDefault). When the read subflow runs, the stage:

  1. Reads the loaded entries (oldest-first) from scope.loaded
  2. No-ops if loaded.length < triggerMinEntries (default 20) or loaded.length <= preserveRecent (default 5) — nothing worth compressing
  3. Otherwise takes the OLDEST (count - preserveRecent) entries and sends them to the summarizer llm with a fixed compression prompt
  4. Replaces that older range with ONE synthetic summary entry — id: summary-{earliestTurn}-to-{latestTurn}, a { role: 'system', content } message, tier: 'cold'
  5. Keeps the preserveRecent most-recent turns verbatim

The synthetic summary is in-memory for that read only — it shapes what gets injected into the messages slot, not what is persisted. The conversation BEATS get progressively coarser the further back you look — exactly like human episodic memory.

Compose it manually (today)

Until defineMemory wires the summarizer automatically, compose the summarize stage yourself with the read-side mount helper:

import { mountMemoryRead, loadRecent, summarize, formatDefault } from 'agentfootprint/memory';

// summarize() takes a SummarizeConfig: { llm, triggerMinEntries?, preserveRecent?, systemPrompt? }
// where llm is an (messages) => Promise<string> callback.

summarize, its SummarizeConfig type, and the mountMemoryRead/mountMemoryPipeline helpers are all exported from agentfootprint/memory.

When to use this vs window vs extract

SituationStrategy
Short chats (under window size)WINDOW (no compression needed)
Long chats where the BEAT matters more than the detailSUMMARIZE (this guide)
Long chats where you want STRUCTURED facts, not narrativeEXTRACT
Cross-run "what did we settle on last time?"NARRATIVE type (different shape) or CAUSAL type (decision evidence)

Multi-tenant identity scoping

Every read and write scopes by MemoryIdentity. A summary belongs to one (tenant, principal, conversationId) tuple — no cross-tenant leakage even if the same store backs many tenants.

Anti-patterns

  • Don't use the production model as the summarizer. Cost will dominate. Use a cheap model dedicated to compression.
  • Don't summarize every turn. The stage gates itself — it no-ops until loaded.length exceeds triggerMinEntries, so short conversations stay verbatim and pay nothing.
  • Don't summarize facts. Use SEMANTIC × EXTRACT for facts. SUMMARIZE compresses NARRATIVE flow; EXTRACT distills DATA.

Next steps

On this page