Narrative memory (Summarization)
Compress long conversations into beats. Older turns are LLM-summarized into a shorter narrative; recent turns stay raw. Trades one cheap LLM call per write for token savings on every read.
Your support agent is on turn 47 of a single conversation. The user is debugging an integration issue and the back-and-forth has been going for 90 minutes. The context window is filling up with detail that's no longer relevant — the LLM is one tool call away from losing the thread. The fix isn't a bigger context window — it's compressing the older turns into a summary so the LLM keeps the BEAT of the conversation without the noise.
What summarization memory is
defineMemory({ type: EPISODIC, strategy: { kind: SUMMARIZE, recent, llm } }) — the long-conversation strategy. The recent parameter sets the boundary: the goal is to keep the last N turns raw and compress older turns into one cheap-LLM summary, so the messages slot gets [summary of older turns] + [last N raw turns].
Cheaper than infinite raw retention. More semantic than truncation.
Current shipped behavior
In the published release, the SUMMARIZE strategy compiles to the same read pipeline as WINDOW with loadCount: recent — it loads the recent most-recent turns and the llm summarizer is not yet invoked automatically by defineMemory. The compression-of-older-turns wiring is planned. To run real summarization today, compose the summarize read-side stage yourself via mountMemoryRead (see Compose it manually).
Define a summarizing memory
const memory = defineMemory({ id: 'long-chat', type: MEMORY_TYPES.EPISODIC, strategy: { kind: MEMORY_STRATEGIES.SUMMARIZE, recent: 6, // keep last 6 turns raw, summarize older llm: summarizer, // dedicated cheap model for compression }, store,});The llm parameter is the dedicated summarizer model. Use a cheap one — Haiku, GPT-4o-mini, or any local model. The summarizer doesn't need to be smart, it needs to be FAST and CHEAP because it runs on every read that crosses the trigger threshold.
How summarization fires
The summarize stage runs on the READ side, inside the load pipeline (loadRecent → summarize → pickByBudget → formatDefault). When the read subflow runs, the stage:
- Reads the loaded entries (oldest-first) from
scope.loaded - No-ops if
loaded.length < triggerMinEntries(default 20) orloaded.length <= preserveRecent(default 5) — nothing worth compressing - Otherwise takes the OLDEST
(count - preserveRecent)entries and sends them to the summarizerllmwith a fixed compression prompt - Replaces that older range with ONE synthetic summary entry —
id: summary-{earliestTurn}-to-{latestTurn}, a{ role: 'system', content }message,tier: 'cold' - Keeps the
preserveRecentmost-recent turns verbatim
The synthetic summary is in-memory for that read only — it shapes what gets injected into the messages slot, not what is persisted. The conversation BEATS get progressively coarser the further back you look — exactly like human episodic memory.
Compose it manually (today)
Until defineMemory wires the summarizer automatically, compose the summarize stage yourself with the read-side mount helper:
import { mountMemoryRead, loadRecent, summarize, formatDefault } from 'agentfootprint/memory';
// summarize() takes a SummarizeConfig: { llm, triggerMinEntries?, preserveRecent?, systemPrompt? }
// where llm is an (messages) => Promise<string> callback.summarize, its SummarizeConfig type, and the mountMemoryRead/mountMemoryPipeline helpers are all exported from agentfootprint/memory.
When to use this vs window vs extract
| Situation | Strategy |
|---|---|
| Short chats (under window size) | WINDOW (no compression needed) |
| Long chats where the BEAT matters more than the detail | SUMMARIZE (this guide) |
| Long chats where you want STRUCTURED facts, not narrative | EXTRACT |
| Cross-run "what did we settle on last time?" | NARRATIVE type (different shape) or CAUSAL type (decision evidence) |
Multi-tenant identity scoping
Every read and write scopes by MemoryIdentity. A summary belongs to one (tenant, principal, conversationId) tuple — no cross-tenant leakage even if the same store backs many tenants.
Anti-patterns
- Don't use the production model as the summarizer. Cost will dominate. Use a cheap model dedicated to compression.
- Don't summarize every turn. The stage gates itself — it no-ops until
loaded.lengthexceedstriggerMinEntries, so short conversations stay verbatim and pay nothing. - Don't summarize facts. Use SEMANTIC × EXTRACT for facts. SUMMARIZE compresses NARRATIVE flow; EXTRACT distills DATA.
Next steps
- Memory guide — the full type × strategy matrix
- Fact extraction — the structured-data alternative to summarization
- Auto memory (hybrid) — stack SUMMARIZE alongside other layers
Fact extraction (Semantic memory)
Distill structured facts from raw conversation. Pattern-based (free, regex) or LLM-based (richer). The right shape when you want to remember "what's true about this user" without replaying every word.
Semantic retrieval (Top-K)
Cosine-similarity search over embedded entries. Returns the most relevant top-K above a strict threshold — the read-side primitive behind RAG.
