Causal memory deep-dive
How the trace becomes the agent's working memory — the exact snapshot shape persisted, the four projection rules, and a worked replay end-to-end. The fourth workflow that no other framework has.
A loan officer agent decides on Monday to reject application #42 — credit score 580, threshold 600. On Friday a different user asks "why was application #42 rejected?" You want the agent to answer from the EXACT decision evidence, not from "memory of memory" reconstruction. That requires the framework to record the agent's reasoning at the moment it happened, not summarize after the fact. This page shows exactly what that recording looks like.
The fourth workflow — agent reads its own trace
If contextual errors are the new class of bug (Why agentfootprint?), then the trace is the recording that makes them debuggable. agentfootprint's observability hands the trace back as three workflows — Live, Offline, Detailed.
Causal memory adds the fourth: the agent itself reads the trace. Six months later, "why did you reject loan #42?" answers from the recorded evidence (creditScore=580, threshold=600), not a rerun. The trace becomes the agent's working memory.
This is the differentiator no other framework on the market has today. Other frameworks' memory remembers what was said — ours remembers what was decided.
What gets persisted
When you configure defineMemory({ type: MEMORY_TYPES.CAUSAL, ... }), every agent.run() records a SnapshotEntry to the configured store. The snapshot captures the agent's run frozen as JSON — the (query, finalContent) pair, the decisions collected via decide()/select(), the toolCalls trajectory, an optional rendered narrative, plus timing and token usage.
The snapshot is the THING. Subsequent retrievals load it back into context; the LLM reads it on a follow-up turn; the answer is grounded in WHAT THE AGENT ACTUALLY DID, not in what an LLM thinks it probably did.
Each snapshot answers four typed questions
The snapshot maps directly onto the four backtrack questions every agentfootprint LLM call answers (see Why § the four backtrack questions):
| Question | Where in the snapshot |
|---|---|
| What happened? | finalContent — the answer the agent produced for the query |
| Who/why decided it? | decisions[] — each DecisionRecord: the stageId, the chosen branch, the matched rule, and the evidence values that satisfied it |
| What did it call? | toolCalls[] — each ToolCallRecord: tool name, args, truncated resultPreview, errored flag |
| How long / how much? | iterations, durationMs, tokenUsage |
Causal memory persists all of these. Reading the snapshot back IS asking the four backtrack questions — without re-running the agent.
The shape (annotated)
A SnapshotEntry looks like this when serialized:
{
"query": "Should we approve loan #42? Credit: 580. Income: 95k. Amount: 50k.",
"finalContent": "Application #42: REJECTED. Credit (580) below floor (600). Suggest manual review.",
"iterations": 1,
"decisions": [
{
"stageId": "classify-risk",
"chosen": "rejected",
"rule": "Marginal credit",
"evidence": { "creditScore": 580, "threshold": 600 }
},
{
"stageId": "pick-reason",
"chosen": "credit-too-low",
"rule": "Credit below floor",
"evidence": { "creditScore": 580, "incomeAboveFloor": true }
}
],
"toolCalls": [
{
"name": "lookupCreditBureau",
"args": { "applicationId": "42" },
"resultPreview": "{ \"score\": 580, \"bureau\": \"experian\" }",
"errored": false
}
],
"narrative": "[Seed] application #42 received\n[ClassifyRisk] credit 580 < threshold 600 → rejected\n[PickReason] chose credit-too-low (income above floor)",
"durationMs": 1840,
"tokenUsage": { "input": 1200, "output": 180 }
}The store wraps this value in a MemoryEntry<SnapshotEntry> that carries retrieval metadata (id, embedding, embeddingModel, createdAt, ttl, …) — the query embedding lives on the wrapper's embedding field, not on the snapshot itself.
Three fields are load-bearing:
decisions— everydecide()/select()choice: thestageId, thechosenbranch, the matchedrule, and theevidencevalues that fed the predicatefinalContent— the agent's answer, paired withqueryfor retrieval and for SFT/DPO exporttoolCalls— the tool-use trajectory, each withname/args/ truncatedresultPreview
Together they ARE the agent's reasoning, byte-for-byte.
The four projection modes
When a snapshot is RETRIEVED for a follow-up turn, you don't always want the whole thing — context budget matters. projection controls what slice gets injected into the next prompt:
| Projection | What lands in the prompt | Use when |
|---|---|---|
SNAPSHOT_PROJECTIONS.DECISIONS (default) | the decisions[] evidence — each stageId → chosen with rule + evidence | "Why did the agent decide X?" follow-ups (cheapest) |
SNAPSHOT_PROJECTIONS.COMMITS | a per-stage stageId: chose "X" summary derived from decisions[] | "What state did the agent reach?" |
SNAPSHOT_PROJECTIONS.NARRATIVE | the rendered narrative string | Human-readable replay |
SNAPSHOT_PROJECTIONS.FULL | the whole serialized snapshot (largest, most expensive) | Forensic-grade audit |
Most apps use DECISIONS — it's the default, the smallest projection that answers the "why" question, and what cheap-model triage works best with.
Note: a full footprintjs
commitLogisn't yet captured inSnapshotEntry, soCOMMITScurrently projects a compact summary of thedecisions[]array as a stand-in. The field set is intentionally minimal today; richer commit/narrative capture is a follow-up FlowRecorder integration.
Define the memory
const causal = defineMemory({ id: 'causal', description: 'Store snapshots of past runs; replay decisions on follow-up.', type: MEMORY_TYPES.CAUSAL, strategy: { kind: MEMORY_STRATEGIES.TOP_K, topK: 1, // single best-matching past run threshold: 0.5, // strict — drop weak matches (no fallback) embedder, }, store, projection: SNAPSHOT_PROJECTIONS.DECISIONS, // inject decision evidence});topK: 1 retrieves the single best-matching past run. threshold: 0.5 is strict — if no past run cosine-matches above 0.5, NO injection happens. The LLM sees no past context. Garbage low-confidence matches make hallucination MORE likely; the strict threshold is intentional.
embedder is the SAME instance used at write time. Mixing embedders silently corrupts retrieval; the framework tags entries with embedderId and filters on read.
A worked replay
import { Agent, defineMemory, MEMORY_TYPES, MEMORY_STRATEGIES, SNAPSHOT_PROJECTIONS } from 'agentfootprint';
import { anthropic } from 'agentfootprint/llm-providers'; // vendor-SDK providers live on this subpath
// Monday — production decision
const monday = Agent.create({ provider: anthropic({...}), model: 'claude-sonnet-4-5-20250929' })
.system('You decide loan applications based on credit + income + amount.')
.memory(causal) // CAUSAL × TOP_K × DECISIONS
.build();
await monday.run({
message: 'Should we approve application #42? Credit: 580. Income: 95k. Amount: 50k.',
identity: { tenant: 'lending', conversationId: 'app-42' },
});
// → "Application #42: REJECTED. Credit (580) below floor (600). Suggest manual review."
// (snapshot persisted automatically with decideLog/selectLog from the run)
// ─── Friday — different user, different conversation ──────────────────
const friday = Agent.create({ provider: anthropic({...}), model: 'claude-haiku-4-5-20251001' })
// ↑ NOTE: cheap follow-up model. Causal memory makes this safe — the
// trace IS the reasoning, so a smaller model can read it.
.memory(causal) // same definition, different agent instance
.build();
await friday.run({
message: 'Why was application #42 rejected?',
identity: { tenant: 'lending', conversationId: 'app-42-followup' },
});
// → "Application #42 was rejected because credit score 580 was below
// the threshold of 600. The decision was made on Monday at 2:14 PM."The Friday agent's prompt includes the projected decisions[] entries from Monday's run — the LLM sees the classify-risk → "rejected" choice with evidence: { creditScore: 580, threshold: 600 }, grounding the follow-up in stored past facts rather than reconstruction.
Why this works — the cheap-model triage economic argument
A trace recorded from your expensive production model (Sonnet-4) is a perfectly good input for a small, fast, cheap model (Haiku, GPT-4o-mini) answering follow-up questions about that run. Reading recorded decision evidence is structurally simpler than re-deriving the answer from first principles — so a smaller model is enough.
Across a production system that handles audit / explain / "why did the agent do X?" traffic, the cost difference compounds:
| Cost per turn | |
|---|---|
| Sonnet-4 re-deriving the answer from raw history | $0.045 |
Haiku-4-5 reading the projected decideLog from snapshot | ~$0.005 |
~10× cost reduction on follow-up traffic. Same correctness — better, in fact, because the cheap model isn't hallucinating from compressed memory; it's reading recorded facts.
This is the second of the three downstream consumers fanning out of one recording (see the hero diagram above): cheap-model triage.
The third downstream consumer — training data export
The same JSON snapshot shape feeds SFT / DPO / process-RL training pipelines. Production traffic becomes labeled trajectories with zero extra instrumentation:
- Every successful customer interaction → positive trajectory
- Every escalation / override → counter-example
- Every
decide()evidence → the supervision signal for process-RL - Every
select()rationale → the verbal explanation that DPO/RLAIF needs
The export API (causalMemory.exportForTraining({ format: 'sft' | 'dpo' | 'process-rl' })) is roadmap work — tracked in GitHub issues. The snapshot shape that makes it possible is shipping today; the export wrapper is the missing 200 lines.
One recording, three economics. No other framework on the market has this shape.
Anti-patterns
- ❌ Don't fall back to top-K-anyway when threshold misses. The library throws by design; garbage past context is worse than no context.
- ❌ Don't change embedders between writes and reads. The framework tags + filters, but only if you respect the contract.
- ❌ Don't use
FULLprojection by default. It's the largest payload; reserve for forensic-grade scenarios.DECISIONScovers 90% of real follow-up queries. - ❌ Don't share the causal store across tenants. The
MemoryIdentitytuple namespaces, but only if you pass per-tenant identity at everyagent.run()call.
Next steps
- Memory guide — the full type × strategy matrix that includes Causal
- Auto memory (Hybrid) — stack causal alongside recent + facts
- Observability guide — the three workflows the trace also enables
- Citations & papers — the research foundation
Ask your agent "why?"
.selfExplain() lets the agent answer follow-up why-questions about its OWN previous turn from the recorded trace — one builder call, the trace tools gated until asked, evidence bound to the completed run (never the in-flight one), and a cheaper-model delegate mode.
Testing
mock({ replies }) + mockMcpClient + InMemoryStore — script every agent decision deterministically. No API keys, no flake, no hidden $40k OpenAI bill from a runaway test loop.
