Debug

Causal memory deep-dive

How the trace becomes the agent's working memory — the exact snapshot shape persisted, the four projection rules, and a worked replay end-to-end. The fourth workflow that no other framework has.

A loan officer agent decides on Monday to reject application #42 — credit score 580, threshold 600. On Friday a different user asks "why was application #42 rejected?" You want the agent to answer from the EXACT decision evidence, not from "memory of memory" reconstruction. That requires the framework to record the agent's reasoning at the moment it happened, not summarize after the fact. This page shows exactly what that recording looks like.

The fourth workflow — agent reads its own trace

If contextual errors are the new class of bug (Why agentfootprint?), then the trace is the recording that makes them debuggable. agentfootprint's observability hands the trace back as three workflows — Live, Offline, Detailed.

Causal memory adds the fourth: the agent itself reads the trace. Six months later, "why did you reject loan #42?" answers from the recorded evidence (creditScore=580, threshold=600), not a rerun. The trace becomes the agent's working memory.

One agent run produces a JSON-portable causal trace. Three downstream consumers fan out (audit replay, cheap-model triage, training data export). The fourth, novel: the agent reads its own trace via causal memory.

This is the differentiator no other framework on the market has today. Other frameworks' memory remembers what was said — ours remembers what was decided.

What gets persisted

When you configure defineMemory({ type: MEMORY_TYPES.CAUSAL, ... }), every agent.run() records a SnapshotEntry to the configured store. The snapshot captures the agent's run frozen as JSON — the (query, finalContent) pair, the decisions collected via decide()/select(), the toolCalls trajectory, an optional rendered narrative, plus timing and token usage.

The snapshot is the THING. Subsequent retrievals load it back into context; the LLM reads it on a follow-up turn; the answer is grounded in WHAT THE AGENT ACTUALLY DID, not in what an LLM thinks it probably did.

Each snapshot answers four typed questions

The snapshot maps directly onto the four backtrack questions every agentfootprint LLM call answers (see Why § the four backtrack questions):

QuestionWhere in the snapshot
What happened?finalContent — the answer the agent produced for the query
Who/why decided it?decisions[] — each DecisionRecord: the stageId, the chosen branch, the matched rule, and the evidence values that satisfied it
What did it call?toolCalls[] — each ToolCallRecord: tool name, args, truncated resultPreview, errored flag
How long / how much?iterations, durationMs, tokenUsage

Causal memory persists all of these. Reading the snapshot back IS asking the four backtrack questions — without re-running the agent.

The shape (annotated)

A SnapshotEntry looks like this when serialized:

{
  "query": "Should we approve loan #42? Credit: 580. Income: 95k. Amount: 50k.",
  "finalContent": "Application #42: REJECTED. Credit (580) below floor (600). Suggest manual review.",
  "iterations": 1,

  "decisions": [
    {
      "stageId": "classify-risk",
      "chosen": "rejected",
      "rule": "Marginal credit",
      "evidence": { "creditScore": 580, "threshold": 600 }
    },
    {
      "stageId": "pick-reason",
      "chosen": "credit-too-low",
      "rule": "Credit below floor",
      "evidence": { "creditScore": 580, "incomeAboveFloor": true }
    }
  ],

  "toolCalls": [
    {
      "name": "lookupCreditBureau",
      "args": { "applicationId": "42" },
      "resultPreview": "{ \"score\": 580, \"bureau\": \"experian\" }",
      "errored": false
    }
  ],

  "narrative": "[Seed] application #42 received\n[ClassifyRisk] credit 580 < threshold 600 → rejected\n[PickReason] chose credit-too-low (income above floor)",

  "durationMs": 1840,
  "tokenUsage": { "input": 1200, "output": 180 }
}

The store wraps this value in a MemoryEntry<SnapshotEntry> that carries retrieval metadata (id, embedding, embeddingModel, createdAt, ttl, …) — the query embedding lives on the wrapper's embedding field, not on the snapshot itself.

Three fields are load-bearing:

  1. decisions — every decide()/select() choice: the stageId, the chosen branch, the matched rule, and the evidence values that fed the predicate
  2. finalContent — the agent's answer, paired with query for retrieval and for SFT/DPO export
  3. toolCalls — the tool-use trajectory, each with name / args / truncated resultPreview

Together they ARE the agent's reasoning, byte-for-byte.

The four projection modes

When a snapshot is RETRIEVED for a follow-up turn, you don't always want the whole thing — context budget matters. projection controls what slice gets injected into the next prompt:

ProjectionWhat lands in the promptUse when
SNAPSHOT_PROJECTIONS.DECISIONS (default)the decisions[] evidence — each stageId → chosen with rule + evidence"Why did the agent decide X?" follow-ups (cheapest)
SNAPSHOT_PROJECTIONS.COMMITSa per-stage stageId: chose "X" summary derived from decisions[]"What state did the agent reach?"
SNAPSHOT_PROJECTIONS.NARRATIVEthe rendered narrative stringHuman-readable replay
SNAPSHOT_PROJECTIONS.FULLthe whole serialized snapshot (largest, most expensive)Forensic-grade audit

Most apps use DECISIONS — it's the default, the smallest projection that answers the "why" question, and what cheap-model triage works best with.

Note: a full footprintjs commitLog isn't yet captured in SnapshotEntry, so COMMITS currently projects a compact summary of the decisions[] array as a stand-in. The field set is intentionally minimal today; richer commit/narrative capture is a follow-up FlowRecorder integration.

Define the memory

const causal = defineMemory({  id: 'causal',  description: 'Store snapshots of past runs; replay decisions on follow-up.',  type: MEMORY_TYPES.CAUSAL,  strategy: {    kind: MEMORY_STRATEGIES.TOP_K,    topK: 1,           // single best-matching past run    threshold: 0.5,    // strict — drop weak matches (no fallback)    embedder,  },  store,  projection: SNAPSHOT_PROJECTIONS.DECISIONS,  // inject decision evidence});

topK: 1 retrieves the single best-matching past run. threshold: 0.5 is strict — if no past run cosine-matches above 0.5, NO injection happens. The LLM sees no past context. Garbage low-confidence matches make hallucination MORE likely; the strict threshold is intentional.

embedder is the SAME instance used at write time. Mixing embedders silently corrupts retrieval; the framework tags entries with embedderId and filters on read.

A worked replay

import { Agent, defineMemory, MEMORY_TYPES, MEMORY_STRATEGIES, SNAPSHOT_PROJECTIONS } from 'agentfootprint';
import { anthropic } from 'agentfootprint/llm-providers'; // vendor-SDK providers live on this subpath

// Monday — production decision
const monday = Agent.create({ provider: anthropic({...}), model: 'claude-sonnet-4-5-20250929' })
  .system('You decide loan applications based on credit + income + amount.')
  .memory(causal)  // CAUSAL × TOP_K × DECISIONS
  .build();

await monday.run({
  message: 'Should we approve application #42? Credit: 580. Income: 95k. Amount: 50k.',
  identity: { tenant: 'lending', conversationId: 'app-42' },
});
// → "Application #42: REJECTED. Credit (580) below floor (600). Suggest manual review."

// (snapshot persisted automatically with decideLog/selectLog from the run)

// ─── Friday — different user, different conversation ──────────────────

const friday = Agent.create({ provider: anthropic({...}), model: 'claude-haiku-4-5-20251001' })
  // ↑ NOTE: cheap follow-up model. Causal memory makes this safe — the
  //   trace IS the reasoning, so a smaller model can read it.
  .memory(causal)  // same definition, different agent instance
  .build();

await friday.run({
  message: 'Why was application #42 rejected?',
  identity: { tenant: 'lending', conversationId: 'app-42-followup' },
});
// → "Application #42 was rejected because credit score 580 was below
//    the threshold of 600. The decision was made on Monday at 2:14 PM."

The Friday agent's prompt includes the projected decisions[] entries from Monday's run — the LLM sees the classify-risk → "rejected" choice with evidence: { creditScore: 580, threshold: 600 }, grounding the follow-up in stored past facts rather than reconstruction.

Why this works — the cheap-model triage economic argument

A trace recorded from your expensive production model (Sonnet-4) is a perfectly good input for a small, fast, cheap model (Haiku, GPT-4o-mini) answering follow-up questions about that run. Reading recorded decision evidence is structurally simpler than re-deriving the answer from first principles — so a smaller model is enough.

Across a production system that handles audit / explain / "why did the agent do X?" traffic, the cost difference compounds:

Cost per turn
Sonnet-4 re-deriving the answer from raw history$0.045
Haiku-4-5 reading the projected decideLog from snapshot~$0.005

~10× cost reduction on follow-up traffic. Same correctness — better, in fact, because the cheap model isn't hallucinating from compressed memory; it's reading recorded facts.

This is the second of the three downstream consumers fanning out of one recording (see the hero diagram above): cheap-model triage.

The third downstream consumer — training data export

The same JSON snapshot shape feeds SFT / DPO / process-RL training pipelines. Production traffic becomes labeled trajectories with zero extra instrumentation:

  • Every successful customer interaction → positive trajectory
  • Every escalation / override → counter-example
  • Every decide() evidence → the supervision signal for process-RL
  • Every select() rationale → the verbal explanation that DPO/RLAIF needs

The export API (causalMemory.exportForTraining({ format: 'sft' | 'dpo' | 'process-rl' })) is roadmap work — tracked in GitHub issues. The snapshot shape that makes it possible is shipping today; the export wrapper is the missing 200 lines.

One recording, three economics. No other framework on the market has this shape.

Anti-patterns

  • Don't fall back to top-K-anyway when threshold misses. The library throws by design; garbage past context is worse than no context.
  • Don't change embedders between writes and reads. The framework tags + filters, but only if you respect the contract.
  • Don't use FULL projection by default. It's the largest payload; reserve for forensic-grade scenarios. DECISIONS covers 90% of real follow-up queries.
  • Don't share the causal store across tenants. The MemoryIdentity tuple namespaces, but only if you pass per-tenant identity at every agent.run() call.

Next steps

On this page