AI agents fail in a way classic debuggers can't see — the code is right, but the answer is wrong because of the context the model saw. agentfootprint records that context so you can trace any wrong answer back to its cause.

Your support agent told a customer their refund was processed. Six weeks later they come back and ask "why did you tell me my refund was processed when it wasn't?" You go to look. The agent is gone. Logs are scattered across three services. The decision evidence is not there. This is the gap agentfootprint exists to close.

The new class of bug

For fifty years, software bugs have been logic errors. A wrong condition, a missed edge case, an off-by-one. You step through the code with a debugger until you find the bad branch.

LLM-powered apps add a second class: contextual errors. The code is correct. The model is correct. The answer is wrong because the LLM's decision rests on context that was ambiguous, missing, or invalidated at the moment of inference.

Tracking which content the model actually saw, and why, is the entire debugging job. Without it, the failure mode is invisible:

What got injected wrong	What the model did
Wrong instruction landed in the `system` slot	Followed the wrong rule
Predicate fired one iteration too early	Reasoned with stale assumptions
Skill body missing when LLM called `read_skill`	Invented its own
Cache prefix invalidated mid-iteration	Saw a silently rewritten stale version
Tool returned but `on-tool-return` injection didn't fire	Couldn't interpret the result

The model doesn't tell you which of these went wrong

It just gives you the wrong answer. You can't step through it with a debugger. By the time you read the response, the context that produced it is gone — unless something recorded it.

A framework that owns the control flow can debug logic errors. A framework that owns the injection can debug contextual errors — because every injection is a typed event with a where, when, why, and how-it-cached.

Every LLM call has 3 fixed slots (system, messages, tools); every flavor lands in one slot under one of 4 fixed triggers. The grid is the entire context-engineering surface.

The four backtrack questions

Every agentfootprint LLM call backtracks to four typed answers — and they're the answers the wrong-answer customer is asking, ten weeks late:

Question	What the trace tells you
What was injected?	Every flavor of content the LLM saw on this iteration (Skill bodies, RAG passages, Steering rules, tool schemas).
Who triggered it?	Which rule fired (`always` / `rule` predicate / `on-tool-return` / `llm-activated` `read_skill`).
When it fired?	Which iteration of the ReAct loop, after which event (which tool returned, which skill activated).
How it landed?	Which slot (`system` / `messages` / `tools`), what position, what cache strategy, whether it was actually applied or skipped.

These aren't logged alongside the call. They're the structural output of the call — produced by the framework owning the runtime loop.

Connected evidence — built declaratively

agentfootprint gives you connected evidence — grounded, auditable, LLM-readable. Every iteration of the ReAct loop, every tool call with its args + result, every context injection, every decision branch is captured as a typed event in a single pass over the run — no instrumentation, no post-processing.

The agent + tool registration is built declaratively. The framework owns the loop, so it can record everything that happens inside it:

const agent = Agent.create({  provider: provider ?? exampleProvider('feature', { respond: weatherRespond }),  model: 'mock',  maxIterations: 5,  // reactMode: 'dynamic-grouped' wraps the LLM turn in an sf-llm-call subflow,  // so Lens renders the agent's reasoning as an LLM group with its context  // slots (system-prompt / messages / tools) nested inside — the SAME shape  // the LLMCall primitive shows — instead of a bare "Final · RUNNER" card.  reactMode: 'dynamic-grouped',})  .system('You answer weather questions using the `weather` tool.')  .tool({    schema: {      name: 'weather',      description: 'Get current weather for a city.',      inputSchema: {        type: 'object',        properties: { city: { type: 'string' } },        required: ['city'],      },    },    execute: async (args) => `${(args as { city: string }).city}: sunny, 72°F`,  })  .build();

Observability is just attaching listeners — no SDK, no agent.observe wrapper, no separate tracing pipeline:

agent.on('agentfootprint.stream.tool_start', (e) =>  console.log(`→ tool ${e.payload.toolName}(${JSON.stringify(e.payload.args)})`),);agent.on('agentfootprint.stream.tool_end', (e) =>  console.log(`← tool result: ${e.payload.result}`),);

Every event is typed. Every payload is what you'd want — toolName + args on tool_start, result on tool_end, iterationCount + token totals on turn_end. No string-parsing logs. No "I think this means the tool was called."

Three workflows from one trace

The same trace serves three workflows:

Mode	What you do	What the trace gives you
Live	Debug as you build	Exactly which injection produced which token; which predicate fired this iteration; which prefix actually got cached
Offline	Monitor what shipped	Replay any past run from its trace. Alert on drift. Attribute cost per injection.
Detailed	Improve via export	Every successful trajectory is labeled training data for fine-tuning or reinforcement learning — no separate data-collection phase

And a fourth, novel: the agent can read its own trace. Six months after the agent rejected loan #42, "why did you reject it?" answers from the recorded evidence (creditScore=580, threshold=600), not a rerun. Causal memory turns the trace into the agent's working memory.

One agent run produces a JSON-portable causal trace. Three downstream consumers fan out: audit replay, cheap-model triage, training data export.

Dynamic ReAct — context recomposes every iteration

The corollary that makes "context engineering" worth the name. Other frameworks assemble the prompt once per turn; agentfootprint re-runs every Injection trigger every iteration. Tool schemas, system prompts, skill bodies, memory recall — all recompute against the freshest state.

Classic ReAct loops back to CallLLM (slots frozen). Dynamic ReAct loops back to SystemPrompt (slots recompose every iteration).

Iteration	Classic ReAct	Dynamic ReAct (agentfootprint)
1	12 tools shown	1 tool (`read_skill`)
2	12 tools shown	5 tools (skill activated)
3	12 tools shown	5 tools

Per-iteration recomposition is also the structural prerequisite for the cache layer — cache markers can't track active injections in lockstep without it.

The deeper point — your harness is your application. Because context recomposes every iteration, you declare rules (an instruction's activeWhen, a skill, a steering doc), not a state machine. Which tool is exposed, which reminder fires, which skill is active — the behavior emerges from the runtime re-evaluating those rules against the freshest state each loop. The harness's context-loop is your application logic — no hand-rolled coordination — and the same loop emits the trace you later interrogate. That's the whole stack in one line: context engineering (what the model sees) · the harness (the runtime that recomposes it every iteration) · explainability (the trace it hands back).

📖 Dynamic ReAct guide for the full taxonomy of what this unlocks (tool-by-tool steering, adaptive tool exposure, cost guardrails, iterative format refinement, failure adaptation).

$0 test runs — scripted ReAct

Tool-using agents are notoriously hard to test because the LLM's behavior is non-deterministic. mock({ replies }) solves that — script the LLM's decisions turn-by-turn, run the agent against the script, get a deterministic, free, instant test:

mock({  replies: [    {      toolCalls: [        {          id: 'call-1',          name: 'lookup',          args: { topic: 'refunds' } as Record<string, unknown>,        },      ],    },    { content: 'Refunds take 3 business days.' },  ],});

Iteration 1 calls the tool with the scripted args; iteration 2 returns the final answer. The tool actually executes; the agent loop is real. Only the LLM is mocked. Swap mock(...) for anthropic(...) to ship — the rest of the agent is identical.

No graph DSL required

Composition is just chained .create() builders, not a separate graph language. Two agents in series? Sequence. Multiple critics with merge? Parallel. Iterate until quality bar? Loop. The compositions are how multi-agent systems are built — there's no MultiAgentSystem class, no Orchestrator to learn:

const runner = reflection({  provider: provider ?? exampleProvider('pattern', { respond: () => replies[i++ % replies.length]! }),  model: 'mock',  proposerPrompt: 'Write or revise a short poem about night.',  criticPrompt:    'Critique the poem. When it is good enough include the marker DONE.',  maxIterations: 5,});

That's Reflexion (Shinn 2023), expressed as Sequence(Agent, critique-LLM, Agent). Tree-of-Thoughts is Parallel(Agent × N) + LLM-rank. Debate is Loop(Agent × 2 + judge). Every named pattern in the agent literature is a composition of the same 2 primitives + 4 compositions. You learn the substrate; the field's growth lands as recipes on top.

Built on footprintjs

agentfootprint is the agent layer. footprintjs is the substrate — the flowchart-pattern execution engine that makes our typed-event stream + replayable traces automatic. footprintjs gives us composition primitives, state-machine semantics, durable workflow checkpoints, and 57+ typed events out of the box; we used the budget those abstractions would have cost us to invest deeply in the injection loop — the layer every other framework leaves to the developer.

You don't need to learn footprintjs to use agentfootprint — but if you want to build your own primitives at this depth, start there.

Next steps

Quick Start — build your first agent
Key Concepts — the 5-layer taxonomy
Reliability gate — rules-based retry / fallback / fail-fast around every LLM call
Causal memory deep dive — how the trace becomes the agent's working memory

Why agentfootprint?