Why agentfootprint?
AI agents fail in a way classic debuggers can't see — the code is right, but the answer is wrong because of the context the model saw. agentfootprint records that context so you can trace any wrong answer back to its cause.
Your support agent told a customer their refund was processed. Six weeks later they come back and ask "why did you tell me my refund was processed when it wasn't?" You go to look. The agent is gone. Logs are scattered across three services. The decision evidence is not there. This is the gap agentfootprint exists to close.
The new class of bug
For fifty years, software bugs have been logic errors. A wrong condition, a missed edge case, an off-by-one. You step through the code with a debugger until you find the bad branch.
LLM-powered apps add a second class: contextual errors. The code is correct. The model is correct. The answer is wrong because the LLM's decision rests on context that was ambiguous, missing, or invalidated at the moment of inference.
Tracking which content the model actually saw, and why, is the entire debugging job. Without it, the failure mode is invisible:
| What got injected wrong | What the model did |
|---|---|
Wrong instruction landed in the system slot | Followed the wrong rule |
| Predicate fired one iteration too early | Reasoned with stale assumptions |
Skill body missing when LLM called read_skill | Invented its own |
| Cache prefix invalidated mid-iteration | Saw a silently rewritten stale version |
Tool returned but on-tool-return injection didn't fire | Couldn't interpret the result |
The model doesn't tell you which of these went wrong
It just gives you the wrong answer. You can't step through it with a debugger. By the time you read the response, the context that produced it is gone — unless something recorded it.
A framework that owns the control flow can debug logic errors. A framework that owns the injection can debug contextual errors — because every injection is a typed event with a where, when, why, and how-it-cached.
The four backtrack questions
Every agentfootprint LLM call backtracks to four typed answers — and they're the answers the wrong-answer customer is asking, ten weeks late:
| Question | What the trace tells you |
|---|---|
| What was injected? | Every flavor of content the LLM saw on this iteration (Skill bodies, RAG passages, Steering rules, tool schemas). |
| Who triggered it? | Which rule fired (always / rule predicate / on-tool-return / llm-activated read_skill). |
| When it fired? | Which iteration of the ReAct loop, after which event (which tool returned, which skill activated). |
| How it landed? | Which slot (system / messages / tools), what position, what cache strategy, whether it was actually applied or skipped. |
These aren't logged alongside the call. They're the structural output of the call — produced by the framework owning the runtime loop.
Connected evidence — built declaratively
agentfootprint gives you connected evidence — grounded, auditable, LLM-readable. Every iteration of the ReAct loop, every tool call with its args + result, every context injection, every decision branch is captured as a typed event in a single pass over the run — no instrumentation, no post-processing.
The agent + tool registration is built declaratively. The framework owns the loop, so it can record everything that happens inside it:
const agent = Agent.create({ provider: provider ?? exampleProvider('feature', { respond: weatherRespond }), model: 'mock', maxIterations: 5, // reactMode: 'dynamic-grouped' wraps the LLM turn in an sf-llm-call subflow, // so Lens renders the agent's reasoning as an LLM group with its context // slots (system-prompt / messages / tools) nested inside — the SAME shape // the LLMCall primitive shows — instead of a bare "Final · RUNNER" card. reactMode: 'dynamic-grouped',}) .system('You answer weather questions using the `weather` tool.') .tool({ schema: { name: 'weather', description: 'Get current weather for a city.', inputSchema: { type: 'object', properties: { city: { type: 'string' } }, required: ['city'], }, }, execute: async (args) => `${(args as { city: string }).city}: sunny, 72°F`, }) .build();Observability is just attaching listeners — no SDK, no agent.observe wrapper, no separate tracing pipeline:
agent.on('agentfootprint.stream.tool_start', (e) => console.log(`→ tool ${e.payload.toolName}(${JSON.stringify(e.payload.args)})`),);agent.on('agentfootprint.stream.tool_end', (e) => console.log(`← tool result: ${e.payload.result}`),);Every event is typed. Every payload is what you'd want — toolName + args on tool_start, result on tool_end, iterationCount + token totals on turn_end. No string-parsing logs. No "I think this means the tool was called."
Three workflows from one trace
The same trace serves three workflows:
| Mode | What you do | What the trace gives you |
|---|---|---|
| Live | Debug as you build | Exactly which injection produced which token; which predicate fired this iteration; which prefix actually got cached |
| Offline | Monitor what shipped | Replay any past run from its trace. Alert on drift. Attribute cost per injection. |
| Detailed | Improve via export | Every successful trajectory is labeled training data for fine-tuning or reinforcement learning — no separate data-collection phase |
And a fourth, novel: the agent can read its own trace. Six months after the agent rejected loan #42, "why did you reject it?" answers from the recorded evidence (creditScore=580, threshold=600), not a rerun. Causal memory turns the trace into the agent's working memory.
Dynamic ReAct — context recomposes every iteration
The corollary that makes "context engineering" worth the name. Other frameworks assemble the prompt once per turn; agentfootprint re-runs every Injection trigger every iteration. Tool schemas, system prompts, skill bodies, memory recall — all recompute against the freshest state.
| Iteration | Classic ReAct | Dynamic ReAct (agentfootprint) |
|---|---|---|
| 1 | 12 tools shown | 1 tool (read_skill) |
| 2 | 12 tools shown | 5 tools (skill activated) |
| 3 | 12 tools shown | 5 tools |
Per-iteration recomposition is also the structural prerequisite for the cache layer — cache markers can't track active injections in lockstep without it.
The deeper point — your harness is your application. Because context recomposes every iteration, you declare rules (an instruction's activeWhen, a skill, a steering doc), not a state machine. Which tool is exposed, which reminder fires, which skill is active — the behavior emerges from the runtime re-evaluating those rules against the freshest state each loop. The harness's context-loop is your application logic — no hand-rolled coordination — and the same loop emits the trace you later interrogate. That's the whole stack in one line: context engineering (what the model sees) · the harness (the runtime that recomposes it every iteration) · explainability (the trace it hands back).
📖 Dynamic ReAct guide for the full taxonomy of what this unlocks (tool-by-tool steering, adaptive tool exposure, cost guardrails, iterative format refinement, failure adaptation).
$0 test runs — scripted ReAct
Tool-using agents are notoriously hard to test because the LLM's behavior is non-deterministic. mock({ replies }) solves that — script the LLM's decisions turn-by-turn, run the agent against the script, get a deterministic, free, instant test:
mock({ replies: [ { toolCalls: [ { id: 'call-1', name: 'lookup', args: { topic: 'refunds' } as Record<string, unknown>, }, ], }, { content: 'Refunds take 3 business days.' }, ],});Iteration 1 calls the tool with the scripted args; iteration 2 returns the final answer. The tool actually executes; the agent loop is real. Only the LLM is mocked. Swap mock(...) for anthropic(...) to ship — the rest of the agent is identical.
No graph DSL required
Composition is just chained .create() builders, not a separate graph language. Two agents in series? Sequence. Multiple critics with merge? Parallel. Iterate until quality bar? Loop. The compositions are how multi-agent systems are built — there's no MultiAgentSystem class, no Orchestrator to learn:
const runner = reflection({ provider: provider ?? exampleProvider('pattern', { respond: () => replies[i++ % replies.length]! }), model: 'mock', proposerPrompt: 'Write or revise a short poem about night.', criticPrompt: 'Critique the poem. When it is good enough include the marker DONE.', maxIterations: 5,});That's Reflexion (Shinn 2023), expressed as Sequence(Agent, critique-LLM, Agent). Tree-of-Thoughts is Parallel(Agent × N) + LLM-rank. Debate is Loop(Agent × 2 + judge). Every named pattern in the agent literature is a composition of the same 2 primitives + 4 compositions. You learn the substrate; the field's growth lands as recipes on top.
Built on footprintjs
agentfootprint is the agent layer. footprintjs is the substrate — the flowchart-pattern execution engine that makes our typed-event stream + replayable traces automatic. footprintjs gives us composition primitives, state-machine semantics, durable workflow checkpoints, and 57+ typed events out of the box; we used the budget those abstractions would have cost us to invest deeply in the injection loop — the layer every other framework leaves to the developer.
You don't need to learn footprintjs to use agentfootprint — but if you want to build your own primitives at this depth, start there.
Next steps
- Quick Start — build your first agent
- Key Concepts — the 5-layer taxonomy
- Reliability gate — rules-based retry / fallback / fail-fast around every LLM call
- Causal memory deep dive — how the trace becomes the agent's working memory
