Why agentfootprint?

Your support agent told a customer their refund was processed. Six weeks later they come back and ask “why did you tell me my refund was processed when it wasn’t?” You go to look. The agent is gone. Logs are scattered across three services. The decision evidence is not there. This is the gap agentfootprint exists to close.

The new class of bug

For fifty years, software bugs have been logic errors. A wrong condition, a missed edge case, an off-by-one. You step through the code with a debugger until you find the bad branch.

LLM-powered apps add a second class: contextual errors. The code is correct. The model is correct. The answer is wrong because the LLM’s decision rests on context that was ambiguous, missing, or invalidated at the moment of inference.

Tracking which content the model actually saw, and why, is the entire debugging job. Without it, the failure mode is invisible:

What got injected wrong	What the model did
Wrong instruction landed in the `system` slot	Followed the wrong rule
Predicate fired one iteration too early	Reasoned with stale assumptions
Skill body missing when LLM called `read_skill`	Invented its own
Cache prefix invalidated mid-iteration	Saw a silently rewritten stale version
Tool returned but `on-tool-return` injection didn’t fire	Couldn’t interpret the result

A framework that owns the control flow can debug logic errors. A framework that owns the injection can debug contextual errors — because every injection is a typed event with a where, when, why, and how-it-cached.

Every LLM call has 3 fixed slots (system, messages, tools); every flavor lands in one slot under one of 4 fixed triggers. The grid is the entire context-engineering surface.

The four backtrack questions

Every agentfootprint LLM call backtracks to four typed answers — and they’re the answers the wrong-answer customer is asking, ten weeks late:

Question	What the trace tells you
What was injected?	Every flavor of content the LLM saw on this iteration (Skill bodies, RAG passages, Steering rules, tool schemas).
Who triggered it?	Which rule fired (`always` / `rule` predicate / `on-tool-return` / `llm-activated` `read_skill`).
When it fired?	Which iteration of the ReAct loop, after which event (which tool returned, which skill activated).
How it landed?	Which slot (`system` / `messages` / `tools`), what position, what cache strategy, whether it was actually applied or skipped.

These aren’t logged alongside the call. They’re the structural output of the call — produced by the framework owning the runtime loop.

Connected evidence — built declaratively

agentfootprint gives you connected evidence — grounded, auditable, LLM-readable. Every iteration of the ReAct loop, every tool call with its args + result, every context injection, every decision branch is captured as a typed event during one DFS traversal — no instrumentation, no post-processing.

The agent + tool registration is built declaratively. The framework owns the loop, so it can record everything that happens inside it:

const agent = Agent.create({
  provider: provider ?? exampleProvider('feature', { respond: weatherRespond }),
  model: 'mock',
  maxIterations: 5,
})
  .system('You answer weather questions using the `weather` tool.')
  .tool({
    schema: {
      name: 'weather',
      description: 'Get current weather for a city.',
      inputSchema: {
        type: 'object',
        properties: { city: { type: 'string' } },
        required: ['city'],
      },
    },
    execute: async (args) => `${(args as { city: string }).city}: sunny, 72°F`,
  })
  .build();

Observability is just attaching listeners — no SDK, no agent.observe wrapper, no separate tracing pipeline:

agent.on('agentfootprint.stream.tool_start', (e) =>
  console.log(`→ tool ${e.payload.toolName}(${JSON.stringify(e.payload.args)})`),
);
agent.on('agentfootprint.stream.tool_end', (e) =>
  console.log(`← tool result: ${e.payload.result}`),
);

Every event is typed. Every payload is what you’d want — toolName + args on tool_start, result on tool_end, iterationCount + token totals on turn_end. No string-parsing logs. No “I think this means the tool was called.”

Three workflows from one trace

The same trace serves three workflows:

Mode	What you do	What the trace gives you
Live	Debug as you build	Exactly which injection produced which token; which predicate fired this iteration; which prefix actually got cached
Offline	Monitor what shipped	Replay any past run from its trace. Alert on drift. Attribute cost per injection.
Detailed	Improve via export	Every successful trajectory is labeled training data for SFT, DPO, or process-RL — no separate data-collection phase

And a fourth, novel: the agent can read its own trace. Six months after the agent rejected loan #42, “why did you reject it?” answers from the recorded evidence (creditScore=580, threshold=600), not a rerun. Causal memory turns the trace into the agent’s working memory.

One agent run produces a JSON-portable causal trace. Three downstream consumers fan out: audit replay, cheap-model triage, training data export.

Dynamic ReAct — context recomposes every iteration

The corollary that makes “context engineering” worth the name. Other frameworks assemble the prompt once per turn; agentfootprint re-runs every Injection trigger every iteration. Tool schemas, system prompts, skill bodies, memory recall — all recompute against the freshest state.

Classic ReAct loops back to CallLLM (slots frozen). Dynamic ReAct loops back to SystemPrompt (slots recompose every iteration).

Iteration	Classic ReAct	Dynamic ReAct (agentfootprint)
1	12 tools shown	1 tool (`read_skill`)
2	12 tools shown	5 tools (skill activated)
3	12 tools shown	5 tools

Per-iteration recomposition is also the structural prerequisite for the cache layer — cache markers can’t track active injections in lockstep without it.

📖 Dynamic ReAct guide for the full taxonomy of what this unlocks (tool-by-tool steering, adaptive tool exposure, cost guardrails, iterative format refinement, failure adaptation).

$0 test runs — scripted ReAct

Tool-using agents are notoriously hard to test because the LLM’s behavior is non-deterministic. mock({ replies }) solves that — script the LLM’s decisions turn-by-turn, run the agent against the script, get a deterministic, free, instant test:

mock({
  replies: [
    {
      toolCalls: [
        {
          id: 'call-1',
          name: 'lookup',
          args: { topic: 'refunds' } as Record<string, unknown>,
        },
      ],
    },
    { content: 'Refunds take 3 business days.' },
  ],
});

Iteration 1 calls the tool with the scripted args; iteration 2 returns the final answer. The tool actually executes; the agent loop is real. Only the LLM is mocked. Swap mock(...) for anthropic(...) to ship — the rest of the agent is identical.

No graph DSL required

Composition is just chained .create() builders, not a separate graph language. Two agents in series? Sequence. Multiple critics with merge? Parallel. Iterate until quality bar? Loop. The compositions are how multi-agent systems are built — there’s no MultiAgentSystem class, no Orchestrator to learn:

const runner = reflection({
  provider: provider ?? exampleProvider('pattern', { respond: () => replies[i++ % replies.length]! }),
  model: 'mock',
  proposerPrompt: 'Write or revise a short poem about night.',
  criticPrompt:
    'Critique the poem. When it is good enough include the marker DONE.',
  maxIterations: 5,
});

That’s Reflexion (Shinn 2023), expressed as Sequence(Agent, critique-LLM, Agent). Tree-of-Thoughts is Parallel(Agent × N) + LLM-rank. Debate is Loop(Agent × 2 + judge). Every named pattern in the agent literature is a composition of the same 2 primitives + 4 compositions. You learn the substrate; the field’s growth lands as recipes on top.

Built on footprintjs

agentfootprint is the agent layer. footprintjs is the substrate — the flowchart-pattern execution engine that makes our typed-event stream + replayable traces automatic. footprintjs gives us composition primitives, state-machine semantics, durable workflow checkpoints, and 57+ typed events out of the box; we used the budget those abstractions would have cost us to invest deeply in the injection loop — the layer every other framework leaves to the developer.

You don’t need to learn footprintjs to use agentfootprint — but if you want to build your own primitives at this depth, start there.

Next steps

Quick Start — build your first agent
Key Concepts — the 5-layer taxonomy
Reliability gate — rules-based retry / fallback / fail-fast around every LLM call
Causal memory deep dive — how the trace becomes the agent’s working memory