Git bisect for context — point the debugging harness at a wrong answer and it walks the run back to the input that caused it, then proves it by removing it.

Beta

The context-bug localizer is a beta feature. It works and ships with a tested example, but the API may still change before GA.

Your agent gave the wrong answer. The trace shows you what happened — every injection, every tool call, every token. It doesn't show you which one of them caused it. That's a different question, and the one you actually need answered at 2 AM.

What observability can't tell you

Observability answers what happened: which fact was injected, which tool ran, what the model returned. But when the answer is wrong, you're left scrolling a trace asking which piece of context did this? — and a trace, however detailed, only correlates. It can't tell you that removing the suspect would have fixed it.

That's a debugging problem, not a logging problem. So we built a debugging harness for it.

Debugging, not evaluation

Evaluation grades many scenarios at scale — does the agent pass? — the development-phase harness you build before shipping. Debugging takes the one output that failed and finds why, then proves it. The book draws the same line: "a single scenario is useful for debugging, but real evaluation requires scale." This guide is the debugging half — it hands you the backtrace and a trigger to connect a failed case to its cause. The grader is yours: we ship the machinery, not a built-in judge. (For production monitoring, see Observability.)

Git bisect for context

localizeContextBug is git bisect for context. Point it at the step that went wrong; it walks the run backward to the inputs that fed that step, ranks them, and — if you give it a way to re-run — proves the culprit by removing it and watching the answer flip.

Five stages, all over the run you already recorded:

Trigger — the step that's wrong (the answer-producing LLM call, an explicit step, a custom rule, or the lowest-quality step).
Slice — walk the commit log backward from the trigger (data and control edges) to everything that fed it.
Weigh — score how strongly each LLM-call edge pulled on its inputs.
Rank — turn the slice into ablatable suspects (tool / injection / memory / arg), ordered by influence.
Ablate — optional, but the whole point: re-run the scenario without each top suspect. If the answer flips, that's a causal verdict.

Honesty first: ranking is a guess, ablation is proof

This is the line we hold that most tools blur. The ranking is a proxy — embedding geometry, not a causal claim. Only the ablation verdict is causal, because it's the only step that actually removed the thing and saw what changed. Every report says which mode it's in (correlational vs causal) and never dresses a guess up as proof.

So a report without a re-runner stops at a ranked list of suspects, honestly labeled correlational. Hand it a re-runner and the top suspects get verdicts — confirmed or not confirmed, with the flip count.

Quickstart

Point the localizer at the answer-producing call and give it a re-runner so it can confirm suspects by ablation:

const  = await ({
  , // { snapshot, events } from the recorded run
  : (()),
  : , // the step that produced the bad answer
  : {
    // give it a re-runner → it confirms suspects by ablation
    : ,
    : ,
    : 3,
    : (, ) => .('APPROVED') !== .('APPROVED'),
  },
});

.(());

The report names the planted cause as the confirmed, causal root — and clears the decoys:

CONTEXT BUG LOCALIZATION — trigger call-llm#18 "CallLLM" (explicit)
mode: CAUSAL — ranked proxies + counterfactual ablation verdicts (verdicts are the only causal claims)
...
 3. [injection 'vip-override'] — score 0.787
    verdict: CAUSAL: ablating injection 'vip-override' flipped the outcome in 3/3 seeded reruns.
 4. [injection 'style-rule']  — score 0.746
    verdict: NOT CONFIRMED: ablating 'style-rule' did not change the outcome in 3 reruns.

ROOT CAUSE: 'vip-override' — ablating it flipped the answer in 3/3 reruns (the only CAUSAL verdict).

Notice the proxy ranking even floats some plumbing above the real culprit — and the harness doesn't care, because it doesn't trust the ranking. It trusts the verdict: the planted fact is the only thing that, when removed, changed the answer.

The full runnable, tested example is 17-localize-quickstart.ts (and the fuller 05-context-bisect.ts, with a tool suspect and a decide() control-edge walk).

Correlational vs causal — pick your cost

Ablation re-runs your scenario N times, so it costs N runs. Skip the re-runner and you get the ranked slice for free — useful triage, honestly labeled a guess. Add the re-runner when you need proof.

When you don't know which step is wrong

atStep is the explicit door. Two others:

Custom trigger — trigger: (artifacts) => stepId: your own rule for "which step looks wrong."
Quality-driven — attach a QualityRecorder and the localizer defaults to its lowest-scoring step.

Bring your own judge

The quality-driven trigger is where you plug in a score. Attach a QualityRecorder with your scoring function — an LLM-as-judge, a schema check, a regex assertion — and the localizer starts the bisect at the lowest-scoring step. Two honest notes:

The score is a guess, not a verdict. The lowest-scoring step is where to start looking — correlational. Only the ablation that follows is causal. We ship the scoring machinery, not the judge: there is no built-in metric.
The scoring function is synchronous. It can't await an LLM judge inline — pre-score into a map (or judge after the run and feed a custom lookup), then return the number.

The standalone quality stack trace — qualityTrace / formatQualityTrace, which renders the score-annotated backtrace — is a footprintjs primitive. See Backward causal chain.

Other debugging doors

Same engine, different entry points (all beta):

finders — traceSteps (dependency-guided backward search), removeAndRetry (leave-one-out ablation), shrinkToCause.
walkToRoot — walk a symptom backward across loop iterations to the true root cause (see Read the run loop-by-loop below).
missing context — supply what was available vs what was sent; restoring a dropped unit that flips the answer is the mirror of ablation (also causal).

Read the run loop-by-loop

When the bug isn't in the last step but a few iterations back, read the run as a trajectory. assembleTrajectory(artifacts) slices a ReAct run into one LoopFrame per iteration — each carrying its LLM call, the context that fed it, and where its tool output came from. walkToRoot is just walkTrajectory(assembleTrajectory(...)): it walks that per-loop structure backward from the symptom to the iteration that planted the cause.

This is the same substrate the localizer already walks — a structured view of the run, not a score. There is no trajectory grade here: "did the whole run succeed?" is a metric you supply. Runnable, tested examples: 13-per-loop-trajectory.ts and 15-walk-to-root.ts.

API

Everything is exported from agentfootprint/observe: localizeContextBug, LocalizeContextBugOptions, ContextBugReport, formatContextBugReport.

Next steps

Observability — the trace this debugger reads: the full event taxonomy
Debugging — the three debug surfaces (live status, logs, replay) this sits on top of
Testing — the mocks-first $0 runs the ablation re-runs are built on

Localize a context bug