Error handling
Tool failure recovery, retry/fallback/circuit-breaker decorators, resume-on-error. Build agents that degrade gracefully instead of crashing on the first 429.
Your production agent makes 50,000 LLM calls a day. One in a thousand gets a 429. One in ten thousand gets a 500. At that volume, "throw and crash" is not a strategy — it's a paging incident waiting to happen. The library ships retry + fallback + circuit-breaker decorators, rules-based reliability, and resume-on-error so you can compose graceful degradation without writing per-provider error glue.
Errors at the boundary
Provider adapters surface the underlying SDK error — there is no wrapper LLMError class with err.code/err.retryable fields. A failed agent.run(...) rejects with the raw error, and observability captures it (the run-failed boundary fires so a monitor shows a humanized, actionable message rather than a stuck "thinking" state):
try {
await agent.run({ message: 'Hello' });
} catch (err) {
// `err` is the underlying provider/SDK error (or a wrapped one carrying
// `.cause`). Inspect `err.status` / `err.statusCode` / `err.message` to
// classify it yourself, or let the decorators below do it for you.
console.error(err);
}The retry/fallback decorators carry their OWN classification heuristics (HTTP 4xx vs 5xx vs AbortError), which you can override per call via shouldRetry / shouldFallback.
Tool errors
When a tool's execute function throws, the error becomes a tool-result message tagged as an error. The LLM sees it and can decide to retry, try a different tool, or surface it to the user:
import { defineTool } from 'agentfootprint';
const riskyTool = defineTool({
name: 'fetch_data',
description: 'Fetch text from a URL',
inputSchema: {
type: 'object',
properties: { url: { type: 'string' } },
required: ['url'],
},
execute: async ({ url }: { url: string }) => {
const res = await fetch(url);
if (!res.ok) {
// Throw to surface as tool-error to the LLM:
throw new Error(`HTTP ${res.status}: ${res.statusText}`);
}
return await res.text();
},
});The agent's narrative captures the error event (agentfootprint.stream.tool_end with the error payload) so observability shows exactly which tool failed when. No silent swallowing.
Retry decorator
withRetry(provider, options?) wraps any LLMProvider with retry-on-transient-error and exponential backoff, and respects an AbortSignal. It lives at the agentfootprint/resilience subpath; the vendor-SDK providers (anthropic, openai, bedrock) live at agentfootprint/providers:
import { withRetry } from 'agentfootprint/resilience';
import { anthropic } from 'agentfootprint/llm-providers';
const provider = withRetry(anthropic({ apiKey }), {
maxAttempts: 5, // total attempts incl. the first (default 3)
initialDelayMs: 200, // delay before first retry (default 200)
backoffFactor: 2, // multiplier per attempt (default 2 → 200ms, 400ms, 800ms)
maxDelayMs: 10_000, // cap on backoff delay (default 10s)
// Override the default heuristic (skip AbortError + 4xx except 429):
shouldRetry: (err, attempt) =>
(err as { status?: number }).status !== 401 && attempt < 5,
onRetry: (err, attempt, delayMs) =>
console.warn(`retry ${attempt} in ${delayMs}ms`, err),
});The default shouldRetry skips AbortError and HTTP 4xx (client mistakes don't benefit from retry) — except 429 Too Many Requests, which IS retried — and retries everything else (5xx, network errors, unknown shapes). Override it to implement per-error policies.
Fallback decorator
withFallback(primary, fallback, options?) swaps providers when the primary fails. Useful for cross-provider resilience — try Anthropic; if it's down, try OpenAI:
import { withFallback } from 'agentfootprint/resilience';
import { anthropic, openai } from 'agentfootprint/llm-providers';
const provider = withFallback(
anthropic({ apiKey: process.env.ANTHROPIC_API_KEY! }),
openai({ apiKey: process.env.OPENAI_API_KEY! }),
{ onFallback: (err) => console.warn('primary failed, falling back:', err) },
);The fallback activates when the primary throws an error meeting the fallback predicate. The default shouldFallback falls back on every error except AbortError — override it to gate on specific status codes. Successful streams pin to the active provider: once the primary's stream has yielded a chunk, an error after that re-throws rather than restarting on the fallback (no provider-flip mid-stream).
For chaining three or more providers, use fallbackProvider(p1, p2, p3, ...) — sugar for nested withFallback that tries each in order, first success wins:
import { fallbackProvider } from 'agentfootprint/resilience';
import { anthropic, openai, mock } from 'agentfootprint/llm-providers';
const provider = fallbackProvider(
anthropic({ apiKey: process.env.ANTHROPIC_API_KEY! }),
openai({ apiKey: process.env.OPENAI_API_KEY! }),
mock({ reply: '[degraded] all upstream providers failed' }),
);Compose: retry + fallback together
Decorators stack — each preserves the LLMProvider interface, so they're drop-in. Build a chain — try anthropic; on failure fall back to openai; the whole chain wrapped in retry:
import { withRetry, withFallback } from 'agentfootprint/resilience';
import { anthropic, openai } from 'agentfootprint/llm-providers';
const provider = withRetry(
withFallback(
anthropic({ apiKey: process.env.ANTHROPIC_API_KEY! }),
openai({ apiKey: process.env.OPENAI_API_KEY! }),
),
{ maxAttempts: 5 },
);Add a circuit breaker per provider so retries stop hammering a downed vendor — once it trips, complete() fails fast with CircuitOpenError, which withFallback catches and routes onward:
import {
withRetry,
withFallback,
withCircuitBreaker,
} from 'agentfootprint/resilience';
import { anthropic, openai } from 'agentfootprint/llm-providers';
const provider = withRetry(
withFallback(
withCircuitBreaker(anthropic({ apiKey: process.env.ANTHROPIC_API_KEY! }), {
failureThreshold: 5,
cooldownMs: 30_000,
}),
withCircuitBreaker(openai({ apiKey: process.env.OPENAI_API_KEY! })),
),
{ maxAttempts: 3 },
);Pass the composed provider to the agent: Agent.create({ provider, ... }).
Circuit breaker
withCircuitBreaker(provider, options?) (from agentfootprint/resilience) trips after N consecutive failures, opens for a cooldown, then half-open probes. It prevents thundering-herd retry on a downed provider — once OPEN, complete() throws CircuitOpenError in <1ms with no network round-trip:
import { withCircuitBreaker, CircuitOpenError } from 'agentfootprint/resilience';
import { anthropic } from 'agentfootprint/llm-providers';
const provider = withCircuitBreaker(anthropic({ apiKey }), {
failureThreshold: 5, // open after 5 consecutive failures (default 5)
cooldownMs: 30_000, // stay open this long before probing (default 30s)
halfOpenSuccessThreshold: 1, // probe successes needed to close (default 1)
});The breaker is per-instance, in-process (no Redis dependency). See the resilience guide for composing it with retry + fallback.
Resume from the failure point
agent.run() rejects with RunCheckpointError when a fault occurs mid-run, carrying both the .cause (the underlying error) and a JSON-serializable .checkpoint (the last-known-good conversation history). Persist it and call agent.resumeOnError(checkpoint) to replay the run with that history restored — the next iteration retries the call that originally failed:
import { Agent, RunCheckpointError } from 'agentfootprint';
try {
const result = await agent.run({ message: 'long task' });
} catch (err) {
if (err instanceof RunCheckpointError) {
await checkpointStore.put(sessionId, err.checkpoint);
// hours / restart later, after the vendor recovers:
const checkpoint = await checkpointStore.get(sessionId);
const result = await agent.resumeOnError(checkpoint);
} else {
throw err; // not a recoverable fault — propagate
}
}Rules-based reliability
For declarative retry/fail-fast policies inside the agent loop, configure Agent.create({...}).reliability({...}) with rules evaluated each iteration. A matching rule decides 'retry' vs 'fail-fast'; on fail-fast the run throws ReliabilityFailFastError (from agentfootprint/reliability):
import { Agent } from 'agentfootprint';
import { ReliabilityFailFastError } from 'agentfootprint/reliability';
const agent = Agent.create({ provider, model })
.reliability({
postDecide: [
{ when: (s) => s.errorKind === '5xx-transient' && s.attempt < 3,
then: 'retry', kind: 'transient-retry' },
{ when: (s) => s.error !== undefined,
then: 'fail-fast', kind: 'unrecoverable' },
],
})
.build();
try {
await agent.run({ message: '...' });
} catch (e) {
if (e instanceof ReliabilityFailFastError) console.log(e.kind, e.reason);
}Anti-patterns
- Don't catch a provider error and swallow it. Always re-throw or take a deliberate action (fallback, log, escalate).
- Don't retry terminal errors. The decorators' default heuristic skips 4xx (except 429) and
AbortErrorfor a reason — retrying an auth error will not magically produce a key. If you overrideshouldRetry, keep that contract. - Don't put retry logic in the tool's
execute. Tool errors are agent-visible by design — let the LLM decide whether to retry. If the tool itself needs retry on a flaky API, that's the tool's internal concern.
Next steps
- Resilience guide — the full decorator surface
- Reliability guide — rules-based retry / fail-fast inside the agent loop
- Pause / resume — JSON-checkpointed human-in-the-loop
Testing
mock({ replies }) + mockMcpClient + InMemoryStore — script every agent decision deterministically. No API keys, no flake, no hidden $40k OpenAI bill from a runaway test loop.
Observability
Typed event streams, recorders, and tier-3 enable.* helpers — observe what the agent did without shaping what it does. 59 events emitted during DFS traversal, no instrumentation.
