Error handling
Your production agent makes 50,000 LLM calls a day. One in a thousand gets a 429. One in ten thousand gets a 500. At that volume, “throw and crash” is not a strategy — it’s a paging incident waiting to happen. The library ships typed errors + retry + fallback decorators so you can compose graceful degradation without writing per-provider error glue.
Typed errors at the boundary
Section titled “Typed errors at the boundary”Provider adapters wrap underlying SDK errors in LLMError with classification:
import { LLMError } from 'agentfootprint';
try { await agent.run({ message: 'Hello' });} catch (err) { if (err instanceof LLMError) { err.code; // 'auth' | 'rate_limit' | 'context_length' | 'invalid_request' // | 'server' | 'timeout' | 'aborted' | 'network' | 'unknown' err.provider; // 'anthropic' | 'openai' | 'bedrock' | ... err.statusCode; // HTTP status (429, 401, 500, etc.) when applicable err.retryable; // true for rate_limit, server, timeout, network }}The retryable flag is the cheap heuristic for “should I try again?” — true for transient failures (rate limit, 5xx, network), false for terminal ones (auth, invalid request). The decorators below honor it.
Tool errors
Section titled “Tool errors”When a tool’s execute function throws, the error becomes a tool-result message tagged as an error. The LLM sees it and can decide to retry, try a different tool, or surface it to the user:
const riskyTool = { schema: { name: 'fetch_data', description: '...', inputSchema: {...} }, execute: async ({ url }: { url: string }) => { const res = await fetch(url); if (!res.ok) { // Throw to surface as tool-error to the LLM: throw new Error(`HTTP ${res.status}: ${res.statusText}`); } return await res.text(); },};The agent’s narrative captures the error event (agentfootprint.stream.tool_end with the error payload) so observability shows exactly which tool failed when. No silent swallowing.
Retry decorator
Section titled “Retry decorator”withRetry(provider, options) wraps any LLMProvider with retry-on-retryable-error. It honors the LLMError.retryable classification and respects an AbortSignal:
import { withRetry, anthropic } from 'agentfootprint';
const provider = withRetry(anthropic({ apiKey }), { maxAttempts: 5, backoff: 'exponential', // 250ms, 500ms, 1s, 2s, 4s shouldRetry: (err, attempt) => err.retryable && attempt < 5,});shouldRetry is overridable so you can implement per-error policies (e.g., “retry rate_limit forever within budget; cap server errors at 3”).
Fallback decorator
Section titled “Fallback decorator”withFallback(primary, fallback) swaps providers when the primary fails. Useful for cross-provider resilience — try Anthropic; if it’s down, try OpenAI:
import { withFallback, anthropic, openai } from 'agentfootprint';
const provider = withFallback( anthropic({ apiKey: process.env.ANTHROPIC_API_KEY! }), openai({ apiKey: process.env.OPENAI_API_KEY! }),);The fallback only activates when the primary throws an error meeting the fallback predicate (default: any retryable + auth + server errors). Successful streams pin to the active provider — no provider-flip mid-stream.
Compose: retry + fallback together
Section titled “Compose: retry + fallback together”Decorators stack. Right-fold to build a chain — try anthropic; on failure fall back to openai; the whole chain wrapped in retry:
const provider = withRetry( withFallback( anthropic({ apiKey: process.env.ANTHROPIC_API_KEY! }), openai({ apiKey: process.env.OPENAI_API_KEY! }), ), { maxAttempts: 5 },);Or use the convenience factory:
import { resilientProvider } from 'agentfootprint';
const provider = resilientProvider({ primary: anthropic({...}), fallbacks: [openai({...}), bedrock({...})], retry: { maxAttempts: 3 },});Coming in v2.5 — Reliability subsystem
Section titled “Coming in v2.5 — Reliability subsystem”The Reliability subsystem (deferred from v2.4) will add three more primitives that compose on top:
CircuitBreaker— trip after N consecutive failures, open for cooldown, half-open probe. Prevents thundering-herd retry on a downed provider.- 3-tier output fallback —
outputFallback(primary, fallback, canned)— primary LLM fails → fallback LLM tries → canned response if both fail. agent.resumeOnError(checkpoint, input)— auto-checkpoint at iteration boundaries; resume from the failure point with corrected input.
Currently on the v2.5 roadmap — the existing decorators above cover the production-critical 80% today.
Anti-patterns
Section titled “Anti-patterns”- Don’t catch
LLMErrorand swallow it. Always re-throw or take a deliberate action (fallback, log, escalate). - Don’t retry non-retryable errors.
err.retryableis the contract; honor it. Retrying an auth error will not magically produce a key. - Don’t put retry logic in the tool’s
execute. Tool errors are agent-visible by design — let the LLM decide whether to retry. If the tool itself needs retry on a flaky API, that’s the tool’s internal concern.
Next steps
Section titled “Next steps”- Resilience guide — the full decorator surface
- Pause / resume — JSON-checkpointed human-in-the-loop (lays groundwork for v2.5 resumeOnError)