Error handling

Your production agent makes 50,000 LLM calls a day. One in a thousand gets a 429. One in ten thousand gets a 500. At that volume, “throw and crash” is not a strategy — it’s a paging incident waiting to happen. The library ships typed errors + retry + fallback decorators so you can compose graceful degradation without writing per-provider error glue.

Typed errors at the boundary

Provider adapters wrap underlying SDK errors in LLMError with classification:

import { LLMError } from 'agentfootprint';

try {
  await agent.run({ message: 'Hello' });
} catch (err) {
  if (err instanceof LLMError) {
    err.code;        // 'auth' | 'rate_limit' | 'context_length' | 'invalid_request'
                     // | 'server' | 'timeout' | 'aborted' | 'network' | 'unknown'
    err.provider;    // 'anthropic' | 'openai' | 'bedrock' | ...
    err.statusCode;  // HTTP status (429, 401, 500, etc.) when applicable
    err.retryable;   // true for rate_limit, server, timeout, network
  }
}

The retryable flag is the cheap heuristic for “should I try again?” — true for transient failures (rate limit, 5xx, network), false for terminal ones (auth, invalid request). The decorators below honor it.

Tool errors

When a tool’s execute function throws, the error becomes a tool-result message tagged as an error. The LLM sees it and can decide to retry, try a different tool, or surface it to the user:

const riskyTool = {
  schema: { name: 'fetch_data', description: '...', inputSchema: {...} },
  execute: async ({ url }: { url: string }) => {
    const res = await fetch(url);
    if (!res.ok) {
      // Throw to surface as tool-error to the LLM:
      throw new Error(`HTTP ${res.status}: ${res.statusText}`);
    }
    return await res.text();
  },
};

The agent’s narrative captures the error event (agentfootprint.stream.tool_end with the error payload) so observability shows exactly which tool failed when. No silent swallowing.

Retry decorator

withRetry(provider, options) wraps any LLMProvider with retry-on-retryable-error. It honors the LLMError.retryable classification and respects an AbortSignal:

import { withRetry, anthropic } from 'agentfootprint';

const provider = withRetry(anthropic({ apiKey }), {
  maxAttempts: 5,
  backoff: 'exponential',  // 250ms, 500ms, 1s, 2s, 4s
  shouldRetry: (err, attempt) => err.retryable && attempt < 5,
});

shouldRetry is overridable so you can implement per-error policies (e.g., “retry rate_limit forever within budget; cap server errors at 3”).

Fallback decorator

withFallback(primary, fallback) swaps providers when the primary fails. Useful for cross-provider resilience — try Anthropic; if it’s down, try OpenAI:

import { withFallback, anthropic, openai } from 'agentfootprint';

const provider = withFallback(
  anthropic({ apiKey: process.env.ANTHROPIC_API_KEY! }),
  openai({ apiKey: process.env.OPENAI_API_KEY! }),
);

The fallback only activates when the primary throws an error meeting the fallback predicate (default: any retryable + auth + server errors). Successful streams pin to the active provider — no provider-flip mid-stream.

Compose: retry + fallback together

Decorators stack. Right-fold to build a chain — try anthropic; on failure fall back to openai; the whole chain wrapped in retry:

const provider = withRetry(
  withFallback(
    anthropic({ apiKey: process.env.ANTHROPIC_API_KEY! }),
    openai({ apiKey: process.env.OPENAI_API_KEY! }),
  ),
  { maxAttempts: 5 },
);

Or use the convenience factory:

import { resilientProvider } from 'agentfootprint';

const provider = resilientProvider({
  primary: anthropic({...}),
  fallbacks: [openai({...}), bedrock({...})],
  retry: { maxAttempts: 3 },
});

Coming in v2.5 — Reliability subsystem

The Reliability subsystem (deferred from v2.4) will add three more primitives that compose on top:

CircuitBreaker — trip after N consecutive failures, open for cooldown, half-open probe. Prevents thundering-herd retry on a downed provider.
3-tier output fallback — outputFallback(primary, fallback, canned) — primary LLM fails → fallback LLM tries → canned response if both fail.
agent.resumeOnError(checkpoint, input) — auto-checkpoint at iteration boundaries; resume from the failure point with corrected input.

Currently on the v2.5 roadmap — the existing decorators above cover the production-critical 80% today.

Anti-patterns

Don’t catch LLMError and swallow it. Always re-throw or take a deliberate action (fallback, log, escalate).
Don’t retry non-retryable errors. err.retryable is the contract; honor it. Retrying an auth error will not magically produce a key.
Don’t put retry logic in the tool’s execute. Tool errors are agent-visible by design — let the LLM decide whether to retry. If the tool itself needs retry on a flaky API, that’s the tool’s internal concern.

Next steps

Resilience guide — the full decorator surface
Pause / resume — JSON-checkpointed human-in-the-loop (lays groundwork for v2.5 resumeOnError)