Skip to content

Error handling

Your production agent makes 50,000 LLM calls a day. One in a thousand gets a 429. One in ten thousand gets a 500. At that volume, “throw and crash” is not a strategy — it’s a paging incident waiting to happen. The library ships typed errors + retry + fallback decorators so you can compose graceful degradation without writing per-provider error glue.

Provider adapters wrap underlying SDK errors in LLMError with classification:

import { LLMError } from 'agentfootprint';
try {
await agent.run({ message: 'Hello' });
} catch (err) {
if (err instanceof LLMError) {
err.code; // 'auth' | 'rate_limit' | 'context_length' | 'invalid_request'
// | 'server' | 'timeout' | 'aborted' | 'network' | 'unknown'
err.provider; // 'anthropic' | 'openai' | 'bedrock' | ...
err.statusCode; // HTTP status (429, 401, 500, etc.) when applicable
err.retryable; // true for rate_limit, server, timeout, network
}
}

The retryable flag is the cheap heuristic for “should I try again?” — true for transient failures (rate limit, 5xx, network), false for terminal ones (auth, invalid request). The decorators below honor it.

When a tool’s execute function throws, the error becomes a tool-result message tagged as an error. The LLM sees it and can decide to retry, try a different tool, or surface it to the user:

const riskyTool = {
schema: { name: 'fetch_data', description: '...', inputSchema: {...} },
execute: async ({ url }: { url: string }) => {
const res = await fetch(url);
if (!res.ok) {
// Throw to surface as tool-error to the LLM:
throw new Error(`HTTP ${res.status}: ${res.statusText}`);
}
return await res.text();
},
};

The agent’s narrative captures the error event (agentfootprint.stream.tool_end with the error payload) so observability shows exactly which tool failed when. No silent swallowing.

withRetry(provider, options) wraps any LLMProvider with retry-on-retryable-error. It honors the LLMError.retryable classification and respects an AbortSignal:

import { withRetry, anthropic } from 'agentfootprint';
const provider = withRetry(anthropic({ apiKey }), {
maxAttempts: 5,
backoff: 'exponential', // 250ms, 500ms, 1s, 2s, 4s
shouldRetry: (err, attempt) => err.retryable && attempt < 5,
});

shouldRetry is overridable so you can implement per-error policies (e.g., “retry rate_limit forever within budget; cap server errors at 3”).

withFallback(primary, fallback) swaps providers when the primary fails. Useful for cross-provider resilience — try Anthropic; if it’s down, try OpenAI:

import { withFallback, anthropic, openai } from 'agentfootprint';
const provider = withFallback(
anthropic({ apiKey: process.env.ANTHROPIC_API_KEY! }),
openai({ apiKey: process.env.OPENAI_API_KEY! }),
);

The fallback only activates when the primary throws an error meeting the fallback predicate (default: any retryable + auth + server errors). Successful streams pin to the active provider — no provider-flip mid-stream.

Decorators stack. Right-fold to build a chain — try anthropic; on failure fall back to openai; the whole chain wrapped in retry:

const provider = withRetry(
withFallback(
anthropic({ apiKey: process.env.ANTHROPIC_API_KEY! }),
openai({ apiKey: process.env.OPENAI_API_KEY! }),
),
{ maxAttempts: 5 },
);

Or use the convenience factory:

import { resilientProvider } from 'agentfootprint';
const provider = resilientProvider({
primary: anthropic({...}),
fallbacks: [openai({...}), bedrock({...})],
retry: { maxAttempts: 3 },
});

The Reliability subsystem (deferred from v2.4) will add three more primitives that compose on top:

  • CircuitBreaker — trip after N consecutive failures, open for cooldown, half-open probe. Prevents thundering-herd retry on a downed provider.
  • 3-tier output fallbackoutputFallback(primary, fallback, canned) — primary LLM fails → fallback LLM tries → canned response if both fail.
  • agent.resumeOnError(checkpoint, input) — auto-checkpoint at iteration boundaries; resume from the failure point with corrected input.

Currently on the v2.5 roadmap — the existing decorators above cover the production-critical 80% today.

  • Don’t catch LLMError and swallow it. Always re-throw or take a deliberate action (fallback, log, escalate).
  • Don’t retry non-retryable errors. err.retryable is the contract; honor it. Retrying an auth error will not magically produce a key.
  • Don’t put retry logic in the tool’s execute. Tool errors are agent-visible by design — let the LLM decide whether to retry. If the tool itself needs retry on a flaky API, that’s the tool’s internal concern.