Skip to content

Resilience

Production traffic peaks Monday morning. Anthropic returns 429s for the next 90 seconds. Your support agent has 200 concurrent users and zero patience for a backoff loop. The framework’s resilience decorators wrap any provider with retry + fallback so your agent degrades gracefully instead of throwing user-visible errors.

DecoratorWhat it does
withRetry(provider, opts)Wraps a provider with retry-on-retryable-error. Honors LLMError.retryable classification. AbortSignal-aware sleep.
withFallback(primary, fallback)If primary throws a fallback-eligible error, retry on fallback. Stream pinning prevents provider-flip mid-stream.
fallbackProvider([providers]) / resilientProvider({...})Convenience composers — chain N fallbacks + retry in one factory call.

All three preserve the LLMProvider interface — drop-in replacements for the underlying provider. They compose freely:

import { withRetry, withFallback, anthropic, openai } from 'agentfootprint';
const provider = withRetry(
withFallback(
anthropic({ apiKey: process.env.ANTHROPIC_API_KEY! }),
openai({ apiKey: process.env.OPENAI_API_KEY! }),
),
{ maxAttempts: 5 },
);

Reads as: try anthropic; on failure fall back to openai; the whole chain is wrapped in retry with 5 attempts. Right-fold of withFallback + outer withRetry is the standard production composition.

For the common 1-primary-N-fallbacks-with-retry shape, use the factory:

import { resilientProvider, anthropic, openai, bedrock } from 'agentfootprint';
const provider = resilientProvider({
primary: anthropic({ apiKey: process.env.ANTHROPIC_API_KEY! }),
fallbacks: [
openai({ apiKey: process.env.OPENAI_API_KEY! }),
bedrock({ region: 'us-west-2' }),
],
retry: { maxAttempts: 3, backoff: 'exponential' },
});

Every provider adapter wraps SDK errors in LLMError with retryable flagging. The default withRetry policy retries when err.retryable === true AND attempt < maxAttempts. Override shouldRetry to implement custom policies:

withRetry(provider, {
maxAttempts: 5,
shouldRetry: (err, attempt) => {
if (err.code === 'rate_limit') return true; // always retry rate limits
if (err.code === 'server') return attempt < 3; // cap server errors at 3
return false;
},
});

shouldFallback works the same way for withFallback.

Both decorators call optional hooks so your recorders can log retry attempts + provider switches:

withRetry(provider, {
maxAttempts: 5,
onRetry: (err, attempt) => console.log(`retry ${attempt}: ${err.message}`),
});
withFallback(primary, fallback, {
onFallback: (err) => console.log(`falling back: ${err.message}`),
});

Both also fire typed events through the standard event dispatcher — listen to agentfootprint.cost.tick to see cost across both primary and fallback providers.

What’s NOT here yet (Reliability subsystem — v2.5)

Section titled “What’s NOT here yet (Reliability subsystem — v2.5)”

The deferred Reliability subsystem adds three more primitives that compose ON TOP of these decorators:

  • CircuitBreaker — trip after N consecutive failures, open for cooldown period, half-open probe before re-closing. Prevents thundering-herd retry on a downed provider.
  • 3-tier output fallbackoutputFallback(primary, fallback, canned). If both providers fail, return a canned response (or escalate).
  • agent.resumeOnError(checkpoint, input) — auto-checkpoint at iteration boundaries; resume from the failure point with corrected input.

The current decorators cover the production-critical 80%. The Reliability subsystem covers the long tail.

  • Don’t retry non-retryable errors. err.retryable is the contract; honor it.
  • Don’t put withRetry BELOW withFallback. Wrong order: every retry on the primary delays the fallback. Right order: outer withRetry retries the WHOLE fallback chain.
  • Don’t compose decorators inside the provider’s hot path. Build the chain ONCE at app startup; pass the composed provider into every Agent.create({ provider }).