Reliability
Your agent is in production. The LLM provider 503s for 4 minutes. Half your requests crash, the other half take 30 seconds because
withRetrykeeps hammering. Then a model rollout starts emitting prose instead of JSON for 2% of outputs. Then a container restarts mid-iteration and 47 in-flight runs vanish. agentfootprint v2.10.x ships three primitives that make each of those a non-event.
The Reliability subsystem (v2.10.x — three primitives, three failure modes)
Section titled “The Reliability subsystem (v2.10.x — three primitives, three failure modes)”| Primitive | Release | What it solves | Where it lives |
|---|---|---|---|
withCircuitBreaker(provider, opts) | v2.10.0 | Vendor outage detection; fail-fast in <5µs after N consecutive failures | agentfootprint/resilience |
.outputFallback({ fallback, canned }) | v2.10.1 | Schema-validation failure; 3-tier degradation (primary → fallback → canned) | builder method on Agent.create({...}) |
agent.resumeOnError(checkpoint) | v2.10.2 | Mid-run failure recovery; checkpoint at iteration boundaries, persist anywhere, resume hours/restarts later | agentfootprint (main barrel) |
All three compose. All three follow the same library convention: opt-in (no behavior change for existing consumers), typed events for observability, fail-open by construction when configured.
End-to-end example
Section titled “End-to-end example”The single example below exercises all three primitives. The docs site live-imports the source so this snippet stays in sync with the runnable file.
/** * 08 — Reliability subsystem: CircuitBreaker + outputFallback + resumeOnError. * * Demonstrates all 3 pieces of the v2.10.x Reliability subsystem * end-to-end. Each piece solves a distinct production failure mode: * * 1. **CircuitBreaker** — vendor outage detection. Wrap the LLM * provider in `withCircuitBreaker(...)`. After N consecutive * failures, the breaker OPENS and fails fast (sub-µs) so * `withFallback` can route to the secondary provider without * wasting 3 retries × backoff per request. * * 2. **outputFallback** — schema-validation failure. Pair with * `.outputSchema(parser)`. When the LLM emits malformed JSON * after maxIterations, fall through to the consumer's * `fallback(err, raw)` function, then to the static `canned` * safety net. Agent NEVER throws on output failure when canned * is set. * * 3. **resumeOnError** — mid-run failure recovery. When LLM 503s * mid-iteration, the agent throws `RunCheckpointError` carrying * the conversation history at the last completed iteration. * Persist the checkpoint to Redis/Postgres/S3, restart the * process, call `agent.resumeOnError(checkpoint)` to continue * from where it failed. * * Run: npx tsx examples/features/08-reliability.ts */
import { z } from 'zod';import { Agent, RunCheckpointError } from '../../src/index.js';import { withCircuitBreaker, withFallback, CircuitOpenError,} from '../../src/resilience/index.js';import type { AgentRunCheckpoint } from '../../src/index.js';import type { LLMProvider, LLMRequest, LLMResponse } from '../../src/adapters/types.js';import { mock } from '../../src/adapters/llm/MockProvider.js';import { isCliEntry, printResult, type ExampleMeta } from '../helpers/cli.js';
export const meta: ExampleMeta = { id: 'features/08-reliability', title: 'Reliability — CircuitBreaker + outputFallback + resumeOnError', group: 'features', description: 'End-to-end demo of the v2.10.x Reliability subsystem: vendor-outage circuit breaker, 3-tier output-schema degradation, and fault-tolerant mid-run resume from JSON-serializable checkpoint.', defaultInput: 'process refund #1234 for $50', providerSlots: ['feature'], tags: ['feature', 'reliability', 'circuit-breaker', 'output-fallback', 'resume-on-error'],};
// ── Schema for the agent's structured output ─────────────────────────
const Refund = z.object({ amount: z.number().nonnegative(), reason: z.string().min(1),});type RefundOutput = z.infer<typeof Refund>;
// ── Helper: provider that fails N times then recovers ────────────────
function flakyProvider(failuresBeforeRecovery: number, name = 'flaky'): LLMProvider { let calls = 0; return { name, async complete(_req: LLMRequest): Promise<LLMResponse> { calls += 1; if (calls <= failuresBeforeRecovery) { throw new Error(`vendor 503 (call ${calls})`); } // After recovery: emit valid JSON for the Refund schema. return mock({ replies: [{ content: JSON.stringify({ amount: 50, reason: 'product defect' }) }], }).complete(_req); }, };}
// ── Three demonstrations ─────────────────────────────────────────────
async function demoCircuitBreaker(): Promise<{ primaryCalls: number; fallbackCalls: number }> { // Primary that fails forever, fallback that always succeeds. let primaryCalls = 0; const primary: LLMProvider = { name: 'primary', async complete(): Promise<LLMResponse> { primaryCalls += 1; throw new Error('vendor 503'); }, }; let fallbackCalls = 0; const fallback: LLMProvider = { name: 'fallback', async complete(): Promise<LLMResponse> { fallbackCalls += 1; return mock({ replies: [{ content: JSON.stringify({ amount: 0, reason: 'fallback path' }) }], }).complete({} as LLMRequest); }, };
// Wrap primary in a breaker; fallback handles any thrown error. const provider = withFallback( withCircuitBreaker(primary, { failureThreshold: 2, cooldownMs: 60_000 }), fallback, );
const agent = Agent.create({ provider, model: 'mock' }) .system('You answer refund questions.') .outputSchema(Refund) .build();
// Run 5 turns. After 2 primary failures, breaker opens; remaining // 3 turns route directly to fallback (primary not called). for (let i = 0; i < 5; i++) { try { await agent.runTyped<RefundOutput>({ message: `query ${i}` }); } catch { // Some early turns may surface CircuitOpenError if the order // happens to fire before fallback engages — that's fine. } } return { primaryCalls, fallbackCalls };}
async function demoOutputFallback(): Promise<{ result: RefundOutput; cannedFired: boolean;}> { // LLM emits prose instead of JSON. With outputFallback, the agent // tier-2's into the consumer's fallback fn; if THAT fails, tier-3 // returns the canned safety-net. const provider = mock({ replies: [{ content: 'Sorry, I cannot help with that.' }] }); let cannedFired = false;
const agent = Agent.create({ provider, model: 'mock' }) .system('You decide refund amounts.') .outputSchema(Refund) .outputFallback({ // Tier 2: try to recover; let's simulate it failing too. fallback: () => { throw new Error('fallback also failed (simulated)'); }, // Tier 3: guaranteed-valid safety net. canned: { amount: 0, reason: 'unable to process — please retry' }, }) .build();
// The resilience event is consumer-side / informational — not in // the typed AgentfootprintEventMap. Cast to satisfy the typed // dispatcher without losing runtime behavior. agent.on('agentfootprint.resilience.output_canned_used' as never, () => { cannedFired = true; });
// Caller never sees OutputSchemaError; gets a typed Refund either way. const result = await agent.runTyped<RefundOutput>({ message: 'refund please' }); return { result, cannedFired };}
async function demoResumeOnError(): Promise<{ failedAt: string; resumeResult: string; serializedCheckpointBytes: number;}> { // Provider that succeeds on call 1 (tool call), fails on call 2, // then succeeds on call 3 (after resume). let calls = 0; const provider: LLMProvider = { name: 'flaky-then-recovers', async complete(): Promise<LLMResponse> { calls += 1; if (calls === 1) { return { content: '', toolCalls: [{ id: 't1', name: 'lookup', args: { id: '1234' } }], usage: { input: 1, output: 1 }, stopReason: 'tool_use', }; } if (calls === 2) { throw new Error('transient vendor 503 (mid-iteration)'); } return { content: 'refund processed: $50 for product defect', toolCalls: [], usage: { input: 1, output: 1 }, stopReason: 'end_turn', }; }, };
const agent = Agent.create({ provider, model: 'mock' }) .system('You process refunds.') .tool({ schema: { name: 'lookup', description: '', inputSchema: { type: 'object' } }, execute: () => 'order #1234 found', }) .build();
let captured: AgentRunCheckpoint | undefined; let failedAt = ''; try { await agent.run({ message: meta.defaultInput ?? 'process refund' }); } catch (err) { if (err instanceof RunCheckpointError) { captured = err.checkpoint; failedAt = `iteration ${err.checkpoint.failurePoint?.iteration} (${err.checkpoint.failurePoint?.phase})`; } else { throw err; } } if (!captured) throw new Error('expected checkpoint');
// Persist the checkpoint anywhere — JSON-serializable, tiny payload. const serialized = JSON.stringify(captured);
// hours / restart / next deploy later: resume from the checkpoint. const result = await agent.resumeOnError(captured); return { failedAt, resumeResult: typeof result === 'string' ? result : '(paused)', serializedCheckpointBytes: serialized.length, };}
// ── Main runner ──────────────────────────────────────────────────────
export async function run(input: string): Promise<unknown> { void input; console.log('\n=== Reliability subsystem demo ===\n');
console.log('1. CircuitBreaker — vendor outage detection'); const cb = await demoCircuitBreaker(); console.log(` primary calls: ${cb.primaryCalls} (capped by breaker)`); console.log(` fallback calls: ${cb.fallbackCalls} (took over after breaker opened)`); // Regression guard: breaker MUST cap primary calls if (cb.primaryCalls >= 5) { console.error('REGRESSION: breaker did not cap primary calls.'); process.exit(1); }
console.log('\n2. outputFallback — 3-tier degradation on schema failure'); const of = await demoOutputFallback(); console.log(` result: ${JSON.stringify(of.result)}`); console.log(` canned fired: ${of.cannedFired}`); // Regression guard: agent must NOT throw, must return canned shape if (of.result.amount !== 0 || !of.result.reason.includes('unable')) { console.error('REGRESSION: canned safety-net did not engage.'); process.exit(1); } if (!of.cannedFired) { console.error('REGRESSION: output_canned_used event did not fire.'); process.exit(1); }
console.log('\n3. resumeOnError — mid-run failure recovery'); const ro = await demoResumeOnError(); console.log(` failed at: ${ro.failedAt}`); console.log(` checkpoint size: ${ro.serializedCheckpointBytes} bytes (JSON)`); console.log(` resume result: ${ro.resumeResult.slice(0, 60)}…`); // Regression guard: resume must complete the run if (!ro.resumeResult.includes('refund processed')) { console.error('REGRESSION: resumeOnError did not complete the run.'); process.exit(1); } if (ro.serializedCheckpointBytes < 50) { console.error('REGRESSION: checkpoint suspiciously small.'); process.exit(1); }
// Touch the unused import so it's clearly part of the example // surface even when not exercised in this code path (the // CircuitOpenError type is what `withCircuitBreaker` throws when // the breaker is OPEN; consumers can `instanceof` check it). void CircuitOpenError;
console.log('\nOK — all 3 reliability primitives behaved as documented.'); return { circuitBreaker: cb, outputFallback: of, resumeOnError: ro };}
if (isCliEntry(import.meta.url)) { run(meta.defaultInput ?? '').then(printResult).catch(console.error);}Sample output:
1. CircuitBreaker — vendor outage detection primary calls: 2 (capped by breaker) fallback calls: 5 (took over after breaker opened)
2. outputFallback — 3-tier degradation on schema failure result: {"amount":0,"reason":"unable to process — please retry"} canned fired: true
3. resumeOnError — mid-run failure recovery failed at: iteration 2 (iteration) checkpoint size: 461 bytes (JSON) resume result: refund processed: $50 for product defect…1. CircuitBreaker — vendor outage detection
Section titled “1. CircuitBreaker — vendor outage detection”withCircuitBreaker wraps any LLMProvider and tracks consecutive failures. After failureThreshold failures, the breaker OPENS and rejects calls immediately with CircuitOpenError — no network round-trip. After cooldownMs, it enters HALF-OPEN and admits probe calls; halfOpenSuccessThreshold successes close it; one failure re-opens it.
import { anthropic, openai } from 'agentfootprint/llm-providers';import { withCircuitBreaker, withFallback } from 'agentfootprint/resilience';
const provider = withFallback( withCircuitBreaker(anthropic({ apiKey }), { failureThreshold: 5, cooldownMs: 30_000, halfOpenSuccessThreshold: 2, onStateChange: (state, why) => log.info(`circuit ${state}: ${why}`), }), withCircuitBreaker(openai({ apiKey })),);Why on top of withRetry?
Section titled “Why on top of withRetry?”withRetry keeps hammering one provider with backoff during a multi-minute Anthropic outage. Each request burns 3 retries × backoff = ~3 sec wasted before falling back. Multiplied by your QPS = lots of wasted time + tokens. The breaker says “we just saw 5 failures in a row; stop calling for 30 seconds.” Subsequent requests fail in <5µs (10k OPEN-state rejections in <50ms), withFallback routes to OpenAI immediately.
Three states with explicit transitions
Section titled “Three states with explicit transitions”CLOSED ──[ N consecutive failures ]──► OPEN ▲ │ │ │ [cooldownMs elapsed] │ ▼ └──[ M probe successes ]──── HALF-OPENPer-instance, NOT distributed
Section titled “Per-instance, NOT distributed”Each withCircuitBreaker(...) call holds its own breaker state in process memory. If you run 100 server replicas, each has its own independent breaker (matches Hystrix default). For cluster-wide coordination, layer your own Redis-backed counter via the onStateChange hook + shouldCount predicate.
2. outputFallback — 3-tier degradation on schema failure
Section titled “2. outputFallback — 3-tier degradation on schema failure”When the agent has an outputSchema AND the LLM emits malformed JSON, the agent normally throws OutputSchemaError. With .outputFallback({ fallback, canned }):
import { z } from 'zod';const Refund = z.object({ amount: z.number().nonnegative(), reason: z.string().min(1) });
const agent = Agent.create({...}) .outputSchema(Refund) .outputFallback({ fallback: async (err, raw) => ({ amount: 0, reason: 'manual review' }), canned: { amount: 0, reason: 'unable to process' }, }) .build();
// Caller never sees OutputSchemaError; gets a typed Refund either way.const refund = await agent.runTyped({ message: '...' });Three tiers
Section titled “Three tiers”| Tier | When | Returns |
|---|---|---|
| Primary | LLM emits valid JSON | LLM’s parsed value |
| Fallback | Schema validation failure | fallback(err, raw) — re-validated against schema |
| Canned | Fallback throws OR returns invalid | Static safety-net (validated at builder time) |
Two typed events for observability
Section titled “Two typed events for observability”agentfootprint.resilience.output_fallback_triggeredagentfootprint.resilience.output_canned_used
Builder-time canned validation
Section titled “Builder-time canned validation”The canned value is parsed against the schema at .outputFallback({...}) time — throws TypeError immediately on misconfig. Surfaces in CI / dev, not at 3am when the fallback engages.
Fail-open vs fail-closed
Section titled “Fail-open vs fail-closed”- With
canned→ agent NEVER throws on output failure (fail-open) - Without
canned→ fallback errors propagate to caller (fail-closed)
Consumer chooses.
3. resumeOnError — mid-run failure recovery
Section titled “3. resumeOnError — mid-run failure recovery”When agent.run() throws on a recoverable error mid-iteration, you get back a RunCheckpointError carrying a JSON-serializable checkpoint of the conversation history at the last completed iteration boundary. Persist it anywhere; resume later.
import { Agent, RunCheckpointError } from 'agentfootprint';
try { const result = await agent.run({ message: 'long task' });} catch (err) { if (err instanceof RunCheckpointError) { // Persist anywhere — Redis, Postgres, S3, queue, file. await checkpointStore.put(sessionId, err.checkpoint);
// hours / restart / new process / next deploy later: const checkpoint = await checkpointStore.get(sessionId); const result = await agent.resumeOnError(checkpoint); } else { throw err; // not recoverable — propagate }}AgentRunCheckpoint shape (JSON-serializable, stable)
Section titled “AgentRunCheckpoint shape (JSON-serializable, stable)”| Field | Type | Purpose |
|---|---|---|
version | 1 (literal) | Forward-compat guard |
runId | string | Original runId — observability correlation |
history | LLMMessage[] | Conversation at last completed iteration |
lastCompletedIteration | number | Where to resume from |
originalInput | { message: string } | User’s original ask |
checkpointedAt | number (ms) | Wall clock when captured |
failurePoint? | { iteration, phase } | Where the failure happened (oncall triage) |
Failure-phase classifier
Section titled “Failure-phase classifier”RunCheckpointError.checkpoint.failurePoint.phase is one of 'llm' | 'tool' | 'iteration' | 'unknown'. Recognizes CircuitOpenError (v2.10.0), AnthropicError, OpenAIError, BedrockError. Goes straight into oncall postmortems — “how often did we lose runs to LLM phase failures last week?”
Tradeoffs (be honest)
Section titled “Tradeoffs (be honest)”- ✅ Survives process restart (JSON-serializable, tiny payload — typically <1 KB)
- ✅ Works with any LLM provider — replay starts from history
- ✅ No footprintjs core changes
- ⚠️ Loses mid-iteration partial state (acceptable — iterations are atomic)
- ⚠️ Tools inside the failed iteration re-execute on resume. For idempotent tools (read-only DB queries) this is fine; for non-idempotent tools (charge card, send email) consumers MUST add their own idempotency keys. v2.10.3+ may add
toolCallId-based dedup.
Composing all three
Section titled “Composing all three”The three primitives stack naturally — different layer of the failure stack:
import { Agent, RunCheckpointError } from 'agentfootprint';import { anthropic, openai } from 'agentfootprint/llm-providers';import { withCircuitBreaker, withFallback } from 'agentfootprint/resilience';import { z } from 'zod';
// Layer 1: provider — circuit-breaker + multi-vendor fallbackconst provider = withFallback( withCircuitBreaker(anthropic({ apiKey })), withCircuitBreaker(openai({ apiKey })),);
// Layer 2: agent — output-schema + canned safety-netconst agent = Agent.create({ provider, model: 'claude-3-5-sonnet' }) .system('You decide refund amounts.') .outputSchema(z.object({ amount: z.number(), reason: z.string() })) .outputFallback({ fallback: async () => ({ amount: 0, reason: 'manual review' }), canned: { amount: 0, reason: 'unable to process' }, }) .build();
// Layer 3: caller — fault-tolerant resume from checkpointasync function processRefund(message: string, sessionId: string) { try { return await agent.runTyped({ message }); } catch (err) { if (err instanceof RunCheckpointError) { await checkpointStore.put(sessionId, err.checkpoint); throw err; // queue picks up, retries with resumeOnError } throw err; }}
// Worker that drains the queue (separate process, fresh container, hours later)async function resumeWorker(sessionId: string) { const checkpoint = await checkpointStore.get(sessionId); return await agent.resumeOnError(checkpoint);}Three primitives, three failure modes, one substrate.
See also
Section titled “See also”examples/features/08-reliability.ts— runnable end-to-end example (this guide’s source-of-truth)- Error handling guide — typed errors at the SDK boundary; precursor to the Reliability primitives
- Observability guide — wire
onStateChange/output_canned_used/failurePoint.phaseinto your observability backend