Monitor

Reliability

Three primitives for production agents — circuit breaker for vendor outages, output-fallback for malformed LLM JSON, resume-on-error for mid-run failure recovery.

Your agent is in production. The LLM provider 503s for 4 minutes. Half your requests crash, the other half take 30 seconds because withRetry keeps hammering. Then a model rollout starts emitting prose instead of JSON for 2% of outputs. Then a container restarts mid-iteration and 47 in-flight runs vanish. agentfootprint v2.10.x ships three primitives that make each of those a non-event.

The Reliability subsystem (v2.10.x — three primitives, three failure modes)

PrimitiveReleaseWhat it solvesWhere it lives
withCircuitBreaker(provider, opts)v2.10.0Vendor outage detection; fail-fast in <5µs after N consecutive failuresagentfootprint/resilience
.outputFallback({ fallback, canned })v2.10.1Schema-validation failure; 3-tier degradation (primary → fallback → canned)builder method on Agent.create({...})
agent.resumeOnError(checkpoint)v2.10.2Mid-run failure recovery; checkpoint at iteration boundaries, persist anywhere, resume hours/restarts lateragentfootprint (main barrel)

All three compose. All three follow the same library convention: opt-in (no behavior change for existing consumers), typed events for observability, fail-open by construction when configured.

End-to-end example

The single example below exercises all three primitives. The docs site live-imports the source so this snippet stays in sync with the runnable file.

/** * 08 — Reliability subsystem: CircuitBreaker + outputFallback + resumeOnError. * * Demonstrates all 3 pieces of the v2.10.x Reliability subsystem * end-to-end. Each piece solves a distinct production failure mode: * *   1. **CircuitBreaker** — vendor outage detection. Wrap the LLM *      provider in `withCircuitBreaker(...)`. After N consecutive *      failures, the breaker OPENS and fails fast (sub-µs) so *      `withFallback` can route to the secondary provider without *      wasting 3 retries × backoff per request. * *   2. **outputFallback** — schema-validation failure. Pair with *      `.outputSchema(parser)`. When the LLM emits malformed JSON *      after maxIterations, fall through to the consumer's *      `fallback(err, raw)` function, then to the static `canned` *      safety net. Agent NEVER throws on output failure when canned *      is set. * *   3. **resumeOnError** — mid-run failure recovery. When LLM 503s *      mid-iteration, the agent throws `RunCheckpointError` carrying *      the conversation history at the last completed iteration. *      Persist the checkpoint to Redis/Postgres/S3, restart the *      process, call `agent.resumeOnError(checkpoint)` to continue *      from where it failed. * * Run:  npx tsx examples/features/08-reliability.ts */import { z } from 'zod';import { Agent, RunCheckpointError } from '../../src/index.js';import {  withCircuitBreaker,  withFallback,  CircuitOpenError,} from '../../src/resilience/index.js';import type { AgentRunCheckpoint } from '../../src/index.js';import type { LLMProvider, LLMRequest, LLMResponse } from '../../src/adapters/types.js';import { mock } from '../../src/adapters/llm/MockProvider.js';import { isCliEntry, printResult, type ExampleMeta } from '../helpers/cli.js';export const meta: ExampleMeta = {  id: 'features/08-reliability',  title: 'Reliability — CircuitBreaker + outputFallback + resumeOnError',  group: 'features',  description:    'End-to-end demo of the v2.10.x Reliability subsystem: vendor-outage circuit breaker, 3-tier output-schema degradation, and fault-tolerant mid-run resume from JSON-serializable checkpoint.',  defaultInput: 'process refund #1234 for $50',  providerSlots: ['feature'],  tags: ['feature', 'reliability', 'circuit-breaker', 'output-fallback', 'resume-on-error'],};// ── Schema for the agent's structured output ─────────────────────────const Refund = z.object({  amount: z.number().nonnegative(),  reason: z.string().min(1),});type RefundOutput = z.infer<typeof Refund>;// ── Helper: provider that fails N times then recovers ────────────────function flakyProvider(failuresBeforeRecovery: number, name = 'flaky'): LLMProvider {  let calls = 0;  return {    name,    async complete(_req: LLMRequest): Promise<LLMResponse> {      calls += 1;      if (calls <= failuresBeforeRecovery) {        throw new Error(`vendor 503 (call ${calls})`);      }      // After recovery: emit valid JSON for the Refund schema.      return mock({        replies: [{ content: JSON.stringify({ amount: 50, reason: 'product defect' }) }],      }).complete(_req);    },  };}// ── Three demonstrations ─────────────────────────────────────────────async function demoCircuitBreaker(): Promise<{ primaryCalls: number; fallbackCalls: number }> {  // Primary that fails forever, fallback that always succeeds.  let primaryCalls = 0;  const primary: LLMProvider = {    name: 'primary',    async complete(): Promise<LLMResponse> {      primaryCalls += 1;      throw new Error('vendor 503');    },  };  let fallbackCalls = 0;  const fallback: LLMProvider = {    name: 'fallback',    async complete(): Promise<LLMResponse> {      fallbackCalls += 1;      return mock({        replies: [{ content: JSON.stringify({ amount: 0, reason: 'fallback path' }) }],      }).complete({} as LLMRequest);    },  };  // Wrap primary in a breaker; fallback handles any thrown error.  const provider = withFallback(    withCircuitBreaker(primary, { failureThreshold: 2, cooldownMs: 60_000 }),    fallback,  );  const agent = Agent.create({ provider, model: 'mock' })    .system('You answer refund questions.')    .outputSchema(Refund)    .build();  // Run 5 turns. After 2 primary failures, breaker opens; remaining  // 3 turns route directly to fallback (primary not called).  for (let i = 0; i < 5; i++) {    try {      await agent.runTyped<RefundOutput>({ message: `query ${i}` });    } catch {      // Some early turns may surface CircuitOpenError if the order      // happens to fire before fallback engages — that's fine.    }  }  return { primaryCalls, fallbackCalls };}async function demoOutputFallback(): Promise<{  result: RefundOutput;  cannedFired: boolean;}> {  // LLM emits prose instead of JSON. With outputFallback, the agent  // tier-2's into the consumer's fallback fn; if THAT fails, tier-3  // returns the canned safety-net.  const provider = mock({ replies: [{ content: 'Sorry, I cannot help with that.' }] });  let cannedFired = false;  const agent = Agent.create({ provider, model: 'mock' })    .system('You decide refund amounts.')    .outputSchema(Refund)    .outputFallback({      // Tier 2: try to recover; let's simulate it failing too.      fallback: () => {        throw new Error('fallback also failed (simulated)');      },      // Tier 3: guaranteed-valid safety net.      canned: { amount: 0, reason: 'unable to process — please retry' },    })    .build();  // The resilience event is consumer-side / informational — not in  // the typed AgentfootprintEventMap. Cast to satisfy the typed  // dispatcher without losing runtime behavior.  agent.on('agentfootprint.resilience.output_canned_used' as never, () => {    cannedFired = true;  });  // Caller never sees OutputSchemaError; gets a typed Refund either way.  const result = await agent.runTyped<RefundOutput>({ message: 'refund please' });  return { result, cannedFired };}async function demoResumeOnError(): Promise<{  failedAt: string;  resumeResult: string;  serializedCheckpointBytes: number;}> {  // Provider that succeeds on call 1 (tool call), fails on call 2,  // then succeeds on call 3 (after resume).  let calls = 0;  const provider: LLMProvider = {    name: 'flaky-then-recovers',    async complete(): Promise<LLMResponse> {      calls += 1;      if (calls === 1) {        return {          content: '',          toolCalls: [{ id: 't1', name: 'lookup', args: { id: '1234' } }],          usage: { input: 1, output: 1 },          stopReason: 'tool_use',        };      }      if (calls === 2) {        throw new Error('transient vendor 503 (mid-iteration)');      }      return {        content: 'refund processed: $50 for product defect',        toolCalls: [],        usage: { input: 1, output: 1 },        stopReason: 'end_turn',      };    },  };  const agent = Agent.create({ provider, model: 'mock' })    .system('You process refunds.')    .tool({      schema: { name: 'lookup', description: '', inputSchema: { type: 'object' } },      execute: () => 'order #1234 found',    })    .build();  let captured: AgentRunCheckpoint | undefined;  let failedAt = '';  try {    await agent.run({ message: meta.defaultInput ?? 'process refund' });  } catch (err) {    if (err instanceof RunCheckpointError) {      captured = err.checkpoint;      failedAt = `iteration ${err.checkpoint.failurePoint?.iteration} (${err.checkpoint.failurePoint?.phase})`;    } else {      throw err;    }  }  if (!captured) throw new Error('expected checkpoint');  // Persist the checkpoint anywhere — JSON-serializable, tiny payload.  const serialized = JSON.stringify(captured);  // hours / restart / next deploy later: resume from the checkpoint.  const result = await agent.resumeOnError(captured);  return {    failedAt,    resumeResult: typeof result === 'string' ? result : '(paused)',    serializedCheckpointBytes: serialized.length,  };}// ── Main runner ──────────────────────────────────────────────────────export async function run(input: string): Promise<unknown> {  void input;  console.log('\n=== Reliability subsystem demo ===\n');  console.log('1. CircuitBreaker — vendor outage detection');  const cb = await demoCircuitBreaker();  console.log(`   primary calls: ${cb.primaryCalls} (capped by breaker)`);  console.log(`   fallback calls: ${cb.fallbackCalls} (took over after breaker opened)`);  // Regression guard: breaker MUST cap primary calls  if (cb.primaryCalls >= 5) {    console.error('REGRESSION: breaker did not cap primary calls.');    process.exit(1);  }  console.log('\n2. outputFallback — 3-tier degradation on schema failure');  const of = await demoOutputFallback();  console.log(`   result: ${JSON.stringify(of.result)}`);  console.log(`   canned fired: ${of.cannedFired}`);  // Regression guard: agent must NOT throw, must return canned shape  if (of.result.amount !== 0 || !of.result.reason.includes('unable')) {    console.error('REGRESSION: canned safety-net did not engage.');    process.exit(1);  }  if (!of.cannedFired) {    console.error('REGRESSION: output_canned_used event did not fire.');    process.exit(1);  }  console.log('\n3. resumeOnError — mid-run failure recovery');  const ro = await demoResumeOnError();  console.log(`   failed at: ${ro.failedAt}`);  console.log(`   checkpoint size: ${ro.serializedCheckpointBytes} bytes (JSON)`);  console.log(`   resume result: ${ro.resumeResult.slice(0, 60)}…`);  // Regression guard: resume must complete the run  if (!ro.resumeResult.includes('refund processed')) {    console.error('REGRESSION: resumeOnError did not complete the run.');    process.exit(1);  }  if (ro.serializedCheckpointBytes < 50) {    console.error('REGRESSION: checkpoint suspiciously small.');    process.exit(1);  }  // Touch the unused import so it's clearly part of the example  // surface even when not exercised in this code path (the  // CircuitOpenError type is what `withCircuitBreaker` throws when  // the breaker is OPEN; consumers can `instanceof` check it).  void CircuitOpenError;  console.log('\nOK — all 3 reliability primitives behaved as documented.');  return { circuitBreaker: cb, outputFallback: of, resumeOnError: ro };}if (isCliEntry(import.meta.url)) {  run(meta.defaultInput ?? '').then(printResult).catch(console.error);}

Sample output:

1. CircuitBreaker — vendor outage detection
   primary calls: 2 (capped by breaker)
   fallback calls: 5 (took over after breaker opened)

2. outputFallback — 3-tier degradation on schema failure
   result: {"amount":0,"reason":"unable to process — please retry"}
   canned fired: true

3. resumeOnError — mid-run failure recovery
   failed at: iteration 2 (iteration)
   checkpoint size: 461 bytes (JSON)
   resume result: refund processed: $50 for product defect…

1. CircuitBreaker — vendor outage detection

withCircuitBreaker wraps any LLMProvider and tracks consecutive failures. After failureThreshold failures, the breaker OPENS and rejects calls immediately with CircuitOpenError — no network round-trip. After cooldownMs, it enters HALF-OPEN and admits probe calls; halfOpenSuccessThreshold successes close it; one failure re-opens it.

import { anthropic, openai } from 'agentfootprint/llm-providers';
import { withCircuitBreaker, withFallback } from 'agentfootprint/resilience';

const provider = withFallback(
  withCircuitBreaker(anthropic({ apiKey }), {
    failureThreshold: 5,
    cooldownMs: 30_000,
    halfOpenSuccessThreshold: 2,
    onStateChange: (state, why) => log.info(`circuit ${state}: ${why}`),
  }),
  withCircuitBreaker(openai({ apiKey })),
);

Why on top of withRetry?

withRetry keeps hammering one provider with backoff during a multi-minute Anthropic outage. Each request burns 3 retries × backoff = ~3 sec wasted before falling back. Multiplied by your QPS = lots of wasted time + tokens. The breaker says "we just saw 5 failures in a row; stop calling for 30 seconds." Subsequent requests fail in <5µs (10k OPEN-state rejections in <50ms), withFallback routes to OpenAI immediately.

Three states with explicit transitions

CLOSED ──[ N consecutive failures ]──► OPEN
   ▲                                    │
   │                                    │ [cooldownMs elapsed]
   │                                    ▼
   └──[ M probe successes ]──── HALF-OPEN

Per-instance, NOT distributed

Each withCircuitBreaker(...) call holds its own breaker state in process memory. If you run 100 server replicas, each has its own independent breaker (matches Hystrix default). For cluster-wide coordination, layer your own Redis-backed counter via the onStateChange hook + shouldCount predicate.

2. outputFallback — 3-tier degradation on schema failure

When the agent has an outputSchema AND the LLM emits malformed JSON, the agent normally throws OutputSchemaError. With .outputFallback({ fallback, canned }):

import { z } from 'zod';
const Refund = z.object({ amount: z.number().nonnegative(), reason: z.string().min(1) });

const agent = Agent.create({...})
  .outputSchema(Refund)
  .outputFallback({
    fallback: async (err, raw) => ({ amount: 0, reason: 'manual review' }),
    canned:   { amount: 0, reason: 'unable to process' },
  })
  .build();

// Caller never sees OutputSchemaError; gets a typed Refund either way.
const refund = await agent.runTyped({ message: '...' });

Three tiers

TierWhenReturns
PrimaryLLM emits valid JSONLLM's parsed value
FallbackSchema validation failurefallback(err, raw) — re-validated against schema
CannedFallback throws OR returns invalidStatic safety-net (validated at builder time)

Two typed events for observability

  • agentfootprint.resilience.output_fallback_triggered
  • agentfootprint.resilience.output_canned_used

Builder-time canned validation

The canned value is parsed against the schema at .outputFallback({...}) time — throws TypeError immediately on misconfig. Surfaces in CI / dev, not at 3am when the fallback engages.

Fail-open vs fail-closed

  • With canned → agent NEVER throws on output failure (fail-open)
  • Without canned → fallback errors propagate to caller (fail-closed)

Consumer chooses.

3. resumeOnError — mid-run failure recovery

When agent.run() throws on a recoverable error mid-iteration, you get back a RunCheckpointError carrying a JSON-serializable checkpoint of the conversation history at the last completed iteration boundary. Persist it anywhere; resume later.

import { Agent, RunCheckpointError } from 'agentfootprint';

try {
  const result = await agent.run({ message: 'long task' });
} catch (err) {
  if (err instanceof RunCheckpointError) {
    // Persist anywhere — Redis, Postgres, S3, queue, file.
    await checkpointStore.put(sessionId, err.checkpoint);

    // hours / restart / new process / next deploy later:
    const checkpoint = await checkpointStore.get(sessionId);
    const result = await agent.resumeOnError(checkpoint);
  } else {
    throw err; // not recoverable — propagate
  }
}

AgentRunCheckpoint shape (JSON-serializable, stable)

FieldTypePurpose
version1 (literal)Forward-compat guard
runIdstringOriginal runId — observability correlation
historyLLMMessage[]Conversation at last completed iteration
lastCompletedIterationnumberWhere to resume from
originalInput{ message: string }User's original ask
checkpointedAtnumber (ms)Wall clock when captured
failurePoint?{ iteration, phase }Where the failure happened (oncall triage)

Failure-phase classifier

RunCheckpointError.checkpoint.failurePoint.phase is one of 'llm' | 'tool' | 'iteration' | 'unknown'. Recognizes CircuitOpenError (v2.10.0), AnthropicError, OpenAIError, BedrockError. Goes straight into oncall postmortems — "how often did we lose runs to LLM phase failures last week?"

Tradeoffs (be honest)

  • ✅ Survives process restart (JSON-serializable, tiny payload — typically <1 KB)
  • ✅ Works with any LLM provider — replay starts from history
  • ✅ No footprintjs core changes
  • ⚠️ Loses mid-iteration partial state (acceptable — iterations are atomic)
  • ⚠️ Tools inside the failed iteration re-execute on resume. For idempotent tools (read-only DB queries) this is fine; for non-idempotent tools (charge card, send email) consumers MUST add their own idempotency keys. v2.10.3+ may add toolCallId-based dedup.

Composing all three

The three primitives stack naturally — different layer of the failure stack:

import { Agent, RunCheckpointError } from 'agentfootprint';
import { anthropic, openai } from 'agentfootprint/llm-providers';
import { withCircuitBreaker, withFallback } from 'agentfootprint/resilience';
import { z } from 'zod';

// Layer 1: provider — circuit-breaker + multi-vendor fallback
const provider = withFallback(
  withCircuitBreaker(anthropic({ apiKey })),
  withCircuitBreaker(openai({ apiKey })),
);

// Layer 2: agent — output-schema + canned safety-net
const agent = Agent.create({ provider, model: 'claude-3-5-sonnet' })
  .system('You decide refund amounts.')
  .outputSchema(z.object({ amount: z.number(), reason: z.string() }))
  .outputFallback({
    fallback: async () => ({ amount: 0, reason: 'manual review' }),
    canned:   { amount: 0, reason: 'unable to process' },
  })
  .build();

// Layer 3: caller — fault-tolerant resume from checkpoint
async function processRefund(message: string, sessionId: string) {
  try {
    return await agent.runTyped({ message });
  } catch (err) {
    if (err instanceof RunCheckpointError) {
      await checkpointStore.put(sessionId, err.checkpoint);
      throw err; // queue picks up, retries with resumeOnError
    }
    throw err;
  }
}

// Worker that drains the queue (separate process, fresh container, hours later)
async function resumeWorker(sessionId: string) {
  const checkpoint = await checkpointStore.get(sessionId);
  return await agent.resumeOnError(checkpoint);
}

Three primitives, three failure modes, one substrate.

See also

On this page