Reliability

Your agent is in production. The LLM provider 503s for 4 minutes. Half your requests crash, the other half take 30 seconds because withRetry keeps hammering. Then a model rollout starts emitting prose instead of JSON for 2% of outputs. Then a container restarts mid-iteration and 47 in-flight runs vanish. agentfootprint v2.10.x ships three primitives that make each of those a non-event.

The Reliability subsystem (v2.10.x — three primitives, three failure modes)

Primitive	Release	What it solves	Where it lives
`withCircuitBreaker(provider, opts)`	v2.10.0	Vendor outage detection; fail-fast in <5µs after N consecutive failures	`agentfootprint/resilience`
`.outputFallback({ fallback, canned })`	v2.10.1	Schema-validation failure; 3-tier degradation (primary → fallback → canned)	builder method on `Agent.create({...})`
`agent.resumeOnError(checkpoint)`	v2.10.2	Mid-run failure recovery; checkpoint at iteration boundaries, persist anywhere, resume hours/restarts later	`agentfootprint` (main barrel)

All three compose. All three follow the same library convention: opt-in (no behavior change for existing consumers), typed events for observability, fail-open by construction when configured.

End-to-end example

The single example below exercises all three primitives. The docs site live-imports the source so this snippet stays in sync with the runnable file.

/**
 * 08 — Reliability subsystem: CircuitBreaker + outputFallback + resumeOnError.
 *
 * Demonstrates all 3 pieces of the v2.10.x Reliability subsystem
 * end-to-end. Each piece solves a distinct production failure mode:
 *
 *   1. **CircuitBreaker** — vendor outage detection. Wrap the LLM
 *      provider in `withCircuitBreaker(...)`. After N consecutive
 *      failures, the breaker OPENS and fails fast (sub-µs) so
 *      `withFallback` can route to the secondary provider without
 *      wasting 3 retries × backoff per request.
 *
 *   2. **outputFallback** — schema-validation failure. Pair with
 *      `.outputSchema(parser)`. When the LLM emits malformed JSON
 *      after maxIterations, fall through to the consumer's
 *      `fallback(err, raw)` function, then to the static `canned`
 *      safety net. Agent NEVER throws on output failure when canned
 *      is set.
 *
 *   3. **resumeOnError** — mid-run failure recovery. When LLM 503s
 *      mid-iteration, the agent throws `RunCheckpointError` carrying
 *      the conversation history at the last completed iteration.
 *      Persist the checkpoint to Redis/Postgres/S3, restart the
 *      process, call `agent.resumeOnError(checkpoint)` to continue
 *      from where it failed.
 *
 * Run:  npx tsx examples/features/08-reliability.ts
 */

import { z } from 'zod';
import { Agent, RunCheckpointError } from '../../src/index.js';
import {
  withCircuitBreaker,
  withFallback,
  CircuitOpenError,
} from '../../src/resilience/index.js';
import type { AgentRunCheckpoint } from '../../src/index.js';
import type { LLMProvider, LLMRequest, LLMResponse } from '../../src/adapters/types.js';
import { mock } from '../../src/adapters/llm/MockProvider.js';
import { isCliEntry, printResult, type ExampleMeta } from '../helpers/cli.js';

export const meta: ExampleMeta = {
  id: 'features/08-reliability',
  title: 'Reliability — CircuitBreaker + outputFallback + resumeOnError',
  group: 'features',
  description:
    'End-to-end demo of the v2.10.x Reliability subsystem: vendor-outage circuit breaker, 3-tier output-schema degradation, and fault-tolerant mid-run resume from JSON-serializable checkpoint.',
  defaultInput: 'process refund #1234 for $50',
  providerSlots: ['feature'],
  tags: ['feature', 'reliability', 'circuit-breaker', 'output-fallback', 'resume-on-error'],
};

// ── Schema for the agent's structured output ─────────────────────────

const Refund = z.object({
  amount: z.number().nonnegative(),
  reason: z.string().min(1),
});
type RefundOutput = z.infer<typeof Refund>;

// ── Helper: provider that fails N times then recovers ────────────────

function flakyProvider(failuresBeforeRecovery: number, name = 'flaky'): LLMProvider {
  let calls = 0;
  return {
    name,
    async complete(_req: LLMRequest): Promise<LLMResponse> {
      calls += 1;
      if (calls <= failuresBeforeRecovery) {
        throw new Error(`vendor 503 (call ${calls})`);
      }
      // After recovery: emit valid JSON for the Refund schema.
      return mock({
        replies: [{ content: JSON.stringify({ amount: 50, reason: 'product defect' }) }],
      }).complete(_req);
    },
  };
}

// ── Three demonstrations ─────────────────────────────────────────────

async function demoCircuitBreaker(): Promise<{ primaryCalls: number; fallbackCalls: number }> {
  // Primary that fails forever, fallback that always succeeds.
  let primaryCalls = 0;
  const primary: LLMProvider = {
    name: 'primary',
    async complete(): Promise<LLMResponse> {
      primaryCalls += 1;
      throw new Error('vendor 503');
    },
  };
  let fallbackCalls = 0;
  const fallback: LLMProvider = {
    name: 'fallback',
    async complete(): Promise<LLMResponse> {
      fallbackCalls += 1;
      return mock({
        replies: [{ content: JSON.stringify({ amount: 0, reason: 'fallback path' }) }],
      }).complete({} as LLMRequest);
    },
  };

  // Wrap primary in a breaker; fallback handles any thrown error.
  const provider = withFallback(
    withCircuitBreaker(primary, { failureThreshold: 2, cooldownMs: 60_000 }),
    fallback,
  );

  const agent = Agent.create({ provider, model: 'mock' })
    .system('You answer refund questions.')
    .outputSchema(Refund)
    .build();

  // Run 5 turns. After 2 primary failures, breaker opens; remaining
  // 3 turns route directly to fallback (primary not called).
  for (let i = 0; i < 5; i++) {
    try {
      await agent.runTyped<RefundOutput>({ message: `query ${i}` });
    } catch {
      // Some early turns may surface CircuitOpenError if the order
      // happens to fire before fallback engages — that's fine.
    }
  }
  return { primaryCalls, fallbackCalls };
}

async function demoOutputFallback(): Promise<{
  result: RefundOutput;
  cannedFired: boolean;
}> {
  // LLM emits prose instead of JSON. With outputFallback, the agent
  // tier-2's into the consumer's fallback fn; if THAT fails, tier-3
  // returns the canned safety-net.
  const provider = mock({ replies: [{ content: 'Sorry, I cannot help with that.' }] });
  let cannedFired = false;

  const agent = Agent.create({ provider, model: 'mock' })
    .system('You decide refund amounts.')
    .outputSchema(Refund)
    .outputFallback({
      // Tier 2: try to recover; let's simulate it failing too.
      fallback: () => {
        throw new Error('fallback also failed (simulated)');
      },
      // Tier 3: guaranteed-valid safety net.
      canned: { amount: 0, reason: 'unable to process — please retry' },
    })
    .build();

  // The resilience event is consumer-side / informational — not in
  // the typed AgentfootprintEventMap. Cast to satisfy the typed
  // dispatcher without losing runtime behavior.
  agent.on('agentfootprint.resilience.output_canned_used' as never, () => {
    cannedFired = true;
  });

  // Caller never sees OutputSchemaError; gets a typed Refund either way.
  const result = await agent.runTyped<RefundOutput>({ message: 'refund please' });
  return { result, cannedFired };
}

async function demoResumeOnError(): Promise<{
  failedAt: string;
  resumeResult: string;
  serializedCheckpointBytes: number;
}> {
  // Provider that succeeds on call 1 (tool call), fails on call 2,
  // then succeeds on call 3 (after resume).
  let calls = 0;
  const provider: LLMProvider = {
    name: 'flaky-then-recovers',
    async complete(): Promise<LLMResponse> {
      calls += 1;
      if (calls === 1) {
        return {
          content: '',
          toolCalls: [{ id: 't1', name: 'lookup', args: { id: '1234' } }],
          usage: { input: 1, output: 1 },
          stopReason: 'tool_use',
        };
      }
      if (calls === 2) {
        throw new Error('transient vendor 503 (mid-iteration)');
      }
      return {
        content: 'refund processed: $50 for product defect',
        toolCalls: [],
        usage: { input: 1, output: 1 },
        stopReason: 'end_turn',
      };
    },
  };

  const agent = Agent.create({ provider, model: 'mock' })
    .system('You process refunds.')
    .tool({
      schema: { name: 'lookup', description: '', inputSchema: { type: 'object' } },
      execute: () => 'order #1234 found',
    })
    .build();

  let captured: AgentRunCheckpoint | undefined;
  let failedAt = '';
  try {
    await agent.run({ message: meta.defaultInput ?? 'process refund' });
  } catch (err) {
    if (err instanceof RunCheckpointError) {
      captured = err.checkpoint;
      failedAt = `iteration ${err.checkpoint.failurePoint?.iteration} (${err.checkpoint.failurePoint?.phase})`;
    } else {
      throw err;
    }
  }
  if (!captured) throw new Error('expected checkpoint');

  // Persist the checkpoint anywhere — JSON-serializable, tiny payload.
  const serialized = JSON.stringify(captured);

  // hours / restart / next deploy later: resume from the checkpoint.
  const result = await agent.resumeOnError(captured);
  return {
    failedAt,
    resumeResult: typeof result === 'string' ? result : '(paused)',
    serializedCheckpointBytes: serialized.length,
  };
}

// ── Main runner ──────────────────────────────────────────────────────

export async function run(input: string): Promise<unknown> {
  void input;
  console.log('\n=== Reliability subsystem demo ===\n');

  console.log('1. CircuitBreaker — vendor outage detection');
  const cb = await demoCircuitBreaker();
  console.log(`   primary calls: ${cb.primaryCalls} (capped by breaker)`);
  console.log(`   fallback calls: ${cb.fallbackCalls} (took over after breaker opened)`);
  // Regression guard: breaker MUST cap primary calls
  if (cb.primaryCalls >= 5) {
    console.error('REGRESSION: breaker did not cap primary calls.');
    process.exit(1);
  }

  console.log('\n2. outputFallback — 3-tier degradation on schema failure');
  const of = await demoOutputFallback();
  console.log(`   result: ${JSON.stringify(of.result)}`);
  console.log(`   canned fired: ${of.cannedFired}`);
  // Regression guard: agent must NOT throw, must return canned shape
  if (of.result.amount !== 0 || !of.result.reason.includes('unable')) {
    console.error('REGRESSION: canned safety-net did not engage.');
    process.exit(1);
  }
  if (!of.cannedFired) {
    console.error('REGRESSION: output_canned_used event did not fire.');
    process.exit(1);
  }

  console.log('\n3. resumeOnError — mid-run failure recovery');
  const ro = await demoResumeOnError();
  console.log(`   failed at: ${ro.failedAt}`);
  console.log(`   checkpoint size: ${ro.serializedCheckpointBytes} bytes (JSON)`);
  console.log(`   resume result: ${ro.resumeResult.slice(0, 60)}…`);
  // Regression guard: resume must complete the run
  if (!ro.resumeResult.includes('refund processed')) {
    console.error('REGRESSION: resumeOnError did not complete the run.');
    process.exit(1);
  }
  if (ro.serializedCheckpointBytes < 50) {
    console.error('REGRESSION: checkpoint suspiciously small.');
    process.exit(1);
  }

  // Touch the unused import so it's clearly part of the example
  // surface even when not exercised in this code path (the
  // CircuitOpenError type is what `withCircuitBreaker` throws when
  // the breaker is OPEN; consumers can `instanceof` check it).
  void CircuitOpenError;

  console.log('\nOK — all 3 reliability primitives behaved as documented.');
  return { circuitBreaker: cb, outputFallback: of, resumeOnError: ro };
}

if (isCliEntry(import.meta.url)) {
  run(meta.defaultInput ?? '').then(printResult).catch(console.error);
}

Sample output:

1. CircuitBreaker — vendor outage detection
   primary calls: 2 (capped by breaker)
   fallback calls: 5 (took over after breaker opened)

2. outputFallback — 3-tier degradation on schema failure
   result: {"amount":0,"reason":"unable to process — please retry"}
   canned fired: true

3. resumeOnError — mid-run failure recovery
   failed at: iteration 2 (iteration)
   checkpoint size: 461 bytes (JSON)
   resume result: refund processed: $50 for product defect…

1. CircuitBreaker — vendor outage detection

withCircuitBreaker wraps any LLMProvider and tracks consecutive failures. After failureThreshold failures, the breaker OPENS and rejects calls immediately with CircuitOpenError — no network round-trip. After cooldownMs, it enters HALF-OPEN and admits probe calls; halfOpenSuccessThreshold successes close it; one failure re-opens it.

import { anthropic, openai } from 'agentfootprint/llm-providers';
import { withCircuitBreaker, withFallback } from 'agentfootprint/resilience';

const provider = withFallback(
  withCircuitBreaker(anthropic({ apiKey }), {
    failureThreshold: 5,
    cooldownMs: 30_000,
    halfOpenSuccessThreshold: 2,
    onStateChange: (state, why) => log.info(`circuit ${state}: ${why}`),
  }),
  withCircuitBreaker(openai({ apiKey })),
);

Why on top of `withRetry`?

withRetry keeps hammering one provider with backoff during a multi-minute Anthropic outage. Each request burns 3 retries × backoff = ~3 sec wasted before falling back. Multiplied by your QPS = lots of wasted time + tokens. The breaker says “we just saw 5 failures in a row; stop calling for 30 seconds.” Subsequent requests fail in <5µs (10k OPEN-state rejections in <50ms), withFallback routes to OpenAI immediately.

Three states with explicit transitions

CLOSED ──[ N consecutive failures ]──► OPEN
   ▲                                    │
   │                                    │ [cooldownMs elapsed]
   │                                    ▼
   └──[ M probe successes ]──── HALF-OPEN

Per-instance, NOT distributed

Each withCircuitBreaker(...) call holds its own breaker state in process memory. If you run 100 server replicas, each has its own independent breaker (matches Hystrix default). For cluster-wide coordination, layer your own Redis-backed counter via the onStateChange hook + shouldCount predicate.

2. outputFallback — 3-tier degradation on schema failure

When the agent has an outputSchema AND the LLM emits malformed JSON, the agent normally throws OutputSchemaError. With .outputFallback({ fallback, canned }):

import { z } from 'zod';
const Refund = z.object({ amount: z.number().nonnegative(), reason: z.string().min(1) });

const agent = Agent.create({...})
  .outputSchema(Refund)
  .outputFallback({
    fallback: async (err, raw) => ({ amount: 0, reason: 'manual review' }),
    canned:   { amount: 0, reason: 'unable to process' },
  })
  .build();

// Caller never sees OutputSchemaError; gets a typed Refund either way.
const refund = await agent.runTyped({ message: '...' });

Three tiers

Tier	When	Returns
Primary	LLM emits valid JSON	LLM’s parsed value
Fallback	Schema validation failure	`fallback(err, raw)` — re-validated against schema
Canned	Fallback throws OR returns invalid	Static safety-net (validated at builder time)

Two typed events for observability

agentfootprint.resilience.output_fallback_triggered
agentfootprint.resilience.output_canned_used

Builder-time canned validation

The canned value is parsed against the schema at .outputFallback({...}) time — throws TypeError immediately on misconfig. Surfaces in CI / dev, not at 3am when the fallback engages.

Fail-open vs fail-closed

With canned → agent NEVER throws on output failure (fail-open)
Without canned → fallback errors propagate to caller (fail-closed)

Consumer chooses.

3. resumeOnError — mid-run failure recovery

When agent.run() throws on a recoverable error mid-iteration, you get back a RunCheckpointError carrying a JSON-serializable checkpoint of the conversation history at the last completed iteration boundary. Persist it anywhere; resume later.

import { Agent, RunCheckpointError } from 'agentfootprint';

try {
  const result = await agent.run({ message: 'long task' });
} catch (err) {
  if (err instanceof RunCheckpointError) {
    // Persist anywhere — Redis, Postgres, S3, queue, file.
    await checkpointStore.put(sessionId, err.checkpoint);

    // hours / restart / new process / next deploy later:
    const checkpoint = await checkpointStore.get(sessionId);
    const result = await agent.resumeOnError(checkpoint);
  } else {
    throw err; // not recoverable — propagate
  }
}

`AgentRunCheckpoint` shape (JSON-serializable, stable)

Field	Type	Purpose
`version`	`1` (literal)	Forward-compat guard
`runId`	`string`	Original runId — observability correlation
`history`	`LLMMessage[]`	Conversation at last completed iteration
`lastCompletedIteration`	`number`	Where to resume from
`originalInput`	`{ message: string }`	User’s original ask
`checkpointedAt`	`number` (ms)	Wall clock when captured
`failurePoint?`	`{ iteration, phase }`	Where the failure happened (oncall triage)

Failure-phase classifier

RunCheckpointError.checkpoint.failurePoint.phase is one of 'llm' | 'tool' | 'iteration' | 'unknown'. Recognizes CircuitOpenError (v2.10.0), AnthropicError, OpenAIError, BedrockError. Goes straight into oncall postmortems — “how often did we lose runs to LLM phase failures last week?”

Tradeoffs (be honest)

✅ Survives process restart (JSON-serializable, tiny payload — typically <1 KB)
✅ Works with any LLM provider — replay starts from history
✅ No footprintjs core changes
⚠️ Loses mid-iteration partial state (acceptable — iterations are atomic)
⚠️ Tools inside the failed iteration re-execute on resume. For idempotent tools (read-only DB queries) this is fine; for non-idempotent tools (charge card, send email) consumers MUST add their own idempotency keys. v2.10.3+ may add toolCallId-based dedup.

Composing all three

The three primitives stack naturally — different layer of the failure stack:

import { Agent, RunCheckpointError } from 'agentfootprint';
import { anthropic, openai } from 'agentfootprint/llm-providers';
import { withCircuitBreaker, withFallback } from 'agentfootprint/resilience';
import { z } from 'zod';

// Layer 1: provider — circuit-breaker + multi-vendor fallback
const provider = withFallback(
  withCircuitBreaker(anthropic({ apiKey })),
  withCircuitBreaker(openai({ apiKey })),
);

// Layer 2: agent — output-schema + canned safety-net
const agent = Agent.create({ provider, model: 'claude-3-5-sonnet' })
  .system('You decide refund amounts.')
  .outputSchema(z.object({ amount: z.number(), reason: z.string() }))
  .outputFallback({
    fallback: async () => ({ amount: 0, reason: 'manual review' }),
    canned:   { amount: 0, reason: 'unable to process' },
  })
  .build();

// Layer 3: caller — fault-tolerant resume from checkpoint
async function processRefund(message: string, sessionId: string) {
  try {
    return await agent.runTyped({ message });
  } catch (err) {
    if (err instanceof RunCheckpointError) {
      await checkpointStore.put(sessionId, err.checkpoint);
      throw err; // queue picks up, retries with resumeOnError
    }
    throw err;
  }
}

// Worker that drains the queue (separate process, fresh container, hours later)
async function resumeWorker(sessionId: string) {
  const checkpoint = await checkpointStore.get(sessionId);
  return await agent.resumeOnError(checkpoint);
}

Three primitives, three failure modes, one substrate.