Reliability gate

Your agent ships. Once a week, the LLM provider 503s; the cumulative cost on one tenant exceeds budget; a malformed tool-call response sneaks through. Without rules, every one of these is a bespoke try/catch somewhere in your call site. agentfootprint v2.11.5 lets you declare the rules.

What it is

Rules-based retry, fallback, and fail-fast around every LLM call inside an Agent’s ReAct loop. You declare rules; the framework wraps the call in a loop driven by those rules.

import { Agent } from 'agentfootprint';
import { ReliabilityFailFastError } from 'agentfootprint/reliability';

const agent = Agent.create({ provider, model: 'claude-sonnet-4-5-20250929' })
  .system('You triage support tickets.')
  .reliability({
    postDecide: [
      // Transient 5xx → retry up to 3 attempts.
      { when: (s) => s.errorKind === '5xx-transient' && s.attempt < 3,
        then: 'retry', kind: 'transient-retry' },
      // Anything else → fail-fast with typed error.
      { when: (s) => s.error !== undefined,
        then: 'fail-fast', kind: 'unrecoverable' },
    ],
    circuitBreaker: { failureThreshold: 3 },
  })
  .build();

try {
  await agent.run({ message: 'help' });
} catch (e) {
  if (e instanceof ReliabilityFailFastError) {
    console.log(e.kind, e.reason, e.payload);
  }
}

Decision verbs

A rule’s then field is one of six verbs:

Verb	Phase	Effect
`continue`	pre-check	No issues — proceed to the call
`ok`	post-decide	Call succeeded — commit response and return
`retry`	post-decide	Re-run the same provider (bumps `attempt`)
`retry-other`	post-decide	Advance to next provider in `providers[]`
`fallback`	post-decide	Invoke `config.fallback(req, lastError)` to repair
`fail-fast`	both	Throw `ReliabilityFailFastError` at `agent.run()`

Streaming + reliability — first-chunk arbitration

Streaming and retry are fundamentally hard to compose. If a stream errors after token 5, retrying re-emits tokens 1-5 (duplicates) — or you buffer the whole stream first and lose progressive UX. There is no clean answer at the LLM provider boundary today.

agentfootprint adopts first-chunk arbitration — the same pattern LangChain uses in RunnableWithFallbacks:

Pre-first-chunk failures (connection, headers, breaker-open): the full rule set fires. Retry, retry-other, fallback, fail-fast all available.
Post-first-chunk failures (mid-stream): rules can only fire ok or fail-fast. Rules wanting retry, retry-other, or fallback are escalated to fail-fast with kind 'mid-stream-not-retryable'.

Why: once tokens have been delivered to the consumer, neither retrying nor falling back is correct without coordination the LLM provider doesn’t offer (no resume tokens, no idempotency for stream content). The honest behavior is to commit what was delivered, or fail-fast.

The consumer keeps streaming on or off as their own choice. Reliability adapts.

Industry context

Why this design? Because every adjacent framework has confronted the same tradeoff:

Framework / SDK	Mid-stream retry	Pattern
Anthropic SDK	No	Skip retry on streams (silently — `#ended`/reject promise)
OpenAI SDK	No	Skip retry on streams (explicit `isRetryableBody` guard)
LangChain `RunnableRetry`	No	Skip retry on streams (source comment: “not very intuitive”)
LangChain `RunnableWithFallbacks`	Pre-first-chunk only	First-chunk arbitration — pull one chunk, commit on first success
LangGraph Pregel	Yes (whole node)	Atomic node retry; `task.writes.clear()`; accepts duplicate tokens
Strands	Yes (whole stream)	`while True` retry; visible duplicate tokens
LlamaIndex	No	Docs: *“set `streaming=False`” to get retry
Llama Stack	Pre-first-token only	Client-SDK retry on HTTP errors

agentfootprint sits with the most disciplined cohort (LangChain RunnableWithFallbacks, LangGraph’s atomic boundary, the OpenAI/Anthropic SDKs): retry honestly when retry is correct, escalate to fail-fast otherwise. The consumer never sees duplicate tokens.

How it composes with the v2.10.x primitives

v2.11.5’s gate is additive — the v2.10.x reliability primitives still work and compose cleanly:

Primitive	Layer	Use when
Reliability gate (v2.11.5)	per-call	rules-based retry / fallback / fail-fast inside the loop
`withCircuitBreaker(provider)`	per-call	low-level breaker around any LLMProvider
`.outputFallback({...})`	per-turn	malformed JSON degradation (3-tier)
`agent.resumeOnError(checkpoint)`	cross-process	mid-run failure recovery; resume on a different host

The gate handles in-loop transient failures. outputFallback handles end-of-turn schema failure. resumeOnError handles process-level crash recovery. Compose all three for full coverage.

Future work

Tee’d buffer mode — internal buffer + atomic-deliver-on-success. Makes mid-stream retry safe at the cost of first-token latency. Opt-in config flag.
Token idempotency keys — moot until LLM providers expose them.
Stage-level granularity — every retry attempt as a separate stage in the trace (currently one stage execution, internal loop). Trades streaming compatibility for richer trace; available today via the buildReliabilityGateChart chart-builder for consumers composing raw LLMCall + gate patterns.

End-to-end example

The single example below exercises happy / retry / fail-fast scenarios. The docs site live-imports the source so this snippet stays in sync with the runnable file.

/**
 * 09 — Reliability gate (v2.11.5): rules-based retry / fallback / fail-fast
 * around every LLM call inside an Agent's ReAct loop.
 *
 * Where 08 covers the v2.10.x reliability *primitives* (`withCircuitBreaker`,
 * `.outputFallback`, `agent.resumeOnError`), this example covers the
 * v2.11.5 reliability *gate* — declarative rules wrapping every CallLLM
 * inside the agent's loop:
 *
 *   PreCheck rules    → continue / fail-fast
 *     ↓
 *   provider call     → response or error
 *     ↓
 *   PostDecide rules  → ok / retry / retry-other / fallback / fail-fast
 *
 * Three scenarios run:
 *
 *   1. Happy path    — reliability configured, first call succeeds.
 *                      Agent returns final answer; no rule fires fail-fast.
 *
 *   2. Retry path    — provider throws a transient 5xx once; postDecide's
 *                      `retry` rule fires; second attempt succeeds.
 *
 *   3. Fail-fast     — provider throws; postDecide's `fail-fast` rule
 *                      fires; `agent.run()` throws ReliabilityFailFastError.
 *                      Caller branches on `e.kind` and `e.payload.phase`.
 *
 * Run:  npx tsx examples/features/09-reliability-gate.ts
 */

import { Agent } from '../../src/index.js';
import { ReliabilityFailFastError } from '../../src/reliability/types.js';
import type { LLMProvider, LLMRequest, LLMResponse } from '../../src/adapters/types.js';
import { type ExampleMeta } from '../helpers/cli.js';

export const meta: ExampleMeta = {
  id: 'features/09-reliability-gate',
  title: 'Reliability gate — rules-based retry / fallback / fail-fast around CallLLM',
  group: 'features',
  description:
    'v2.11.5 — declarative reliability rules wrapping every LLM call inside an Agent loop. Demonstrates happy path, transient-retry recovery, and post-decide fail-fast → typed ReliabilityFailFastError. Streaming + reliability uses first-chunk arbitration: pre-first-chunk failures honor the full rule set; mid-stream failures only honor ok / fail-fast.',
  defaultInput: 'demo all three reliability paths',
  providerSlots: ['feature'],
  tags: ['feature', 'reliability', 'reliability-gate', 'retry', 'fail-fast', 'fallback'],
};

// ─── Test providers ──────────────────────────────────────────────

/** Always succeeds. */
function okProvider(reply: string): LLMProvider {
  return {
    name: 'mock',
    complete: async (): Promise<LLMResponse> => ({
      content: reply,
      toolCalls: [],
      usage: { input: 1, output: 1 },
      stopReason: 'end_turn',
    }),
  };
}

/** Throws `failTimes` times then succeeds. Useful for retry scenarios. */
function flakyProvider(opts: {
  failTimes: number;
  error: Error;
  successReply: string;
}): { provider: LLMProvider; getCalls: () => number } {
  let calls = 0;
  const provider: LLMProvider = {
    name: 'flaky',
    complete: async (_req: LLMRequest): Promise<LLMResponse> => {
      calls += 1;
      if (calls <= opts.failTimes) throw opts.error;
      return {
        content: opts.successReply,
        toolCalls: [],
        usage: { input: 1, output: 1 },
        stopReason: 'end_turn',
      };
    },
  };
  return { provider, getCalls: () => calls };
}

/** Always throws. Useful for fail-fast scenarios. */
function alwaysThrowsProvider(error: Error): LLMProvider {
  return {
    name: 'broken',
    complete: async (): Promise<LLMResponse> => {
      throw error;
    },
  };
}

// ─── Scenario 1: happy path — rules configured, first call succeeds ─

async function happyPath(): Promise<{ result: string }> {
  const agent = Agent.create({ provider: okProvider('all good'), model: 'mock' })
    .system('You echo.')
    .reliability({
      postDecide: [
        // Rule: any error → fail-fast. Doesn't fire here because the
        // call succeeds; the agent returns the LLM's response.
        {
          when: (s) => s.error !== undefined,
          then: 'fail-fast',
          kind: 'unrecoverable',
        },
      ],
    })
    .build();
  const result = (await agent.run({ message: 'hi' })) as string;
  return { result };
}

// ─── Scenario 2: retry — first call fails transient 5xx, second succeeds ─

async function retryPath(): Promise<{ result: string; providerCalls: number }> {
  const transient = new Error('Service Unavailable');
  (transient as Error & { status?: number }).status = 503;
  const flaky = flakyProvider({
    failTimes: 1,
    error: transient,
    successReply: 'recovered',
  });

  const agent = Agent.create({ provider: flaky.provider, model: 'mock' })
    .system('You echo.')
    .reliability({
      postDecide: [
        // Retry up to 3 attempts on 5xx. After that, fail-fast on
        // subsequent errors.
        {
          when: (s) => s.errorKind === '5xx-transient' && s.attempt < 3,
          then: 'retry',
          kind: 'transient-retry',
          label: 'transient 5xx, retrying',
        },
        {
          when: (s) => s.error !== undefined,
          then: 'fail-fast',
          kind: 'unrecoverable',
        },
      ],
    })
    .build();

  const result = (await agent.run({ message: 'go' })) as string;
  return { result, providerCalls: flaky.getCalls() };
}

// ─── Scenario 3: fail-fast — error → typed ReliabilityFailFastError ─

async function failFastPath(): Promise<{
  thrown: boolean;
  kind?: string;
  reason?: string;
  phase?: string;
}> {
  const fatal = new Error('schema violation');
  const agent = Agent.create({ provider: alwaysThrowsProvider(fatal), model: 'mock' })
    .system('You echo.')
    .reliability({
      postDecide: [
        {
          when: (s) => s.error !== undefined,
          then: 'fail-fast',
          kind: 'unrecoverable',
          label: 'unrecoverable error from provider',
        },
      ],
    })
    .build();

  try {
    await agent.run({ message: 'go' });
    return { thrown: false };
  } catch (e) {
    if (e instanceof ReliabilityFailFastError) {
      return {
        thrown: true,
        kind: e.kind,
        reason: e.reason,
        phase: e.payload?.phase,
      };
    }
    throw e;
  }
}

// ─── Entry point ──────────────────────────────────────────────────

export async function run(): Promise<{
  happy: { result: string };
  retry: { result: string; providerCalls: number };
  failFast: { thrown: boolean; kind?: string; reason?: string; phase?: string };
}> {
  const happy = await happyPath();
  const retry = await retryPath();
  const failFast = await failFastPath();
  return { happy, retry, failFast };
}

// Run as a script: regression-guard the example so the CI integration
// test catches drift if the API changes.
if (import.meta.url === `file://${process.argv[1]}`) {
  run()
    .then((out) => {
      console.log('=== reliability gate scenarios ===');
      console.log('happy:    ', out.happy);
      console.log('retry:    ', out.retry);
      console.log('failFast: ', out.failFast);

      // Sanity: each scenario engaged as designed.
      if (out.happy.result !== 'all good') {
        console.error('happy path: unexpected result');
        process.exit(1);
      }
      if (out.retry.providerCalls !== 2 || out.retry.result !== 'recovered') {
        console.error('retry path: unexpected calls/result');
        process.exit(1);
      }
      if (
        !out.failFast.thrown ||
        out.failFast.kind !== 'unrecoverable' ||
        out.failFast.phase !== 'post-decide'
      ) {
        console.error('fail-fast: did not engage as designed');
        process.exit(1);
      }
      console.log('OK');
    })
    .catch((err) => {
      console.error(err);
      process.exit(1);
    });
}

Reference

Agent.create({...}).reliability(config) — fluent builder method
ReliabilityConfig — the rule list shape (see agentfootprint/reliability types)
ReliabilityFailFastError — thrown at agent.run() on fail-fast decisions; carries kind, reason, cause, payload, snapshot
CircuitBreaker pure functions — exported from agentfootprint/reliability for consumers hydrating breaker state from a persistence store