Reliability gate
Your agent ships. Once a week, the LLM provider 503s; the cumulative cost on one tenant exceeds budget; a malformed tool-call response sneaks through. Without rules, every one of these is a bespoke
try/catchsomewhere in your call site. agentfootprint v2.11.5 lets you declare the rules.
What it is
Section titled “What it is”Rules-based retry, fallback, and fail-fast around every LLM call inside an Agent’s ReAct loop. You declare rules; the framework wraps the call in a loop driven by those rules.
import { Agent } from 'agentfootprint';import { ReliabilityFailFastError } from 'agentfootprint/reliability';
const agent = Agent.create({ provider, model: 'claude-sonnet-4-5-20250929' }) .system('You triage support tickets.') .reliability({ postDecide: [ // Transient 5xx → retry up to 3 attempts. { when: (s) => s.errorKind === '5xx-transient' && s.attempt < 3, then: 'retry', kind: 'transient-retry' }, // Anything else → fail-fast with typed error. { when: (s) => s.error !== undefined, then: 'fail-fast', kind: 'unrecoverable' }, ], circuitBreaker: { failureThreshold: 3 }, }) .build();
try { await agent.run({ message: 'help' });} catch (e) { if (e instanceof ReliabilityFailFastError) { console.log(e.kind, e.reason, e.payload); }}Decision verbs
Section titled “Decision verbs”A rule’s then field is one of six verbs:
| Verb | Phase | Effect |
|---|---|---|
continue | pre-check | No issues — proceed to the call |
ok | post-decide | Call succeeded — commit response and return |
retry | post-decide | Re-run the same provider (bumps attempt) |
retry-other | post-decide | Advance to next provider in providers[] |
fallback | post-decide | Invoke config.fallback(req, lastError) to repair |
fail-fast | both | Throw ReliabilityFailFastError at agent.run() |
Streaming + reliability — first-chunk arbitration
Section titled “Streaming + reliability — first-chunk arbitration”Streaming and retry are fundamentally hard to compose. If a stream errors after token 5, retrying re-emits tokens 1-5 (duplicates) — or you buffer the whole stream first and lose progressive UX. There is no clean answer at the LLM provider boundary today.
agentfootprint adopts first-chunk arbitration — the same pattern LangChain uses in RunnableWithFallbacks:
- Pre-first-chunk failures (connection, headers, breaker-open): the full rule set fires. Retry, retry-other, fallback, fail-fast all available.
- Post-first-chunk failures (mid-stream): rules can only fire
okorfail-fast. Rules wantingretry,retry-other, orfallbackare escalated to fail-fast with kind'mid-stream-not-retryable'.
Why: once tokens have been delivered to the consumer, neither retrying nor falling back is correct without coordination the LLM provider doesn’t offer (no resume tokens, no idempotency for stream content). The honest behavior is to commit what was delivered, or fail-fast.
The consumer keeps streaming on or off as their own choice. Reliability adapts.
Industry context
Section titled “Industry context”Why this design? Because every adjacent framework has confronted the same tradeoff:
| Framework / SDK | Mid-stream retry | Pattern |
|---|---|---|
| Anthropic SDK | No | Skip retry on streams (silently — #ended/reject promise) |
| OpenAI SDK | No | Skip retry on streams (explicit isRetryableBody guard) |
LangChain RunnableRetry | No | Skip retry on streams (source comment: “not very intuitive”) |
LangChain RunnableWithFallbacks | Pre-first-chunk only | First-chunk arbitration — pull one chunk, commit on first success |
| LangGraph Pregel | Yes (whole node) | Atomic node retry; task.writes.clear(); accepts duplicate tokens |
| Strands | Yes (whole stream) | while True retry; visible duplicate tokens |
| LlamaIndex | No | Docs: *“set streaming=False” to get retry |
| Llama Stack | Pre-first-token only | Client-SDK retry on HTTP errors |
agentfootprint sits with the most disciplined cohort (LangChain RunnableWithFallbacks, LangGraph’s atomic boundary, the OpenAI/Anthropic SDKs): retry honestly when retry is correct, escalate to fail-fast otherwise. The consumer never sees duplicate tokens.
How it composes with the v2.10.x primitives
Section titled “How it composes with the v2.10.x primitives”v2.11.5’s gate is additive — the v2.10.x reliability primitives still work and compose cleanly:
| Primitive | Layer | Use when |
|---|---|---|
| Reliability gate (v2.11.5) | per-call | rules-based retry / fallback / fail-fast inside the loop |
withCircuitBreaker(provider) | per-call | low-level breaker around any LLMProvider |
.outputFallback({...}) | per-turn | malformed JSON degradation (3-tier) |
agent.resumeOnError(checkpoint) | cross-process | mid-run failure recovery; resume on a different host |
The gate handles in-loop transient failures. outputFallback handles end-of-turn schema failure. resumeOnError handles process-level crash recovery. Compose all three for full coverage.
Future work
Section titled “Future work”- Tee’d buffer mode — internal buffer + atomic-deliver-on-success. Makes mid-stream retry safe at the cost of first-token latency. Opt-in config flag.
- Token idempotency keys — moot until LLM providers expose them.
- Stage-level granularity — every retry attempt as a separate stage in the trace (currently one stage execution, internal loop). Trades streaming compatibility for richer trace; available today via the
buildReliabilityGateChartchart-builder for consumers composing rawLLMCall + gatepatterns.
End-to-end example
Section titled “End-to-end example”The single example below exercises happy / retry / fail-fast scenarios. The docs site live-imports the source so this snippet stays in sync with the runnable file.
/** * 09 — Reliability gate (v2.11.5): rules-based retry / fallback / fail-fast * around every LLM call inside an Agent's ReAct loop. * * Where 08 covers the v2.10.x reliability *primitives* (`withCircuitBreaker`, * `.outputFallback`, `agent.resumeOnError`), this example covers the * v2.11.5 reliability *gate* — declarative rules wrapping every CallLLM * inside the agent's loop: * * PreCheck rules → continue / fail-fast * ↓ * provider call → response or error * ↓ * PostDecide rules → ok / retry / retry-other / fallback / fail-fast * * Three scenarios run: * * 1. Happy path — reliability configured, first call succeeds. * Agent returns final answer; no rule fires fail-fast. * * 2. Retry path — provider throws a transient 5xx once; postDecide's * `retry` rule fires; second attempt succeeds. * * 3. Fail-fast — provider throws; postDecide's `fail-fast` rule * fires; `agent.run()` throws ReliabilityFailFastError. * Caller branches on `e.kind` and `e.payload.phase`. * * Run: npx tsx examples/features/09-reliability-gate.ts */
import { Agent } from '../../src/index.js';import { ReliabilityFailFastError } from '../../src/reliability/types.js';import type { LLMProvider, LLMRequest, LLMResponse } from '../../src/adapters/types.js';import { type ExampleMeta } from '../helpers/cli.js';
export const meta: ExampleMeta = { id: 'features/09-reliability-gate', title: 'Reliability gate — rules-based retry / fallback / fail-fast around CallLLM', group: 'features', description: 'v2.11.5 — declarative reliability rules wrapping every LLM call inside an Agent loop. Demonstrates happy path, transient-retry recovery, and post-decide fail-fast → typed ReliabilityFailFastError. Streaming + reliability uses first-chunk arbitration: pre-first-chunk failures honor the full rule set; mid-stream failures only honor ok / fail-fast.', defaultInput: 'demo all three reliability paths', providerSlots: ['feature'], tags: ['feature', 'reliability', 'reliability-gate', 'retry', 'fail-fast', 'fallback'],};
// ─── Test providers ──────────────────────────────────────────────
/** Always succeeds. */function okProvider(reply: string): LLMProvider { return { name: 'mock', complete: async (): Promise<LLMResponse> => ({ content: reply, toolCalls: [], usage: { input: 1, output: 1 }, stopReason: 'end_turn', }), };}
/** Throws `failTimes` times then succeeds. Useful for retry scenarios. */function flakyProvider(opts: { failTimes: number; error: Error; successReply: string;}): { provider: LLMProvider; getCalls: () => number } { let calls = 0; const provider: LLMProvider = { name: 'flaky', complete: async (_req: LLMRequest): Promise<LLMResponse> => { calls += 1; if (calls <= opts.failTimes) throw opts.error; return { content: opts.successReply, toolCalls: [], usage: { input: 1, output: 1 }, stopReason: 'end_turn', }; }, }; return { provider, getCalls: () => calls };}
/** Always throws. Useful for fail-fast scenarios. */function alwaysThrowsProvider(error: Error): LLMProvider { return { name: 'broken', complete: async (): Promise<LLMResponse> => { throw error; }, };}
// ─── Scenario 1: happy path — rules configured, first call succeeds ─
async function happyPath(): Promise<{ result: string }> { const agent = Agent.create({ provider: okProvider('all good'), model: 'mock' }) .system('You echo.') .reliability({ postDecide: [ // Rule: any error → fail-fast. Doesn't fire here because the // call succeeds; the agent returns the LLM's response. { when: (s) => s.error !== undefined, then: 'fail-fast', kind: 'unrecoverable', }, ], }) .build(); const result = (await agent.run({ message: 'hi' })) as string; return { result };}
// ─── Scenario 2: retry — first call fails transient 5xx, second succeeds ─
async function retryPath(): Promise<{ result: string; providerCalls: number }> { const transient = new Error('Service Unavailable'); (transient as Error & { status?: number }).status = 503; const flaky = flakyProvider({ failTimes: 1, error: transient, successReply: 'recovered', });
const agent = Agent.create({ provider: flaky.provider, model: 'mock' }) .system('You echo.') .reliability({ postDecide: [ // Retry up to 3 attempts on 5xx. After that, fail-fast on // subsequent errors. { when: (s) => s.errorKind === '5xx-transient' && s.attempt < 3, then: 'retry', kind: 'transient-retry', label: 'transient 5xx, retrying', }, { when: (s) => s.error !== undefined, then: 'fail-fast', kind: 'unrecoverable', }, ], }) .build();
const result = (await agent.run({ message: 'go' })) as string; return { result, providerCalls: flaky.getCalls() };}
// ─── Scenario 3: fail-fast — error → typed ReliabilityFailFastError ─
async function failFastPath(): Promise<{ thrown: boolean; kind?: string; reason?: string; phase?: string;}> { const fatal = new Error('schema violation'); const agent = Agent.create({ provider: alwaysThrowsProvider(fatal), model: 'mock' }) .system('You echo.') .reliability({ postDecide: [ { when: (s) => s.error !== undefined, then: 'fail-fast', kind: 'unrecoverable', label: 'unrecoverable error from provider', }, ], }) .build();
try { await agent.run({ message: 'go' }); return { thrown: false }; } catch (e) { if (e instanceof ReliabilityFailFastError) { return { thrown: true, kind: e.kind, reason: e.reason, phase: e.payload?.phase, }; } throw e; }}
// ─── Entry point ──────────────────────────────────────────────────
export async function run(): Promise<{ happy: { result: string }; retry: { result: string; providerCalls: number }; failFast: { thrown: boolean; kind?: string; reason?: string; phase?: string };}> { const happy = await happyPath(); const retry = await retryPath(); const failFast = await failFastPath(); return { happy, retry, failFast };}
// Run as a script: regression-guard the example so the CI integration// test catches drift if the API changes.if (import.meta.url === `file://${process.argv[1]}`) { run() .then((out) => { console.log('=== reliability gate scenarios ==='); console.log('happy: ', out.happy); console.log('retry: ', out.retry); console.log('failFast: ', out.failFast);
// Sanity: each scenario engaged as designed. if (out.happy.result !== 'all good') { console.error('happy path: unexpected result'); process.exit(1); } if (out.retry.providerCalls !== 2 || out.retry.result !== 'recovered') { console.error('retry path: unexpected calls/result'); process.exit(1); } if ( !out.failFast.thrown || out.failFast.kind !== 'unrecoverable' || out.failFast.phase !== 'post-decide' ) { console.error('fail-fast: did not engage as designed'); process.exit(1); } console.log('OK'); }) .catch((err) => { console.error(err); process.exit(1); });}Reference
Section titled “Reference”Agent.create({...}).reliability(config)— fluent builder methodReliabilityConfig— the rule list shape (seeagentfootprint/reliabilitytypes)ReliabilityFailFastError— thrown atagent.run()onfail-fastdecisions; carrieskind,reason,cause,payload,snapshotCircuitBreakerpure functions — exported fromagentfootprint/reliabilityfor consumers hydrating breaker state from a persistence store