Skip to content

Reliability gate

Your agent ships. Once a week, the LLM provider 503s; the cumulative cost on one tenant exceeds budget; a malformed tool-call response sneaks through. Without rules, every one of these is a bespoke try/catch somewhere in your call site. agentfootprint v2.11.5 lets you declare the rules.

Rules-based retry, fallback, and fail-fast around every LLM call inside an Agent’s ReAct loop. You declare rules; the framework wraps the call in a loop driven by those rules.

import { Agent } from 'agentfootprint';
import { ReliabilityFailFastError } from 'agentfootprint/reliability';
const agent = Agent.create({ provider, model: 'claude-sonnet-4-5-20250929' })
.system('You triage support tickets.')
.reliability({
postDecide: [
// Transient 5xx → retry up to 3 attempts.
{ when: (s) => s.errorKind === '5xx-transient' && s.attempt < 3,
then: 'retry', kind: 'transient-retry' },
// Anything else → fail-fast with typed error.
{ when: (s) => s.error !== undefined,
then: 'fail-fast', kind: 'unrecoverable' },
],
circuitBreaker: { failureThreshold: 3 },
})
.build();
try {
await agent.run({ message: 'help' });
} catch (e) {
if (e instanceof ReliabilityFailFastError) {
console.log(e.kind, e.reason, e.payload);
}
}

A rule’s then field is one of six verbs:

VerbPhaseEffect
continuepre-checkNo issues — proceed to the call
okpost-decideCall succeeded — commit response and return
retrypost-decideRe-run the same provider (bumps attempt)
retry-otherpost-decideAdvance to next provider in providers[]
fallbackpost-decideInvoke config.fallback(req, lastError) to repair
fail-fastbothThrow ReliabilityFailFastError at agent.run()

Streaming + reliability — first-chunk arbitration

Section titled “Streaming + reliability — first-chunk arbitration”

Streaming and retry are fundamentally hard to compose. If a stream errors after token 5, retrying re-emits tokens 1-5 (duplicates) — or you buffer the whole stream first and lose progressive UX. There is no clean answer at the LLM provider boundary today.

agentfootprint adopts first-chunk arbitration — the same pattern LangChain uses in RunnableWithFallbacks:

  • Pre-first-chunk failures (connection, headers, breaker-open): the full rule set fires. Retry, retry-other, fallback, fail-fast all available.
  • Post-first-chunk failures (mid-stream): rules can only fire ok or fail-fast. Rules wanting retry, retry-other, or fallback are escalated to fail-fast with kind 'mid-stream-not-retryable'.

Why: once tokens have been delivered to the consumer, neither retrying nor falling back is correct without coordination the LLM provider doesn’t offer (no resume tokens, no idempotency for stream content). The honest behavior is to commit what was delivered, or fail-fast.

The consumer keeps streaming on or off as their own choice. Reliability adapts.

Why this design? Because every adjacent framework has confronted the same tradeoff:

Framework / SDKMid-stream retryPattern
Anthropic SDKNoSkip retry on streams (silently — #ended/reject promise)
OpenAI SDKNoSkip retry on streams (explicit isRetryableBody guard)
LangChain RunnableRetryNoSkip retry on streams (source comment: “not very intuitive”)
LangChain RunnableWithFallbacksPre-first-chunk onlyFirst-chunk arbitration — pull one chunk, commit on first success
LangGraph PregelYes (whole node)Atomic node retry; task.writes.clear(); accepts duplicate tokens
StrandsYes (whole stream)while True retry; visible duplicate tokens
LlamaIndexNoDocs: *“set streaming=False” to get retry
Llama StackPre-first-token onlyClient-SDK retry on HTTP errors

agentfootprint sits with the most disciplined cohort (LangChain RunnableWithFallbacks, LangGraph’s atomic boundary, the OpenAI/Anthropic SDKs): retry honestly when retry is correct, escalate to fail-fast otherwise. The consumer never sees duplicate tokens.

How it composes with the v2.10.x primitives

Section titled “How it composes with the v2.10.x primitives”

v2.11.5’s gate is additive — the v2.10.x reliability primitives still work and compose cleanly:

PrimitiveLayerUse when
Reliability gate (v2.11.5)per-callrules-based retry / fallback / fail-fast inside the loop
withCircuitBreaker(provider)per-calllow-level breaker around any LLMProvider
.outputFallback({...})per-turnmalformed JSON degradation (3-tier)
agent.resumeOnError(checkpoint)cross-processmid-run failure recovery; resume on a different host

The gate handles in-loop transient failures. outputFallback handles end-of-turn schema failure. resumeOnError handles process-level crash recovery. Compose all three for full coverage.

  • Tee’d buffer mode — internal buffer + atomic-deliver-on-success. Makes mid-stream retry safe at the cost of first-token latency. Opt-in config flag.
  • Token idempotency keys — moot until LLM providers expose them.
  • Stage-level granularity — every retry attempt as a separate stage in the trace (currently one stage execution, internal loop). Trades streaming compatibility for richer trace; available today via the buildReliabilityGateChart chart-builder for consumers composing raw LLMCall + gate patterns.

The single example below exercises happy / retry / fail-fast scenarios. The docs site live-imports the source so this snippet stays in sync with the runnable file.

examples/features/09-reliability-gate.ts
/**
* 09 — Reliability gate (v2.11.5): rules-based retry / fallback / fail-fast
* around every LLM call inside an Agent's ReAct loop.
*
* Where 08 covers the v2.10.x reliability *primitives* (`withCircuitBreaker`,
* `.outputFallback`, `agent.resumeOnError`), this example covers the
* v2.11.5 reliability *gate* — declarative rules wrapping every CallLLM
* inside the agent's loop:
*
* PreCheck rules → continue / fail-fast
* ↓
* provider call → response or error
* ↓
* PostDecide rules → ok / retry / retry-other / fallback / fail-fast
*
* Three scenarios run:
*
* 1. Happy path — reliability configured, first call succeeds.
* Agent returns final answer; no rule fires fail-fast.
*
* 2. Retry path — provider throws a transient 5xx once; postDecide's
* `retry` rule fires; second attempt succeeds.
*
* 3. Fail-fast — provider throws; postDecide's `fail-fast` rule
* fires; `agent.run()` throws ReliabilityFailFastError.
* Caller branches on `e.kind` and `e.payload.phase`.
*
* Run: npx tsx examples/features/09-reliability-gate.ts
*/
import { Agent } from '../../src/index.js';
import { ReliabilityFailFastError } from '../../src/reliability/types.js';
import type { LLMProvider, LLMRequest, LLMResponse } from '../../src/adapters/types.js';
import { type ExampleMeta } from '../helpers/cli.js';
export const meta: ExampleMeta = {
id: 'features/09-reliability-gate',
title: 'Reliability gate — rules-based retry / fallback / fail-fast around CallLLM',
group: 'features',
description:
'v2.11.5 — declarative reliability rules wrapping every LLM call inside an Agent loop. Demonstrates happy path, transient-retry recovery, and post-decide fail-fast → typed ReliabilityFailFastError. Streaming + reliability uses first-chunk arbitration: pre-first-chunk failures honor the full rule set; mid-stream failures only honor ok / fail-fast.',
defaultInput: 'demo all three reliability paths',
providerSlots: ['feature'],
tags: ['feature', 'reliability', 'reliability-gate', 'retry', 'fail-fast', 'fallback'],
};
// ─── Test providers ──────────────────────────────────────────────
/** Always succeeds. */
function okProvider(reply: string): LLMProvider {
return {
name: 'mock',
complete: async (): Promise<LLMResponse> => ({
content: reply,
toolCalls: [],
usage: { input: 1, output: 1 },
stopReason: 'end_turn',
}),
};
}
/** Throws `failTimes` times then succeeds. Useful for retry scenarios. */
function flakyProvider(opts: {
failTimes: number;
error: Error;
successReply: string;
}): { provider: LLMProvider; getCalls: () => number } {
let calls = 0;
const provider: LLMProvider = {
name: 'flaky',
complete: async (_req: LLMRequest): Promise<LLMResponse> => {
calls += 1;
if (calls <= opts.failTimes) throw opts.error;
return {
content: opts.successReply,
toolCalls: [],
usage: { input: 1, output: 1 },
stopReason: 'end_turn',
};
},
};
return { provider, getCalls: () => calls };
}
/** Always throws. Useful for fail-fast scenarios. */
function alwaysThrowsProvider(error: Error): LLMProvider {
return {
name: 'broken',
complete: async (): Promise<LLMResponse> => {
throw error;
},
};
}
// ─── Scenario 1: happy path — rules configured, first call succeeds ─
async function happyPath(): Promise<{ result: string }> {
const agent = Agent.create({ provider: okProvider('all good'), model: 'mock' })
.system('You echo.')
.reliability({
postDecide: [
// Rule: any error → fail-fast. Doesn't fire here because the
// call succeeds; the agent returns the LLM's response.
{
when: (s) => s.error !== undefined,
then: 'fail-fast',
kind: 'unrecoverable',
},
],
})
.build();
const result = (await agent.run({ message: 'hi' })) as string;
return { result };
}
// ─── Scenario 2: retry — first call fails transient 5xx, second succeeds ─
async function retryPath(): Promise<{ result: string; providerCalls: number }> {
const transient = new Error('Service Unavailable');
(transient as Error & { status?: number }).status = 503;
const flaky = flakyProvider({
failTimes: 1,
error: transient,
successReply: 'recovered',
});
const agent = Agent.create({ provider: flaky.provider, model: 'mock' })
.system('You echo.')
.reliability({
postDecide: [
// Retry up to 3 attempts on 5xx. After that, fail-fast on
// subsequent errors.
{
when: (s) => s.errorKind === '5xx-transient' && s.attempt < 3,
then: 'retry',
kind: 'transient-retry',
label: 'transient 5xx, retrying',
},
{
when: (s) => s.error !== undefined,
then: 'fail-fast',
kind: 'unrecoverable',
},
],
})
.build();
const result = (await agent.run({ message: 'go' })) as string;
return { result, providerCalls: flaky.getCalls() };
}
// ─── Scenario 3: fail-fast — error → typed ReliabilityFailFastError ─
async function failFastPath(): Promise<{
thrown: boolean;
kind?: string;
reason?: string;
phase?: string;
}> {
const fatal = new Error('schema violation');
const agent = Agent.create({ provider: alwaysThrowsProvider(fatal), model: 'mock' })
.system('You echo.')
.reliability({
postDecide: [
{
when: (s) => s.error !== undefined,
then: 'fail-fast',
kind: 'unrecoverable',
label: 'unrecoverable error from provider',
},
],
})
.build();
try {
await agent.run({ message: 'go' });
return { thrown: false };
} catch (e) {
if (e instanceof ReliabilityFailFastError) {
return {
thrown: true,
kind: e.kind,
reason: e.reason,
phase: e.payload?.phase,
};
}
throw e;
}
}
// ─── Entry point ──────────────────────────────────────────────────
export async function run(): Promise<{
happy: { result: string };
retry: { result: string; providerCalls: number };
failFast: { thrown: boolean; kind?: string; reason?: string; phase?: string };
}> {
const happy = await happyPath();
const retry = await retryPath();
const failFast = await failFastPath();
return { happy, retry, failFast };
}
// Run as a script: regression-guard the example so the CI integration
// test catches drift if the API changes.
if (import.meta.url === `file://${process.argv[1]}`) {
run()
.then((out) => {
console.log('=== reliability gate scenarios ===');
console.log('happy: ', out.happy);
console.log('retry: ', out.retry);
console.log('failFast: ', out.failFast);
// Sanity: each scenario engaged as designed.
if (out.happy.result !== 'all good') {
console.error('happy path: unexpected result');
process.exit(1);
}
if (out.retry.providerCalls !== 2 || out.retry.result !== 'recovered') {
console.error('retry path: unexpected calls/result');
process.exit(1);
}
if (
!out.failFast.thrown ||
out.failFast.kind !== 'unrecoverable' ||
out.failFast.phase !== 'post-decide'
) {
console.error('fail-fast: did not engage as designed');
process.exit(1);
}
console.log('OK');
})
.catch((err) => {
console.error(err);
process.exit(1);
});
}
  • Agent.create({...}).reliability(config) — fluent builder method
  • ReliabilityConfig — the rule list shape (see agentfootprint/reliability types)
  • ReliabilityFailFastError — thrown at agent.run() on fail-fast decisions; carries kind, reason, cause, payload, snapshot
  • CircuitBreaker pure functions — exported from agentfootprint/reliability for consumers hydrating breaker state from a persistence store