Skip to content

Testing AI Agents for $0: The Adapter-Swap Pattern

April 2026 · Sanjay Krishna Anbalagan


We ran our agent test suite. Fifty-three tests. Each one called Claude with real prompts, real tools, real multi-turn conversations. Every test passed. The bill: $47.12.

The next day, two tests failed. Same code, same prompts. Different responses — because LLMs are non-deterministic. We re-ran. They passed. We re-ran again. One failed. Another $14.

This is the testing problem nobody talks about in the AI agent space: your tests are expensive, slow, flaky, and non-deterministic. Every run costs money. Every assertion is probabilistic. CI pipelines become unreliable. Developers stop writing tests because the feedback loop is broken.

We fixed this with a pattern we call adapter swapping.

The idea is simple: your agent code doesn’t know (or care) which LLM provider is behind it. In tests, you use mock(). In production, you use anthropic() or openai(). The agent code is identical.

import { Agent, defineTool, mock } from 'agentfootprint';
// Real providers live on a peer-dep-isolated subpath:
import { anthropic } from 'agentfootprint/llm-providers';
// The tool — same in tests and production
const calculator = defineTool({
name: 'calculator',
description: 'Evaluate a math expression',
inputSchema: {
type: 'object',
properties: { expression: { type: 'string' } },
required: ['expression'],
},
execute: async (input) => String(eval(input.expression)),
});
// ── Test code ────────────────────────────────────────
const testProvider = mock({
replies: [
{
content: 'Let me calculate that.',
toolCalls: [{
id: 'tc1',
name: 'calculator',
args: { expression: '42 * 17' },
}],
},
{ content: 'The answer is 714.' },
],
});
// ── Production code ──────────────────────────────────
const prodProvider = anthropic({ defaultModel: 'claude-sonnet-4-20250514' });
// ── Agent code — IDENTICAL in both cases ─────────────
function buildAgent(provider) {
return Agent.create({ provider, model: 'anthropic' })
.system('You are a helpful calculator assistant.')
.tool(calculator)
.maxIterations(5)
.build();
}
// Test: $0, instant, deterministic
const testAgent = buildAgent(testProvider);
const result = await testAgent.run({ message: 'What is 42 times 17?' });
assert(result.includes('714')); // run() resolves to the final-answer string
// Production: real LLM, real cost
const prodAgent = buildAgent(prodProvider);

The mock provider returns exactly the responses you specify. Tool calls happen deterministically. The agent’s ReAct loop executes the same way — calling tools, processing results, generating the next turn — but without any API calls.

This isn’t just “mock the HTTP call.” The mock adapter participates in the full agent lifecycle:

Tool call orchestration. The mock returns a tool call → the agent executes the real tool handler → the result goes back to the mock → the mock returns the next response. Your tool handlers run for real. Your error handling runs for real. Only the LLM is mocked.

const provider = mock({
replies: [
// Turn 1: LLM decides to search
{
content: 'Searching for information...',
toolCalls: [{ id: 'tc1', name: 'search', args: { query: 'AI trends' } }],
},
// Turn 2: LLM processes search results and responds
{ content: 'Based on my research, here are the top AI trends...' },
],
});
const agent = Agent.create({ provider, model: 'mock' })
.tool(searchTool) // Real tool — actually executes
.build();
const result = await agent.run({ message: 'What are the AI trends?' });
// searchTool.execute() was called with { query: 'AI trends' }
// The full ReAct loop ran — mock → tool → mock → response

Multi-turn conversations. Each entry in the mock replies array is one LLM turn. The agent processes them in sequence, exactly like it would with a real provider.

Recorder verification. Attach a recorder to mock runs and watch the real emit-channel events stream by. Verify that your observability pipeline captures the right data. A recorder is just { id, onEmit } — agentfootprint emits typed events like agentfootprint.stream.llm_end (carries usage token counts) and agentfootprint.agent.turn_end.

let llmCalls = 0;
const provider = mock({
replies: [
{ content: 'thinking...', toolCalls: [{ id: 't1', name: 'noop', args: {} }] },
{ content: 'done' },
],
});
const agent = Agent.create({ provider, model: 'mock' })
.tool(noopTool)
.recorder({
id: 'observability',
onEmit: (e) => {
if (e.name === 'agentfootprint.stream.llm_end') llmCalls++;
},
})
.build();
await agent.run({ message: 'Hello' });
assert(llmCalls === 2); // Two LLM turns

For ready-made observability, the costRecorder, agentRecorder, and toolsRecorder factories wrap this same emit channel.

Sequence pipelines. Mock individual agents within a composition. Test the orchestration logic without any API calls.

import { Sequence } from 'agentfootprint';
const pipeline = Sequence.create()
.step('research', mockResearchAgent) // each step is any Runner
.step('write', mockWriterAgent)
.build();
const result = await pipeline.run({ message: 'Write about AI safety' });
// Both agents ran with mocks — pipeline orchestration tested for $0

Error handling. Resilience wrappers decorate the provider, so a mock that throws once lets you test retry/fallback for $0:

import { withRetry } from 'agentfootprint/resilience';
// Simulate API failure on the first call, success after that.
let calls = 0;
const flakyProvider = mock({
respond: () => {
if (calls++ === 0) throw new Error('rate_limit: too many requests');
return { content: 'Success on retry!' };
},
});
// withRetry wraps the PROVIDER (drop-in LLMProvider), not the agent.
const robustProvider = withRetry(flakyProvider, {
maxAttempts: 3,
initialDelayMs: 100,
});
const agent = Agent.create({ provider: robustProvider, model: 'mock' }).build();
const result = await agent.run({ message: 'Hello' });
assert(result === 'Success on retry!');

withFallback and fallbackProvider (also from agentfootprint/resilience) compose the same way — they wrap one provider and fall through to another on failure.

agentfootprint has concepts that compose together: LLMCall → Agent → RAG → Sequence/Parallel → swarm. Each one accepts a provider. Every one works with mock().

ConceptAPITesting pattern
LLMCallLLMCall.create(...)Mock one response
AgentAgent.create(...)Mock response sequence with tool calls
RAGdefineRAG(...)Mock retriever + LLM response
CompositionSequence / Parallel / Conditional / LoopMock each step in the pipeline
Swarmswarm(...)Mock router decisions + specialist responses

You start simple (test an LLMCall), compose up (test an Agent with tools), and eventually test full swarm orchestrations — all at $0.

Mock tests verify your orchestration logic, tool integrations, and error handling. They don’t verify prompt quality or response appropriateness. For that, you still need real LLM calls — but far fewer.

Our recommended split:

  • 90% mock tests — orchestration, tools, error handling, recorders, pipelines
  • 10% real LLM tests — prompt quality, response format, edge cases

The mock tests run in CI on every commit (fast, free, deterministic). The real LLM tests run nightly or before release (slow, costly, but necessary).

Terminal window
npm install agentfootprint
import { Agent, mock } from 'agentfootprint';
const agent = Agent.create({
provider: mock({ reply: 'Hello! How can I help?' }),
model: 'mock',
})
.system('You are a helpful assistant.')
.build();
const result = await agent.run({ message: 'Hi there' });
console.log(result); // "Hello! How can I help?"

Zero API calls. Zero cost. Deterministic. Your CI pipeline will thank you.