Pause / Resume
Human-in-the-loop with JSON-checkpointed state. Pause hours mid-run via askHuman or pauseHere; resume on a different process, day, or server.
An agent processes a refund request. Mid-run, the LLM calls
askOperator({ question: 'approve $500 refund?' }). The agent has to wait for a human — could be 30 seconds, could be 3 hours, could be tomorrow. You can't keep the process running. The framework hands you back a JSON checkpoint; you persist it (Redis, Postgres, S3); when the human responds, you resume. Different process, different server — same conversation.
What "pausable" means here
Two halves:
- Pause — a tool calls
pauseHere(...)(or the agent uses the built-inaskHumantool); the framework throws aPauseRequest; the agent loop catches it;agent.run()returns aRunnerPauseOutcomecontaining a JSON-serializablecheckpoint+ the pause data. - Resume — your code persists the checkpoint anywhere; when the human's reply is ready,
agent.resume(checkpoint, humanAnswer)re-builds the state, returns the answer to the paused tool, and continues the agent loop from exactly where it stopped.
The checkpoint is JSON — no functions, no class instances, no closures. Cross-server safe.
A pausing tool
return Agent.create({ provider: provider ?? exampleProvider('feature'), model: 'mock',}) .system('You process refunds. Use askOperator to request approval.') .tool({ schema: { name: 'askOperator', description: 'Ask a human operator for approval.', inputSchema: { type: 'object', properties: { question: { type: 'string' } }, }, }, execute: (args) => { const q = (args as { question: string }).question; // pauseHere throws a PauseRequest; the Agent catches it, // captures the checkpoint, and surfaces a RunnerPauseOutcome // up to whoever called .run(). pauseHere({ question: q, severity: 'high' }); return ''; // unreachable — pauseHere always throws }, }) .build();pauseHere({ question, severity }) throws a special PauseRequest. The agent catches it, captures the checkpoint, and surfaces a RunnerPauseOutcome up to whoever called .run().
The tool's execute looks like it never returns — that's correct. pauseHere always throws. The "return" happens later via resume().
Process A → checkpoint → Process B
// Process A
const result = await agent.run({ message: 'refund order 123' });
if (isPaused(result)) {
// result.checkpoint is JSON-serializable
await db.save('pauses:' + sessionId, JSON.stringify(result.checkpoint));
notifyHuman(result.pauseData);
return; // process A is done
}
// Process B (later, different server, different day)
const checkpoint = JSON.parse(await db.get('pauses:' + sessionId));
const humanAnswer = { approved: true, amount: 500 };
const finalResult = await agent.resume(checkpoint, humanAnswer);
// finalResult is the agent's final string outputBuild the agent fresh in Process B — same factory function, NOT the same instance. The checkpoint is the only thing that crosses the process boundary.
When to use pause/resume vs error vs callback
| Situation | Use |
|---|---|
| Long-running approval workflow | pause/resume (this guide) |
| Synchronous tool error → LLM retries | tool throws; see Error handling |
| External webhook → trigger something | pause/resume + webhook handler calls agent.resume() |
| Background task fires multiple times | not an agent — use a queue + per-job agent |
Anti-patterns
- Don't store closures in the checkpoint. The serializer rejects them. Build the agent fresh in Process B from the same factory.
- Don't pass the live agent instance across processes. Pass the checkpoint. The framework rebuilds state from it.
- Don't poll the agent for "is it paused yet?" — the result of
.run()tells you.isPaused(result)is a typed predicate.
Next steps
- Error handling — what's recoverable via retry vs what needs human escalation
- Security guide — permission-gated pauses for sensitive operations
- Reliability guide —
agent.resumeOnError(checkpoint)auto-checkpoints on an uncaught mid-run error (the failure throws aRunCheckpointErrorcarrying a JSON-serializable checkpoint), so any failure becomes resumable, not just intentional pauses.
Deployment
Multi-tenant identity at every store call, peer-dep declarations, mocks-first dev → real-infra prod swap. The patterns that take an agentfootprint app from laptop to production.
Resilience
withRetry + withFallback + fallbackProvider + withCircuitBreaker — composable decorators that wrap any LLMProvider with retry-on-transient, cross-provider failover, and fail-fast breaking. Compose freely.
