Skills are context engineering for instructions — abstracted so you don't do it by hand and get it wrong. A conceptual walk through what they actually are, and why the abstraction exists.

What ships today

Every API surface this essay describes — defineSkill({ id, description, body, tools, surfaceMode, refreshPolicy }), resolveSurfaceMode(provider, model), and SkillRegistry — is shipped and runnable. Per-mode runtime dispatch is live: surfaceMode: 'tool-only' now suppresses the body from the system slot and delivers it via the read_skill tool result; 'both' does both. The one field still reserved is refreshPolicy — the option is typed and non-breaking today, but the long-context re-injection hook is not yet wired into the engine.

In one line

Skills are context engineering for instructions — abstracted so you don't do it by hand and get it wrong.

That's the whole thing. The rest of this doc is why that's the right frame, what "by hand and wrong" looks like, and how agentfootprint's Skills machinery guarantees the context engineering stays correct across providers and context lengths.

Unpacking it

"Context engineering" is the named practice in the modern agent stack: deciding what goes into the LLM's context window on each turn. What lives in the system prompt. What lives in tool descriptions. What's retrieved and injected. What's pruned. Whether a piece of content sits ahead of the conversation or inside the recency window. It's one of the handful of disciplines that separates a working agent from a broken one.

"For instructions" narrows it to one sub-domain. There are four kinds of content you engineer into a turn:

Kind of content	The named abstraction	Example
Instructions — how to do something	Skills (+ `defineInstruction`)	"When investigating port errors, fetch metrics first."
Data — what the world looks like	RAG, retrieval	"Here are the 5 most-similar past tickets."
History — what has happened so far	Memory	"The user said earlier that their name is Alice."
Capabilities — what can be done	Tools	"You may call `getMetrics(interface)`."

Skills are the context-engineering abstraction for the instructions row. Memory (which we shipped in 1.14–1.15) is the abstraction for history. Tools is the abstraction for capabilities. RAG is the abstraction for data. Each row has its own answer because each row has a different shape of correctness problem.

"Abstracted so you don't do it by hand and get it wrong" is the value proposition in one phrase. Doing this by hand is not conceptually hard — it's just unusually error-prone because the right answer changes with conditions you don't control:

Put instructions in the system prompt and they drift out of attention on non-Claude providers.
Put them all in tool results and you need a registry to dispatch them.
Put every instruction everywhere and you pay the token cost per turn forever — even when none apply.
Put them too early in the conversation and they're stale by turn 40.
Put them too late and the model hasn't seen them when it needs them.

Getting this right is non-obvious, provider-specific, and context-length-dependent. Skills are the abstraction that makes it correct by construction. An app author shouldn't need to know which generation of which provider has strong system-prompt adherence, or which tool-result positions survive long-context attention decay. A library can know that. Skills is where we encode it.

Why this framing matters

The first question most teams ask about Skills is reasonable and almost always wrong:

"Is Skills a new LLM protocol we need to support?"

No. Skills is not a protocol. Skills is not a new field on the Messages API. Skills is not a model-level capability Anthropic trained into Claude 4. At the protocol layer, Skills is a folder convention plus a loading discipline, riding entirely on top of fields that have been there since the Messages API shipped: system, tools, tool_use, tool_result. Everything interesting is happening one layer up — in how a framework decides what to put where.

Context-engineering-for-instructions is what that "one layer up" is doing. That's the whole game.

This doc exists because when that framing is missing, people reinvent Skills badly. They treat it as a protocol feature (it isn't). They parallel it with their existing instruction system (it's the same thing). They build it Claude-first and ship to OpenAI (the context engineering silently drifts). Understanding the three-stage anatomy and the cross-provider-correctness tradeoff is the difference between "we built a Skills system" and "we packaged Claude Agent SDK's wrapper with the quiet bug that broke on our OpenAI deployment."

Companion doc

For the actual API — defineSkill, SkillRegistry, AgentBuilder.skills() — see the Skills guide. This page is the why, that page is the how.

Three things that look the same, aren't

Before the anatomy, clear up a frequent conflation:

Thing	Shipped by	What it is	Delivery mechanism
Skills (Claude Agent SDK)	Anthropic	A folder convention — `SKILL.md` + YAML frontmatter (`name`, `description`) loaded progressively on demand.	Descriptions advertised upfront; body fetched via tool call, returned as tool result.
Steering Docs (AWS Strands)	AWS	Always-on behavioral priors attached to the agent.	Static injection into the system prompt.
`Injection` (agentfootprint)	agentfootprint	The one primitive every flavor reduces to: `{ trigger, inject, flavor }`. A per-iteration `trigger` decides inclusion; `inject` carries content to one or more slots.	`inject` can target all three slots — `systemPrompt`, `tools`, `messages`.

These are not equivalent primitives. Skills is a discovery-and-dispatch pattern. Steering Docs are always-on context. An Injection (via defineInstruction) is a predicate-gated rule.

A Skill, in our library, is just an Injection with an llm-activated trigger — built by defineSkill. It is not a peer of the primitive. That hierarchy keeps the primitive count honest.

What ships today (the runnable surface)

defineSkill({ id, description, body, tools }). The library auto-attaches a read_skill tool to your agent's tool registry; the LLM activates the skill by calling it. When activated, the skill's body lands in the system slot and the skill's tools become available for the rest of the turn:

const billingSkill = defineSkill({  id: 'billing',  description: 'Read for refund / charge / billing questions. Unlocks process_refund.',  body: 'When handling billing: confirm the order id, then call process_refund. Always state the amount + payment method in the final reply.',  tools: [refundTool],});

That's the foundation. The rest of this essay explains why this design + what's built on top of it (surfaceMode, refreshPolicy, SkillRegistry).

The actual anatomy — three stages over Messages API

A Skill call, end-to-end, decomposes into three stages. Every stage maps to an existing Messages API primitive.

Stage A — Discovery: "what Skills exist?"

The model needs to know a Skill is available before it can invoke one. The Agent SDK's choice: inject an <available_skills> system-reminder block into the system prompt on every turn:

You have these skills available:
- pptx:  Create PowerPoint presentations from a template.
- xlsx:  Parse and manipulate Excel spreadsheets.
- pdf:   PDF manipulation toolkit.
- ...

This is just text in the system field. No new API.

Stage B — Invocation: "I want to use the `pptx` skill"

The model emits a regular tool call to a built-in Skill-loader tool:

{ "type": "tool_use", "name": "Skill", "input": { "skill": "pptx" } }

Identical protocol to any other tool use. The SDK registered the loader tool as one line in the tools field alongside whatever else the app registered.

Stage C — Execution: "here's the skill body"

The SDK looks up the pptx folder, reads SKILL.md, and returns its body as the tool_result:

You are now following skill: pptx (v2.1).
Purpose: Create PowerPoint presentations from a template.
Procedure:
  1. Read the template from templates/default.pptx
  2. Extract user intent from the current turn
  3. Render slides by applying substitutions
  ...

Tool result lands in the message stream. The model reads it in the freshest possible position and follows the instructions.

That's it. Three stages. Three primitives. No new anything.

The recency question — what we cracked

The intuition most people arrive at first (including me):

"Putting skill descriptions in the system prompt gives them a worse recency profile than putting them via the Tools API. We can do better by routing discovery through a tool."

This is wrong, and the correction is worth pinning down because it reshapes the whole design argument.

Tool descriptions are already upfront context. Every Messages API call includes the full tools field — every tool's name, description, and input schema — alongside the system prompt. Both sit ahead of the conversation. Both arrive with every request. Their recency profile is identical from the model's positional perspective.

So routing skill discovery through a tool description vs through the system prompt makes no architectural difference to recency. In both flows:

Skill descriptions are upfront on every request.
The model picks one based on the user's turn.
Only the chosen skill's body travels via tool_result (the actually-fresh position).

Why did Anthropic pick the system prompt, then? Not for recency — for organization:

Namespacing. Skill entries don't pollute the ordinary tools list in the UI or the model's mental model. The user's Read tool and the Skill loader shouldn't be peers.
List bloat. Apps might expose 50 skills; stuffing them into tools crowds every turn's payload. Putting them in a system-prompt block renders them as documentation instead.
Uniform loader signature. One Skill(name) tool handles any skill; adding a new skill doesn't change the advertised tool surface.

These are ergonomic wins. They are not architectural ones. The recency profile is the same either way — assuming the provider respects system-prompt position.

That assumption is where cross-provider correctness falls in.

The real problem: cross-provider adherence is a training thing

The protocol says nothing about recency bias. Training does.

Claude ≥ 3.5 is trained hard on system-prompt adherence. Instructions placed there are followed even in long contexts.
Claude pre-3.5, GPT, Llama, Mistral, smaller models have well-documented recency bias. Under long contexts (or adversarial tool-result interleaving), instructions placed in the system prompt get out-followed by instructions placed near the conversation's tail.

So the Skills design that works brilliantly on Claude — descriptions in the system prompt, bodies via tool result — degrades silently on other providers. Your OpenAI-backed eval passes on turn 1 and drifts by turn 40. Your local-model deployment never quite gets there.

This is what makes Skills a correctness concern for a framework — not just an ergonomics one. An application author can't reasonably be expected to know which generation of which provider was trained to honor system-prompt anchoring vs which one leans on recency. A library can.

What agentfootprint does differently

Given the above, our Skills implementation makes four choices. All of them fit in ~200 LOC of sugar on top of primitives we already had (the Injection primitive, the on-tool-return trigger, tool registry).

1. `surfaceMode: 'auto'` — the library picks per provider

import { resolveSurfaceMode } from 'agentfootprint';

// Pure function — no side effects. Inspect what 'auto' will resolve to.
resolveSurfaceMode('anthropic', 'claude-sonnet-4-5-20250929'); // → 'both'
resolveSurfaceMode('anthropic', 'claude-3-haiku-20240307');    // → 'tool-only'
resolveSurfaceMode('openai', 'gpt-4o');                        // → 'tool-only'
resolveSurfaceMode('mock');                                    // → 'tool-only'

Four modes: 'system-prompt', 'tool-only', 'both', 'auto'. On Claude ≥ 3.5 we use 'both' (cheap to cache, high adherence). Everywhere else we fall back to 'tool-only' — recency-first delivery via tool result, which is a protocol-level guarantee and doesn't rely on training.

This is the difference between a Skills pattern that happens to work on Claude and one that is correct across every provider the library supports, mock() included (evals match production).

2. The `on-tool-return` trigger as the recency-first injection point

agentfootprint already shipped the Injection primitive with an on-tool-return trigger ({ kind: 'on-tool-return', toolName }, built via defineInstruction({ activeWhen: (ctx) => ctx.lastToolResult?.toolName === '…' })) — the cross-provider-correct path. Skills ride on top of it:

A tool result lands in the message stream and is exposed as ctx.lastToolResult.
on-tool-return triggers (and rule predicates inspecting ctx.lastToolResult) evaluate on that result.
Matched injections deliver text in the freshest position the model is about to read.

That text lands in the freshest possible position by protocol, not by training. Skills inherit this for free because every Skill IS-A Injection and shares the same trigger machinery.

3. Refresh policy for long contexts

defineSkill({
  id: 'critical-rule',
  description: 'Critical reasoning rule for long-context runs',
  body: '...',
  refreshPolicy: { afterTokens: 50_000, via: 'tool-result' },
});

Belt-and-suspenders: even if system-prompt anchoring was honored at turn 1, by turn 40 on a non-Claude provider it's drifted out of effective attention. Re-inject the body as a fresh tool result past a token threshold. The API surface is shipped today; the runtime hook is still reserved (the engine accepts but ignores refreshPolicy until the long-context attention work lands) — so specifying it today is non-breaking.

4. Auto-generated loader tools + centralized governance

defineSkill(...) auto-attaches a single read_skill tool to the agent's tool registry. The consumer never hand-writes it.

For shared skill catalogs across multiple agents, SkillRegistry is the centralized-governance answer:

import { SkillRegistry } from 'agentfootprint';

const registry = new SkillRegistry();
registry.register(billingSkill).register(refundSkill).register(complianceSkill);

const supportAgent = Agent.create({ provider }).skills(registry).build();
const escalationAgent = Agent.create({ provider }).skills(registry).build();

// Add a new skill — every consumer Agent picks it up at next build.
registry.register(newSkill);

agent.skills(registry) is the bulk-register companion to .skill(t). Use the registry pattern when 2+ agents share overlapping skills; use .skill(...) directly when one agent has its own catalog.

Why `read_skill` not `Skill`

Agent SDK calls its loader Skill. We mirror the verb-object convention of every other tool in agentfootprint (search, load_file, get_trace) and name it read_skill. Same protocol, clearer name.

End-to-end flow

User turn arrives
         │
         ▼
┌────────────────────────────────────────────────────────┐
│ registry.resolveForSkill(skill, provider, model)       │
│   surfaceMode 'system-prompt' | 'both'                 │
│     → body lands in the system slot next iteration     │
│   surfaceMode 'tool-only'                              │
│     → body suppressed from system; tool result IS body │
└────────────────────────────────────────────────────────┘
         │
         ▼
┌────────────────────────────────────────────────────────┐
│ registry.toTools() → { listSkills, readSkill }         │
│   list_skills + read_skill                              │
│   these are advertised in `tools` on every turn        │
└────────────────────────────────────────────────────────┘
         │
         ▼
  LLM turn 1 → recognizes match → tool_use: read_skill({ id })
         │
         ▼
┌────────────────────────────────────────────────────────┐
│ read_skill handler (buildReadSkillTool)                │
│   registry.get(id) → confirmation OR body (per mode)   │
│   tool-only/both → returns body (recency-first)        │
│   system-prompt/auto → "Skill '{id}' activated…"       │
│   id appended to scope.activatedInjectionIds           │
└────────────────────────────────────────────────────────┘
         │
         ▼
  LLM turn 2 → follows the skill body, calls domain tools
         │
         ▼
  [ tool calls + results flow as normal ]
         │
         ▼
┌────────────────────────────────────────────────────────┐
│ on-tool-return / rule injections fire per result       │
│   • tool X fails → inject "when X fails…" snippet      │
│   • trigger match → inject relevant skill hints        │
│   → same tool-result payload, same recency window      │
└────────────────────────────────────────────────────────┘
         │
         ▼
  [ reserved: after N tokens, refreshPolicy re-injects ]
  [ the body into a fresh tool_result breadcrumb        ]

No primitive is invented. Every step uses a field the Messages API has always had. The library's work is packaging — putting the right descriptions in the right surface for the right provider, with recency guarantees where provider training can't carry them.

Routing between skills — the skill graph

A single skill is one playbook. Real agents have several — a front desk that hands off to billing or to tech support — and you want the handoffs to be declared and drawable, not buried in prompt prose. skillGraph() is that: a fluent builder where .entry() marks where a turn starts and .route(from, to, { when }) declares a deterministic handoff. The SAME built object feeds the agent (Agent.create()....skillGraph(graph)) and draws as a graph — declared is drawn.

The picture below is built from the exact skillGraph() builder shown above it — its boxes and edges are a build product of these .entry() / .route() calls, not a hand-drawn diagram. Click a skill to see its playbook and the tools it unlocks.

Loading the skill graph…

Solid edges are the route() transitions you declared; dashed edges are skills the model can still reach on its own with read_skill. The graph is also the single source of two runtime guarantees: graph.nextSkill(ctx) advances the cursor exactly as the drawn routes say, and graph.reachableSkills(ctx) is the gate that stops the model from read_skill-jumping outside the graph. One object — the agent runs it, the docs draw it.

The one-line takeaway

Skills are context engineering for instructions, abstracted so you don't do it by hand and get it wrong. The protocol layer stays the Messages API's existing system + tools + tool_result fields — nothing new there. The framework layer's job is deciding where each instruction lives on every turn, per provider, per context length, so the model actually follows it. The Agent SDK's Claude-first answer is right for Claude; agentfootprint's surfaceMode: 'auto' + on-tool-return delivery is right for every provider, mock() included.

When this understanding matters

If you're:

Building a Skills-style system without Claude in the stack — the Agent SDK pattern degrades. Use surfaceMode: 'tool-only' and put skill content in tool results, not the system prompt.
Writing evals on mock() that should predict production — surfaceMode: 'auto' resolves to 'tool-only' on mock(), matching the behavior you'll see on any provider whose training you can't bank on. If you'd coded descriptions into the system prompt yourself, mock() would trivially follow them and production would drift.
Running agents for 6+ hours with large contexts — even on Claude, system-prompt anchoring decays past ~100K dense tokens. This is why refreshPolicy exists (its runtime hook is still reserved). Design your Skill bodies to be re-surfaceable via tool result, not just discoverable via system prompt.
Wondering why your framework's "Skills" feature doesn't compose with your instruction system — because they're the same thing. Skills are Injections (the same primitive defineInstruction builds) with a discovery layer bolted on. If your framework treats them as separate primitives, you're paying for parallel abstractions.

Skills, explained