Skills, explained

In one line

Skills are context engineering for instructions — abstracted so you don’t do it by hand and get it wrong.

That’s the whole thing. The rest of this doc is why that’s the right frame, what “by hand and wrong” looks like, and how agentfootprint’s Skills machinery guarantees the context engineering stays correct across providers and context lengths.

Unpacking it

“Context engineering” is the named practice in the modern agent stack: deciding what goes into the LLM’s context window on each turn. What lives in the system prompt. What lives in tool descriptions. What’s retrieved and injected. What’s pruned. Whether a piece of content sits ahead of the conversation or inside the recency window. It’s one of the handful of disciplines that separates a working agent from a broken one.

“For instructions” narrows it to one sub-domain. There are four kinds of content you engineer into a turn:

Kind of content	The named abstraction	Example
Instructions — how to do something	Skills (+ `AgentInstruction`)	“When investigating port errors, fetch metrics first.”
Data — what the world looks like	RAG, retrieval	”Here are the 5 most-similar past tickets.”
History — what has happened so far	Memory	”The user said earlier that their name is Alice.”
Capabilities — what can be done	Tools	”You may call `getMetrics(interface)`.”

Skills are the context-engineering abstraction for the instructions row. Memory (which we shipped in 1.14–1.15) is the abstraction for history. Tools is the abstraction for capabilities. RAG is the abstraction for data. Each row has its own answer because each row has a different shape of correctness problem.

“Abstracted so you don’t do it by hand and get it wrong” is the value proposition in one phrase. Doing this by hand is not conceptually hard — it’s just unusually error-prone because the right answer changes with conditions you don’t control:

Put instructions in the system prompt and they drift out of attention on non-Claude providers.
Put them all in tool results and you need a registry to dispatch them.
Put every instruction everywhere and you pay the token cost per turn forever — even when none apply.
Put them too early in the conversation and they’re stale by turn 40.
Put them too late and the model hasn’t seen them when it needs them.

Getting this right is non-obvious, provider-specific, and context-length-dependent. Skills are the abstraction that makes it correct by construction. An app author shouldn’t need to know which generation of which provider has strong system-prompt adherence, or which tool-result positions survive long-context attention decay. A library can know that. Skills is where we encode it.

Why this framing matters

The first question most teams ask about Skills is reasonable and almost always wrong:

“Is Skills a new LLM protocol we need to support?”

No. Skills is not a protocol. Skills is not a new field on the Messages API. Skills is not a model-level capability Anthropic trained into Claude 4. At the protocol layer, Skills is a folder convention plus a loading discipline, riding entirely on top of fields that have been there since the Messages API shipped: system, tools, tool_use, tool_result. Everything interesting is happening one layer up — in how a framework decides what to put where.

Context-engineering-for-instructions is what that “one layer up” is doing. That’s the whole game.

This doc exists because when that framing is missing, people reinvent Skills badly. They treat it as a protocol feature (it isn’t). They parallel it with their existing instruction system (it’s the same thing). They build it Claude-first and ship to OpenAI (the context engineering silently drifts). Understanding the three-stage anatomy and the cross-provider-correctness tradeoff is the difference between “we built a Skills system” and “we packaged Claude Agent SDK’s wrapper with the quiet bug that broke on our OpenAI deployment.”

Three things that look the same, aren’t

Before the anatomy, clear up a frequent conflation:

Thing	Shipped by	What it is	Delivery mechanism
Skills (Claude Agent SDK)	Anthropic	A folder convention — `SKILL.md` + YAML frontmatter (`name`, `description`) loaded progressively on demand.	Descriptions advertised upfront; body fetched via tool call, returned as tool result.
Steering Docs (AWS Strands)	AWS	Always-on behavioral priors attached to the agent.	Static injection into the system prompt.
`AgentInstruction` (agentfootprint)	agentfootprint	A conditional rule: `{activeWhen, prompt?, tools?, onToolResult?}`. Per-turn predicate decides inclusion.	Can inject into all three positions — system prompt, tools list, tool-result stream.

These are not equivalent primitives. Skills is a discovery-and-dispatch pattern. Steering Docs are always-on context. AgentInstruction is a predicate-gated rule.

A Skill, in our library, is composed from AgentInstructions. It is not a peer. That hierarchy keeps the primitive count honest.

What ships today (the runnable surface)

defineSkill({ id, description, body, tools }). The library auto-attaches a read_skill tool to your agent’s tool registry; the LLM activates the skill by calling it. When activated, the skill’s body lands in the system slot and the skill’s tools become available for the rest of the turn:

const billingSkill = defineSkill({
  id: 'billing',
  description: 'Read for refund / charge / billing questions. Unlocks process_refund.',
  body: 'When handling billing: confirm the order id, then call process_refund. Always state the amount + payment method in the final reply.',
  tools: [refundTool],
});

That’s the foundation. The rest of this essay explains why this design + what we’re building on top of it (surfaceMode, refreshPolicy, SkillRegistry — all v2.4 Phase 4).

The actual anatomy — three stages over Messages API

A Skill call, end-to-end, decomposes into three stages. Every stage maps to an existing Messages API primitive.

Stage A — Discovery: “what Skills exist?”

The model needs to know a Skill is available before it can invoke one. The Agent SDK’s choice: inject an <available_skills> system-reminder block into the system prompt on every turn:

You have these skills available:
- pptx:  Create PowerPoint presentations from a template.
- xlsx:  Parse and manipulate Excel spreadsheets.
- pdf:   PDF manipulation toolkit.
- ...

This is just text in the system field. No new API.

Stage B — Invocation: “I want to use the `pptx` skill”

The model emits a regular tool call to a built-in Skill-loader tool:

{ "type": "tool_use", "name": "Skill", "input": { "skill": "pptx" } }

Identical protocol to any other tool use. The SDK registered the loader tool as one line in the tools field alongside whatever else the app registered.

Stage C — Execution: “here’s the skill body”

The SDK looks up the pptx folder, reads SKILL.md, and returns its body as the tool_result:

You are now following skill: pptx (v2.1).
Purpose: Create PowerPoint presentations from a template.
Procedure:
  1. Read the template from templates/default.pptx
  2. Extract user intent from the current turn
  3. Render slides by applying substitutions
  ...

Tool result lands in the message stream. The model reads it in the freshest possible position and follows the instructions.

That’s it. Three stages. Three primitives. No new anything.

The recency question — what we cracked

The intuition most people arrive at first (including me):

“Putting skill descriptions in the system prompt gives them a worse recency profile than putting them via the Tools API. We can do better by routing discovery through a tool.”

This is wrong, and the correction is worth pinning down because it reshapes the whole design argument.

Tool descriptions are already upfront context. Every Messages API call includes the full tools field — every tool’s name, description, and input schema — alongside the system prompt. Both sit ahead of the conversation. Both arrive with every request. Their recency profile is identical from the model’s positional perspective.

So routing skill discovery through a tool description vs through the system prompt makes no architectural difference to recency. In both flows:

Skill descriptions are upfront on every request.
The model picks one based on the user’s turn.
Only the chosen skill’s body travels via tool_result (the actually-fresh position).

Why did Anthropic pick the system prompt, then? Not for recency — for organization:

Namespacing. Skill entries don’t pollute the ordinary tools list in the UI or the model’s mental model. The user’s Read tool and the Skill loader shouldn’t be peers.
List bloat. Apps might expose 50 skills; stuffing them into tools crowds every turn’s payload. Putting them in a system-prompt block renders them as documentation instead.
Uniform loader signature. One Skill(name) tool handles any skill; adding a new skill doesn’t change the advertised tool surface.

These are ergonomic wins. They are not architectural ones. The recency profile is the same either way — assuming the provider respects system-prompt position.

That assumption is where cross-provider correctness falls in.

The real problem: cross-provider adherence is a training thing

The protocol says nothing about recency bias. Training does.

Claude ≥ 3.5 is trained hard on system-prompt adherence. Instructions placed there are followed even in long contexts.
Claude pre-3.5, GPT, Llama, Mistral, smaller models have well-documented recency bias. Under long contexts (or adversarial tool-result interleaving), instructions placed in the system prompt get out-followed by instructions placed near the conversation’s tail.

So the Skills design that works brilliantly on Claude — descriptions in the system prompt, bodies via tool result — degrades silently on other providers. Your OpenAI-backed eval passes on turn 1 and drifts by turn 40. Your local-model deployment never quite gets there.

This is what makes Skills a correctness concern for a framework — not just an ergonomics one. An application author can’t reasonably be expected to know which generation of which provider was trained to honor system-prompt anchoring vs which one leans on recency. A library can.

What agentfootprint does differently

Given the above, our Skills implementation makes four choices. All of them fit in ~200 LOC of sugar on top of primitives we already had (AgentInstruction, onToolResult, tool registry).

1. `surfaceMode: 'auto'` — the library picks per provider

import { resolveSurfaceMode } from 'agentfootprint';

// Pure function — no side effects. Inspect what 'auto' will resolve to.
resolveSurfaceMode('anthropic', 'claude-sonnet-4-5-20250929'); // → 'both'
resolveSurfaceMode('anthropic', 'claude-3-haiku-20240307');    // → 'tool-only'
resolveSurfaceMode('openai', 'gpt-4o');                        // → 'tool-only'
resolveSurfaceMode('mock');                                    // → 'tool-only'

Four modes: 'system-prompt', 'tool-only', 'both', 'auto'. On Claude ≥ 3.5 we use 'both' (cheap to cache, high adherence). Everywhere else we fall back to 'tool-only' — recency-first delivery via tool result, which is a protocol-level guarantee and doesn’t rely on training.

This is the difference between a Skills pattern that happens to work on Claude and one that is correct across every provider the library supports, mock() included (evals match production).

2. `AgentInstruction.onToolResult` as the recency-first injection point

agentfootprint already shipped AgentInstruction with an onToolResult hook — the cross-provider-correct path. Skills ride on top of it:

A tool result lands in the message stream.
onToolResult rules evaluate on that result.
Matched rules inject text into the same tool-result payload the model is about to read.

That text lands in the freshest possible position by protocol, not by training. Skills inherit this for free because every Skill IS-A AgentInstruction and inherits onToolResult.

3. Refresh policy for long contexts

defineSkill({
  id: 'critical-rule',
  description: 'Critical reasoning rule for long-context runs',
  body: '...',
  refreshPolicy: { afterTokens: 50_000, via: 'tool-result' },
});

Belt-and-suspenders: even if system-prompt anchoring was honored at turn 1, by turn 40 on a non-Claude provider it’s drifted out of effective attention. Re-inject the registry as a fresh tool result past a token threshold. The API surface is shipped today; the runtime hook lands in v2.5 as part of the long-context attention work — specifying refreshPolicy today is non-breaking.

4. Auto-generated loader tools + centralized governance

defineSkill(...) auto-attaches a single read_skill tool to the agent’s tool registry. The consumer never hand-writes it.

For shared skill catalogs across multiple agents, SkillRegistry is the centralized-governance answer:

import { SkillRegistry } from 'agentfootprint';

const registry = new SkillRegistry();
registry.register(billingSkill).register(refundSkill).register(complianceSkill);

const supportAgent = Agent.create({ provider }).skills(registry).build();
const escalationAgent = Agent.create({ provider }).skills(registry).build();

// Add a new skill — every consumer Agent picks it up at next build.
registry.register(newSkill);

agent.skills(registry) is the bulk-register companion to .skill(t). Use the registry pattern when 2+ agents share overlapping skills; use .skill(...) directly when one agent has its own catalog.

End-to-end flow

User turn arrives
         │
         ▼
┌────────────────────────────────────────────────────────┐
│ registry.toPromptFragment(provider)                    │
│   surfaceMode 'system-prompt' | 'both'                 │
│     → embed {id, title, description} list in system   │
│   surfaceMode 'tool-only'                              │
│     → no system-prompt block                          │
└────────────────────────────────────────────────────────┘
         │
         ▼
┌────────────────────────────────────────────────────────┐
│ registry.toTools()                                     │
│   always: list_skills, read_skill                      │
│   these are advertised in `tools` on every turn        │
└────────────────────────────────────────────────────────┘
         │
         ▼
  LLM turn 1 → recognizes match → tool_use: read_skill({id})
         │
         ▼
┌────────────────────────────────────────────────────────┐
│ read_skill handler                                     │
│   registry.getById(id) → renderSkillBody()             │
│   → "You are now following skill: {id} ({version})."   │
│   delivered as tool_result (recency-first by protocol) │
└────────────────────────────────────────────────────────┘
         │
         ▼
  LLM turn 2 → follows the skill body, calls domain tools
         │
         ▼
  [ tool calls + results flow as normal ]
         │
         ▼
┌────────────────────────────────────────────────────────┐
│ AgentInstruction.onToolResult fires per skill rule     │
│   • tool X fails → inject "when X fails…" snippet      │
│   • trigger match → inject relevant skill hints        │
│   → same tool-result payload, same recency window      │
└────────────────────────────────────────────────────────┘
         │
         ▼
  [ phase-2: after N tokens, refreshPolicy re-injects ]
  [ the registry into a fresh tool_result breadcrumb  ]

No primitive is invented. Every step uses a field the Messages API has always had. The library’s work is packaging — putting the right descriptions in the right surface for the right provider, with recency guarantees where provider training can’t carry them.

The one-line takeaway

Skills are context engineering for instructions, abstracted so you don’t do it by hand and get it wrong. The protocol layer stays the Messages API’s existing system + tools + tool_result fields — nothing new there. The framework layer’s job is deciding where each instruction lives on every turn, per provider, per context length, so the model actually follows it. The Agent SDK’s Claude-first answer is right for Claude; agentfootprint’s surfaceMode: 'auto' + onToolResult delivery is right for every provider, mock() included.

When this understanding matters

If you’re:

Building a Skills-style system without Claude in the stack — the Agent SDK pattern degrades. Use surfaceMode: 'tool-only' and put skill content in tool results, not the system prompt.
Writing evals on mock() that should predict production — surfaceMode: 'auto' resolves to 'tool-only' on mock(), matching the behavior you’ll see on any provider whose training you can’t bank on. If you’d coded descriptions into the system prompt yourself, mock() would trivially follow them and production would drift.
Running agents for 6+ hours with large contexts — even on Claude, system-prompt anchoring decays past ~100K dense tokens. This is why refreshPolicy exists (Phase 2). Design your Skill bodies to be re-surfaceable via tool result, not just discoverable via system prompt.
Wondering why your framework’s “Skills” feature doesn’t compose with your instruction system — because they’re the same thing. Skills are AgentInstructions with a discovery layer bolted on. If your framework treats them as separate primitives, you’re paying for parallel abstractions.