Skills, explained
In one line
Section titled βIn one lineβSkills are context engineering for instructions β abstracted so you donβt do it by hand and get it wrong.
Thatβs the whole thing. The rest of this doc is why thatβs the right frame, what βby hand and wrongβ looks like, and how agentfootprintβs Skills machinery guarantees the context engineering stays correct across providers and context lengths.
Unpacking it
Section titled βUnpacking itββContext engineeringβ is the named practice in the modern agent stack: deciding what goes into the LLMβs context window on each turn. What lives in the system prompt. What lives in tool descriptions. Whatβs retrieved and injected. Whatβs pruned. Whether a piece of content sits ahead of the conversation or inside the recency window. Itβs one of the handful of disciplines that separates a working agent from a broken one.
βFor instructionsβ narrows it to one sub-domain. There are four kinds of content you engineer into a turn:
| Kind of content | The named abstraction | Example |
|---|---|---|
| Instructions β how to do something | Skills (+ AgentInstruction) | βWhen investigating port errors, fetch metrics first.β |
| Data β what the world looks like | RAG, retrieval | βHere are the 5 most-similar past tickets.β |
| History β what has happened so far | Memory | βThe user said earlier that their name is Alice.β |
| Capabilities β what can be done | Tools | βYou may call getMetrics(interface).β |
Skills are the context-engineering abstraction for the instructions row. Memory (which we shipped in 1.14β1.15) is the abstraction for history. Tools is the abstraction for capabilities. RAG is the abstraction for data. Each row has its own answer because each row has a different shape of correctness problem.
βAbstracted so you donβt do it by hand and get it wrongβ is the value proposition in one phrase. Doing this by hand is not conceptually hard β itβs just unusually error-prone because the right answer changes with conditions you donβt control:
- Put instructions in the system prompt and they drift out of attention on non-Claude providers.
- Put them all in tool results and you need a registry to dispatch them.
- Put every instruction everywhere and you pay the token cost per turn forever β even when none apply.
- Put them too early in the conversation and theyβre stale by turn 40.
- Put them too late and the model hasnβt seen them when it needs them.
Getting this right is non-obvious, provider-specific, and context-length-dependent. Skills are the abstraction that makes it correct by construction. An app author shouldnβt need to know which generation of which provider has strong system-prompt adherence, or which tool-result positions survive long-context attention decay. A library can know that. Skills is where we encode it.
Why this framing matters
Section titled βWhy this framing mattersβThe first question most teams ask about Skills is reasonable and almost always wrong:
βIs Skills a new LLM protocol we need to support?β
No. Skills is not a protocol. Skills is not a new field on the Messages API. Skills is not a model-level capability Anthropic trained into Claude 4. At the protocol layer, Skills is a folder convention plus a loading discipline, riding entirely on top of fields that have been there since the Messages API shipped: system, tools, tool_use, tool_result. Everything interesting is happening one layer up β in how a framework decides what to put where.
Context-engineering-for-instructions is what that βone layer upβ is doing. Thatβs the whole game.
This doc exists because when that framing is missing, people reinvent Skills badly. They treat it as a protocol feature (it isnβt). They parallel it with their existing instruction system (itβs the same thing). They build it Claude-first and ship to OpenAI (the context engineering silently drifts). Understanding the three-stage anatomy and the cross-provider-correctness tradeoff is the difference between βwe built a Skills systemβ and βwe packaged Claude Agent SDKβs wrapper with the quiet bug that broke on our OpenAI deployment.β
Three things that look the same, arenβt
Section titled βThree things that look the same, arenβtβBefore the anatomy, clear up a frequent conflation:
| Thing | Shipped by | What it is | Delivery mechanism |
|---|---|---|---|
| Skills (Claude Agent SDK) | Anthropic | A folder convention β SKILL.md + YAML frontmatter (name, description) loaded progressively on demand. | Descriptions advertised upfront; body fetched via tool call, returned as tool result. |
| Steering Docs (AWS Strands) | AWS | Always-on behavioral priors attached to the agent. | Static injection into the system prompt. |
AgentInstruction (agentfootprint) | agentfootprint | A conditional rule: {activeWhen, prompt?, tools?, onToolResult?}. Per-turn predicate decides inclusion. | Can inject into all three positions β system prompt, tools list, tool-result stream. |
These are not equivalent primitives. Skills is a discovery-and-dispatch pattern. Steering Docs are always-on context. AgentInstruction is a predicate-gated rule.
A Skill, in our library, is composed from AgentInstructions. It is not a peer. That hierarchy keeps the primitive count honest.
What ships today (the runnable surface)
Section titled βWhat ships today (the runnable surface)βdefineSkill({ id, description, body, tools }). The library auto-attaches a read_skill tool to your agentβs tool registry; the LLM activates the skill by calling it. When activated, the skillβs body lands in the system slot and the skillβs tools become available for the rest of the turn:
const billingSkill = defineSkill({ id: 'billing', description: 'Read for refund / charge / billing questions. Unlocks process_refund.', body: 'When handling billing: confirm the order id, then call process_refund. Always state the amount + payment method in the final reply.', tools: [refundTool],});Thatβs the foundation. The rest of this essay explains why this design + what weβre building on top of it (surfaceMode, refreshPolicy, SkillRegistry β all v2.4 Phase 4).
The actual anatomy β three stages over Messages API
Section titled βThe actual anatomy β three stages over Messages APIβA Skill call, end-to-end, decomposes into three stages. Every stage maps to an existing Messages API primitive.
Stage A β Discovery: βwhat Skills exist?β
Section titled βStage A β Discovery: βwhat Skills exist?ββThe model needs to know a Skill is available before it can invoke one. The Agent SDKβs choice: inject an <available_skills> system-reminder block into the system prompt on every turn:
You have these skills available:- pptx: Create PowerPoint presentations from a template.- xlsx: Parse and manipulate Excel spreadsheets.- pdf: PDF manipulation toolkit.- ...This is just text in the system field. No new API.
Stage B β Invocation: βI want to use the pptx skillβ
Section titled βStage B β Invocation: βI want to use the pptx skillββThe model emits a regular tool call to a built-in Skill-loader tool:
{ "type": "tool_use", "name": "Skill", "input": { "skill": "pptx" } }Identical protocol to any other tool use. The SDK registered the loader tool as one line in the tools field alongside whatever else the app registered.
Stage C β Execution: βhereβs the skill bodyβ
Section titled βStage C β Execution: βhereβs the skill bodyββThe SDK looks up the pptx folder, reads SKILL.md, and returns its body as the tool_result:
You are now following skill: pptx (v2.1).Purpose: Create PowerPoint presentations from a template.Procedure: 1. Read the template from templates/default.pptx 2. Extract user intent from the current turn 3. Render slides by applying substitutions ...Tool result lands in the message stream. The model reads it in the freshest possible position and follows the instructions.
Thatβs it. Three stages. Three primitives. No new anything.
The recency question β what we cracked
Section titled βThe recency question β what we crackedβThe intuition most people arrive at first (including me):
βPutting skill descriptions in the system prompt gives them a worse recency profile than putting them via the Tools API. We can do better by routing discovery through a tool.β
This is wrong, and the correction is worth pinning down because it reshapes the whole design argument.
Tool descriptions are already upfront context. Every Messages API call includes the full tools field β every toolβs name, description, and input schema β alongside the system prompt. Both sit ahead of the conversation. Both arrive with every request. Their recency profile is identical from the modelβs positional perspective.
So routing skill discovery through a tool description vs through the system prompt makes no architectural difference to recency. In both flows:
- Skill descriptions are upfront on every request.
- The model picks one based on the userβs turn.
- Only the chosen skillβs body travels via
tool_result(the actually-fresh position).
Why did Anthropic pick the system prompt, then? Not for recency β for organization:
- Namespacing. Skill entries donβt pollute the ordinary tools list in the UI or the modelβs mental model. The userβs
Readtool and the Skill loader shouldnβt be peers. - List bloat. Apps might expose 50 skills; stuffing them into
toolscrowds every turnβs payload. Putting them in a system-prompt block renders them as documentation instead. - Uniform loader signature. One
Skill(name)tool handles any skill; adding a new skill doesnβt change the advertised tool surface.
These are ergonomic wins. They are not architectural ones. The recency profile is the same either way β assuming the provider respects system-prompt position.
That assumption is where cross-provider correctness falls in.
The real problem: cross-provider adherence is a training thing
Section titled βThe real problem: cross-provider adherence is a training thingβThe protocol says nothing about recency bias. Training does.
- Claude β₯ 3.5 is trained hard on system-prompt adherence. Instructions placed there are followed even in long contexts.
- Claude pre-3.5, GPT, Llama, Mistral, smaller models have well-documented recency bias. Under long contexts (or adversarial tool-result interleaving), instructions placed in the system prompt get out-followed by instructions placed near the conversationβs tail.
So the Skills design that works brilliantly on Claude β descriptions in the system prompt, bodies via tool result β degrades silently on other providers. Your OpenAI-backed eval passes on turn 1 and drifts by turn 40. Your local-model deployment never quite gets there.
This is what makes Skills a correctness concern for a framework β not just an ergonomics one. An application author canβt reasonably be expected to know which generation of which provider was trained to honor system-prompt anchoring vs which one leans on recency. A library can.
What agentfootprint does differently
Section titled βWhat agentfootprint does differentlyβGiven the above, our Skills implementation makes four choices. All of them fit in ~200 LOC of sugar on top of primitives we already had (AgentInstruction, onToolResult, tool registry).
1. surfaceMode: 'auto' β the library picks per provider
Section titled β1. surfaceMode: 'auto' β the library picks per providerβimport { resolveSurfaceMode } from 'agentfootprint';
// Pure function β no side effects. Inspect what 'auto' will resolve to.resolveSurfaceMode('anthropic', 'claude-sonnet-4-5-20250929'); // β 'both'resolveSurfaceMode('anthropic', 'claude-3-haiku-20240307'); // β 'tool-only'resolveSurfaceMode('openai', 'gpt-4o'); // β 'tool-only'resolveSurfaceMode('mock'); // β 'tool-only'Four modes: 'system-prompt', 'tool-only', 'both', 'auto'. On Claude β₯ 3.5 we use 'both' (cheap to cache, high adherence). Everywhere else we fall back to 'tool-only' β recency-first delivery via tool result, which is a protocol-level guarantee and doesnβt rely on training.
This is the difference between a Skills pattern that happens to work on Claude and one that is correct across every provider the library supports, mock() included (evals match production).
2. AgentInstruction.onToolResult as the recency-first injection point
Section titled β2. AgentInstruction.onToolResult as the recency-first injection pointβagentfootprint already shipped AgentInstruction with an onToolResult hook β the cross-provider-correct path. Skills ride on top of it:
- A tool result lands in the message stream.
onToolResultrules evaluate on that result.- Matched rules inject text into the same tool-result payload the model is about to read.
That text lands in the freshest possible position by protocol, not by training. Skills inherit this for free because every Skill IS-A AgentInstruction and inherits onToolResult.
3. Refresh policy for long contexts
Section titled β3. Refresh policy for long contextsβdefineSkill({ id: 'critical-rule', description: 'Critical reasoning rule for long-context runs', body: '...', refreshPolicy: { afterTokens: 50_000, via: 'tool-result' },});Belt-and-suspenders: even if system-prompt anchoring was honored at turn 1, by turn 40 on a non-Claude provider itβs drifted out of effective attention. Re-inject the registry as a fresh tool result past a token threshold. The API surface is shipped today; the runtime hook lands in v2.5 as part of the long-context attention work β specifying refreshPolicy today is non-breaking.
4. Auto-generated loader tools + centralized governance
Section titled β4. Auto-generated loader tools + centralized governanceβdefineSkill(...) auto-attaches a single read_skill tool to the agentβs tool registry. The consumer never hand-writes it.
For shared skill catalogs across multiple agents, SkillRegistry is the centralized-governance answer:
import { SkillRegistry } from 'agentfootprint';
const registry = new SkillRegistry();registry.register(billingSkill).register(refundSkill).register(complianceSkill);
const supportAgent = Agent.create({ provider }).skills(registry).build();const escalationAgent = Agent.create({ provider }).skills(registry).build();
// Add a new skill β every consumer Agent picks it up at next build.registry.register(newSkill);agent.skills(registry) is the bulk-register companion to .skill(t). Use the registry pattern when 2+ agents share overlapping skills; use .skill(...) directly when one agent has its own catalog.
End-to-end flow
Section titled βEnd-to-end flowβUser turn arrives β βΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ registry.toPromptFragment(provider) ββ surfaceMode 'system-prompt' | 'both' ββ β embed {id, title, description} list in system ββ surfaceMode 'tool-only' ββ β no system-prompt block βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β βΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ registry.toTools() ββ always: list_skills, read_skill ββ these are advertised in `tools` on every turn βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β βΌ LLM turn 1 β recognizes match β tool_use: read_skill({id}) β βΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ read_skill handler ββ registry.getById(id) β renderSkillBody() ββ β "You are now following skill: {id} ({version})." ββ delivered as tool_result (recency-first by protocol) βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β βΌ LLM turn 2 β follows the skill body, calls domain tools β βΌ [ tool calls + results flow as normal ] β βΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ AgentInstruction.onToolResult fires per skill rule ββ β’ tool X fails β inject "when X failsβ¦" snippet ββ β’ trigger match β inject relevant skill hints ββ β same tool-result payload, same recency window βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β βΌ [ phase-2: after N tokens, refreshPolicy re-injects ] [ the registry into a fresh tool_result breadcrumb ]No primitive is invented. Every step uses a field the Messages API has always had. The libraryβs work is packaging β putting the right descriptions in the right surface for the right provider, with recency guarantees where provider training canβt carry them.
The one-line takeaway
Section titled βThe one-line takeawayβSkills are context engineering for instructions, abstracted so you donβt do it by hand and get it wrong. The protocol layer stays the Messages APIβs existing
system+tools+tool_resultfields β nothing new there. The framework layerβs job is deciding where each instruction lives on every turn, per provider, per context length, so the model actually follows it. The Agent SDKβs Claude-first answer is right for Claude;agentfootprintβssurfaceMode: 'auto'+onToolResultdelivery is right for every provider,mock()included.
When this understanding matters
Section titled βWhen this understanding mattersβIf youβre:
- Building a Skills-style system without Claude in the stack β the Agent SDK pattern degrades. Use
surfaceMode: 'tool-only'and put skill content in tool results, not the system prompt. - Writing evals on
mock()that should predict production βsurfaceMode: 'auto'resolves to'tool-only'onmock(), matching the behavior youβll see on any provider whose training you canβt bank on. If youβd coded descriptions into the system prompt yourself,mock()would trivially follow them and production would drift. - Running agents for 6+ hours with large contexts β even on Claude, system-prompt anchoring decays past ~100K dense tokens. This is why
refreshPolicyexists (Phase 2). Design your Skill bodies to be re-surfaceable via tool result, not just discoverable via system prompt. - Wondering why your frameworkβs βSkillsβ feature doesnβt compose with your instruction system β because theyβre the same thing. Skills are
AgentInstructions with a discovery layer bolted on. If your framework treats them as separate primitives, youβre paying for parallel abstractions.
See also
Section titled βSee alsoβ- Skills β the API reference
- Instructions & decisions β the
AgentInstructionprimitive Skills compose over - Tool use β the protocol Skills ride on