playbook/outfitter-agents/plugins/outfitter/skills/skills-dev/references/best-practices.md

27 KiB

Agent Skills Best Practices

Community-sourced patterns, techniques, and pitfalls from practitioners and official documentation.

Table of Contents

Progressive Disclosure Architecture

Three-tier information model: Discovery → Activation → Execution

Discovery Layer (~50 tokens)

YAML frontmatter that helps agents find the right skill without loading full content.

---
name: pdf-processing
description: Extracts text and tables from PDF files, fills forms, and merges documents. Use when working with PDF files or when the user mentions PDFs, forms, or document extraction.
---

Keys to effective discovery:

  • Include WHAT the skill does AND WHEN to use it
  • Use third-person voice
  • Include specific trigger terms users might mention
  • Keep under 100 tokens

Activation Layer (~2-5K tokens)

Core SKILL.md instructions loaded when skill is invoked.

Structure:

# Skill Name

<when_to_use>
Clear criteria for when this skill applies
</when_to_use>

<workflow>
Step-by-step process (numbered or structured)
</workflow>

<rules>
- ALWAYS: Mandatory behaviors
- NEVER: Prohibited actions
- PREFER: Recommended approaches
</rules>

<references>
Links to deep-dive docs in references/ subdirectory
</references>

Keys to effective activation:

  • Assume intelligence: Claude doesn't need basic concepts explained
  • Be directive, not comprehensive: Focus on what makes THIS approach different
  • Keep under 500 lines: Move details to references/
  • Use examples sparingly: Only for non-obvious patterns

Execution Layer (dynamic)

Deep-dive content loaded on-demand from references/ subdirectory.

Pattern from practitioners:

skill-name/
├── SKILL.md                    # Core workflow (500 lines max)
├── references/
│   ├── configuration.md        # Detailed config options
│   ├── error-handling.md       # Edge cases and recovery
│   ├── advanced-patterns.md    # Expert techniques
│   └── examples.md             # Worked examples
└── scripts/                    # Helper utilities

Why this works (source: Juan C Olamendy, skillmatic-ai):

  • Prevents context rot from loading irrelevant information
  • Allows targeted follow-up ("show me the advanced patterns")
  • Keeps initial load fast and focused
  • Scales to complex domains without overwhelming context

Skill Composition Patterns

Skills Invoking Skills

Pattern: Reference other skills in instructions rather than duplicating methodology.

## Error Investigation

Load the **outfitter:debugging** skill using the Skill tool to investigate
this authentication failure systematically.

Pass these parameters to the debugging workflow:
- Error context: [collected error details]
- Hypothesis: Token validation timing issue

Why this works:

  • Reuses established methodologies
  • Maintains single source of truth
  • Allows skills to evolve independently
  • Reduces duplication across skill library

Anti-pattern: Embedding another skill's instructions inline.

Subagent Architecture

For orchestrating specialized work with context isolation, see claude-code.md for Claude Code-specific patterns.

Skill + External Service Integration

Skills can integrate with external services (APIs, MCP servers) by separating concerns:

  • External service: Handles authentication, rate limiting, data access
  • Skill: Handles business logic, formatting, workflows

This separation enables reuse across similar domains.

Description Optimization

Goal: Help Claude discover your skill without loading it.

Include Both WHAT and WHEN

Vague: "Processes PDFs" Specific: "Extracts text and tables from PDF files, fills forms, and merges documents. Use when working with PDF files or when the user mentions PDFs, forms, or document extraction."

Use Third-Person Voice

"Use me when you need to debug" "Debugs issues using systematic root cause analysis. Use when encountering errors, unexpected behavior, or test failures."

Include Trigger Terms

Think about what users actually say:

description: Creates weekly team status reports with wins, challenges, and priorities.
  Use when the user asks for a team update, standup report, weekly summary, or status
  email. Keywords: standup, weekly update, team report, status.

Be Specific About Scope

"Helps with testing" "Implements test-driven development using Red-Green-Refactor cycles. Use when implementing new features with tests first, refactoring with test coverage, or reproducing bugs as failing tests."

Source: Official Anthropic best practices emphasize specificity prevents Claude from loading irrelevant skills.

Common Pitfalls

1. Making SKILL.md Too Verbose

Symptom: 1000+ line SKILL.md files with exhaustive explanations.

Why it's a problem:

  • Wastes context window on every invocation
  • Buries key directives in noise
  • Assumes Claude needs basic concepts explained

Fix:

  • Keep SKILL.md under 500 lines
  • Move deep dives to references/
  • Trust Claude's base knowledge
  • Focus on WHAT makes THIS approach unique

Example (source: Anthropic best practices):

Verbose:

## What is Test-Driven Development?

Test-Driven Development (TDD) is a software development methodology where you write
tests before writing the actual code. This approach was popularized by Kent Beck
and has become a cornerstone of modern software engineering practices...

[500 lines of TDD philosophy]

Concise:

## TDD Workflow

1. **Red**: Write a failing test for the next small piece of functionality
2. **Green**: Write minimal code to make the test pass
3. **Refactor**: Improve code while keeping tests green

ALWAYS write the test first. NEVER skip the refactor step.

2. Negative-Only Constraints

Symptom: Instructions full of "NEVER do X" without alternatives.

Problem:

- NEVER use any types
- NEVER skip error handling
- NEVER commit without tests

Why it's a problem: Tells Claude what NOT to do but not what TO do.

Fix: Pair constraints with positive alternatives:

- ALWAYS use strict types; NEVER use `any`
- ALWAYS handle errors with Result types; NEVER let exceptions propagate silently
- ALWAYS run tests before committing; NEVER push untested code

3. Deeply Nested File References

Symptom: Skills referencing files that reference other files 3+ levels deep.

Why it's a problem:

  • Context explosion
  • Circular references
  • Hard to maintain

Fix (source: skillmatic-ai research):

  • Keep references ONE level deep
  • Use table of contents in long reference files
  • Let Claude request additional detail if needed

Deep nesting:

SKILL.md → references/patterns.md → references/examples/auth.md → references/examples/auth/jwt.md

Flat structure:

SKILL.md → references/auth-patterns.md (with ToC for JWT, OAuth, etc.)

4. Not Treating Skills Like Code

Symptom: Skills maintained as loose documents without version control, testing, or reviews.

Why it's a problem:

  • Skills drift from reality
  • Breaking changes go unnoticed
  • No way to roll back problematic versions

Fix (source: blog.sshh.io, Nate's newsletter):

  • Version control: Skills in git repos with semantic versioning
  • Testing: Build evaluations to validate skill behavior
  • Reviews: Treat skill PRs like code reviews
  • Changelog: Document what changed and why

Pattern from practitioners:

---
name: api-integration
version: 2.1.0
changelog: |
  2.1.0 - Added retry logic for rate limiting
  2.0.0 - Switched to streaming responses (breaking)
  1.5.0 - Added webhook verification
---

5. Over-Relying on Auto-Compaction

Symptom: Never manually clearing context, letting auto-compaction handle everything.

Why it's a problem (source: blog.sshh.io practitioner experience):

  • Important context gets compressed or dropped
  • Skill instructions get summarized incorrectly
  • Debugging becomes harder when full skill isn't visible

Fix: Manual context management strategy:

  1. Start complex tasks with /clear for clean slate
  2. Use /catchup with explicit context about what skills are active
  3. Let auto-compaction handle routine continuations
  4. Force reload skills after compaction if behavior seems off

When to manually clear:

  • Starting new major feature
  • Switching between unrelated tasks
  • After hitting context limits on complex debugging
  • When skill behavior seems inconsistent

6. Unclear Skill Boundaries

Symptom: Skill tries to do too many unrelated things.

Example: "code-helper" that does linting, testing, documentation, deployment, and debugging.

Why it's a problem:

  • Hard to discover (description too generic)
  • Loads unnecessary context
  • Becomes maintenance nightmare

Fix: One skill, one job.

Well-scoped skills:

  • linting-workflow: Code quality checks and fixes
  • tdd: TDD methodology
  • api-documentation: API reference generation
  • deployment-automation: Deploy and rollback workflows
  • debugging: Root cause investigation

Exception: Orchestrator skills that explicitly load other skills (like feature-development that loads TDD → documentation → deployment in sequence).

7. No Usage Examples

Symptom: Skill has abstract instructions but no concrete examples.

Why it's a problem: Claude may misinterpret intent without seeing desired output.

Fix: Include 1-2 examples in references/examples.md

Pattern:

# Examples

## Example 1: Simple Case

**Input**: User asks to add login endpoint

**Workflow**:
1. Load TDD skill
2. Write failing test for /login POST
3. Implement minimal auth logic
4. Refactor to service layer

**Output**: [Show actual test code + implementation]

## Example 2: Edge Case

**Input**: User asks to add login with OAuth and JWT and refresh tokens

**Workflow**:
1. Load pathfinding skill to break down requirements
2. Load TDD skill for each component separately
3. OAuth integration → JWT generation → Refresh logic
4. Each gets its own test cycle

**Output**: [Show breakdown and test structure]

Source: Official Anthropic best practices recommend examples for non-obvious patterns.

Testing Strategies

Eval-Driven Development

Pattern: Build evaluations BEFORE extensive documentation (source: Nate's newsletter).

Workflow:

  1. Create minimal skill version
  2. Build test suite with target inputs/outputs
  3. Iterate skill until evals pass consistently
  4. THEN write comprehensive docs

Why this works:

  • Prevents documenting the wrong approach
  • Faster iteration cycles
  • Forces clarity about success criteria
  • Builds regression test suite automatically

Implementation (from Nate's debugging toolkit):

// skill-testing-framework pattern
interface SkillEval {
  name: string;
  input: string;
  expectedBehavior: string[];
  forbiddenBehavior: string[];
  targetModels: ('haiku' | 'sonnet' | 'opus')[];
}

const tddSkillEvals: SkillEval[] = [
  {
    name: "basic-tdd-workflow",
    input: "Add a login endpoint",
    expectedBehavior: [
      "Writes test first",
      "Test fails initially (red phase)",
      "Implements minimal solution",
      "Test passes (green phase)",
      "Refactors with tests passing"
    ],
    forbiddenBehavior: [
      "Writes implementation before test",
      "Skips refactor step",
      "Makes test pass by modifying test"
    ],
    targetModels: ['haiku', 'sonnet', 'opus']
  }
];

Multi-Model Testing

Pattern: Test skills with all target models.

Why: Haiku, Sonnet, and Opus interpret instructions differently:

  • Haiku: Needs more explicit instructions, less inference
  • Sonnet: Balanced reasoning, good for most workflows
  • Opus: Handles complex context, better with ambiguity

Testing strategy (source: Anthropic best practices):

Aspect Haiku Test Sonnet Test Opus Test
Clarity Do instructions work with minimal reasoning? Do instructions balance brevity and clarity? Do instructions leverage advanced reasoning?
Context Works with small context? Handles moderate references? Manages large cross-references?
Edge cases Explicit handling? Reasonable inference? Sophisticated judgment?

Fix pattern: If Haiku fails but Sonnet passes, instructions likely assume too much inference.

Real-World Usage Testing

Pattern: Test skills with actual users/agents in production-like scenarios.

Anti-pattern: Only testing with constructed examples.

Strategy (from practitioner experience):

  1. Dogfooding: Use your own skills for real work
  2. Iteration tracking: Log when skills are loaded but not followed
  3. Confusion signals: Detect when Claude asks for clarification (skill might be unclear)
  4. Outcome validation: Did the skill achieve its intended result?

Metrics to track:

  • Skill load frequency (is it discoverable?)
  • Completion rate (do workflows finish?)
  • User satisfaction (did it solve the problem?)
  • Iteration count (how many tries to get it right?)

From blog.sshh.io: "Built 10 debugging tools after watching 100 people hit the same problems in their first week."

Systematic Evaluation Framework

Components (source: Nate's newsletter, skillmatic-ai research):

  1. skill-debugging-assistant: Identifies where skills fail
  2. skill-security-analyzer: Checks for security risks in skill code
  3. skill-gap-analyzer: Finds missing skills in your library
  4. skill-performance-profiler: Tracks context usage and latency
  5. prompt-optimization-analyzer: Improves skill descriptions for discovery
  6. skill-testing-framework: Automated test runner for skills

Pattern: Build tools to test tools.

Advanced Techniques

Hook-Based Validation

For platform-specific hook implementation patterns, see claude-code.md.

General principle: Use hooks to enforce constraints at decision points—prevent destructive operations, enforce testing requirements, validate configuration before deployment.

Organization-Wide Skill Libraries

Pattern: Centralized skill repository as institutional knowledge (source: Juan C Olamendy, Medium).

Structure:

company-skills/
├── engineering/
│   ├── deployment-workflow/
│   ├── incident-response/
│   └── architecture-review/
├── product/
│   ├── user-story-creation/
│   └── feature-planning/
└── business/
    ├── team-standup/
    └── quarterly-planning/

Benefits:

  • Codifies company processes
  • Onboarding material becomes executable
  • Process improvements propagate automatically
  • Consistency across teams

Implementation (from practitioners):

  1. Central registry: Marketplace or internal skill server
  2. Contribution guidelines: Templates for creating company skills
  3. Review process: Skills reviewed like code before publishing
  4. Version management: Semantic versioning for breaking changes
  5. Deprecation policy: How to sunset old patterns

Pattern from blog.sshh.io:

# Company Skill Manifest

## Deployment
- `deployment-staging`: Deploy to staging with rollback plan
- `deployment-production`: Production deploy with checklist
- `deployment-rollback`: Emergency rollback procedures

## Code Review
- `pr-review-backend`: Backend code review checklist
- `pr-review-frontend`: Frontend code review standards
- `security-review`: Security-focused code review

## Documentation
- `api-documentation`: OpenAPI spec generation
- `readme-maintenance`: README updates for features

Anti-pattern: Every team building their own version of the same workflows.

Progressive Skill Disclosure in Practice

Advanced pattern: Table of contents in reference files for targeted loading.

Example (source: skillmatic-ai architecture):

# API Integration Patterns

## Table of Contents

- [REST Basics](#rest-basics) - Standard CRUD operations
- [GraphQL](#graphql) - Query and mutation patterns
- [Webhooks](#webhooks) - Event-driven integrations
- [Rate Limiting](#rate-limiting) - Backoff and retry
- [Authentication](#authentication) - OAuth, JWT, API keys
- [Error Handling](#error-handling) - Retry logic and fallbacks

## REST Basics

[Focused content on REST]

## GraphQL

[Focused content on GraphQL]

Usage: Skill says "See references/api-patterns.md#rate-limiting for retry logic" rather than loading entire file.

Why it works:

  • Claude can navigate to specific section
  • Preserves context for other tasks
  • User can request more depth if needed

Skills as Living Documentation

Pattern: Skills replace static documentation that goes stale.

Traditional docs: "Here's how to deploy" (written once, outdated quickly) Skill: Executes deployment with current best practices

Benefits (source: Juan C Olamendy):

  • Always current: If process changes, skill changes
  • Executable: Not just instructions but enforcement
  • Testable: Verify the process actually works
  • Discoverable: Claude can find relevant process

Example transformation:

Static doc (docs/deployment.md):

# Deployment Process

1. Run tests
2. Update version number
3. Build production bundle
4. Upload to S3
5. Clear CDN cache
6. Notify team in Slack

[This gets outdated when we switch to Vercel]

Skill (skills/deployment/SKILL.md):

---
name: deployment-production
description: Deploys to production with safety checks
---

# Production Deployment

1. Verify all tests pass: `bun test`
2. Run build: `bun run build`
3. Deploy to Vercel: `vercel --prod`
4. Verify deployment: Check /api/health
5. Notify team: Use Slack MCP to post to #deployments

ALWAYS wait for health check before considering deploy complete.

When process changes: Update skill, test it, deploy new version. Documentation stays current.

Skill Chaining for Complex Workflows

Pattern: Master skill orchestrates sequence of specialized skills.

Example (source: practitioner patterns):

---
name: feature-development
description: End-to-end feature development workflow
---

# Feature Development Workflow

## Stage 1: Planning
Load **pathfinding** skill to clarify requirements and architecture.

## Stage 2: Implementation
Load **tdd** skill to implement with tests.

## Stage 3: Documentation
Load **api-documentation** skill to generate API docs.

## Stage 4: Review
Load **code-review** skill to validate implementation.

## Stage 5: Deployment
Load **deployment-staging** skill to deploy for testing.

Each stage must complete successfully before proceeding to next.

Advantage: Each specialized skill can evolve independently. Feature-development orchestrates but doesn't duplicate.

Related pattern - Conditional chaining:

## Error Recovery

If tests fail in Stage 2:
  Load **debugging** skill to investigate
  Return to Stage 2 after fixes

If code review finds issues in Stage 4:
  Return to Stage 2 for fixes
  Re-run Stage 3 to update docs
  Re-run Stage 4 to re-review

Security Considerations

Critical warning (source: Sid Bharath tutorial, security research): Skills can execute arbitrary code and access files. Only use skills from trusted sources.

Risks

  1. Code execution: Skills can include scripts that run on your machine
  2. File access: Skills can read/write files in project
  3. Network access: Skills can make HTTP requests
  4. Credential access: Skills can access environment variables, config files
  5. Social engineering: Malicious skills disguised as helpful tools

Protection Strategies

1. Source verification:

  • Only install skills from trusted authors
  • Review skill code before using
  • Check community reputation and reviews
  • Verify skill matches description (no hidden behavior)

2. Code review checklist (from security research):

## Skill Security Review

- [ ] Review all scripts in scripts/ directory
- [ ] Check for file system access patterns
- [ ] Verify network requests are legitimate
- [ ] Confirm no credential harvesting
- [ ] Check for obfuscated code
- [ ] Validate external dependencies
- [ ] Test in isolated environment first

3. Sandbox testing:

  • Test new skills in isolated project first
  • Use throwaway credentials for initial testing
  • Monitor file system and network activity
  • Check for unexpected side effects

4. Minimal permissions:

# Proposed security metadata (from research)
permissions:
  file_read: ['src/**', 'docs/**']
  file_write: ['docs/**']
  network: ['https://api.company.com']
  environment: []

5. Audit logging: Track what skills do in production:

  • What files were accessed?
  • What commands were executed?
  • What network requests were made?

From security papers: "Skills are code execution with conversational interface. Treat them with same security rigor as any code dependency."

Organization-Wide Patterns

Skill as Institutional Knowledge

Pattern: Replace tribal knowledge with executable skills (source: Juan C Olamendy).

Traditional problem:

  • "How do we deploy?" → Ask Sarah, she knows
  • "What's the PR review process?" → Different on every team
  • "How do we handle incidents?" → Check the wiki (outdated)

Skill solution:

  • deployment-production skill: Encodes Sarah's knowledge
  • pr-review skill: Standardizes review process
  • incident-response skill: Current playbook, always up to date

Implementation strategy:

  1. Identify critical workflows: What knowledge is locked in people's heads?
  2. Interview experts: How do they actually do the work?
  3. Create skills: Encode process as executable workflow
  4. Test with novices: Can someone unfamiliar complete the task?
  5. Iterate: Refine based on real usage
  6. Deprecate docs: Point to skills instead of wikis

Example from blog.sshh.io:

---
name: internal-deploy
description: Company deployment process with all safety checks
---

# Internal Deployment Workflow

## Pre-Deploy Checklist
1. Verify Jira ticket is in "Ready for Deploy" status
2. Confirm tests pass in CI: `check-ci-status`
3. Get approval in #deploy-requests Slack channel

## Deploy
1. Run staging deploy: `npm run deploy:staging`
2. Verify staging health: `curl https://staging.company.com/health`
3. Run smoke tests: `npm run smoke-test:staging`
4. Deploy to production: `npm run deploy:prod`
5. Monitor for 5 minutes: Watch Datadog dashboard

## Post-Deploy
1. Verify production health: `curl https://company.com/health`
2. Post to #deployments: "Deployed [feature] to prod"
3. Update Jira ticket to "Deployed"

NEVER skip smoke tests. ALWAYS monitor after deploy.

Benefit: New team members can deploy safely on day one.

Contribution Guidelines

Pattern: Treat skills like open source contributions.

Template (from ComposioHQ awesome-claude-skills):

# Contributing Skills

## Before Submitting

1. **Test thoroughly**: Run skill with Haiku, Sonnet, and Opus
2. **Follow structure**: Use provided skill template
3. **Document clearly**: Include description, when to use, examples
4. **Security review**: No malicious code or credential access
5. **License**: MIT or Apache 2.0

## Skill Requirements

- [ ] Descriptive name (kebab-case)
- [ ] Clear description with trigger terms
- [ ] SKILL.md under 500 lines
- [ ] References in references/ subdirectory
- [ ] At least one example in examples/
- [ ] Testing results documented
- [ ] README.md with usage instructions

## Review Process

1. Submit PR with skill in skills/your-skill-name/
2. Maintainers review for quality and security
3. Address feedback
4. Approved skills merged and published

Versioning Strategy

Pattern: Semantic versioning for skills (from practitioners).

Format: MAJOR.MINOR.PATCH

---
name: api-integration
version: 2.1.0
---

Versioning rules:

  • MAJOR: Breaking changes (workflow steps changed, different inputs required)
  • MINOR: New features (additional optional steps, new references added)
  • PATCH: Bug fixes (typos, clarifications, small improvements)

Breaking change example:

# Version 1.x: Required user to provide API key
---
name: api-client
version: 1.5.0
description: Make API calls with provided credentials
---

# Version 2.x: Uses MCP server for authentication (breaking)
---
name: api-client
version: 2.0.0
description: Make API calls using Linear MCP server
---

Migration guide pattern:

# Migration Guide: 1.x → 2.0

## Breaking Changes

- No longer accepts `api_key` parameter
- Now requires Linear MCP server configured
- Response format changed from JSON to structured objects

## Migration Steps

1. Install Linear MCP server: `/mcp install linear`
2. Update skill invocations to remove `api_key`
3. Update code expecting JSON to handle structured objects

Summary: Hierarchy of Best Practices

Essential (Do These Always)

  1. Progressive disclosure: Keep SKILL.md under 500 lines, use references/
  2. Clear descriptions: Include what AND when, with trigger terms
  3. Assume intelligence: Claude doesn't need basics explained
  4. Test with real usage: Dogfood your own skills
  5. Version control: Track changes, review like code

Important (Do These Usually)

  1. Multi-model testing: Verify Haiku, Sonnet, Opus behavior
  2. Positive constraints: Say what TO do, not just what NOT to do
  3. Examples for non-obvious: Show expected behavior
  4. Composition over duplication: Reference other skills
  5. Security review: Audit code execution and file access

Advanced (Do These for Scale)

  1. Eval-driven development: Build tests before extensive docs
  2. Hook-based enforcement: Use PreToolUse for quality gates
  3. Organization-wide libraries: Centralized skill registry
  4. Semantic versioning: Track breaking changes
  5. Skills as living docs: Replace static documentation

Expert (Do These for Excellence)

  1. Systematic evaluation framework: Build tools to test tools
  2. Master-Clone architecture: Optimize context usage
  3. Conditional skill chaining: Orchestrate complex workflows
  4. Audit logging: Track skill execution in production
  5. Community contribution: Share patterns, learn from others

Sources

Research synthesized from:

  • Official Documentation: Anthropic Claude Agent Skills Best Practices
  • Community Repositories: ComposioHQ/awesome-claude-skills, skillmatic-ai/awesome-agent-skills
  • Practitioner Blogs: blog.sshh.io (Claude Code at scale), Juan C Olamendy (Medium), Sid Bharath
  • Research: Security considerations from academic papers, progressive disclosure architecture
  • Tooling: Nate's Newsletter (debugging toolkit), evaluation frameworks

Last updated: 2026-01-10