1.3 KiB

Raw Permalink Blame History

Trigger And Eval Playbook

Use this playbook for skills that matter enough to test.

A. Trigger Evaluation

Create three prompt buckets:

1. Should Trigger

Prompts that clearly need the skill.

Goal:

verify recall

2. Should Not Trigger

Prompts that are clearly outside the skill boundary.

Goal:

verify precision

3. Near Neighbors

Prompts that look similar but should use another skill or no skill.

Goal:

catch false positives and ambiguous routing

B. Execution Evaluation

For each important use case, create 1 to 3 realistic prompts with:

user-like phrasing
representative inputs or file types
expected output description
key checks

C. Revision Loop

When a skill underperforms:

Fix boundary or description problems before adding more body text.
Move brittle logic into scripts or templates.
Split reference content if SKILL.md becomes bloated.
Re-run the same eval set before expanding scope.

D. Minimum QA By Skill Tier

Personal skill

2 realistic prompts
manual review

Team skill

3 to 5 realistic prompts
trigger positives and negatives
one revision loop

Infrastructure or meta-skill

5+ execution prompts
trigger positives, negatives, and near neighbors
benchmark notes across revisions
ownership and drift review