29 lines
741 B
Markdown
29 lines
741 B
Markdown
# Evaluation Path
|
|
|
|
Use evaluation to decide whether the skill actually changes agent behavior.
|
|
|
|
## Lightweight qualitative check
|
|
|
|
Run this by default:
|
|
|
|
1. Read the skill as an agent would.
|
|
2. Simulate one realistic task.
|
|
3. Confirm the output contract is clear.
|
|
4. Check that validation is possible.
|
|
5. List residual gaps.
|
|
|
|
## Depth rubric
|
|
|
|
Score each dimension as pass, partial, or fail:
|
|
|
|
- Trigger precision.
|
|
- Workflow completeness.
|
|
- Safety and permission boundaries.
|
|
- Output determinism.
|
|
- Validation strength.
|
|
- Progressive disclosure.
|
|
|
|
## Baseline comparison
|
|
|
|
Only run a deeper baseline-vs-with-skill comparison when requested or when risk is high. Use the same task, same inputs, and a holdout case that was not used while editing.
|