741 B
741 B
Evaluation Path
Use evaluation to decide whether the skill actually changes agent behavior.
Lightweight qualitative check
Run this by default:
- Read the skill as an agent would.
- Simulate one realistic task.
- Confirm the output contract is clear.
- Check that validation is possible.
- List residual gaps.
Depth rubric
Score each dimension as pass, partial, or fail:
- Trigger precision.
- Workflow completeness.
- Safety and permission boundaries.
- Output determinism.
- Validation strength.
- Progressive disclosure.
Baseline comparison
Only run a deeper baseline-vs-with-skill comparison when requested or when risk is high. Use the same task, same inputs, and a holdout case that was not used while editing.