# Evaluation Path Use evaluation to decide whether the skill actually changes agent behavior. ## Lightweight qualitative check Run this by default: 1. Read the skill as an agent would. 2. Simulate one realistic task. 3. Confirm the output contract is clear. 4. Check that validation is possible. 5. List residual gaps. ## Depth rubric Score each dimension as pass, partial, or fail: - Trigger precision. - Workflow completeness. - Safety and permission boundaries. - Output determinism. - Validation strength. - Progressive disclosure. ## Baseline comparison Only run a deeper baseline-vs-with-skill comparison when requested or when risk is high. Use the same task, same inputs, and a holdout case that was not used while editing.