741 B

Raw Blame History

Evaluation Path

Use evaluation to decide whether the skill actually changes agent behavior.

Lightweight qualitative check

Run this by default:

Read the skill as an agent would.
Simulate one realistic task.
Confirm the output contract is clear.
Check that validation is possible.
List residual gaps.

Depth rubric

Score each dimension as pass, partial, or fail:

Trigger precision.
Workflow completeness.
Safety and permission boundaries.
Output determinism.
Validation strength.
Progressive disclosure.

Baseline comparison

Only run a deeper baseline-vs-with-skill comparison when requested or when risk is high. Use the same task, same inputs, and a holdout case that was not used while editing.

741 B Raw Blame History

Evaluation Path

Lightweight qualitative check

Depth rubric

Baseline comparison

741 B

Raw Blame History