3.1 KiB
Evals — does the skill actually route to the right answer?
A skill is only as good as an agent's ability to find and apply the right entry under a real
problem. These evals test that, in two tiers, against a fixed set of realistic scenarios
(cases.jsonl) spanning both halves of the skill (remote-GPU operations on every
platform family + the DL-training-debug layer, including the convergence-debugging and
data-pipeline files).
Tier 1 — structural reachability (runnable, no API key)
python evals/run_evals.py # exits non-zero if any case regresses
For each scenario it asserts the answer is present, at the documented location, with the
expected entry IDs / keywords intact: every expect_files exists, every expect_ids is still a
### <ID> header there, every expect_grep term is still in the text. This is a drift guard —
it catches a renamed/removed entry, a moved section, a deleted file, or a fact rewritten away from
its key term. Run it in CI; it needs nothing but Python 3.
What it does not prove: that an agent actually navigates there (Tier 2), or that the platform facts are correct on a live box (see Verification status).
Tier 2 — agentic navigation (the gold standard)
The real test: give a fresh agent the skill and one scenario's prompt, let it navigate from
SKILL.md only (following the documented routing, not blind grep), and check it reaches a correct,
specific answer covering the case's must_cover points within ~2 hops. Each case records its last
such run in the agentic field; the collected runs are in RESULTS.md.
To re-run Tier 2 with any agent/harness: load the skill, paste a case prompt, and grade the
answer against expect_files / expect_ids / must_cover. (Anthropic's skill best-practices
recommend ≥3 evals across Haiku/Sonnet/Opus — re-running these cases per model is the way to meet
that bar; results to date were gathered on the development model and are labelled as such.)
Adding a case
Append one JSON object per line to cases.jsonl:
{"id": "kebab-id", "prompt": "the user's situation, verbatim-ish",
"expect_files": ["references/training/<file>.md"], "expect_ids": ["O7"],
"expect_grep": ["lr finder"], "must_cover": "the key points a correct answer must hit",
"agentic": "PASS/FAIL (date): the navigation path observed"}
Use expect_ids for the training catalogs (they have ### O7 / DP1 / M17 … headers) and
expect_grep for platform profiles (which are section-structured). Then python evals/run_evals.py.
Verification status (important)
These evals test retrieval and routing inside the skill — not the truth of the platform facts
on a live instance. Only the AutoDL profile is battle-tested by the author; the other six platform
profiles are researched from official docs + community reports and not yet live-validated (see
the repo README's "Verification status" and references/self-improvement.md §5). A case passing
here means "the skill leads an agent to this documented answer," not "this answer was confirmed on
a rented box."