playbook/antigravity-awesome-skills/skills/remote-gpu-trainer/profiles/_schema.md

5.0 KiB
Raw Blame History

Platform Profile Schema

Every profiles/<platform>.md describes ONE platform with the same 8 sections in the same order, so they are scannable and diffable. A profile owns all the slow-changing, per-platform substrate that the SKILL.md phases delegate to. It does not describe a specific job (that's the portable job request, below) and never repeats the universal gotchas (those live in references/gotchas_universal.md — link, don't restate).

Design rule borrowed from SkyPilot / dstack / Ray: hardware is a CONSTRAINT, not a SKU. A job asks for gpu: A100:8; the profile owns how that maps to this platform's instance types. Secrets are referenced by env-var NAME or file path only — never inline a key.


Required structure of profiles/<platform>.md

Start each profile with a compact frontmatter block (the machine-readable facts), then the 8 prose sections.

---
platform: <name>            # e.g. runpod
kind: ssh-rental            # ssh-rental | cloud-api | kubernetes | slurm
meter_stop_verb: terminate  # the action that STOPS billing (stop | terminate | destroy | release | 关机 | manual)
meter_stop_irreversible: true
detach_primitive: tmux      # tmux | sbatch | k8s-job | nohup | kaggle-commit
spot_available: true
spot_grace: ~5s             # SIGTERM→SIGKILL window, or n/a
shared_fs: false            # is there a cross-instance shared filesystem?
inode_cap: none             # ~200K | none | host-dependent
free_egress: true           # download/upload to the wire free?
china_mirror_needed: false  # does it sit behind the GFW?
host_driver_cuda_max: "12.x"
local_nvme: true
---

1. LAUNCH

Entry points (web console / CLI / REST API / SSH), the canonical create command, and the env contract — what IS the Python env (prebuilt base? a Docker image you choose? Lambda Stack?). State the rule "the image/base IS the env — do not conda create on a rental" if it applies.

2. STORAGE MODEL (the survival matrix — principle #4)

List every storage tier with its path, speed, and size/inode cap. Then a survival matrix:

Tier Path Survives STOP? Survives DESTROY? Cap

State region/DC-lock for any shared/network volume. Name the mount checkpoints MUST go to for the teardown verb in §5.

3. NETWORK

Egress/proxy story, China-mirror relevance (link references/china-network.md if applicable), how ports/services are exposed (TB/Jupyter), and the SSH flavor(s) — note if proxied/basic SSH cannot scp/rsync (then direct-TCP is required) and whether ports change on restart.

4. SPOT / INTERRUPTION + RESUME (principle #7/#8)

The interruption model (spot bid? capacity? auto-shutdown clock? auto-release?), the detection signal + grace window, and the resume hook. Link references/spot-resilience.md for the cadence formula.

5. TEARDOWN / BILLING (principle #9 + the Iron Law)

Exactly what stops the meter (stop vs terminate vs destroy vs 关机), what each preserves, what is irreversible, and the cost trap (e.g. "stop still bills storage 2×"). This is the most error-prone section — be precise.

6. DAEMON TOOL

The detach primitive (tmux / sbatch / Job manifest / commit), whether it survives an instance restart (not just an SSH drop), and any native queue/scheduler. Note if tmux must be apt install-ed or is absent (use nohup … </dev/null >log 2>&1 &).

7. TOP GOTCHAS (48, platform-pinned)

Only the platform-specific ones, Symptom → Root cause → Fix. Universal gotchas are referenced, not repeated. Give each a stable local id (e.g. RP1, VAST2).

8. SCRIPT OVERRIDES

The exact values to parameterize the scripts/ templates for this platform: DATA_DIR= (fast scratch) · DURABLE_DIR= (survives teardown) · PROXY_HOOK= · CRED_FILE= (file path; "" if the key is an env var/secret) · SCRATCH= (what to prune) · HF_HOME= · DETACH=. The templates read exactly these env-var names. Two further knobs derive rather than being set per platform: RUN_ONE (the queue runner's path to run_one.sh) defaults to $DURABLE_DIR/run_one.sh, and PROJECT_REPO_DIR (where this run's code lives) is a per-run value — see "Portable job request" below; set either explicitly only if your layout differs.


Portable job request (NOT in the profile — keep it per-run)

A job is described separately so the same job runs against any profile. Document it in references/parallel_ablation.md; the shape:

resources:
  gpu: {name: A100, count: 8, memory: 40GB+}   # a CONSTRAINT (ranges ok), never a platform SKU
  disk: 200GB
candidates: [autodl, china, runpod]            # ordered fallback → "describe once, run anywhere"
file_mounts: {/data: {source: ..., mode: MOUNT_CACHED}}   # MOUNT | COPY | MOUNT_CACHED
run: "bash run_queue.sh queue.txt"

The launcher resolves a job against a profile; the profile supplies paths/verbs, the job supplies the work. Keeping them separate is what makes a profile reusable across every job.