playbook/antigravity-awesome-skills/skills/remote-gpu-trainer/profiles/_schema.md

101 lines
5.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Platform Profile Schema
Every `profiles/<platform>.md` describes ONE platform with the **same 8 sections in the same order**, so
they are scannable and diffable. A profile owns all the *slow-changing, per-platform* substrate that the
SKILL.md phases delegate to. It does **not** describe a specific job (that's the portable job request,
below) and never repeats the universal gotchas (those live in `references/gotchas_universal.md` — link,
don't restate).
Design rule borrowed from SkyPilot / dstack / Ray: **hardware is a CONSTRAINT, not a SKU.** A job asks
for `gpu: A100:8`; the profile owns how that maps to this platform's instance types. **Secrets are
referenced by env-var NAME or file path only — never inline a key**.
---
## Required structure of `profiles/<platform>.md`
Start each profile with a compact frontmatter block (the machine-readable facts), then the 8 prose
sections.
```yaml
---
platform: <name> # e.g. runpod
kind: ssh-rental # ssh-rental | cloud-api | kubernetes | slurm
meter_stop_verb: terminate # the action that STOPS billing (stop | terminate | destroy | release | 关机 | manual)
meter_stop_irreversible: true
detach_primitive: tmux # tmux | sbatch | k8s-job | nohup | kaggle-commit
spot_available: true
spot_grace: ~5s # SIGTERM→SIGKILL window, or n/a
shared_fs: false # is there a cross-instance shared filesystem?
inode_cap: none # ~200K | none | host-dependent
free_egress: true # download/upload to the wire free?
china_mirror_needed: false # does it sit behind the GFW?
host_driver_cuda_max: "12.x"
local_nvme: true
---
```
### 1. LAUNCH
Entry points (web console / CLI / REST API / SSH), the canonical create command, and the **env
contract** — what IS the Python env (prebuilt base? a Docker image you choose? Lambda Stack?). State the
rule "the image/base IS the env — do not `conda create` on a rental" if it applies.
### 2. STORAGE MODEL *(the survival matrix — principle #4)*
List every storage tier with its path, speed, and size/inode cap. Then a **survival matrix**:
| Tier | Path | Survives STOP? | Survives DESTROY? | Cap |
|---|---|---|---|---|
State region/DC-lock for any shared/network volume. Name the mount checkpoints MUST go to for the
teardown verb in §5.
### 3. NETWORK
Egress/proxy story, China-mirror relevance (link `references/china-network.md` if applicable), how
ports/services are exposed (TB/Jupyter), and the **SSH flavor(s)** — note if proxied/basic SSH cannot
`scp`/`rsync` (then direct-TCP is required) and whether ports change on restart.
### 4. SPOT / INTERRUPTION + RESUME *(principle #7/#8)*
The interruption model (spot bid? capacity? auto-shutdown clock? auto-release?), the **detection signal +
grace window**, and the resume hook. Link `references/spot-resilience.md` for the cadence formula.
### 5. TEARDOWN / BILLING *(principle #9 + the Iron Law)*
Exactly **what stops the meter** (stop vs terminate vs destroy vs 关机), what each preserves, what is
**irreversible**, and the cost trap (e.g. "stop still bills storage 2×"). This is the most error-prone
section — be precise.
### 6. DAEMON TOOL
The detach primitive (`tmux` / `sbatch` / Job manifest / commit), whether it survives an instance restart
(not just an SSH drop), and any native queue/scheduler. Note if `tmux` must be `apt install`-ed or is
absent (use `nohup … </dev/null >log 2>&1 &`).
### 7. TOP GOTCHAS (48, platform-pinned)
Only the *platform-specific* ones, Symptom → Root cause → Fix. Universal gotchas are referenced, not
repeated. Give each a stable local id (e.g. `RP1`, `VAST2`).
### 8. SCRIPT OVERRIDES
The exact values to parameterize the `scripts/` templates for this platform:
`DATA_DIR=` (fast scratch) · `DURABLE_DIR=` (survives teardown) · `PROXY_HOOK=` · `CRED_FILE=` (file path; `""` if the key is an env var/secret) · `SCRATCH=` (what to prune) · `HF_HOME=` · `DETACH=`.
The templates read exactly these env-var names. Two further knobs *derive* rather than being set per
platform: `RUN_ONE` (the queue runner's path to `run_one.sh`) defaults to `$DURABLE_DIR/run_one.sh`, and
`PROJECT_REPO_DIR` (where *this run's* code lives) is a per-run value — see "Portable job request" below;
set either explicitly only if your layout differs.
---
## Portable job request (NOT in the profile — keep it per-run)
A job is described separately so the *same* job runs against any profile. Document it in
`references/parallel_ablation.md`; the shape:
```yaml
resources:
gpu: {name: A100, count: 8, memory: 40GB+} # a CONSTRAINT (ranges ok), never a platform SKU
disk: 200GB
candidates: [autodl, china, runpod] # ordered fallback → "describe once, run anywhere"
file_mounts: {/data: {source: ..., mode: MOUNT_CACHED}} # MOUNT | COPY | MOUNT_CACHED
run: "bash run_queue.sh queue.txt"
```
The launcher resolves a job against a profile; the profile supplies paths/verbs, the job supplies
the work. Keeping them separate is what makes a profile reusable across every job.