# Platform Profile Schema Every `profiles/.md` describes ONE platform with the **same 8 sections in the same order**, so they are scannable and diffable. A profile owns all the *slow-changing, per-platform* substrate that the SKILL.md phases delegate to. It does **not** describe a specific job (that's the portable job request, below) and never repeats the universal gotchas (those live in `references/gotchas_universal.md` — link, don't restate). Design rule borrowed from SkyPilot / dstack / Ray: **hardware is a CONSTRAINT, not a SKU.** A job asks for `gpu: A100:8`; the profile owns how that maps to this platform's instance types. **Secrets are referenced by env-var NAME or file path only — never inline a key**. --- ## Required structure of `profiles/.md` Start each profile with a compact frontmatter block (the machine-readable facts), then the 8 prose sections. ```yaml --- platform: # e.g. runpod kind: ssh-rental # ssh-rental | cloud-api | kubernetes | slurm meter_stop_verb: terminate # the action that STOPS billing (stop | terminate | destroy | release | 关机 | manual) meter_stop_irreversible: true detach_primitive: tmux # tmux | sbatch | k8s-job | nohup | kaggle-commit spot_available: true spot_grace: ~5s # SIGTERM→SIGKILL window, or n/a shared_fs: false # is there a cross-instance shared filesystem? inode_cap: none # ~200K | none | host-dependent free_egress: true # download/upload to the wire free? china_mirror_needed: false # does it sit behind the GFW? host_driver_cuda_max: "12.x" local_nvme: true --- ``` ### 1. LAUNCH Entry points (web console / CLI / REST API / SSH), the canonical create command, and the **env contract** — what IS the Python env (prebuilt base? a Docker image you choose? Lambda Stack?). State the rule "the image/base IS the env — do not `conda create` on a rental" if it applies. ### 2. STORAGE MODEL *(the survival matrix — principle #4)* List every storage tier with its path, speed, and size/inode cap. Then a **survival matrix**: | Tier | Path | Survives STOP? | Survives DESTROY? | Cap | |---|---|---|---|---| State region/DC-lock for any shared/network volume. Name the mount checkpoints MUST go to for the teardown verb in §5. ### 3. NETWORK Egress/proxy story, China-mirror relevance (link `references/china-network.md` if applicable), how ports/services are exposed (TB/Jupyter), and the **SSH flavor(s)** — note if proxied/basic SSH cannot `scp`/`rsync` (then direct-TCP is required) and whether ports change on restart. ### 4. SPOT / INTERRUPTION + RESUME *(principle #7/#8)* The interruption model (spot bid? capacity? auto-shutdown clock? auto-release?), the **detection signal + grace window**, and the resume hook. Link `references/spot-resilience.md` for the cadence formula. ### 5. TEARDOWN / BILLING *(principle #9 + the Iron Law)* Exactly **what stops the meter** (stop vs terminate vs destroy vs 关机), what each preserves, what is **irreversible**, and the cost trap (e.g. "stop still bills storage 2×"). This is the most error-prone section — be precise. ### 6. DAEMON TOOL The detach primitive (`tmux` / `sbatch` / Job manifest / commit), whether it survives an instance restart (not just an SSH drop), and any native queue/scheduler. Note if `tmux` must be `apt install`-ed or is absent (use `nohup … log 2>&1 &`). ### 7. TOP GOTCHAS (4–8, platform-pinned) Only the *platform-specific* ones, Symptom → Root cause → Fix. Universal gotchas are referenced, not repeated. Give each a stable local id (e.g. `RP1`, `VAST2`). ### 8. SCRIPT OVERRIDES The exact values to parameterize the `scripts/` templates for this platform: `DATA_DIR=` (fast scratch) · `DURABLE_DIR=` (survives teardown) · `PROXY_HOOK=` · `CRED_FILE=` (file path; `""` if the key is an env var/secret) · `SCRATCH=` (what to prune) · `HF_HOME=` · `DETACH=`. The templates read exactly these env-var names. Two further knobs *derive* rather than being set per platform: `RUN_ONE` (the queue runner's path to `run_one.sh`) defaults to `$DURABLE_DIR/run_one.sh`, and `PROJECT_REPO_DIR` (where *this run's* code lives) is a per-run value — see "Portable job request" below; set either explicitly only if your layout differs. --- ## Portable job request (NOT in the profile — keep it per-run) A job is described separately so the *same* job runs against any profile. Document it in `references/parallel_ablation.md`; the shape: ```yaml resources: gpu: {name: A100, count: 8, memory: 40GB+} # a CONSTRAINT (ranges ok), never a platform SKU disk: 200GB candidates: [autodl, china, runpod] # ordered fallback → "describe once, run anywhere" file_mounts: {/data: {source: ..., mode: MOUNT_CACHED}} # MOUNT | COPY | MOUNT_CACHED run: "bash run_queue.sh queue.txt" ``` The launcher resolves a job against a profile; the profile supplies paths/verbs, the job supplies the work. Keeping them separate is what makes a profile reusable across every job.