101 lines
5.0 KiB
Markdown
101 lines
5.0 KiB
Markdown
# Platform Profile Schema
|
||
|
||
Every `profiles/<platform>.md` describes ONE platform with the **same 8 sections in the same order**, so
|
||
they are scannable and diffable. A profile owns all the *slow-changing, per-platform* substrate that the
|
||
SKILL.md phases delegate to. It does **not** describe a specific job (that's the portable job request,
|
||
below) and never repeats the universal gotchas (those live in `references/gotchas_universal.md` — link,
|
||
don't restate).
|
||
|
||
Design rule borrowed from SkyPilot / dstack / Ray: **hardware is a CONSTRAINT, not a SKU.** A job asks
|
||
for `gpu: A100:8`; the profile owns how that maps to this platform's instance types. **Secrets are
|
||
referenced by env-var NAME or file path only — never inline a key**.
|
||
|
||
---
|
||
|
||
## Required structure of `profiles/<platform>.md`
|
||
|
||
Start each profile with a compact frontmatter block (the machine-readable facts), then the 8 prose
|
||
sections.
|
||
|
||
```yaml
|
||
---
|
||
platform: <name> # e.g. runpod
|
||
kind: ssh-rental # ssh-rental | cloud-api | kubernetes | slurm
|
||
meter_stop_verb: terminate # the action that STOPS billing (stop | terminate | destroy | release | 关机 | manual)
|
||
meter_stop_irreversible: true
|
||
detach_primitive: tmux # tmux | sbatch | k8s-job | nohup | kaggle-commit
|
||
spot_available: true
|
||
spot_grace: ~5s # SIGTERM→SIGKILL window, or n/a
|
||
shared_fs: false # is there a cross-instance shared filesystem?
|
||
inode_cap: none # ~200K | none | host-dependent
|
||
free_egress: true # download/upload to the wire free?
|
||
china_mirror_needed: false # does it sit behind the GFW?
|
||
host_driver_cuda_max: "12.x"
|
||
local_nvme: true
|
||
---
|
||
```
|
||
|
||
### 1. LAUNCH
|
||
Entry points (web console / CLI / REST API / SSH), the canonical create command, and the **env
|
||
contract** — what IS the Python env (prebuilt base? a Docker image you choose? Lambda Stack?). State the
|
||
rule "the image/base IS the env — do not `conda create` on a rental" if it applies.
|
||
|
||
### 2. STORAGE MODEL *(the survival matrix — principle #4)*
|
||
List every storage tier with its path, speed, and size/inode cap. Then a **survival matrix**:
|
||
|
||
| Tier | Path | Survives STOP? | Survives DESTROY? | Cap |
|
||
|---|---|---|---|---|
|
||
|
||
State region/DC-lock for any shared/network volume. Name the mount checkpoints MUST go to for the
|
||
teardown verb in §5.
|
||
|
||
### 3. NETWORK
|
||
Egress/proxy story, China-mirror relevance (link `references/china-network.md` if applicable), how
|
||
ports/services are exposed (TB/Jupyter), and the **SSH flavor(s)** — note if proxied/basic SSH cannot
|
||
`scp`/`rsync` (then direct-TCP is required) and whether ports change on restart.
|
||
|
||
### 4. SPOT / INTERRUPTION + RESUME *(principle #7/#8)*
|
||
The interruption model (spot bid? capacity? auto-shutdown clock? auto-release?), the **detection signal +
|
||
grace window**, and the resume hook. Link `references/spot-resilience.md` for the cadence formula.
|
||
|
||
### 5. TEARDOWN / BILLING *(principle #9 + the Iron Law)*
|
||
Exactly **what stops the meter** (stop vs terminate vs destroy vs 关机), what each preserves, what is
|
||
**irreversible**, and the cost trap (e.g. "stop still bills storage 2×"). This is the most error-prone
|
||
section — be precise.
|
||
|
||
### 6. DAEMON TOOL
|
||
The detach primitive (`tmux` / `sbatch` / Job manifest / commit), whether it survives an instance restart
|
||
(not just an SSH drop), and any native queue/scheduler. Note if `tmux` must be `apt install`-ed or is
|
||
absent (use `nohup … </dev/null >log 2>&1 &`).
|
||
|
||
### 7. TOP GOTCHAS (4–8, platform-pinned)
|
||
Only the *platform-specific* ones, Symptom → Root cause → Fix. Universal gotchas are referenced, not
|
||
repeated. Give each a stable local id (e.g. `RP1`, `VAST2`).
|
||
|
||
### 8. SCRIPT OVERRIDES
|
||
The exact values to parameterize the `scripts/` templates for this platform:
|
||
`DATA_DIR=` (fast scratch) · `DURABLE_DIR=` (survives teardown) · `PROXY_HOOK=` · `CRED_FILE=` (file path; `""` if the key is an env var/secret) · `SCRATCH=` (what to prune) · `HF_HOME=` · `DETACH=`.
|
||
The templates read exactly these env-var names. Two further knobs *derive* rather than being set per
|
||
platform: `RUN_ONE` (the queue runner's path to `run_one.sh`) defaults to `$DURABLE_DIR/run_one.sh`, and
|
||
`PROJECT_REPO_DIR` (where *this run's* code lives) is a per-run value — see "Portable job request" below;
|
||
set either explicitly only if your layout differs.
|
||
|
||
---
|
||
|
||
## Portable job request (NOT in the profile — keep it per-run)
|
||
|
||
A job is described separately so the *same* job runs against any profile. Document it in
|
||
`references/parallel_ablation.md`; the shape:
|
||
|
||
```yaml
|
||
resources:
|
||
gpu: {name: A100, count: 8, memory: 40GB+} # a CONSTRAINT (ranges ok), never a platform SKU
|
||
disk: 200GB
|
||
candidates: [autodl, china, runpod] # ordered fallback → "describe once, run anywhere"
|
||
file_mounts: {/data: {source: ..., mode: MOUNT_CACHED}} # MOUNT | COPY | MOUNT_CACHED
|
||
run: "bash run_queue.sh queue.txt"
|
||
```
|
||
|
||
The launcher resolves a job against a profile; the profile supplies paths/verbs, the job supplies
|
||
the work. Keeping them separate is what makes a profile reusable across every job.
|