# Lifecycle Checklist — the 6-phase runbook as a per-platform checklist Purpose: a platform-parameterized, copy-pasteable checkbox runbook for one remote-GPU job, Phase 0 (environment audit) through Phase 5 (aggregate + verify + teardown). Substrate is delegated to **your platform profile** (`profiles/.md`, 8-field schema in `profiles/_schema.md`) — this file never hardcodes a mount, verb, or proxy. Each phase ends in the runnable check from `SKILL.md`. `grep -in references/lifecycle_checklist.md` to jump. ## Table of contents - Phase −1 — one-time setup (skip if reused) - Phase 0 — environment audit - Phase 1 — SSH + credentials - Phase 2 — wrapper + CPU-smoke gate - Phase 3 — detached launch - Phase 4 — durable monitoring - Phase 5 — aggregate + verify + teardown - Cost-saving teardown table - Failure handling (inline, any phase) > **How to use:** pick the profile FIRST (`SKILL.md` → "Pick your platform profile"). Wherever a step > says *profile data mount* / *profile durable mount* / *profile meter-stop verb* / *profile detach > primitive*, read the literal value out of that profile's STORAGE / TEARDOWN / DAEMON sections. Skip any > phase already done. Universal gotchas referenced by id live in `references/gotchas_universal.md` — not > restated here. --- ## Phase −1 — One-time setup *(skip if reused from a past project)* - [ ] **First contact — surface the profile's "Surface to the user" block** to the user: the platform's non-obvious **conveniences** (one-click SSH-key registration, GPU-availability notifications, built-in panels) and its **danger clocks** (auto-release/auto-delete timers, stop-still-bills, low-balance purge). Don't assume they know — the danger clocks cost data or money (principle #10). - [ ] Local SSH keypair exists (`~/.ssh/id_ed25519`); public key registered with the platform. - [ ] `~/.ssh/config` alias per instance, with keepalive (`ServerAliveInterval`/`ServerAliveCountMax`) — see `references/ssh_transport.md`. - [ ] Durable storage provisioned per profile (shared FS / network volume / persistent disk), sized ≥ `ckpt_size × N + buffer`. - [ ] Reusable image/snapshot saved with the project's env + code, IF the profile supports it (saves repeated cold-build time × N instances). - [ ] `.gitattributes` sets `*.sh text eol=lf` so Windows-authored scripts don't ship CRLF (gotcha U26). > **verify:** `ssh 'echo reachable'` returns `reachable` and the durable mount appears in the profile's survival matrix. --- ## Phase 0 — Environment audit - [ ] Read the profile's STORAGE survival-matrix: know which tier survives *stop* vs *destroy* (principle #4). Write checkpoints to the tier that survives the intended *profile meter-stop verb*. - [ ] Read the profile's region/DC-lock: if a shared/network volume is region-scoped, confirm all instances share that region BEFORE launch. - [ ] Measure live, do not assume: `df -h && df -i ` (inodes die before bytes — principle #5), cgroup `cat /sys/fs/cgroup/memory.max`, `nvidia-smi`. - [ ] `du -sh` the real space hogs on the actual mount (symlinked caches hide — principle #5), not the dir assumed. - [ ] Pre-compute the checkpoint disk budget: `ckpt_size × N + scratch`; confirm it fits the *profile data mount* cap. > **verify:** `nvidia-smi` shows the expected GPU and `df -i ` is not near 100%. --- ## Phase 1 — SSH + credentials - [ ] Set the SSH alias/env per the profile's NETWORK section; note the SSH flavor (a proxied/basic SSH may not `scp`/`rsync` — direct-TCP required; ports may change on restart). - [ ] Use the prebuilt image/base AS the env — do NOT `conda create` on a rental (throwaway-instance exception: the image IS the env). - [ ] Push secrets via **stdin, never onto a shared/durable FS** (a shared FS is multi-project) — pattern in `references/ssh_transport.md`. Reference creds by env-var NAME / file path, never inline a key. - [ ] If the profile sits behind the GFW, wire the China-mirror endpoint now (`references/china-network.md`); validate the speed test on the SAME route the real transfer uses (principle #7). > **verify:** `ssh 'python -c "import torch;print(torch.cuda.is_available())"'` prints `True`. --- ## Phase 2 — Wrapper + CPU-smoke gate - [ ] Build an idempotent `run_one` / `run_queue` from `scripts/` (`run_one.sh.template`, `run_queue.sh.template`), parameterized from the profile's SCRIPT OVERRIDES (`DATA_DIR=`, `DURABLE_DIR=`, `PROXY_HOOK=`, `CRED_FILE=`, `SCRATCH=`, `HF_HOME=`, `DETACH=`). - [ ] Wrappers are resumable: load-latest-on-startup unconditionally so the identical launch command resumes, not restarts (principle #8). - [ ] Build the per-cell queue/config files with one isolated write path per cell (no shared mutable output — parallel ablation needs this; `references/parallel_ablation.md`). - [ ] **Run the cheap CPU smoke LOCALLY, BEFORE renting** — 1–2 batches, logger disabled, tiny shapes; it kills import/config/shape/scale bugs for ~free (principle #2). Smoke *content* → **verifying-dl-experiments** (REQUIRED). > **verify:** smoke exits 0 on 2 batches with the logger disabled, no Traceback. --- ## Phase 3 — Detached launch - [ ] Launch via the *profile detach primitive* (tmux / `sbatch` / k8s Job / commit) — survives an SSH drop; confirm whether it also survives an instance restart (profile DAEMON section). - [ ] Push code/data with a resumable transfer (`rsync --partial` or `timeout`+resume loop — principle #7); never edit a script under a live run — version filenames (principle #6). - [ ] Probe briefly: log head + process alive + no traceback, then **hand control back**. Never a blocking foreground `sleep` (foreground Bash hard-caps at 600 s). > **verify:** within 60 s, the detach session is alive and the first log line shows the expected step/epoch. --- ## Phase 4 — Durable monitoring - [ ] For anything over ~1–2 h, deploy the **four-layer architecture** (`references/monitoring_patterns.md`): on-box self-completion chain + session patrol loop + event sentinels + recovery handbook. A session-bound watcher alone dies with the session (principle #3). - [ ] Use `run_in_background` (no duration cap, notifies on exit; a Claude Code primitive — other hosts map per `monitoring_patterns.md` §7) for long waits; never foreground-poll. NEVER an unquoted `|` inside a poll-regex — it reads stdin and hangs forever. - [ ] Watch `df -i` trend (not just `df -h`), cgroup memory %, new FINISHED/ERROR/Traceback markers, and fast-finish (< ~50% expected duration → probable failure). - [ ] Reconcile each watcher against the job's REAL process/artifact (`tmux ls`/`squeue`/`pgrep` + output `mtime`) — a watcher's own state is a claim, not ground truth (principle #3). Tear a watcher down when its job is superseded. - [ ] Classify each failure → its fixed remediation (see Failure handling below); **never blind-retry**. > **verify:** the patrol reports a status line even when nothing changed (proves it's alive, not silently dead). --- ## Phase 5 — Aggregate + verify + teardown - [ ] Run the aggregation step (`scripts/aggregate_to_fs.sh`, idempotent — safe to re-run) to checked-sync results to the *profile durable mount*. **Gate the success line on the actual copy result** — `cp …; echo synced` lies on a full/inode-exhausted FS (principle #3, gotcha U33). - [ ] Confirm the durable mount has the expected artifact count: `ssh 'ls /final_ckpts/ | wc -l'`. - [ ] Pull results to local (resumable per-dir scp/rsync loop); HARD-sanitize the local target to `/path/to/local` — never a real personal path. - [ ] **Load-and-verify each artifact** before teardown: `python scripts/verify_local.py /path/to/local/final_ckpts/` → expect `OK N/N, errors 0`. Re-pull + re-verify any error. - [ ] Record disclosable run facts for the paper: CLI overrides, tracker summary URL (a hosted tracker survives teardown — **huggingface-skills:huggingface-trackio**). Transport verbs (`hf download --resume`, `hf upload-large-folder`) → **huggingface-skills:hf-cli**. - [ ] ONLY THEN perform the *profile meter-stop verb*, AFTER explicit user approval of the specific cost-affecting action. > **verify:** `verify_local.py` reports 100% OK *before* any teardown. > **Iron Law — teardown gate:** NO `stop` / `release` / `terminate` / `destroy` / file-delete until > checkpoints are **pulled to local AND verified by load**, AND the user has explicitly approved the > cost-affecting action. "It looked done in the log" is not evidence (principle #3). On most platforms the > meter-stopping verb is **irreversible** (deletes the disk) — confirmation matters *more*, not less. The > general form is **superpowers:verification-before-completion** (REQUIRED). --- ## Cost-saving teardown table The verb that stops the meter, what each preserves, and irreversibility — bind the platform-specific verb from the profile's TEARDOWN section. **The biggest portability trap: "stop" rarely stops the meter, and the action that does is usually irreversible** (principle #4/#9). | Action | Stops GPU meter? | What survives | Reversible? | |---|---|---|---| | **stop / 关机 (power-off)** | Sometimes — depends on profile (AutoDL: yes, keeps disk; RunPod/vast: still bills storage 1–2×) | Disk tier per profile survival-matrix | Yes — instance restartable | | **release idle instance** | Yes | Only the durable/shared mount (data disk gone) | No — instance + container disk destroyed | | **terminate** | Yes | Only a network/persistent volume, if one was mounted | **No — irreversible**, disk deleted | | **destroy** | Yes | Nothing on the box | **No — irreversible**, total loss | | **delete durable files (keep subscription)** | Storage trickle only | Subscription survives for new data | No — those files gone | | **cancel durable storage subscription** | Storage cost only | Nothing | **No — irreversible**, all durable data lost | **Default conservative plan:** stop/release the GPU instance first (immediate $ saving, low risk once artifacts are verified-local). Keep durable storage 1–3 months until the paper is submitted. Cancel the durable subscription LAST, only after the local copy is verified and the user approves. --- ## Failure handling *(inline, any phase)* Categorize before reacting; retry the **identical** config — hand-patching one run destroys comparability (principle #7; **verifying-dl-experiments** owns is-it-a-bug-or-real). - [ ] **Probabilistic** (epoch-1 stall, transient `wandb.init` blip, spot preemption): queue a retry with the SAME config, no safeguards. Resume works because of checkpoint-load (principle #8). - [ ] **Disk-full** (exit 1 + `iostream` / "No space left"): prune the *profile scratch* (`SCRATCH=` — periodic checkpoints, unused caches), keep `best`; if cleanup can't free enough, **ask to expand the disk**, never silently shrink the experiment (principle #9). Then retry. - [ ] **Real bug** (CUDA OOM, code error, all-zero metric): stop, investigate code — do NOT retry blindly. > Symptom → root cause → fix for each, plus the full catalog: `references/gotchas_universal.md` > (`grep -in references/gotchas_universal.md` to jump).