# Operating Principles — the 10 invariants, expanded These are the *why* behind every phase and gotcha. They hold on any **metered, isolated, rented GPU** — AutoDL, RunPod, vast.ai, Lambda, Paperspace, a Chinese platform, a bare SSH box, Slurm, or K8s. Only the concrete paths/CLI change (those live in `profiles/.md`). Internalize these; the recipes follow. The one-line form is in `SKILL.md`; this file carries the cross-platform nuance. To jump: `grep -n '^## ' references/principles.md`. --- ## 1. Minimize paid wall-clock The meter runs the *entire* time the box is up, not just while the GPU computes. Three consequences: smoke-test correctness **locally on CPU before renting** (principle #2); **launch detached and hand control back** rather than babysitting a blocking `sleep`; and **release the instant verification passes** (principle #9 governs the *who-decides*). Every idle paid minute — a stuck download, a forgotten box overnight, a human-in-the-loop pause on a live instance — is money. *Universal.* Even on Slurm where the "meter" is walltime/fairshare quota rather than dollars, the same discipline applies: don't hold an allocation idle. --- ## 2. Cheap checks before expensive compute A CPU smoke (1–2 batches, logger disabled, tiny shapes) kills import errors, config drift, tensor-shape and measurement-**scale** bugs for ~free, **before** they bill GPU-hours. It is *necessary, not sufficient* — it won't catch convergence — but it catches the dumb-and-expensive failures that otherwise only surface after an instance spins up. *Boundary:* this skill owns *when* to run the smoke (the pre-rent gate). The smoke's *content* — what to assert, how to shrink the problem — belongs to **`verifying-dl-experiments`**. Don't duplicate it here. --- ## 3. Trust artifacts you loaded, not log lines that claim success "synced / saved / done / 100% complete" is a **claim**, and claims lie under a silently-failed write — a full disk, exhausted inodes, a swallowed error, a half-uploaded blob. Confirm the file **exists and loads** before releasing the only copy. **A watcher's own state is also a claim**, not ground truth. An async condition-waiter whose job you superseded polls a marker that will never arrive (a zombie that loops forever). A session-scoped monitor dies on context reset while the job runs on. Reconcile watchers against the job's *real* process and artifacts (`tmux ls` / `squeue` / `pgrep`, output `mtime`, a load-test), tear a watcher down when you supersede its job, and match a watcher's lifetime to the wait's duration. > **Monitoring physics this rests on:** foreground Bash hard-caps at 600 s (a long foreground wait is > killed at 10 min); `run_in_background` has **no** cap and notifies on exit; a never-*exiting* watcher > never notifies; an unquoted `|` inside a poll regex splits into piped commands and the first reads > stdin → hangs forever. See `references/monitoring_patterns.md`. *Universal — the load-bearing spine.* It is the platform instance of `superpowers:verification-before-completion`'s Iron Law ("no completion claim without fresh verification evidence"). Shared with `verifying-dl-experiments`. --- ## 4. Know what survives stop vs destroy **The single biggest portability trap.** AutoDL persists `/root` across a power-off — so the AutoDL habit is "just 关机, my data's fine." That assumption is **false almost everywhere else**: - **RunPod** wipes the *container disk* on stop; only the *volume disk* (`/workspace`) survives a stop, and only a **Network Volume** survives a terminate. - **vast.ai** keeps disk across a stop but **bills it forever**; a destroy loses everything. - **K8s** wipes the pod filesystem on every reschedule unless a PVC is mounted. - **Colab** loses `/content` and RAM on disconnect. So the principle is not a path — it's a **discipline**: for each platform, before Phase 0, read the profile's STORAGE survival-matrix and write your checkpoints to the mount that survives the teardown verb you intend to use. The data you need most often lives on the *volatile* tier by default. *Mixed:* the *rule* is universal; the *which-mount* value is a profile fact. --- ## 5. Storage fails on the dimension — and the location — you're not watching Disk dies on **inodes before bytes** (`df -h` shows 34% while `cp` fails "No space left" because `df -i` is at 100% — classic on a shared FS full of many-small-files eval output). The real space hog often lives where you didn't look — a **symlinked cache** (`~/.cache/huggingface` mapped onto the data disk) can outweigh the `runs/` you created. **Audit with `du` on the actual mount, not assumptions.** Clean by **value**: keep the tiny irreplaceable evidence (metric/eval JSONs), discard the large reproducible scratch (periodic checkpoints, unused model caches — one observed sweep left **179 GB** of superseded `latest.pt`/`epoch_*.pt` while the real evidence was **<200 MB** of JSON). Pre-compute the budget; monitor `df -i`, not just `df -h`. *Mixed:* the inode-cap *number* is a profile fact (AutoDL/China enforce ~200K; RunPod/vast/Lambda spec GB quotas with no documented inode cap). The "audit the real mount, clean by value" discipline is core. The general form of the many-small-files trap is **shard into tar** (WebDataset) — see `references/gotchas_universal.md` U25. --- ## 6. Never mutate inputs under a live run A running job holds its scripts **in memory by byte-offset**. tmux keeps `run_queue.sh` as-loaded; bash reads a script by seeking to a saved offset, so `scp`-ing a new version mid-run makes bash land in the middle of a *different* file and re-execute blocks (duplicate runs, stalled queues). Version filenames; edit only when nothing is reading them (`pgrep -af