180 lines
10 KiB
Markdown
180 lines
10 KiB
Markdown
# Operating Principles — the 10 invariants, expanded
|
||
|
||
These are the *why* behind every phase and gotcha. They hold on any **metered, isolated, rented GPU**
|
||
— AutoDL, RunPod, vast.ai, Lambda, Paperspace, a Chinese platform, a bare SSH box, Slurm, or K8s. Only
|
||
the concrete paths/CLI change (those live in `profiles/<platform>.md`). Internalize these; the recipes
|
||
follow. The one-line form is in `SKILL.md`; this file carries the cross-platform nuance.
|
||
|
||
To jump: `grep -n '^## ' references/principles.md`.
|
||
|
||
---
|
||
|
||
## 1. Minimize paid wall-clock
|
||
|
||
The meter runs the *entire* time the box is up, not just while the GPU computes. Three consequences:
|
||
smoke-test correctness **locally on CPU before renting** (principle #2); **launch detached and hand
|
||
control back** rather than babysitting a blocking `sleep`; and **release the instant verification
|
||
passes** (principle #9 governs the *who-decides*). Every idle paid minute — a stuck download, a forgotten
|
||
box overnight, a human-in-the-loop pause on a live instance — is money.
|
||
|
||
*Universal.* Even on Slurm where the "meter" is walltime/fairshare quota rather than dollars, the same
|
||
discipline applies: don't hold an allocation idle.
|
||
|
||
---
|
||
|
||
## 2. Cheap checks before expensive compute
|
||
|
||
A CPU smoke (1–2 batches, logger disabled, tiny shapes) kills import errors, config drift, tensor-shape
|
||
and measurement-**scale** bugs for ~free, **before** they bill GPU-hours. It is *necessary, not
|
||
sufficient* — it won't catch convergence — but it catches the dumb-and-expensive failures that otherwise
|
||
only surface after an instance spins up.
|
||
|
||
*Boundary:* this skill owns *when* to run the smoke (the pre-rent gate). The smoke's *content* — what to
|
||
assert, how to shrink the problem — belongs to **`verifying-dl-experiments`**. Don't duplicate it here.
|
||
|
||
---
|
||
|
||
## 3. Trust artifacts you loaded, not log lines that claim success
|
||
|
||
"synced / saved / done / 100% complete" is a **claim**, and claims lie under a silently-failed write —
|
||
a full disk, exhausted inodes, a swallowed error, a half-uploaded blob. Confirm the file **exists and
|
||
loads** before releasing the only copy.
|
||
|
||
**A watcher's own state is also a claim**, not ground truth. An async condition-waiter whose job you
|
||
superseded polls a marker that will never arrive (a zombie that loops forever). A session-scoped monitor
|
||
dies on context reset while the job runs on. Reconcile watchers against the job's *real* process and
|
||
artifacts (`tmux ls` / `squeue` / `pgrep`, output `mtime`, a load-test), tear a watcher down when you
|
||
supersede its job, and match a watcher's lifetime to the wait's duration.
|
||
|
||
> **Monitoring physics this rests on:** foreground Bash hard-caps at 600 s (a long foreground wait is
|
||
> killed at 10 min); `run_in_background` has **no** cap and notifies on exit; a never-*exiting* watcher
|
||
> never notifies; an unquoted `|` inside a poll regex splits into piped commands and the first reads
|
||
> stdin → hangs forever. See `references/monitoring_patterns.md`.
|
||
|
||
*Universal — the load-bearing spine.* It is the platform instance of
|
||
`superpowers:verification-before-completion`'s Iron Law ("no completion claim without fresh verification
|
||
evidence"). Shared with `verifying-dl-experiments`.
|
||
|
||
---
|
||
|
||
## 4. Know what survives stop vs destroy
|
||
|
||
**The single biggest portability trap.** AutoDL persists `/root` across a power-off — so the AutoDL
|
||
habit is "just 关机, my data's fine." That assumption is **false almost everywhere else**:
|
||
|
||
- **RunPod** wipes the *container disk* on stop; only the *volume disk* (`/workspace`) survives a stop,
|
||
and only a **Network Volume** survives a terminate.
|
||
- **vast.ai** keeps disk across a stop but **bills it forever**; a destroy loses everything.
|
||
- **K8s** wipes the pod filesystem on every reschedule unless a PVC is mounted.
|
||
- **Colab** loses `/content` and RAM on disconnect.
|
||
|
||
So the principle is not a path — it's a **discipline**: for each platform, before Phase 0, read the
|
||
profile's STORAGE survival-matrix and write your checkpoints to the mount that survives the teardown verb
|
||
you intend to use. The data you need most often lives on the *volatile* tier by default.
|
||
|
||
*Mixed:* the *rule* is universal; the *which-mount* value is a profile fact.
|
||
|
||
---
|
||
|
||
## 5. Storage fails on the dimension — and the location — you're not watching
|
||
|
||
Disk dies on **inodes before bytes** (`df -h` shows 34% while `cp` fails "No space left" because `df -i`
|
||
is at 100% — classic on a shared FS full of many-small-files eval output). The real space hog often
|
||
lives where you didn't look — a **symlinked cache** (`~/.cache/huggingface` mapped onto the data disk)
|
||
can outweigh the `runs/` you created. **Audit with `du` on the actual mount, not assumptions.** Clean by
|
||
**value**: keep the tiny irreplaceable evidence (metric/eval JSONs), discard the large reproducible
|
||
scratch (periodic checkpoints, unused model caches — one observed sweep left **179 GB** of superseded `latest.pt`/`epoch_*.pt` while the real evidence was **<200 MB** of JSON). Pre-compute the budget; monitor `df -i`, not just
|
||
`df -h`.
|
||
|
||
*Mixed:* the inode-cap *number* is a profile fact (AutoDL/China enforce ~200K; RunPod/vast/Lambda spec
|
||
GB quotas with no documented inode cap). The "audit the real mount, clean by value" discipline is core.
|
||
The general form of the many-small-files trap is **shard into tar** (WebDataset) — see
|
||
`references/gotchas_universal.md` U25.
|
||
|
||
---
|
||
|
||
## 6. Never mutate inputs under a live run
|
||
|
||
A running job holds its scripts **in memory by byte-offset**. tmux keeps `run_queue.sh` as-loaded; bash
|
||
reads a script by seeking to a saved offset, so `scp`-ing a new version mid-run makes bash land in the
|
||
middle of a *different* file and re-execute blocks (duplicate runs, stalled queues). Version filenames;
|
||
edit only when nothing is reading them (`pgrep -af <script>` empty).
|
||
|
||
*Universal — pure bash/tmux physics.* Identical across every SSH backend.
|
||
|
||
---
|
||
|
||
## 7. Design for retry — failure is probabilistic, transfers are flaky, mirrors are route-specific
|
||
|
||
Some fraction of identical launches die (a network blip during `wandb.init`, a transient kernel fault, a
|
||
spot preemption). Wrappers must be **idempotent and resumable**; retry the **identical config** rather
|
||
than hand-patching one run (which destroys comparability — see `verifying-dl-experiments`).
|
||
|
||
**Bulk transfers are the prototypical flaky step:** wrap them in `timeout`+resume retry loops — a stall
|
||
≠ permanent failure, and resumable downloads accumulate progress across kills. An acceleration
|
||
**mirror/proxy/cache speeds ONE route, not all** — it may cover the metadata/API path while the bulk-data
|
||
path (a CDN/blob backend) still fails, and a *domestic* source routed through a *foreign*-acceleration
|
||
proxy is slower. Match the route to the origin; validate a speed test on the **same route** the real
|
||
transfer uses (a no-proxy probe of a proxied transfer measures nothing).
|
||
|
||
*Universal.* The **spot/preemption** sub-case is profile-parameterized (central on vast/RunPod; on
|
||
Lambda/Paperspace/China the interruption is auto-shutdown/auto-release/capacity instead) — see principle
|
||
#8 and `references/spot-resilience.md`.
|
||
|
||
---
|
||
|
||
## 8. Checkpoint-to-durable + idempotent resume is the universal spine
|
||
|
||
Detaching the job is necessary but not sufficient. The **one** mechanism that survives every failure
|
||
mode — SSH drop, Slurm walltime kill, K8s pod reschedule, spot preemption, Colab disconnect — is:
|
||
|
||
1. **Checkpoint full state to the platform's durable location** on a periodic timer (model + optimizer +
|
||
LR-scheduler + epoch/step + RNG + dataloader position), written **atomically** (`tmp`→`fsync`→
|
||
`os.rename`) so a mid-write kill never corrupts the latest good checkpoint.
|
||
2. **Load-latest-on-startup unconditionally**, so the *identical launch command* resumes instead of
|
||
restarting. This is what makes principle #7's "retry the identical config" actually resume progress.
|
||
|
||
The **detach primitive is the swappable plug** — tmux on a bare box, `sbatch --requeue` on Slurm, a Job
|
||
manifest on K8s, a Save&Run commit on Kaggle, a checkpoint-to-Drive loop on Colab. Checkpoint+resume is
|
||
the invariant underneath all of them.
|
||
|
||
*Universal.* Cadence is a formula, not a guess — Young/Daly `W = √(2·μ·C)` (μ = mean time between
|
||
preemptions, C = checkpoint write time); round *down* to an iteration boundary. Managed frameworks
|
||
(SkyPilot Managed Jobs, SageMaker) move the box for you but **restart your process from scratch — your
|
||
checkpoint-load is what restores progress.** Details + worked numbers in `references/spot-resilience.md`.
|
||
|
||
---
|
||
|
||
## 9. Cost and destructive actions are the user's call
|
||
|
||
Never auto-release/terminate an instance, never delete durable/shared files without explicit
|
||
confirmation, and if your own cleanup can't free enough space, **ask to expand the disk** (state the GB
|
||
needed) rather than silently shrinking the experiment (fewer seeds, smaller eval, capped vis).
|
||
|
||
This is sharpened, not softened, by going multi-platform: on RunPod/vast/Lambda the meter-stopping action
|
||
is the **irreversible** `terminate`/`destroy` that deletes the disk — so the confirmation gate matters
|
||
*more*. Operationalize it as the **teardown Iron Law** (SKILL.md Phase 5): no teardown before checkpoints
|
||
are pulled to local AND verified by load AND the user approves the specific cost-affecting action.
|
||
|
||
*Universal.* A shared FS is also multi-project: work inside your project's own folder, delete only your
|
||
own redundancy, never a top-level dir you didn't create.
|
||
|
||
---
|
||
|
||
## 10. Teach the user the platform, don't just drive it
|
||
|
||
Most users — especially on a platform they rent only occasionally — don't know its non-obvious
|
||
**conveniences** or its **danger clocks**, and the skill's job is not just to operate the box but to *tell
|
||
them*. On first contact with a platform, proactively surface:
|
||
- **Conveniences they'd otherwise miss:** one-click SSH-key registration (so the agent can connect
|
||
non-interactively), GPU-availability notifications, the built-in panels (JupyterLab / the TensorBoard tile).
|
||
- **Danger clocks that cost data or money:** auto-release / auto-delete timers on *stopped* instances
|
||
(AutoDL releases a 关机 box after **15 days** → the data disk is gone; several CN platforms in ~10), a
|
||
stop that keeps billing (vast.ai forever, RunPod 2×), low-balance / arrears purge.
|
||
|
||
The per-platform list lives in each profile's **Surface to the user** block. This pairs with #9: #9 stops
|
||
the agent from *doing* the dangerous thing; #10 makes the agent *warn the human* about the danger clock
|
||
before it fires. The most expensive surprises on rented hardware are the silent timers (a parked box
|
||
released, a stopped disk still billing), not the visible failures — surfacing them early is the cheapest
|
||
insurance.
|