playbook/antigravity-awesome-skills/skills/remote-gpu-trainer/references/principles.md

180 lines
10 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Operating Principles — the 10 invariants, expanded
These are the *why* behind every phase and gotcha. They hold on any **metered, isolated, rented GPU**
— AutoDL, RunPod, vast.ai, Lambda, Paperspace, a Chinese platform, a bare SSH box, Slurm, or K8s. Only
the concrete paths/CLI change (those live in `profiles/<platform>.md`). Internalize these; the recipes
follow. The one-line form is in `SKILL.md`; this file carries the cross-platform nuance.
To jump: `grep -n '^## ' references/principles.md`.
---
## 1. Minimize paid wall-clock
The meter runs the *entire* time the box is up, not just while the GPU computes. Three consequences:
smoke-test correctness **locally on CPU before renting** (principle #2); **launch detached and hand
control back** rather than babysitting a blocking `sleep`; and **release the instant verification
passes** (principle #9 governs the *who-decides*). Every idle paid minute — a stuck download, a forgotten
box overnight, a human-in-the-loop pause on a live instance — is money.
*Universal.* Even on Slurm where the "meter" is walltime/fairshare quota rather than dollars, the same
discipline applies: don't hold an allocation idle.
---
## 2. Cheap checks before expensive compute
A CPU smoke (12 batches, logger disabled, tiny shapes) kills import errors, config drift, tensor-shape
and measurement-**scale** bugs for ~free, **before** they bill GPU-hours. It is *necessary, not
sufficient* — it won't catch convergence — but it catches the dumb-and-expensive failures that otherwise
only surface after an instance spins up.
*Boundary:* this skill owns *when* to run the smoke (the pre-rent gate). The smoke's *content* — what to
assert, how to shrink the problem — belongs to **`verifying-dl-experiments`**. Don't duplicate it here.
---
## 3. Trust artifacts you loaded, not log lines that claim success
"synced / saved / done / 100% complete" is a **claim**, and claims lie under a silently-failed write —
a full disk, exhausted inodes, a swallowed error, a half-uploaded blob. Confirm the file **exists and
loads** before releasing the only copy.
**A watcher's own state is also a claim**, not ground truth. An async condition-waiter whose job you
superseded polls a marker that will never arrive (a zombie that loops forever). A session-scoped monitor
dies on context reset while the job runs on. Reconcile watchers against the job's *real* process and
artifacts (`tmux ls` / `squeue` / `pgrep`, output `mtime`, a load-test), tear a watcher down when you
supersede its job, and match a watcher's lifetime to the wait's duration.
> **Monitoring physics this rests on:** foreground Bash hard-caps at 600 s (a long foreground wait is
> killed at 10 min); `run_in_background` has **no** cap and notifies on exit; a never-*exiting* watcher
> never notifies; an unquoted `|` inside a poll regex splits into piped commands and the first reads
> stdin → hangs forever. See `references/monitoring_patterns.md`.
*Universal — the load-bearing spine.* It is the platform instance of
`superpowers:verification-before-completion`'s Iron Law ("no completion claim without fresh verification
evidence"). Shared with `verifying-dl-experiments`.
---
## 4. Know what survives stop vs destroy
**The single biggest portability trap.** AutoDL persists `/root` across a power-off — so the AutoDL
habit is "just 关机, my data's fine." That assumption is **false almost everywhere else**:
- **RunPod** wipes the *container disk* on stop; only the *volume disk* (`/workspace`) survives a stop,
and only a **Network Volume** survives a terminate.
- **vast.ai** keeps disk across a stop but **bills it forever**; a destroy loses everything.
- **K8s** wipes the pod filesystem on every reschedule unless a PVC is mounted.
- **Colab** loses `/content` and RAM on disconnect.
So the principle is not a path — it's a **discipline**: for each platform, before Phase 0, read the
profile's STORAGE survival-matrix and write your checkpoints to the mount that survives the teardown verb
you intend to use. The data you need most often lives on the *volatile* tier by default.
*Mixed:* the *rule* is universal; the *which-mount* value is a profile fact.
---
## 5. Storage fails on the dimension — and the location — you're not watching
Disk dies on **inodes before bytes** (`df -h` shows 34% while `cp` fails "No space left" because `df -i`
is at 100% — classic on a shared FS full of many-small-files eval output). The real space hog often
lives where you didn't look — a **symlinked cache** (`~/.cache/huggingface` mapped onto the data disk)
can outweigh the `runs/` you created. **Audit with `du` on the actual mount, not assumptions.** Clean by
**value**: keep the tiny irreplaceable evidence (metric/eval JSONs), discard the large reproducible
scratch (periodic checkpoints, unused model caches — one observed sweep left **179 GB** of superseded `latest.pt`/`epoch_*.pt` while the real evidence was **<200 MB** of JSON). Pre-compute the budget; monitor `df -i`, not just
`df -h`.
*Mixed:* the inode-cap *number* is a profile fact (AutoDL/China enforce ~200K; RunPod/vast/Lambda spec
GB quotas with no documented inode cap). The "audit the real mount, clean by value" discipline is core.
The general form of the many-small-files trap is **shard into tar** (WebDataset) see
`references/gotchas_universal.md` U25.
---
## 6. Never mutate inputs under a live run
A running job holds its scripts **in memory by byte-offset**. tmux keeps `run_queue.sh` as-loaded; bash
reads a script by seeking to a saved offset, so `scp`-ing a new version mid-run makes bash land in the
middle of a *different* file and re-execute blocks (duplicate runs, stalled queues). Version filenames;
edit only when nothing is reading them (`pgrep -af <script>` empty).
*Universal — pure bash/tmux physics.* Identical across every SSH backend.
---
## 7. Design for retry — failure is probabilistic, transfers are flaky, mirrors are route-specific
Some fraction of identical launches die (a network blip during `wandb.init`, a transient kernel fault, a
spot preemption). Wrappers must be **idempotent and resumable**; retry the **identical config** rather
than hand-patching one run (which destroys comparability — see `verifying-dl-experiments`).
**Bulk transfers are the prototypical flaky step:** wrap them in `timeout`+resume retry loops — a stall
≠ permanent failure, and resumable downloads accumulate progress across kills. An acceleration
**mirror/proxy/cache speeds ONE route, not all** — it may cover the metadata/API path while the bulk-data
path (a CDN/blob backend) still fails, and a *domestic* source routed through a *foreign*-acceleration
proxy is slower. Match the route to the origin; validate a speed test on the **same route** the real
transfer uses (a no-proxy probe of a proxied transfer measures nothing).
*Universal.* The **spot/preemption** sub-case is profile-parameterized (central on vast/RunPod; on
Lambda/Paperspace/China the interruption is auto-shutdown/auto-release/capacity instead) — see principle
#8 and `references/spot-resilience.md`.
---
## 8. Checkpoint-to-durable + idempotent resume is the universal spine
Detaching the job is necessary but not sufficient. The **one** mechanism that survives every failure
mode — SSH drop, Slurm walltime kill, K8s pod reschedule, spot preemption, Colab disconnect — is:
1. **Checkpoint full state to the platform's durable location** on a periodic timer (model + optimizer +
LR-scheduler + epoch/step + RNG + dataloader position), written **atomically** (`tmp`→`fsync`→
`os.rename`) so a mid-write kill never corrupts the latest good checkpoint.
2. **Load-latest-on-startup unconditionally**, so the *identical launch command* resumes instead of
restarting. This is what makes principle #7's "retry the identical config" actually resume progress.
The **detach primitive is the swappable plug** — tmux on a bare box, `sbatch --requeue` on Slurm, a Job
manifest on K8s, a Save&Run commit on Kaggle, a checkpoint-to-Drive loop on Colab. Checkpoint+resume is
the invariant underneath all of them.
*Universal.* Cadence is a formula, not a guess — Young/Daly `W = √(2·μ·C)` (μ = mean time between
preemptions, C = checkpoint write time); round *down* to an iteration boundary. Managed frameworks
(SkyPilot Managed Jobs, SageMaker) move the box for you but **restart your process from scratch — your
checkpoint-load is what restores progress.** Details + worked numbers in `references/spot-resilience.md`.
---
## 9. Cost and destructive actions are the user's call
Never auto-release/terminate an instance, never delete durable/shared files without explicit
confirmation, and if your own cleanup can't free enough space, **ask to expand the disk** (state the GB
needed) rather than silently shrinking the experiment (fewer seeds, smaller eval, capped vis).
This is sharpened, not softened, by going multi-platform: on RunPod/vast/Lambda the meter-stopping action
is the **irreversible** `terminate`/`destroy` that deletes the disk — so the confirmation gate matters
*more*. Operationalize it as the **teardown Iron Law** (SKILL.md Phase 5): no teardown before checkpoints
are pulled to local AND verified by load AND the user approves the specific cost-affecting action.
*Universal.* A shared FS is also multi-project: work inside your project's own folder, delete only your
own redundancy, never a top-level dir you didn't create.
---
## 10. Teach the user the platform, don't just drive it
Most users — especially on a platform they rent only occasionally — don't know its non-obvious
**conveniences** or its **danger clocks**, and the skill's job is not just to operate the box but to *tell
them*. On first contact with a platform, proactively surface:
- **Conveniences they'd otherwise miss:** one-click SSH-key registration (so the agent can connect
non-interactively), GPU-availability notifications, the built-in panels (JupyterLab / the TensorBoard tile).
- **Danger clocks that cost data or money:** auto-release / auto-delete timers on *stopped* instances
(AutoDL releases a 关机 box after **15 days** → the data disk is gone; several CN platforms in ~10), a
stop that keeps billing (vast.ai forever, RunPod 2×), low-balance / arrears purge.
The per-platform list lives in each profile's **Surface to the user** block. This pairs with #9: #9 stops
the agent from *doing* the dangerous thing; #10 makes the agent *warn the human* about the danger clock
before it fires. The most expensive surprises on rented hardware are the silent timers (a parked box
released, a stopped disk still billing), not the visible failures — surfacing them early is the cheapest
insurance.