playbook/antigravity-awesome-skills/skills/remote-gpu-trainer/references/principles.md

10 KiB
Raw Blame History

Operating Principles — the 10 invariants, expanded

These are the why behind every phase and gotcha. They hold on any metered, isolated, rented GPU — AutoDL, RunPod, vast.ai, Lambda, Paperspace, a Chinese platform, a bare SSH box, Slurm, or K8s. Only the concrete paths/CLI change (those live in profiles/<platform>.md). Internalize these; the recipes follow. The one-line form is in SKILL.md; this file carries the cross-platform nuance.

To jump: grep -n '^## ' references/principles.md.


1. Minimize paid wall-clock

The meter runs the entire time the box is up, not just while the GPU computes. Three consequences: smoke-test correctness locally on CPU before renting (principle #2); launch detached and hand control back rather than babysitting a blocking sleep; and release the instant verification passes (principle #9 governs the who-decides). Every idle paid minute — a stuck download, a forgotten box overnight, a human-in-the-loop pause on a live instance — is money.

Universal. Even on Slurm where the "meter" is walltime/fairshare quota rather than dollars, the same discipline applies: don't hold an allocation idle.


2. Cheap checks before expensive compute

A CPU smoke (12 batches, logger disabled, tiny shapes) kills import errors, config drift, tensor-shape and measurement-scale bugs for ~free, before they bill GPU-hours. It is necessary, not sufficient — it won't catch convergence — but it catches the dumb-and-expensive failures that otherwise only surface after an instance spins up.

Boundary: this skill owns when to run the smoke (the pre-rent gate). The smoke's content — what to assert, how to shrink the problem — belongs to verifying-dl-experiments. Don't duplicate it here.


3. Trust artifacts you loaded, not log lines that claim success

"synced / saved / done / 100% complete" is a claim, and claims lie under a silently-failed write — a full disk, exhausted inodes, a swallowed error, a half-uploaded blob. Confirm the file exists and loads before releasing the only copy.

A watcher's own state is also a claim, not ground truth. An async condition-waiter whose job you superseded polls a marker that will never arrive (a zombie that loops forever). A session-scoped monitor dies on context reset while the job runs on. Reconcile watchers against the job's real process and artifacts (tmux ls / squeue / pgrep, output mtime, a load-test), tear a watcher down when you supersede its job, and match a watcher's lifetime to the wait's duration.

Monitoring physics this rests on: foreground Bash hard-caps at 600 s (a long foreground wait is killed at 10 min); run_in_background has no cap and notifies on exit; a never-exiting watcher never notifies; an unquoted | inside a poll regex splits into piped commands and the first reads stdin → hangs forever. See references/monitoring_patterns.md.

Universal — the load-bearing spine. It is the platform instance of superpowers:verification-before-completion's Iron Law ("no completion claim without fresh verification evidence"). Shared with verifying-dl-experiments.


4. Know what survives stop vs destroy

The single biggest portability trap. AutoDL persists /root across a power-off — so the AutoDL habit is "just 关机, my data's fine." That assumption is false almost everywhere else:

  • RunPod wipes the container disk on stop; only the volume disk (/workspace) survives a stop, and only a Network Volume survives a terminate.
  • vast.ai keeps disk across a stop but bills it forever; a destroy loses everything.
  • K8s wipes the pod filesystem on every reschedule unless a PVC is mounted.
  • Colab loses /content and RAM on disconnect.

So the principle is not a path — it's a discipline: for each platform, before Phase 0, read the profile's STORAGE survival-matrix and write your checkpoints to the mount that survives the teardown verb you intend to use. The data you need most often lives on the volatile tier by default.

Mixed: the rule is universal; the which-mount value is a profile fact.


5. Storage fails on the dimension — and the location — you're not watching

Disk dies on inodes before bytes (df -h shows 34% while cp fails "No space left" because df -i is at 100% — classic on a shared FS full of many-small-files eval output). The real space hog often lives where you didn't look — a symlinked cache (~/.cache/huggingface mapped onto the data disk) can outweigh the runs/ you created. Audit with du on the actual mount, not assumptions. Clean by value: keep the tiny irreplaceable evidence (metric/eval JSONs), discard the large reproducible scratch (periodic checkpoints, unused model caches — one observed sweep left 179 GB of superseded latest.pt/epoch_*.pt while the real evidence was <200 MB of JSON). Pre-compute the budget; monitor df -i, not just df -h.

Mixed: the inode-cap number is a profile fact (AutoDL/China enforce ~200K; RunPod/vast/Lambda spec GB quotas with no documented inode cap). The "audit the real mount, clean by value" discipline is core. The general form of the many-small-files trap is shard into tar (WebDataset) — see references/gotchas_universal.md U25.


6. Never mutate inputs under a live run

A running job holds its scripts in memory by byte-offset. tmux keeps run_queue.sh as-loaded; bash reads a script by seeking to a saved offset, so scp-ing a new version mid-run makes bash land in the middle of a different file and re-execute blocks (duplicate runs, stalled queues). Version filenames; edit only when nothing is reading them (pgrep -af <script> empty).

Universal — pure bash/tmux physics. Identical across every SSH backend.


7. Design for retry — failure is probabilistic, transfers are flaky, mirrors are route-specific

Some fraction of identical launches die (a network blip during wandb.init, a transient kernel fault, a spot preemption). Wrappers must be idempotent and resumable; retry the identical config rather than hand-patching one run (which destroys comparability — see verifying-dl-experiments).

Bulk transfers are the prototypical flaky step: wrap them in timeout+resume retry loops — a stall ≠ permanent failure, and resumable downloads accumulate progress across kills. An acceleration mirror/proxy/cache speeds ONE route, not all — it may cover the metadata/API path while the bulk-data path (a CDN/blob backend) still fails, and a domestic source routed through a foreign-acceleration proxy is slower. Match the route to the origin; validate a speed test on the same route the real transfer uses (a no-proxy probe of a proxied transfer measures nothing).

Universal. The spot/preemption sub-case is profile-parameterized (central on vast/RunPod; on Lambda/Paperspace/China the interruption is auto-shutdown/auto-release/capacity instead) — see principle #8 and references/spot-resilience.md.


8. Checkpoint-to-durable + idempotent resume is the universal spine

Detaching the job is necessary but not sufficient. The one mechanism that survives every failure mode — SSH drop, Slurm walltime kill, K8s pod reschedule, spot preemption, Colab disconnect — is:

  1. Checkpoint full state to the platform's durable location on a periodic timer (model + optimizer + LR-scheduler + epoch/step + RNG + dataloader position), written atomically (tmpfsyncos.rename) so a mid-write kill never corrupts the latest good checkpoint.
  2. Load-latest-on-startup unconditionally, so the identical launch command resumes instead of restarting. This is what makes principle #7's "retry the identical config" actually resume progress.

The detach primitive is the swappable plug — tmux on a bare box, sbatch --requeue on Slurm, a Job manifest on K8s, a Save&Run commit on Kaggle, a checkpoint-to-Drive loop on Colab. Checkpoint+resume is the invariant underneath all of them.

Universal. Cadence is a formula, not a guess — Young/Daly W = √(2·μ·C) (μ = mean time between preemptions, C = checkpoint write time); round down to an iteration boundary. Managed frameworks (SkyPilot Managed Jobs, SageMaker) move the box for you but restart your process from scratch — your checkpoint-load is what restores progress. Details + worked numbers in references/spot-resilience.md.


9. Cost and destructive actions are the user's call

Never auto-release/terminate an instance, never delete durable/shared files without explicit confirmation, and if your own cleanup can't free enough space, ask to expand the disk (state the GB needed) rather than silently shrinking the experiment (fewer seeds, smaller eval, capped vis).

This is sharpened, not softened, by going multi-platform: on RunPod/vast/Lambda the meter-stopping action is the irreversible terminate/destroy that deletes the disk — so the confirmation gate matters more. Operationalize it as the teardown Iron Law (SKILL.md Phase 5): no teardown before checkpoints are pulled to local AND verified by load AND the user approves the specific cost-affecting action.

Universal. A shared FS is also multi-project: work inside your project's own folder, delete only your own redundancy, never a top-level dir you didn't create.


10. Teach the user the platform, don't just drive it

Most users — especially on a platform they rent only occasionally — don't know its non-obvious conveniences or its danger clocks, and the skill's job is not just to operate the box but to tell them. On first contact with a platform, proactively surface:

  • Conveniences they'd otherwise miss: one-click SSH-key registration (so the agent can connect non-interactively), GPU-availability notifications, the built-in panels (JupyterLab / the TensorBoard tile).
  • Danger clocks that cost data or money: auto-release / auto-delete timers on stopped instances (AutoDL releases a 关机 box after 15 days → the data disk is gone; several CN platforms in ~10), a stop that keeps billing (vast.ai forever, RunPod 2×), low-balance / arrears purge.

The per-platform list lives in each profile's Surface to the user block. This pairs with #9: #9 stops the agent from doing the dangerous thing; #10 makes the agent warn the human about the danger clock before it fires. The most expensive surprises on rented hardware are the silent timers (a parked box released, a stopped disk still billing), not the visible failures — surfacing them early is the cheapest insurance.