451 lines
35 KiB
Markdown
451 lines
35 KiB
Markdown
---
|
||
platform: generic-ssh # the DEFAULT profile; Slurm / K8s / Colab-Kaggle are thin diffs below
|
||
kind: ssh # ssh | slurm | kubernetes | notebook (per sub-section)
|
||
meter_stop_verb: manual # nothing reclaims the box — a forgotten instance bills 24/7
|
||
meter_stop_irreversible: true # destroying the box deletes its disk; no platform undo
|
||
detach_primitive: tmux # tmux/nohup (bare) | sbatch (Slurm) | k8s-job (K8s) | kaggle-commit
|
||
spot_available: false # bare box: none by default; Slurm scavenger + spot rentals override
|
||
spot_grace: n/a # bare: n/a · Slurm: SIGTERM→KillWait(default 30s)→SIGKILL · K8s: terminationGracePeriodSeconds(default 30s)
|
||
shared_fs: host-dependent # bare: one disk you own · Slurm: parallel /scratch · K8s: a PVC
|
||
inode_cap: host-dependent # measure with df -i; do NOT assume an AutoDL ~200K constant
|
||
free_egress: host-dependent
|
||
china_mirror_needed: host-dependent # only if the box sits behind the GFW
|
||
host_driver_cuda_max: host-dependent
|
||
local_nvme: host-dependent
|
||
---
|
||
|
||
# Profile: generic-SSH — the DEFAULT (bare box) + Slurm / Kubernetes / Colab-Kaggle diffs
|
||
|
||
One-line purpose: the lowest-common-denominator profile for a box where **SSH is the only control
|
||
channel and teardown is manual** — every other platform profile is a *diff* against this baseline.
|
||
|
||
> **Surface to the user up front (principle #10):** ⚠️ Danger clock — there is usually **no auto-release / idle timer to save you**: a forgotten box **bills 24/7** until you tear it down, and teardown is entirely manual (no platform safety net). Reality — you **expose ports yourself** (an `ssh -L` tunnel for TB/Jupyter); on Slurm a job dies at **walltime** — design the requeue.
|
||
|
||
Read this whole file before Phase 0 on any unbranded rental, then jump to the matching sub-section
|
||
(Slurm / Kubernetes / Colab-Kaggle) if the backend is a scheduler, a cluster, or a notebook.
|
||
**Universal gotchas are NOT restated here** — see `references/gotchas_universal.md`.
|
||
|
||
**Table of contents** (`grep -in '<keyword>' profiles/generic-ssh.md` to jump):
|
||
- BASELINE: 8-field schema for the bare-SSH box (sections 1–8)
|
||
- THIN DIFF — SLURM (sbatch replaces tmux)
|
||
- THIN DIFF — KUBERNETES (a Job manifest replaces the shell)
|
||
- THIN DIFF — COLAB / KAGGLE (not SSH-orchestratable)
|
||
|
||
The one load-bearing abstraction every backend below solves differently: **detach the job from the
|
||
connection, and make the result survive the session ending.** Checkpoint-to-durable + idempotent
|
||
resume (principle #8) is the invariant; the detach primitive (tmux / sbatch / Job / commit) is the
|
||
swappable plug.
|
||
|
||
---
|
||
|
||
## 1. LAUNCH
|
||
|
||
- **Entry point:** `ssh user@host` — key-based, fronted by an `~/.ssh/config` alias so the rest of
|
||
the workflow says `ssh gpu-box`. There is **no platform API, console, or CLI** — SSH is the *only*
|
||
control channel (this is what makes the box "generic"). Set the alias per `references/ssh_transport.md`.
|
||
- **Push code:** `rsync -avz --partial ./proj/ gpu-box:~/proj/` — resumable, delta-only on re-syncs;
|
||
prefer over `scp` (a reset `scp` restarts from zero). Pull results the same way, reversed.
|
||
- **Download weights/datasets ON the box**, not over the local uplink: `ssh gpu-box 'cd ~/proj &&
|
||
hf download <repo> --local-dir data'` (or `aws s3 cp`, `wget`). The box almost always has a fatter,
|
||
cheaper pipe to HF/S3 than a home connection — pushing a 50 GB checkpoint over a residential uplink
|
||
is the classic self-inflicted stall. Transport verbs → **REQUIRED:** `huggingface-skills:hf-cli`.
|
||
- **Env contract:** whatever the host ships. There is no prebuilt "base" guarantee — inspect
|
||
`which python && python -V && nvidia-smi` first. If the image has a usable env, treat it as AutoDL's
|
||
base (do not `conda create` on a throwaway box); if it is bare, `conda create` / `venv` once and
|
||
pin it. State the seed/determinism in the run itself — no platform does it here (**REQUIRED:**
|
||
`verifying-dl-experiments`).
|
||
|
||
→ **verify:** `ssh gpu-box 'python -c "import torch;print(torch.cuda.is_available())"'` prints `True`.
|
||
|
||
## 2. STORAGE MODEL *(the survival matrix — principle #4)*
|
||
|
||
The box gives **one persistent disk that is yours to manage** — no shared FS, no platform quota
|
||
service, no automatic reclamation. *Measure, never assume:* run `df -h && df -i <mount>` live on the
|
||
box. Caps are host-dependent — do **not** carry over an AutoDL ~200K-inode or ~200 GB constant.
|
||
|
||
| Tier | Path | Survives STOP? | Survives DESTROY? | Cap |
|
||
|---|---|---|---|---|
|
||
| Root / home disk | `/` , `~` | yes (box keeps running) | **no** (destroy deletes the box) | host-dependent — `df -h`/`df -i` |
|
||
| Attached block volume (if any) | `/path/to/mount` | yes | depends on provider — verify before destroy | host-dependent |
|
||
|
||
The only "survival matrix" subtlety on a bare box: there is **no stop/destroy distinction the
|
||
platform enforces** — the box runs until *manually* stopped, and a destroy wipes the disk with no
|
||
undo. So checkpoints must land on a mount that gets `rsync`-pulled to local **before** teardown
|
||
(§5). Disk fails on inodes before bytes and the real hog hides in a symlinked cache — audit the
|
||
actual mount with `du`, clean by value (keep tiny eval JSONs, prune large periodic checkpoints).
|
||
|
||
## 3. NETWORK
|
||
|
||
- **Egress/proxy:** host-dependent; there is no platform proxy hook. If the box sits behind the GFW,
|
||
set the mirror manually — `export HF_ENDPOINT=https://hf-mirror.com` (or `HF_HUB_ENABLE_HF_TRANSFER=1`
|
||
off-GFW) — and validate the speed test on the **same route** the real transfer uses (principle #7).
|
||
- **Port exposure:** expose services yourself. TensorBoard / Jupyter ride an SSH tunnel from the
|
||
local machine: `ssh -L 6006:localhost:6006 gpu-box` then open `http://<localhost>:6006`. There is
|
||
no console port-forward button.
|
||
- **SSH flavor:** direct-TCP key-based SSH — `scp`/`rsync` work normally (unlike the proxied SSH on
|
||
some rental platforms). If the provider hands out a non-standard port, pin it in the alias.
|
||
|
||
## 4. SPOT / INTERRUPTION + RESUME *(principle #7/#8)*
|
||
|
||
A bare on-demand box has **no spot/preemption model by default** — it runs until manually stopped, so
|
||
the interruption to design against is an **SSH drop**, not an eviction. Without a detach primitive an
|
||
SSH drop sends SIGHUP and kills the job; `tmux` (§6) is what severs the job from the connection.
|
||
|
||
Resume is **self-built**: checkpoint full state (model + optimizer + scheduler + epoch/step + RNG +
|
||
dataloader position) atomically (`tmp`→`fsync`→`os.rename`) on a periodic timer, and load-latest
|
||
unconditionally on startup so the *identical launch command* resumes. Cadence formula + atomic-write
|
||
pattern → `references/spot-resilience.md`. (Spot-rented bare boxes exist — if the provider can evict,
|
||
treat it like the vast.ai profile: tiny/zero grace, checkpoint continuously.)
|
||
|
||
## 5. TEARDOWN / BILLING *(principle #9 + the Iron Law)*
|
||
|
||
**Teardown is MANUAL and is the number-one cost failure on this profile.** Nothing reclaims the box:
|
||
no idle timer, no auto-release, no scheduler that ends the job. **A forgotten box bills 24/7** — an
|
||
overnight idle instance is the most expensive single mistake on metered hardware.
|
||
|
||
- The meter-stopping action is **provider-manual** (a console "stop"/"destroy", a `terminate` API, or
|
||
a phone call) — and on most bare rentals it is **irreversible** (deletes the disk).
|
||
- "Stop after pulling results" is a **mandatory final phase**, not an afterthought. Honor the
|
||
**teardown Iron Law**: no stop/destroy until checkpoints are **pulled to local AND verified by
|
||
load** (`scripts/verify_local.py`) **AND** the user has approved the cost-affecting action.
|
||
"It looked done in the log" is not evidence (principle #3). **REQUIRED:**
|
||
`superpowers:verification-before-completion`.
|
||
|
||
## 6. DAEMON TOOL
|
||
|
||
- **`tmux`** is the detach primitive: `tmux new -s train` → run inside → `Ctrl-b d` to detach;
|
||
`tmux attach -t train` to reattach, `tmux ls` to reconcile a watcher against the real session
|
||
(principle #3). It survives an SSH drop; it does **not** survive a box reboot — relaunch after one.
|
||
- **Fallback** when tmux is absent and cannot be installed: `nohup <cmd> </dev/null >log 2>&1 &` then
|
||
`disown`. Always redirect stdin from `/dev/null` so the job never blocks reading the terminal.
|
||
- **No native queue** — the operator IS the scheduler, monitor, and janitor. Use the parameterized
|
||
`scripts/run_queue.sh.template` for a resumable serial queue; never edit a queue script while it is
|
||
being read (principle #6 — version the filename).
|
||
|
||
## 7. TOP GOTCHAS (platform-pinned; universal ones → `references/gotchas_universal.md`)
|
||
|
||
- **GEN1 — Forgotten box bills 24/7.** Symptom: a week-old invoice for an instance that finished
|
||
training on day one. → Root cause: nothing on a bare box reclaims it; the human is the only janitor.
|
||
→ Fix: make teardown a tracked Phase-5 step; after the verified pull, prompt the user to stop/destroy
|
||
(never auto-act — principle #9); for cross-session safety set a `/schedule` reminder to re-check.
|
||
- **GEN2 — SSH drop kills the run (no tmux).** Symptom: training dies the moment the laptop sleeps or
|
||
the network blips. → Root cause: the job is a child of the SSH shell; the drop sends SIGHUP.
|
||
→ Fix: launch inside `tmux` (or `nohup … & disown`) **before** the long run starts — not after it is
|
||
already orphaned.
|
||
- **GEN3 — `scp` restarts from zero on a reset; `rsync` does not.** Symptom: a 40 GB re-sync that
|
||
never finishes over a flaky link. → Root cause: `scp` has no resume. → Fix: `rsync -avz --partial`
|
||
for every code/data/result transfer; wrap bulk pulls in a `timeout`+resume loop (principle #7).
|
||
- **GEN4 — CRLF breaks `.sh` on the Linux box.** Symptom: `bash: $'\r': command not found`, or a
|
||
shebang that "isn't found." → Root cause: a script authored on Windows carries CRLF line endings.
|
||
→ Fix: `.gitattributes` with `*.sh text eol=lf`; on-box unblock `sed -i 's/\r$//' run.sh`.
|
||
- **GEN5 — Heavy DL static-checked on the wrong machine.** Symptom: an OOM or a CUDA mismatch only
|
||
reproduces on the box. → Root cause: static/import checks ran locally, the real compute is remote.
|
||
→ Fix: run the cheap CPU smoke locally (Phase 2), run the heavy DL **on the box**; for the
|
||
bug-vs-effect call once it runs, defer to **REQUIRED:** `verifying-dl-experiments`.
|
||
- **GEN6 — A box reboot silently orphans the run (`tmux` does not survive it).** Symptom: a detached
|
||
job vanishes with a clean `dmesg`, idle GPU, and low `uptime`; `tmux ls` shows no sessions.
|
||
→ Root cause: `tmux`/`nohup` survive an SSH drop but **not** a host reboot — the rental rebooted (host
|
||
maintenance, kernel update, or an OOM that took the box) and every session died. → Fix: treat reboot
|
||
as one of the four "vanished process" causes (cross-link `references/gotchas_universal.md` U3); make
|
||
resume idempotent (§4) so the *same* launch command continues from the last checkpoint; for a box that
|
||
reboots often, add an `@reboot` cron or a systemd unit that re-launches the detached queue.
|
||
- **GEN7 — A second concurrent run silently halves throughput by oversubscribing the GPU.** Symptom: two
|
||
training runs on the "same idle GPU" both crawl, or the second OOMs on a card that looked free.
|
||
→ Root cause: a bare box has **no scheduler** — nothing prevents two processes sharing one GPU, so they
|
||
contend for VRAM and SM time. → Fix: the operator *is* the scheduler — serialize with the
|
||
`run_queue.sh` template, or pin each run to a distinct card with `CUDA_VISIBLE_DEVICES=<n>`; check
|
||
`nvidia-smi` for an existing holder before every launch (zombie holders → U11).
|
||
- **GEN8 — Watching a poll connection, not the run, declares a false death.** Symptom: the ssh-poll
|
||
drops and the run is pronounced dead, but the job finished fine and wrote `best.pth`. → Root cause: a
|
||
dropped *poll* connection ≠ the training dying; the two failure modes are conflated. → Fix: on any poll
|
||
drop, re-ssh and check ground truth directly (`pgrep -af train`, log tail, `best.pth` mtime) before
|
||
concluding anything (principle #3); robust short-connection poll template → U17.
|
||
|
||
### Platform-specific debugging (bare SSH)
|
||
|
||
The box has no console — every diagnostic is an ssh one-liner. Run these *separately* (a kill drops the
|
||
SSH, U1/U4), and bound each with `ssh -o ConnectTimeout=15 -o ServerAliveInterval=10` so a blip
|
||
self-kills instead of half-open hanging:
|
||
|
||
- **Is the run alive or orphaned?** `ssh gpu-box 'tmux ls; pgrep -af <train-script> | head'` — empty
|
||
`tmux ls` after a vanished log ⇒ reboot/HUP (GEN6); reconcile the watcher against the real session.
|
||
- **Why did it die (the 4-cause ladder)?** `ssh gpu-box 'dmesg 2>/dev/null | grep -iE "killed process|out of memory|Xid" | tail; uptime'` — OOM line ⇒ U9/U10; clean dmesg + low uptime ⇒ reboot (GEN6); `Xid 48/79` ⇒ dead GPU, re-rent (U22).
|
||
- **GPU health, not just util%:** `ssh gpu-box 'nvidia-smi dmon -s pucvmet -d 1 -c 5'` — read SM clock + power, not `GPU-Util` (a liar, U21); a holder `nvidia-smi` cannot see ⇒ `fuser -v /dev/nvidia*` (U11).
|
||
- **Disk before it bites:** `ssh gpu-box 'df -h <mount>; df -i <mount>'` — inodes hit 100% before bytes (U7); the byte-hog often hides in `~/.cache/huggingface` (`du -sh ~/.cache/huggingface/hub/models--* | sort -rh`).
|
||
- **Stuck download?** A transfer with a live process but a flat `df` is stalled, not progressing —
|
||
`ssh gpu-box 'ls -la --time-style=+%H:%M data/*.tmp; df -h <mount>'`; if the size has not moved, kill and
|
||
resume the per-dir loop (`scripts/download_loop.sh`, U12), never restart from zero.
|
||
|
||
## 8. SCRIPT OVERRIDES
|
||
|
||
Values to parameterize the `scripts/` templates for a bare-SSH box:
|
||
|
||
```
|
||
DATA_DIR=$HOME/proj (working dir / data disk on the box)
|
||
DURABLE_DIR=$HOME/proj (durable mount = the measured persistent disk; pull to local before teardown)
|
||
PROXY_HOOK= (none by default; set HF_ENDPOINT=https://hf-mirror.com only if behind the GFW)
|
||
CRED_FILE=~/.netrc on the box's local disk, streamed in via stdin — never onto a shared/durable FS
|
||
SCRATCH=*.latest.pth and periodic checkpoints (prune on success; keep best + tiny eval JSONs)
|
||
HF_HOME=$HOME/proj/.hf (redirect off the default ~/.cache so it lands on the data disk)
|
||
DETACH=tmux (the swappable plug — replaced by sbatch / Job / commit in the diffs below)
|
||
```
|
||
|
||
---
|
||
|
||
# THIN DIFF — SLURM *(sbatch replaces tmux)*
|
||
|
||
`kind: slurm` · meter = walltime/fairshare **quota, not dollars** · detach = `sbatch` · no teardown.
|
||
|
||
The scheduler owns the job's lifecycle: the operator **submits**, Slurm runs and detaches it.
|
||
`tmux+nohup` is **replaced** (not supplemented) by `sbatch` — a submitted batch job survives logout
|
||
with no tmux. A bare `srun` still **blocks and dies on terminal close** like a foreground process, so
|
||
wrap `srun` *inside* an `sbatch` script for long runs.
|
||
|
||
- **Submit / monitor / kill:** `sbatch job.sh` (returns a jobid immediately) · `squeue -u $USER`
|
||
(status — replaces "reattach tmux") · `sacct -j <jobid>` (post-mortem: exit code, maxRSS, elapsed)
|
||
· `scancel <jobid>` (kill). Logs go to `slurm-%j.out` (arrays: `slurm-%A_%a.out`) — file-based, same
|
||
logs-to-file contract as the baseline.
|
||
- **GPUs are declarative:** `#SBATCH --gres=gpu:a100:2` (or `--gpus=volta:3`); request, do not place.
|
||
Slurm's GRES plugin sets `CUDA_VISIBLE_DEVICES` per step (verified slurm.schedmd.com/gres.html 2026-06).
|
||
- **Walltime ceiling — the hard new constraint:** `#SBATCH --time=HH:MM:SS` and at the limit each task
|
||
is sent **SIGTERM, then SIGKILL after `KillWait` (default 30 s)** (verified slurm.schedmd.com/sbatch.html
|
||
+ slurm.conf 2026-06). Long training MUST checkpoint and requeue, not "run until done."
|
||
- **Preemption + checkpoint-on-signal:** on time-limit or scavenger-partition eviction the same
|
||
SIGTERM→KillWait→SIGKILL sequence applies. Arm `#SBATCH --signal=B:SIGTERM@360` for a ~6-minute warning
|
||
(the `B:` prefix signals the **batch shell**, not the steps; **Slurm may fire it up to 60 s EARLY** —
|
||
size the warning with that slack, verified slurm.schedmd.com/sbatch.html 2026-06), trap it to set a flag,
|
||
and `#SBATCH --requeue` to auto-return to the queue (the script restarts **from its beginning with the
|
||
same job ID**) and resume from the last checkpoint. Cadence formula → `references/spot-resilience.md`.
|
||
- **Native orchestration replaces hand-rolled fan-out:** `--array=0-15` (rate-limit with `%4`) fans out
|
||
ablation cells, `--dependency=afterok:<jobid>` chains stages (runs only on exit-code-0).
|
||
- **No per-hour teardown — watch fairshare.** Nodes are not `shutdown`; the job just ends. The
|
||
baseline's #1 risk (forgotten box) **disappears**, replaced by "don't blow the walltime/fairshare
|
||
allocation." There is nothing to stop.
|
||
- **No root, shared multi-tenant node:** cannot `apt install`. Use `module load cuda` or a container
|
||
(**Apptainer/Singularity** — Docker is usually banned).
|
||
- **Filesystem split:** the shared parallel FS (`$HOME`, `/scratch`) persists and is where checkpoints
|
||
go; node-local **`$TMPDIR` is wiped when the job ends** — stage scratch to `$TMPDIR`, checkpoint to
|
||
`/scratch`. Multi-node NCCL/fabric specifics → `references/multinode.md`.
|
||
|
||
### Slurm gotchas (platform-pinned; universal → `references/gotchas_universal.md`)
|
||
|
||
- **SLURM1 — Checkpoint *inside* the signal handler corrupts the checkpoint.** Symptom: `--requeue`
|
||
works most of the time, then intermittently writes a corrupt `hpc_ckpt` and the requeued job won't
|
||
load. → Root cause: a Python signal handler can fire **after any bytecode instruction** — including
|
||
mid-backward-pass — so checkpointing directly in the handler races with training (verified
|
||
github.com/Lightning-AI/pytorch-lightning#21406 2026-06). → Fix: the handler does the **minimum** —
|
||
set a flag; poll the flag in the training loop and checkpoint at a **safe point** (end of step), then
|
||
`scontrol requeue $SLURM_JOB_ID` or exit so `--requeue` returns it.
|
||
- **SLURM2 — Warning signal arrives too late; the SIGKILL lands mid-write.** Symptom: the
|
||
`--signal@360` trap fires but the checkpoint is half-written when SIGKILL hits. → Root cause: two
|
||
slacks compound — Slurm may send the warning **up to 60 s early OR late**, and at the actual wall the
|
||
`KillWait` grace is only ~30 s (verified slurm.schedmd.com 2026-06). → Fix: budget the warning so a
|
||
full checkpoint fits *before* the wall even with the 60 s jitter; checkpoint *periodically* too (never
|
||
rely on the one signal); make the write atomic (`tmp`→`fsync`→`rename`, U6) so a truncated file is
|
||
never loaded.
|
||
- **SLURM3 — `srun` inside `sbatch` no longer inherits `--cpus-per-task` (Slurm ≥ 22.05).** Symptom: a
|
||
nested `srun` hangs, sees one CPU, or under-threads the dataloader. → Root cause: since 22.05 `srun`
|
||
stopped reading `SLURM_CPUS_PER_TASK` and must be told explicitly (verified docs.icer.msu.edu 2026-06).
|
||
→ Fix: `srun -c $SLURM_CPUS_PER_TASK …`, or set `export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK`; pass
|
||
`--gpus-per-task`/`--gres` on the `srun` too — a step does not inherit the allocation's GRES by default.
|
||
- **SLURM4 — OOM is a job STATE, not a Python traceback.** Symptom: the job dies with no error in the
|
||
log; `sacct` shows `State=OUT_OF_MEMORY` (or `slurmstepd: Detected 1 oom-kill event(s)`). → Root cause:
|
||
Slurm cgroup sets a hard memory limit at (a fraction of) the requested `--mem`; exceeding it is an
|
||
OOM-kill the kernel performs (verified osc.edu / icer.msu.edu 2026-06). → Fix: read `sacct -o
|
||
MaxRSS,ReqMem` and raise `--mem`/`--mem-per-cpu` to MaxRSS×1.2; this is the cgroup-RAM OOM of U9
|
||
(dataloader workers × a big tensor), distinct from VRAM OOM (U10) — **do not** shrink batch for a
|
||
host-RAM OOM.
|
||
- **SLURM5 — `$TMPDIR` checkpoints evaporate when the job ends.** Symptom: a requeued/array job finds an
|
||
empty checkpoint dir. → Root cause: node-local `$TMPDIR` is wiped at job end; only the shared parallel
|
||
FS persists across a requeue or a different node. → Fix: stage *scratch* to `$TMPDIR` for speed, but
|
||
write **checkpoints to `/scratch/$USER`**; never point `DURABLE_DIR` at node-local storage.
|
||
|
||
### Slurm debugging (squeue / sacct / cgroup triage)
|
||
|
||
- **Still queued or running?** `squeue -u $USER -o '%i %T %r %M %l %R'` — the `%r` Reason column explains
|
||
a `PENDING` (e.g. `Resources`, `Priority`, `QOSMaxGPUPerUserLimit`); `%R` on a running job is the nodelist.
|
||
- **Post-mortem (why it ended):** `sacct -j <jobid> --format=JobID,State,ExitCode,DerivedExitCode,Elapsed,MaxRSS,ReqMem,Timelimit,NodeList`
|
||
— `State=TIMEOUT` ⇒ walltime kill (raise `--time` or requeue); `OUT_OF_MEMORY` ⇒ SLURM4; `PREEMPTED`/`NODE_FAIL`
|
||
⇒ requeue territory; `ExitCode` like `0:9` means killed by **signal 9** (SIGKILL — the KillWait expired).
|
||
- **Live resource use:** `sstat -j <jobid>.batch --format=JobID,MaxRSS,MaxVMSize` on a *running* step
|
||
(sacct only finalizes at exit); cross-check against `ReqMem` to catch a creeping leak before the cgroup kills it.
|
||
- **GPU actually allocated to the step?** inside the job: `echo $CUDA_VISIBLE_DEVICES && nvidia-smi -L`
|
||
— a mismatch ⇒ SLURM3 (`--gres`/`--gpus-per-task` not on the `srun`).
|
||
- **Multi-node hang** (job RUNNING, no progress) ⇒ NCCL/fabric, not Slurm → `references/multinode.md`.
|
||
|
||
**Slurm OVERRIDES:** `DETACH=sbatch` · `DURABLE_DIR=/scratch/$USER/proj` (durable) + `DATA_DIR=$TMPDIR`
|
||
(node-local, wiped) · `PROXY_HOOK=module load cuda` · teardown=`n/a (watch sacct + fairshare)`.
|
||
|
||
---
|
||
|
||
# THIN DIFF — KUBERNETES *(a Job manifest replaces the shell)*
|
||
|
||
`kind: kubernetes` · detach = a `Job` manifest (no shell) · persistence = a **PVC, non-optional**.
|
||
|
||
The unit of work is a **manifest**, not a session: `kubectl apply -f job.yaml`; the control plane
|
||
schedules a pod and a `Job` controller **replaces it on failure** up to `backoffLimit` (**default 4** —
|
||
each failure creates a *new* pod, it does not restart the old one; verified kubernetes.io Jobs doc
|
||
2026-06). The "detach from my connection" problem vanishes — the pod never had a connection to the shell.
|
||
|
||
- **GPUs:** `resources.limits: nvidia.com/gpu: 1`. Quirk (verified kubernetes.io scheduling-gpus 2026-06):
|
||
GPUs go in **`limits` only**; if `requests` is set it must **equal** `limits`, and you cannot set
|
||
`requests` without `limits`; GPUs are **integer, not shared or overcommitted** — one whole GPU per
|
||
container (absent MIG/time-slicing, which K8s does not provide out of the box). Provided by the NVIDIA
|
||
device-plugin DaemonSet.
|
||
- **Code delivery is different — no `rsync` into a pod.** Code is **baked into a container image**
|
||
(build → push to a registry) or pulled at pod start. This is the biggest workflow shift from the
|
||
baseline; pin the base image by `@sha256:` digest, not `:latest` (U30).
|
||
- **Persistence is the headline risk:** the **pod filesystem is EPHEMERAL by design.** On
|
||
death/restart/reschedule, anything written outside a mounted volume is **gone**. Checkpoints **must**
|
||
mount a **PersistentVolumeClaim** (or object storage) at `/checkpoints` — this is non-optional and is
|
||
the single most common way ML-on-K8s loses work.
|
||
- **Monitor:** `kubectl get pods` · `kubectl logs -f <pod>` (replaces `tail -f`). `kubectl exec -it …
|
||
-- bash` is a debugging tool, not the run mechanism — an exec session is not durable.
|
||
- **Declarative parallelism:** `Job` `parallelism`/`completions` (both default 1) for fan-out (the K8s
|
||
analog of Slurm arrays).
|
||
- **Lifecycle knobs:** `activeDeadlineSeconds` is the walltime analog (terminates the Job past the
|
||
deadline); `ttlSecondsAfterFinished` auto-GCs a finished Job; `terminationGracePeriodSeconds` (**default
|
||
30 s**, verified kubernetes.io 2026-06) is the SIGTERM→SIGKILL window — the K8s analog of Slurm
|
||
`KillWait`, so the same checkpoint-on-SIGTERM discipline applies.
|
||
- **Teardown is two-layered:** `kubectl delete job <name>` frees the *pod* (cheap), but the underlying
|
||
**node/cluster keeps costing** unless an autoscaler scales it down. **delete ≠ scale-down** — the
|
||
node release is the real cost lever, distinct from the baseline's single "destroy the box."
|
||
|
||
### Kubernetes gotchas (platform-pinned; universal → `references/gotchas_universal.md`)
|
||
|
||
- **K8S1 — Pod stuck `Pending`: `Insufficient nvidia.com/gpu`.** Symptom: `kubectl get pods` shows
|
||
`Pending`; the events read `0/N nodes are available: N Insufficient nvidia.com/gpu`. → Root cause:
|
||
*usually not* missing hardware — the **device-plugin DaemonSet** isn't running, so no node advertises
|
||
allocatable GPUs; or a taint blocks scheduling (verified kubenatives.com + GKE troubleshooting 2026-06).
|
||
→ Fix: `kubectl describe node <n> | grep -A4 -E 'Capacity|Allocatable'` — if `nvidia.com/gpu` is `0`,
|
||
the plugin is down: `kubectl get ds -n kube-system | grep nvidia` and `kubectl logs -n kube-system -l
|
||
k8s-app=nvidia-device-plugin`; add the matching toleration if the GPU nodes are tainted.
|
||
- **K8S2 — `RestartPolicy: Always` is rejected on a Job.** Symptom: `kubectl apply` errors that a Job's
|
||
pod template may only use `Never` or `OnFailure`. → Root cause: a Job is not a Deployment; only those
|
||
two restart policies are legal (verified kubernetes.io Jobs doc 2026-06). → Fix: use `OnFailure`
|
||
(restart the *container* in place — keeps `/checkpoints` warm) or `Never` (a fresh pod per attempt,
|
||
cleaner logs); never copy a Deployment's `Always`.
|
||
- **K8S3 — `ImagePullBackOff` / `ErrImagePull` after a registry push.** Symptom: the pod never starts;
|
||
events show `Back-off pulling image`. → Root cause: a private registry without an `imagePullSecrets`,
|
||
a wrong tag/digest, or a too-big layer timing out the pull. → Fix: `kubectl describe pod <p>` reads the
|
||
exact pull error; attach `imagePullSecrets`, pin a real `@sha256:` digest (U30), and pre-warm large
|
||
images onto the node pool.
|
||
- **K8S4 — `Multi-Attach error` on a rescheduled pod (RWO PVC).** Symptom: a pod stuck
|
||
`ContainerCreating` after a node failure: `Volume is already exclusively attached to one node`. → Root
|
||
cause: a **ReadWriteOnce** PVC can attach to **one node at a time**; on failover the old attachment
|
||
hasn't released, and two distributed-training pods on different nodes can never share an RWO volume
|
||
(verified discuss.kubernetes.io / bobcares.com 2026-06). → Fix: for multi-node training use
|
||
**ReadWriteMany** (NFS/EFS/CephFS) for the shared checkpoint dir, or pin co-dependent pods to one node
|
||
with affinity; on a stuck failover, force-detach via the cloud console or delete the old `VolumeAttachment`.
|
||
- **K8S5 — Pod `Evicted` mid-training under node disk pressure.** Symptom: a long run dies with
|
||
`status: Evicted, reason: The node was low on resource: ephemeral-storage`. → Root cause: container
|
||
logs, the writable layer, and `emptyDir` count as **ephemeral storage**; checkpoints/caches written
|
||
outside the PVC fill the node and the kubelet evicts the pod (verified jorijn.com / oneuptime.com
|
||
2026-06). → Fix: write **everything large to the PVC**, set `resources.limits.ephemeral-storage`,
|
||
rotate logs, and back `emptyDir` scratch with `sizeLimit`; this is the K8s face of the disk-full crash
|
||
(U6/U7).
|
||
- **K8S6 — Container runs but trains on CPU (GPU never attached).** Symptom: a pod runs to completion,
|
||
loss curves normal, ~100× too slow. → Root cause: the GPU limit was omitted, or `nvidia-smi` works on
|
||
the *node* but the container lacks the runtime/library path. → Fix: **validate `kubectl exec <p> --
|
||
nvidia-smi` before trusting a run**; ensure `resources.limits.nvidia.com/gpu` is set and the NVIDIA
|
||
container runtime is the default (this is U31 surfaced through K8s).
|
||
|
||
### Kubernetes debugging (kubectl triage)
|
||
|
||
- **Why is it Pending / not starting?** `kubectl describe pod <p>` — the **Events** section names it
|
||
directly (Insufficient GPU ⇒ K8S1; FailedScheduling taint; ImagePullBackOff ⇒ K8S3; FailedMount ⇒ K8S4).
|
||
- **Why did it die?** `kubectl get pod <p> -o jsonpath='{.status.containerStatuses[0].lastState.terminated}'`
|
||
— `reason: OOMKilled` ⇒ raise `resources.limits.memory` (cgroup-RAM, U9); `Error` + exit code ⇒ read logs.
|
||
- **Logs of a crashed/previous attempt:** `kubectl logs <p> --previous` (the current pod may be a fresh
|
||
retry with an empty log); `kubectl get events --sort-by=.lastTimestamp` for the cluster-wide timeline.
|
||
- **Did the node even offer GPUs?** `kubectl describe node <n> | grep -A4 Allocatable` — `nvidia.com/gpu: 0`
|
||
⇒ device plugin down (K8S1).
|
||
- **Is the PVC bound and mounted?** `kubectl get pvc` (`Bound`?) and `kubectl describe pod <p>` Volumes
|
||
section — an unbound PVC stalls the pod in `Pending`.
|
||
|
||
**K8s OVERRIDES:** `DETACH=k8s-job` · `DURABLE_DIR=/checkpoints` (PVC mount — required; RWX for multi-node)
|
||
· `CRED_FILE=""` — credentials arrive as a K8s Secret mounted as an env var (WANDB_API_KEY / HF_TOKEN),
|
||
never a file on disk and never baked into the image layer, so run_one's `[ -n "$CRED_FILE" ]` guard skips
|
||
the file read and the env var passes through · teardown=`kubectl delete` **+** scale the node pool down.
|
||
|
||
---
|
||
|
||
# THIN DIFF — COLAB / KAGGLE *(not SSH-orchestratable)*
|
||
|
||
`kind: notebook` · **no SSH, no tmux, no persistent disk, no real job abstraction.** The generic
|
||
core's central primitive ("detach + survive the session") cannot be satisfied directly — degrade to
|
||
**checkpoint-to-cloud + idempotent resume**. Teardown is automatic and free; the *opposite* problem to
|
||
the baseline — the work cannot be kept alive long enough.
|
||
|
||
**Colab (free tier):**
|
||
- **Idle timeout ~90 min** (no cell activity) and a hard **~12 h max VM lifetime**; on disconnect all
|
||
RAM, variables, models, and the local `/content` filesystem are **lost**. Limits are **dynamic and
|
||
unpublished** — GPU type/availability and the exact ceilings "vary over time" and GPU is best-effort,
|
||
can be denied or downgraded (verified research.google.com/colaboratory/faq.html 2026-06).
|
||
- **Free tier requires the browser tab to STAY OPEN** — *(verified — corrects the draft's "anti-idle
|
||
tricks are unreliable" framing)*: **background execution is a Pro+ paid feature**; on free tier closing
|
||
the tab stops the runtime shortly after (verified github.com/googlecolab/colabtools#4151 + community
|
||
reports 2026-06). So keep-alive hacks aren't merely *unreliable* — there is **no supported headless
|
||
background run at all** on free Colab. Design for the disconnect, do not fight it.
|
||
- **Only survival mechanism:** mount Google Drive and **checkpoint every epoch to Drive**; make the
|
||
entrypoint **resume-from-Drive idempotent** so the inevitable reconnect continues, not restarts.
|
||
|
||
**Kaggle (free tier) — slightly better, because of one real primitive:**
|
||
- **30 GPU-hours/week** floating quota (T4×2 or P100; resets weekly); **interactive idle timeout ~60 min**
|
||
and a **~9 h** session cap (verified kaggle.com/docs/efficient-gpu-usage + product-feedback 2026-06).
|
||
- **The one genuine headless-background primitive: "Save Version → Save & Run All (commit)."** It
|
||
snapshots the notebook and runs it **on a separate machine with no idle timer, surviving browser
|
||
close**, and **persists `/kaggle/working` (20 GB) as the committed version's output** (commit times out
|
||
at ~9 h GPU / ~12 h CPU). This is the closest thing to `sbatch` in the free-tier world — single it out
|
||
as Kaggle's detach primitive. Live monitoring is weak (Colab: watch the cell; Kaggle commit: inspect
|
||
only the finished version's logs).
|
||
- **Code delivery:** clone from GitHub or pull the platform's dataset mounts — no scp.
|
||
|
||
### Colab / Kaggle gotchas (platform-pinned; universal → `references/gotchas_universal.md`)
|
||
|
||
- **NB1 — Drive sync lag silently loses the "saved" checkpoint.** Symptom: training logs
|
||
`saved best.pth to /content/drive/...`, the runtime disconnects an hour later, and the file is **0 bytes
|
||
or absent** in Drive. → Root cause: writes to mounted Drive are **buffered and sync asynchronously** —
|
||
large files can take up to ~30 min to actually land, and an unmount/disconnect before the flush loses
|
||
them (verified github.com/googlecolab/colabtools#2607 + #4426 2026-06). → Fix: call
|
||
`drive.flush_and_unmount()` (or `os.fsync`) right after each checkpoint, keep checkpoints small, and
|
||
treat a checkpoint as durable **only after** it is visible in Drive — re-list it before trusting resume.
|
||
- **NB2 — Kaggle commit fails if any cell errors → the whole output is lost.** Symptom: "Save & Run All"
|
||
shows `committing…` forever or fails with a non-zero/`Code 0` error, and **nothing** in `/kaggle/working`
|
||
is saved. → Root cause: a commit re-runs the notebook **top-to-bottom on a fresh machine**; one failing
|
||
cell (or an interactive-only state, or a flaky cell) aborts the commit and discards its output (verified
|
||
kaggle.com/product-feedback/334753 + 59557 2026-06). → Fix: before committing, **Run All interactively**
|
||
end-to-end on a clean kernel (catch order/state bugs); guard long sections so a late failure still writes
|
||
partial results to `/kaggle/working`; rely on `/kaggle/working` (persisted), not in-memory variables.
|
||
- **NB3 — Kaggle batch (commit) run picks the WRONG accelerator / has no internet.** Symptom: a committed
|
||
run is glacial (ran on CPU) or fails to `pip install`/download. → Root cause: the **accelerator and
|
||
internet toggle are notebook settings the commit inherits** — a notebook left on "None"/internet-off
|
||
commits that way; internet also requires phone verification on the account. → Fix: set Accelerator =
|
||
GPU and Internet = On in the notebook *before* committing; verify with `torch.cuda.is_available()` in an
|
||
early cell so a CPU commit fails fast instead of wasting the 9 h.
|
||
- **NB4 — `/content` (Colab) and `/kaggle/temp` are scratch, not durable.** Symptom: results written to
|
||
`/content/...` or `/kaggle/temp` vanish on disconnect. → Root cause: only Drive (Colab) and
|
||
`/kaggle/working` (Kaggle committed output) survive the session; everything else is ephemeral. → Fix:
|
||
point `DURABLE_DIR` at the surviving path; never let the final artifact land only on scratch.
|
||
- **NB5 — Free Colab disconnect mid-epoch with no warning.** Symptom: the session simply dies; there is
|
||
**no SIGTERM, no grace window** to catch. → Root cause: unlike Slurm/K8s, a notebook eviction gives no
|
||
signal — the resume contract is the *only* defense. → Fix: checkpoint every N steps to Drive
|
||
(NB1-safe), make cell-1 resume-from-latest idempotent, and chain runs across sessions under the
|
||
per-session ceiling. There is no checkpoint-on-signal here (contrast Slurm `--signal` / K8s SIGTERM).
|
||
|
||
### Colab / Kaggle debugging (session-death triage)
|
||
|
||
- **What am I actually on?** First cell: `import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0))`
|
||
and `!nvidia-smi` — catches a CPU-only Colab assignment or a CPU Kaggle commit (NB3) before wasting the session.
|
||
- **Is the checkpoint really in Drive?** `!ls -la /content/drive/MyDrive/proj/*.pth` *after* a
|
||
`drive.flush_and_unmount()` — a 0-byte or missing file ⇒ sync lag (NB1), do not teardown trusting it.
|
||
- **Did the Kaggle commit succeed?** Open the Version's **Logs** tab (the only post-mortem for a committed
|
||
run) — a failed cell shows there; the committed `/kaggle/working` is the artifact, not the editor state.
|
||
- **Disk full inside the notebook?** `!df -h` — `/kaggle/working` caps at 20 GB; HF cache and intermediate
|
||
files exhaust it fast (U6/U7), prune before the commit's final write.
|
||
|
||
**Colab/Kaggle OVERRIDES:** `DETACH=`Drive-checkpoint loop (Colab) / Save&Run-All commit (Kaggle) ·
|
||
`DURABLE_DIR=`Drive `/content/drive/MyDrive/proj` (Colab) / `/kaggle/working` (Kaggle) · teardown=`automatic`
|
||
· the pattern, every run: checkpoint every N steps → idempotent resume from cell 1 → keep each run
|
||
under the per-session ceiling → chain runs across sessions.
|