playbook/antigravity-awesome-skills/skills/remote-gpu-trainer/profiles/generic-ssh.md

35 KiB
Raw Permalink Blame History

platform kind meter_stop_verb meter_stop_irreversible detach_primitive spot_available spot_grace shared_fs inode_cap free_egress china_mirror_needed host_driver_cuda_max local_nvme
generic-ssh ssh manual true tmux false n/a host-dependent host-dependent host-dependent host-dependent host-dependent host-dependent

Profile: generic-SSH — the DEFAULT (bare box) + Slurm / Kubernetes / Colab-Kaggle diffs

One-line purpose: the lowest-common-denominator profile for a box where SSH is the only control channel and teardown is manual — every other platform profile is a diff against this baseline.

Surface to the user up front (principle #10): ⚠️ Danger clock — there is usually no auto-release / idle timer to save you: a forgotten box bills 24/7 until you tear it down, and teardown is entirely manual (no platform safety net). Reality — you expose ports yourself (an ssh -L tunnel for TB/Jupyter); on Slurm a job dies at walltime — design the requeue.

Read this whole file before Phase 0 on any unbranded rental, then jump to the matching sub-section (Slurm / Kubernetes / Colab-Kaggle) if the backend is a scheduler, a cluster, or a notebook. Universal gotchas are NOT restated here — see references/gotchas_universal.md.

Table of contents (grep -in '<keyword>' profiles/generic-ssh.md to jump):

  • BASELINE: 8-field schema for the bare-SSH box (sections 18)
  • THIN DIFF — SLURM (sbatch replaces tmux)
  • THIN DIFF — KUBERNETES (a Job manifest replaces the shell)
  • THIN DIFF — COLAB / KAGGLE (not SSH-orchestratable)

The one load-bearing abstraction every backend below solves differently: detach the job from the connection, and make the result survive the session ending. Checkpoint-to-durable + idempotent resume (principle #8) is the invariant; the detach primitive (tmux / sbatch / Job / commit) is the swappable plug.


1. LAUNCH

  • Entry point: ssh user@host — key-based, fronted by an ~/.ssh/config alias so the rest of the workflow says ssh gpu-box. There is no platform API, console, or CLI — SSH is the only control channel (this is what makes the box "generic"). Set the alias per references/ssh_transport.md.
  • Push code: rsync -avz --partial ./proj/ gpu-box:~/proj/ — resumable, delta-only on re-syncs; prefer over scp (a reset scp restarts from zero). Pull results the same way, reversed.
  • Download weights/datasets ON the box, not over the local uplink: ssh gpu-box 'cd ~/proj && hf download <repo> --local-dir data' (or aws s3 cp, wget). The box almost always has a fatter, cheaper pipe to HF/S3 than a home connection — pushing a 50 GB checkpoint over a residential uplink is the classic self-inflicted stall. Transport verbs → REQUIRED: huggingface-skills:hf-cli.
  • Env contract: whatever the host ships. There is no prebuilt "base" guarantee — inspect which python && python -V && nvidia-smi first. If the image has a usable env, treat it as AutoDL's base (do not conda create on a throwaway box); if it is bare, conda create / venv once and pin it. State the seed/determinism in the run itself — no platform does it here (REQUIRED: verifying-dl-experiments).

verify: ssh gpu-box 'python -c "import torch;print(torch.cuda.is_available())"' prints True.

2. STORAGE MODEL (the survival matrix — principle #4)

The box gives one persistent disk that is yours to manage — no shared FS, no platform quota service, no automatic reclamation. Measure, never assume: run df -h && df -i <mount> live on the box. Caps are host-dependent — do not carry over an AutoDL ~200K-inode or ~200 GB constant.

Tier Path Survives STOP? Survives DESTROY? Cap
Root / home disk / , ~ yes (box keeps running) no (destroy deletes the box) host-dependent — df -h/df -i
Attached block volume (if any) /path/to/mount yes depends on provider — verify before destroy host-dependent

The only "survival matrix" subtlety on a bare box: there is no stop/destroy distinction the platform enforces — the box runs until manually stopped, and a destroy wipes the disk with no undo. So checkpoints must land on a mount that gets rsync-pulled to local before teardown (§5). Disk fails on inodes before bytes and the real hog hides in a symlinked cache — audit the actual mount with du, clean by value (keep tiny eval JSONs, prune large periodic checkpoints).

3. NETWORK

  • Egress/proxy: host-dependent; there is no platform proxy hook. If the box sits behind the GFW, set the mirror manually — export HF_ENDPOINT=https://hf-mirror.com (or HF_HUB_ENABLE_HF_TRANSFER=1 off-GFW) — and validate the speed test on the same route the real transfer uses (principle #7).
  • Port exposure: expose services yourself. TensorBoard / Jupyter ride an SSH tunnel from the local machine: ssh -L 6006:localhost:6006 gpu-box then open http://<localhost>:6006. There is no console port-forward button.
  • SSH flavor: direct-TCP key-based SSH — scp/rsync work normally (unlike the proxied SSH on some rental platforms). If the provider hands out a non-standard port, pin it in the alias.

4. SPOT / INTERRUPTION + RESUME (principle #7/#8)

A bare on-demand box has no spot/preemption model by default — it runs until manually stopped, so the interruption to design against is an SSH drop, not an eviction. Without a detach primitive an SSH drop sends SIGHUP and kills the job; tmux (§6) is what severs the job from the connection.

Resume is self-built: checkpoint full state (model + optimizer + scheduler + epoch/step + RNG + dataloader position) atomically (tmpfsyncos.rename) on a periodic timer, and load-latest unconditionally on startup so the identical launch command resumes. Cadence formula + atomic-write pattern → references/spot-resilience.md. (Spot-rented bare boxes exist — if the provider can evict, treat it like the vast.ai profile: tiny/zero grace, checkpoint continuously.)

5. TEARDOWN / BILLING (principle #9 + the Iron Law)

Teardown is MANUAL and is the number-one cost failure on this profile. Nothing reclaims the box: no idle timer, no auto-release, no scheduler that ends the job. A forgotten box bills 24/7 — an overnight idle instance is the most expensive single mistake on metered hardware.

  • The meter-stopping action is provider-manual (a console "stop"/"destroy", a terminate API, or a phone call) — and on most bare rentals it is irreversible (deletes the disk).
  • "Stop after pulling results" is a mandatory final phase, not an afterthought. Honor the teardown Iron Law: no stop/destroy until checkpoints are pulled to local AND verified by load (scripts/verify_local.py) AND the user has approved the cost-affecting action. "It looked done in the log" is not evidence (principle #3). REQUIRED: superpowers:verification-before-completion.

6. DAEMON TOOL

  • tmux is the detach primitive: tmux new -s train → run inside → Ctrl-b d to detach; tmux attach -t train to reattach, tmux ls to reconcile a watcher against the real session (principle #3). It survives an SSH drop; it does not survive a box reboot — relaunch after one.
  • Fallback when tmux is absent and cannot be installed: nohup <cmd> </dev/null >log 2>&1 & then disown. Always redirect stdin from /dev/null so the job never blocks reading the terminal.
  • No native queue — the operator IS the scheduler, monitor, and janitor. Use the parameterized scripts/run_queue.sh.template for a resumable serial queue; never edit a queue script while it is being read (principle #6 — version the filename).

7. TOP GOTCHAS (platform-pinned; universal ones → references/gotchas_universal.md)

  • GEN1 — Forgotten box bills 24/7. Symptom: a week-old invoice for an instance that finished training on day one. → Root cause: nothing on a bare box reclaims it; the human is the only janitor. → Fix: make teardown a tracked Phase-5 step; after the verified pull, prompt the user to stop/destroy (never auto-act — principle #9); for cross-session safety set a /schedule reminder to re-check.
  • GEN2 — SSH drop kills the run (no tmux). Symptom: training dies the moment the laptop sleeps or the network blips. → Root cause: the job is a child of the SSH shell; the drop sends SIGHUP. → Fix: launch inside tmux (or nohup … & disown) before the long run starts — not after it is already orphaned.
  • GEN3 — scp restarts from zero on a reset; rsync does not. Symptom: a 40 GB re-sync that never finishes over a flaky link. → Root cause: scp has no resume. → Fix: rsync -avz --partial for every code/data/result transfer; wrap bulk pulls in a timeout+resume loop (principle #7).
  • GEN4 — CRLF breaks .sh on the Linux box. Symptom: bash: $'\r': command not found, or a shebang that "isn't found." → Root cause: a script authored on Windows carries CRLF line endings. → Fix: .gitattributes with *.sh text eol=lf; on-box unblock sed -i 's/\r$//' run.sh.
  • GEN5 — Heavy DL static-checked on the wrong machine. Symptom: an OOM or a CUDA mismatch only reproduces on the box. → Root cause: static/import checks ran locally, the real compute is remote. → Fix: run the cheap CPU smoke locally (Phase 2), run the heavy DL on the box; for the bug-vs-effect call once it runs, defer to REQUIRED: verifying-dl-experiments.
  • GEN6 — A box reboot silently orphans the run (tmux does not survive it). Symptom: a detached job vanishes with a clean dmesg, idle GPU, and low uptime; tmux ls shows no sessions. → Root cause: tmux/nohup survive an SSH drop but not a host reboot — the rental rebooted (host maintenance, kernel update, or an OOM that took the box) and every session died. → Fix: treat reboot as one of the four "vanished process" causes (cross-link references/gotchas_universal.md U3); make resume idempotent (§4) so the same launch command continues from the last checkpoint; for a box that reboots often, add an @reboot cron or a systemd unit that re-launches the detached queue.
  • GEN7 — A second concurrent run silently halves throughput by oversubscribing the GPU. Symptom: two training runs on the "same idle GPU" both crawl, or the second OOMs on a card that looked free. → Root cause: a bare box has no scheduler — nothing prevents two processes sharing one GPU, so they contend for VRAM and SM time. → Fix: the operator is the scheduler — serialize with the run_queue.sh template, or pin each run to a distinct card with CUDA_VISIBLE_DEVICES=<n>; check nvidia-smi for an existing holder before every launch (zombie holders → U11).
  • GEN8 — Watching a poll connection, not the run, declares a false death. Symptom: the ssh-poll drops and the run is pronounced dead, but the job finished fine and wrote best.pth. → Root cause: a dropped poll connection ≠ the training dying; the two failure modes are conflated. → Fix: on any poll drop, re-ssh and check ground truth directly (pgrep -af train, log tail, best.pth mtime) before concluding anything (principle #3); robust short-connection poll template → U17.

Platform-specific debugging (bare SSH)

The box has no console — every diagnostic is an ssh one-liner. Run these separately (a kill drops the SSH, U1/U4), and bound each with ssh -o ConnectTimeout=15 -o ServerAliveInterval=10 so a blip self-kills instead of half-open hanging:

  • Is the run alive or orphaned? ssh gpu-box 'tmux ls; pgrep -af <train-script> | head' — empty tmux ls after a vanished log ⇒ reboot/HUP (GEN6); reconcile the watcher against the real session.
  • Why did it die (the 4-cause ladder)? ssh gpu-box 'dmesg 2>/dev/null | grep -iE "killed process|out of memory|Xid" | tail; uptime' — OOM line ⇒ U9/U10; clean dmesg + low uptime ⇒ reboot (GEN6); Xid 48/79 ⇒ dead GPU, re-rent (U22).
  • GPU health, not just util%: ssh gpu-box 'nvidia-smi dmon -s pucvmet -d 1 -c 5' — read SM clock + power, not GPU-Util (a liar, U21); a holder nvidia-smi cannot see ⇒ fuser -v /dev/nvidia* (U11).
  • Disk before it bites: ssh gpu-box 'df -h <mount>; df -i <mount>' — inodes hit 100% before bytes (U7); the byte-hog often hides in ~/.cache/huggingface (du -sh ~/.cache/huggingface/hub/models--* | sort -rh).
  • Stuck download? A transfer with a live process but a flat df is stalled, not progressing — ssh gpu-box 'ls -la --time-style=+%H:%M data/*.tmp; df -h <mount>'; if the size has not moved, kill and resume the per-dir loop (scripts/download_loop.sh, U12), never restart from zero.

8. SCRIPT OVERRIDES

Values to parameterize the scripts/ templates for a bare-SSH box:

DATA_DIR=$HOME/proj    (working dir / data disk on the box)
DURABLE_DIR=$HOME/proj (durable mount = the measured persistent disk; pull to local before teardown)
PROXY_HOOK=        (none by default; set HF_ENDPOINT=https://hf-mirror.com only if behind the GFW)
CRED_FILE=~/.netrc on the box's local disk, streamed in via stdin — never onto a shared/durable FS
SCRATCH=*.latest.pth and periodic checkpoints  (prune on success; keep best + tiny eval JSONs)
HF_HOME=$HOME/proj/.hf  (redirect off the default ~/.cache so it lands on the data disk)
DETACH=tmux            (the swappable plug — replaced by sbatch / Job / commit in the diffs below)

THIN DIFF — SLURM (sbatch replaces tmux)

kind: slurm · meter = walltime/fairshare quota, not dollars · detach = sbatch · no teardown.

The scheduler owns the job's lifecycle: the operator submits, Slurm runs and detaches it. tmux+nohup is replaced (not supplemented) by sbatch — a submitted batch job survives logout with no tmux. A bare srun still blocks and dies on terminal close like a foreground process, so wrap srun inside an sbatch script for long runs.

  • Submit / monitor / kill: sbatch job.sh (returns a jobid immediately) · squeue -u $USER (status — replaces "reattach tmux") · sacct -j <jobid> (post-mortem: exit code, maxRSS, elapsed) · scancel <jobid> (kill). Logs go to slurm-%j.out (arrays: slurm-%A_%a.out) — file-based, same logs-to-file contract as the baseline.
  • GPUs are declarative: #SBATCH --gres=gpu:a100:2 (or --gpus=volta:3); request, do not place. Slurm's GRES plugin sets CUDA_VISIBLE_DEVICES per step (verified slurm.schedmd.com/gres.html 2026-06).
  • Walltime ceiling — the hard new constraint: #SBATCH --time=HH:MM:SS and at the limit each task is sent SIGTERM, then SIGKILL after KillWait (default 30 s) (verified slurm.schedmd.com/sbatch.html
    • slurm.conf 2026-06). Long training MUST checkpoint and requeue, not "run until done."
  • Preemption + checkpoint-on-signal: on time-limit or scavenger-partition eviction the same SIGTERM→KillWait→SIGKILL sequence applies. Arm #SBATCH --signal=B:SIGTERM@360 for a ~6-minute warning (the B: prefix signals the batch shell, not the steps; Slurm may fire it up to 60 s EARLY — size the warning with that slack, verified slurm.schedmd.com/sbatch.html 2026-06), trap it to set a flag, and #SBATCH --requeue to auto-return to the queue (the script restarts from its beginning with the same job ID) and resume from the last checkpoint. Cadence formula → references/spot-resilience.md.
  • Native orchestration replaces hand-rolled fan-out: --array=0-15 (rate-limit with %4) fans out ablation cells, --dependency=afterok:<jobid> chains stages (runs only on exit-code-0).
  • No per-hour teardown — watch fairshare. Nodes are not shutdown; the job just ends. The baseline's #1 risk (forgotten box) disappears, replaced by "don't blow the walltime/fairshare allocation." There is nothing to stop.
  • No root, shared multi-tenant node: cannot apt install. Use module load cuda or a container (Apptainer/Singularity — Docker is usually banned).
  • Filesystem split: the shared parallel FS ($HOME, /scratch) persists and is where checkpoints go; node-local $TMPDIR is wiped when the job ends — stage scratch to $TMPDIR, checkpoint to /scratch. Multi-node NCCL/fabric specifics → references/multinode.md.

Slurm gotchas (platform-pinned; universal → references/gotchas_universal.md)

  • SLURM1 — Checkpoint inside the signal handler corrupts the checkpoint. Symptom: --requeue works most of the time, then intermittently writes a corrupt hpc_ckpt and the requeued job won't load. → Root cause: a Python signal handler can fire after any bytecode instruction — including mid-backward-pass — so checkpointing directly in the handler races with training (verified github.com/Lightning-AI/pytorch-lightning#21406 2026-06). → Fix: the handler does the minimum — set a flag; poll the flag in the training loop and checkpoint at a safe point (end of step), then scontrol requeue $SLURM_JOB_ID or exit so --requeue returns it.
  • SLURM2 — Warning signal arrives too late; the SIGKILL lands mid-write. Symptom: the --signal@360 trap fires but the checkpoint is half-written when SIGKILL hits. → Root cause: two slacks compound — Slurm may send the warning up to 60 s early OR late, and at the actual wall the KillWait grace is only ~30 s (verified slurm.schedmd.com 2026-06). → Fix: budget the warning so a full checkpoint fits before the wall even with the 60 s jitter; checkpoint periodically too (never rely on the one signal); make the write atomic (tmpfsyncrename, U6) so a truncated file is never loaded.
  • SLURM3 — srun inside sbatch no longer inherits --cpus-per-task (Slurm ≥ 22.05). Symptom: a nested srun hangs, sees one CPU, or under-threads the dataloader. → Root cause: since 22.05 srun stopped reading SLURM_CPUS_PER_TASK and must be told explicitly (verified docs.icer.msu.edu 2026-06). → Fix: srun -c $SLURM_CPUS_PER_TASK …, or set export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK; pass --gpus-per-task/--gres on the srun too — a step does not inherit the allocation's GRES by default.
  • SLURM4 — OOM is a job STATE, not a Python traceback. Symptom: the job dies with no error in the log; sacct shows State=OUT_OF_MEMORY (or slurmstepd: Detected 1 oom-kill event(s)). → Root cause: Slurm cgroup sets a hard memory limit at (a fraction of) the requested --mem; exceeding it is an OOM-kill the kernel performs (verified osc.edu / icer.msu.edu 2026-06). → Fix: read sacct -o MaxRSS,ReqMem and raise --mem/--mem-per-cpu to MaxRSS×1.2; this is the cgroup-RAM OOM of U9 (dataloader workers × a big tensor), distinct from VRAM OOM (U10) — do not shrink batch for a host-RAM OOM.
  • SLURM5 — $TMPDIR checkpoints evaporate when the job ends. Symptom: a requeued/array job finds an empty checkpoint dir. → Root cause: node-local $TMPDIR is wiped at job end; only the shared parallel FS persists across a requeue or a different node. → Fix: stage scratch to $TMPDIR for speed, but write checkpoints to /scratch/$USER; never point DURABLE_DIR at node-local storage.

Slurm debugging (squeue / sacct / cgroup triage)

  • Still queued or running? squeue -u $USER -o '%i %T %r %M %l %R' — the %r Reason column explains a PENDING (e.g. Resources, Priority, QOSMaxGPUPerUserLimit); %R on a running job is the nodelist.
  • Post-mortem (why it ended): sacct -j <jobid> --format=JobID,State,ExitCode,DerivedExitCode,Elapsed,MaxRSS,ReqMem,Timelimit,NodeListState=TIMEOUT ⇒ walltime kill (raise --time or requeue); OUT_OF_MEMORY ⇒ SLURM4; PREEMPTED/NODE_FAIL ⇒ requeue territory; ExitCode like 0:9 means killed by signal 9 (SIGKILL — the KillWait expired).
  • Live resource use: sstat -j <jobid>.batch --format=JobID,MaxRSS,MaxVMSize on a running step (sacct only finalizes at exit); cross-check against ReqMem to catch a creeping leak before the cgroup kills it.
  • GPU actually allocated to the step? inside the job: echo $CUDA_VISIBLE_DEVICES && nvidia-smi -L — a mismatch ⇒ SLURM3 (--gres/--gpus-per-task not on the srun).
  • Multi-node hang (job RUNNING, no progress) ⇒ NCCL/fabric, not Slurm → references/multinode.md.

Slurm OVERRIDES: DETACH=sbatch · DURABLE_DIR=/scratch/$USER/proj (durable) + DATA_DIR=$TMPDIR (node-local, wiped) · PROXY_HOOK=module load cuda · teardown=n/a (watch sacct + fairshare).


THIN DIFF — KUBERNETES (a Job manifest replaces the shell)

kind: kubernetes · detach = a Job manifest (no shell) · persistence = a PVC, non-optional.

The unit of work is a manifest, not a session: kubectl apply -f job.yaml; the control plane schedules a pod and a Job controller replaces it on failure up to backoffLimit (default 4 — each failure creates a new pod, it does not restart the old one; verified kubernetes.io Jobs doc 2026-06). The "detach from my connection" problem vanishes — the pod never had a connection to the shell.

  • GPUs: resources.limits: nvidia.com/gpu: 1. Quirk (verified kubernetes.io scheduling-gpus 2026-06): GPUs go in limits only; if requests is set it must equal limits, and you cannot set requests without limits; GPUs are integer, not shared or overcommitted — one whole GPU per container (absent MIG/time-slicing, which K8s does not provide out of the box). Provided by the NVIDIA device-plugin DaemonSet.
  • Code delivery is different — no rsync into a pod. Code is baked into a container image (build → push to a registry) or pulled at pod start. This is the biggest workflow shift from the baseline; pin the base image by @sha256: digest, not :latest (U30).
  • Persistence is the headline risk: the pod filesystem is EPHEMERAL by design. On death/restart/reschedule, anything written outside a mounted volume is gone. Checkpoints must mount a PersistentVolumeClaim (or object storage) at /checkpoints — this is non-optional and is the single most common way ML-on-K8s loses work.
  • Monitor: kubectl get pods · kubectl logs -f <pod> (replaces tail -f). kubectl exec -it … -- bash is a debugging tool, not the run mechanism — an exec session is not durable.
  • Declarative parallelism: Job parallelism/completions (both default 1) for fan-out (the K8s analog of Slurm arrays).
  • Lifecycle knobs: activeDeadlineSeconds is the walltime analog (terminates the Job past the deadline); ttlSecondsAfterFinished auto-GCs a finished Job; terminationGracePeriodSeconds (default 30 s, verified kubernetes.io 2026-06) is the SIGTERM→SIGKILL window — the K8s analog of Slurm KillWait, so the same checkpoint-on-SIGTERM discipline applies.
  • Teardown is two-layered: kubectl delete job <name> frees the pod (cheap), but the underlying node/cluster keeps costing unless an autoscaler scales it down. delete ≠ scale-down — the node release is the real cost lever, distinct from the baseline's single "destroy the box."

Kubernetes gotchas (platform-pinned; universal → references/gotchas_universal.md)

  • K8S1 — Pod stuck Pending: Insufficient nvidia.com/gpu. Symptom: kubectl get pods shows Pending; the events read 0/N nodes are available: N Insufficient nvidia.com/gpu. → Root cause: usually not missing hardware — the device-plugin DaemonSet isn't running, so no node advertises allocatable GPUs; or a taint blocks scheduling (verified kubenatives.com + GKE troubleshooting 2026-06). → Fix: kubectl describe node <n> | grep -A4 -E 'Capacity|Allocatable' — if nvidia.com/gpu is 0, the plugin is down: kubectl get ds -n kube-system | grep nvidia and kubectl logs -n kube-system -l k8s-app=nvidia-device-plugin; add the matching toleration if the GPU nodes are tainted.
  • K8S2 — RestartPolicy: Always is rejected on a Job. Symptom: kubectl apply errors that a Job's pod template may only use Never or OnFailure. → Root cause: a Job is not a Deployment; only those two restart policies are legal (verified kubernetes.io Jobs doc 2026-06). → Fix: use OnFailure (restart the container in place — keeps /checkpoints warm) or Never (a fresh pod per attempt, cleaner logs); never copy a Deployment's Always.
  • K8S3 — ImagePullBackOff / ErrImagePull after a registry push. Symptom: the pod never starts; events show Back-off pulling image. → Root cause: a private registry without an imagePullSecrets, a wrong tag/digest, or a too-big layer timing out the pull. → Fix: kubectl describe pod <p> reads the exact pull error; attach imagePullSecrets, pin a real @sha256: digest (U30), and pre-warm large images onto the node pool.
  • K8S4 — Multi-Attach error on a rescheduled pod (RWO PVC). Symptom: a pod stuck ContainerCreating after a node failure: Volume is already exclusively attached to one node. → Root cause: a ReadWriteOnce PVC can attach to one node at a time; on failover the old attachment hasn't released, and two distributed-training pods on different nodes can never share an RWO volume (verified discuss.kubernetes.io / bobcares.com 2026-06). → Fix: for multi-node training use ReadWriteMany (NFS/EFS/CephFS) for the shared checkpoint dir, or pin co-dependent pods to one node with affinity; on a stuck failover, force-detach via the cloud console or delete the old VolumeAttachment.
  • K8S5 — Pod Evicted mid-training under node disk pressure. Symptom: a long run dies with status: Evicted, reason: The node was low on resource: ephemeral-storage. → Root cause: container logs, the writable layer, and emptyDir count as ephemeral storage; checkpoints/caches written outside the PVC fill the node and the kubelet evicts the pod (verified jorijn.com / oneuptime.com 2026-06). → Fix: write everything large to the PVC, set resources.limits.ephemeral-storage, rotate logs, and back emptyDir scratch with sizeLimit; this is the K8s face of the disk-full crash (U6/U7).
  • K8S6 — Container runs but trains on CPU (GPU never attached). Symptom: a pod runs to completion, loss curves normal, ~100× too slow. → Root cause: the GPU limit was omitted, or nvidia-smi works on the node but the container lacks the runtime/library path. → Fix: validate kubectl exec <p> -- nvidia-smi before trusting a run; ensure resources.limits.nvidia.com/gpu is set and the NVIDIA container runtime is the default (this is U31 surfaced through K8s).

Kubernetes debugging (kubectl triage)

  • Why is it Pending / not starting? kubectl describe pod <p> — the Events section names it directly (Insufficient GPU ⇒ K8S1; FailedScheduling taint; ImagePullBackOff ⇒ K8S3; FailedMount ⇒ K8S4).
  • Why did it die? kubectl get pod <p> -o jsonpath='{.status.containerStatuses[0].lastState.terminated}'reason: OOMKilled ⇒ raise resources.limits.memory (cgroup-RAM, U9); Error + exit code ⇒ read logs.
  • Logs of a crashed/previous attempt: kubectl logs <p> --previous (the current pod may be a fresh retry with an empty log); kubectl get events --sort-by=.lastTimestamp for the cluster-wide timeline.
  • Did the node even offer GPUs? kubectl describe node <n> | grep -A4 Allocatablenvidia.com/gpu: 0 ⇒ device plugin down (K8S1).
  • Is the PVC bound and mounted? kubectl get pvc (Bound?) and kubectl describe pod <p> Volumes section — an unbound PVC stalls the pod in Pending.

K8s OVERRIDES: DETACH=k8s-job · DURABLE_DIR=/checkpoints (PVC mount — required; RWX for multi-node) · CRED_FILE="" — credentials arrive as a K8s Secret mounted as an env var (WANDB_API_KEY / HF_TOKEN), never a file on disk and never baked into the image layer, so run_one's [ -n "$CRED_FILE" ] guard skips the file read and the env var passes through · teardown=kubectl delete + scale the node pool down.


THIN DIFF — COLAB / KAGGLE (not SSH-orchestratable)

kind: notebook · no SSH, no tmux, no persistent disk, no real job abstraction. The generic core's central primitive ("detach + survive the session") cannot be satisfied directly — degrade to checkpoint-to-cloud + idempotent resume. Teardown is automatic and free; the opposite problem to the baseline — the work cannot be kept alive long enough.

Colab (free tier):

  • Idle timeout ~90 min (no cell activity) and a hard ~12 h max VM lifetime; on disconnect all RAM, variables, models, and the local /content filesystem are lost. Limits are dynamic and unpublished — GPU type/availability and the exact ceilings "vary over time" and GPU is best-effort, can be denied or downgraded (verified research.google.com/colaboratory/faq.html 2026-06).
  • Free tier requires the browser tab to STAY OPEN(verified — corrects the draft's "anti-idle tricks are unreliable" framing): background execution is a Pro+ paid feature; on free tier closing the tab stops the runtime shortly after (verified github.com/googlecolab/colabtools#4151 + community reports 2026-06). So keep-alive hacks aren't merely unreliable — there is no supported headless background run at all on free Colab. Design for the disconnect, do not fight it.
  • Only survival mechanism: mount Google Drive and checkpoint every epoch to Drive; make the entrypoint resume-from-Drive idempotent so the inevitable reconnect continues, not restarts.

Kaggle (free tier) — slightly better, because of one real primitive:

  • 30 GPU-hours/week floating quota (T4×2 or P100; resets weekly); interactive idle timeout ~60 min and a ~9 h session cap (verified kaggle.com/docs/efficient-gpu-usage + product-feedback 2026-06).
  • The one genuine headless-background primitive: "Save Version → Save & Run All (commit)." It snapshots the notebook and runs it on a separate machine with no idle timer, surviving browser close, and persists /kaggle/working (20 GB) as the committed version's output (commit times out at ~9 h GPU / ~12 h CPU). This is the closest thing to sbatch in the free-tier world — single it out as Kaggle's detach primitive. Live monitoring is weak (Colab: watch the cell; Kaggle commit: inspect only the finished version's logs).
  • Code delivery: clone from GitHub or pull the platform's dataset mounts — no scp.

Colab / Kaggle gotchas (platform-pinned; universal → references/gotchas_universal.md)

  • NB1 — Drive sync lag silently loses the "saved" checkpoint. Symptom: training logs saved best.pth to /content/drive/..., the runtime disconnects an hour later, and the file is 0 bytes or absent in Drive. → Root cause: writes to mounted Drive are buffered and sync asynchronously — large files can take up to ~30 min to actually land, and an unmount/disconnect before the flush loses them (verified github.com/googlecolab/colabtools#2607 + #4426 2026-06). → Fix: call drive.flush_and_unmount() (or os.fsync) right after each checkpoint, keep checkpoints small, and treat a checkpoint as durable only after it is visible in Drive — re-list it before trusting resume.
  • NB2 — Kaggle commit fails if any cell errors → the whole output is lost. Symptom: "Save & Run All" shows committing… forever or fails with a non-zero/Code 0 error, and nothing in /kaggle/working is saved. → Root cause: a commit re-runs the notebook top-to-bottom on a fresh machine; one failing cell (or an interactive-only state, or a flaky cell) aborts the commit and discards its output (verified kaggle.com/product-feedback/334753 + 59557 2026-06). → Fix: before committing, Run All interactively end-to-end on a clean kernel (catch order/state bugs); guard long sections so a late failure still writes partial results to /kaggle/working; rely on /kaggle/working (persisted), not in-memory variables.
  • NB3 — Kaggle batch (commit) run picks the WRONG accelerator / has no internet. Symptom: a committed run is glacial (ran on CPU) or fails to pip install/download. → Root cause: the accelerator and internet toggle are notebook settings the commit inherits — a notebook left on "None"/internet-off commits that way; internet also requires phone verification on the account. → Fix: set Accelerator = GPU and Internet = On in the notebook before committing; verify with torch.cuda.is_available() in an early cell so a CPU commit fails fast instead of wasting the 9 h.
  • NB4 — /content (Colab) and /kaggle/temp are scratch, not durable. Symptom: results written to /content/... or /kaggle/temp vanish on disconnect. → Root cause: only Drive (Colab) and /kaggle/working (Kaggle committed output) survive the session; everything else is ephemeral. → Fix: point DURABLE_DIR at the surviving path; never let the final artifact land only on scratch.
  • NB5 — Free Colab disconnect mid-epoch with no warning. Symptom: the session simply dies; there is no SIGTERM, no grace window to catch. → Root cause: unlike Slurm/K8s, a notebook eviction gives no signal — the resume contract is the only defense. → Fix: checkpoint every N steps to Drive (NB1-safe), make cell-1 resume-from-latest idempotent, and chain runs across sessions under the per-session ceiling. There is no checkpoint-on-signal here (contrast Slurm --signal / K8s SIGTERM).

Colab / Kaggle debugging (session-death triage)

  • What am I actually on? First cell: import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0)) and !nvidia-smi — catches a CPU-only Colab assignment or a CPU Kaggle commit (NB3) before wasting the session.
  • Is the checkpoint really in Drive? !ls -la /content/drive/MyDrive/proj/*.pth after a drive.flush_and_unmount() — a 0-byte or missing file ⇒ sync lag (NB1), do not teardown trusting it.
  • Did the Kaggle commit succeed? Open the Version's Logs tab (the only post-mortem for a committed run) — a failed cell shows there; the committed /kaggle/working is the artifact, not the editor state.
  • Disk full inside the notebook? !df -h/kaggle/working caps at 20 GB; HF cache and intermediate files exhaust it fast (U6/U7), prune before the commit's final write.

Colab/Kaggle OVERRIDES: DETACH=Drive-checkpoint loop (Colab) / Save&Run-All commit (Kaggle) · DURABLE_DIR=Drive /content/drive/MyDrive/proj (Colab) / /kaggle/working (Kaggle) · teardown=automatic · the pattern, every run: checkpoint every N steps → idempotent resume from cell 1 → keep each run under the per-session ceiling → chain runs across sessions.