playbook/antigravity-awesome-skills/skills/remote-gpu-trainer/references/gotchas_universal.md

# Universal & mixed gotcha catalog — every metered remote-GPU rental

The cross-platform gotchas: they bite on **any** metered, isolated, rented GPU — only the concrete
path/proxy/billing-verb changes (those live in `profiles/<platform>.md`). Each entry is
**Symptom → Root cause → Fix**. "Mixed" entries are universal in symptom but carry a *platform-specific
value* in the fix — the rule stays here, the value lives in a profile. Platform-only gotchas (AutoDL's
TB-pin, the wandb-key classifier, the network_turbo proxy literal) do NOT live here — see each profile's
TOP GOTCHAS section.

To jump: `grep -in '<keyword>' references/gotchas_universal.md` (e.g. `inode`, `egress`, `xid`, `crlf`,
`stdin`, `zombie`). Numbering `U1…` is stable; cross-platform additions continue the same series.

## Table of contents (by theme)

- **Process & SSH** — U1 SSH-dies-on-kill · U2 tmux-holds-script-in-memory · U3 vanished-process-4-causes · U4 kill-drops-SSH-before-relaunch · U5 hook-safe-launch
- **Disk & Storage** — U6 disk-full-crashes-torch.save · U7 storage-fails-on-inodes · U8 stage-hot-data-to-NVMe
- **Memory & OOM** — U9 cgroup-OOM-num_workers×tensor · U10 VRAM-OOM-vs-cgroup-OOM · U11 zombie-VRAM-nvidia-smi-cant-see · U41 host-metrics-lie/oom_kill-counter
- **Transfer & Download** — U12 scp-resets→resumable-loop · U13 scp-into-uncreated-dir · U14 egress-surcharge+same-AZ · U15 compress-before-the-wire
- **Monitoring** — U16 stale-waiters/zombie-monitors · U17 unquoted-pipe-grep-hang+robust-poll · U18 two-leg-remote-self-completion · U19 tracker-deletion-lags · U20 hosted-tracker-survives-teardown · U39 live-panel/TB-silently-empty (path/port/process mismatch) · U43 block-buffered-stdout-looks-frozen
- **GPU health** — U21 nvidia-smi-util%-is-a-liar · U22 Xid-48/79-dead-GPU-re-rent · U23 thermal/power-throttle-steals-25-40%
- **Dataloader & IO** — U24 dataloader-starvation-knobs · U25 many-small-files→shard-into-tar · U40 intra-op-thread-oversubscription-starves-GPU
- **Env & Container** — U26 CRLF-breaks-sh · U27 overlay-config-files · U28 CUDA-toolkit-vs-driver-vs-torch · U29 install-from-lockfile · U30 pin-image-by-sha256 · U31 container-runs-but-no-GPU · U42 box-code-drift/verify-deploy
- **Cost & teardown** — U32 task-epoch-default · U33 silent-checked-sync
- **Secrets & trackers** — U34 secrets-via-stdin · U35 tracker-offline-without-key
- **Delegated (cross-link only)** — U36 cuDNN-nondeterminism · U37 matplotlib-2^16 · U38 GPU-0%-util-data-bound
- **Pointers** — spot/preemption → `references/spot-resilience.md`; multi-node/NCCL → `references/multinode.md`

---

## Process & SSH

### U1 — SSH disconnects on `pkill -9` (exit 255, "Connection reset")

**Symptom**: `ssh <host> 'pkill -9 -f train'` returns `Connection reset by peer`, exit 255.

**Root cause**: killing the python tree tears down the PTY chain; the SSH client gets EOF and exits. The
remote command may have run fine.

**Fix**: this is **normal, not an error** — re-ssh and verify state, do not panic-retry.
```bash
ssh <host> "tmux kill-session -t qN 2>/dev/null; sleep 3; pkill -9 -f 'src.train'"  # SSH exits 255 here
ssh <host> "pgrep -af 'src.train' | head -1 || echo CLEAN"                            # separate call verifies
```

### U2 — tmux holds the script in memory; editing it mid-run re-executes blocks

**Symptom**: a queue/launcher script is updated mid-run, but the running job still uses the old logic; or
an ablation completes cleanly yet **restarts from epoch 1** with a second tracker run and the queue never
advances.

**Root cause**: bash reads a script **by byte-offset on demand**. tmux keeps the launched script as-loaded;
`scp`-ing a new version mid-run makes bash seek to its saved offset in a *now-different* file, land
mid-command, and re-execute a block (duplicate runs, stalled queue). A child invocation (`bash run_one.sh`)
IS re-read fresh for the *next* item — but only if none is parked mid-script. (principle #6.)

**Fix**: **never overwrite a script any process is executing** — check `pgrep -af <script>` first; version
the filename for hot changes (`run_one_v2.sh`), point only *new* launches at it. Appending lines to a queue
file is safe (`while read < file` sees appended bytes); changing structure is not. To hot-swap, kill +
restart the detach session so fresh bash reads from the top. Recovery: kill the session, copy the finished
`best.pth` to durable storage, restart `run_queue.sh queue.txt <start_index>` to skip done items, delete any
duplicate tracker run (cross-link verifying-dl-experiments **REQUIRED**).

**Related detach trap — a non-exported var doesn't cross into the detach primitive.** A `VAR=x` set in
your shell before `tmux new-session` / `nohup` is **not** in the detached job's environment unless
**exported** (or inlined in the launched command) — the job sees it empty, and a launcher/monitor that
interpolates it silently misdirects (writes output to the wrong path, mis-reports "died"). `export VAR`
before launch, or inline it: `tmux new-session -d "VAR=$VAR bash run.sh"`.

### U3 — A vanished remote process ≠ OOM: enumerate the 4 causes

**Symptom**: a detached run's log stops right after `Starting training` with no epoch output and no
traceback; `pgrep` shows it gone. The reflex is "OOM-killed."

**Root cause is one of four** — OOM is only one:
1. **Machine restart / reboot** — `dmesg` is *clean*, GPU idle, cgroup roomy, `uptime` low. Most-missed: nothing in the log hints at it.
2. **OOM-kill (`-9`)** — `dmesg | grep -i 'killed process'` shows it, memory was tight (U9).
3. **SSH HUP** — a foreground (non-`nohup`/`tmux`/`setsid`) launch dies when its parent SSH drops.
4. **Manual kill** — an earlier `pkill` matched more than intended.

**Fix — diagnose cheap → conclusive before "fixing"**:
```bash
dmesg 2>/dev/null | grep -iE 'killed process|out of memory' | tail   # OOM? empty = not OOM
nvidia-smi --query-gpu=memory.used,memory.free --format=csv,noheader  # idle now = died, not hung
cat /sys/fs/cgroup/memory.max | numfmt --to=iec                       # roomy = OOM unlikely
uptime                                                                # low = recent reboot (cause 1)
```
Clean `dmesg` + idle GPU + roomy cgroup + low `uptime` ⇒ **reboot, not OOM**. Do NOT shrink batch size to
"fix" a phantom OOM — that masks the one variable under test. **Separate trap**: a dropped poll connection
≠ the training dying — re-ssh and check the process/artifact directly (`pgrep -af train`, log tail,
`best.pth` mtime) before concluding the run died (principle #3).

### U4 — `kill` drops the SSH before a relaunch in the SAME command runs

**Symptom**: `ssh <host> 'pkill -f X; relaunch X'` kills X but X is **not** relaunched; ssh returns 255.

**Root cause**: killing a session-tied process drops the SSH (U1, normal) at the kill, so everything after
it in that one command never executes.

**Fix**: split — kill in one ssh call, relaunch (with NO kill) in the next. To stop a kill/poll pattern
from matching the matcher's own command line, split the literal: `A=base; B=lines.; pgrep -f "${A}${B}"`
(the contiguous string `baselines.` never appears in the cmdline running `pgrep`).

### U5 — Hook-safe remote launch: keep env activation VISIBLE in the launch command

**Symptom**: an env-guard hook (e.g. "no DL in conda base") blocks or asks on
`ssh <host> 'nohup bash /root/job.sh ...'` even though `job.sh` activates the right env internally; it also
misfires on heredocs that inline `python -m <pkg>.train`.

**Root cause**: the hook scans the **command string** — it cannot see inside an scp'd script, and a bare
`bash job.sh` launch has no visible `conda activate <env>`, so the guard assumes base.

**Fix**: write the heavy script via Write/`scp` (so `python -m ...train` lives in the file, not the command)
and put a VISIBLE activation in the launch ssh command:
`ssh <host> 'source /path/to/conda.sh; conda activate <env>; nohup bash /root/job.sh ...'` — the script
re-activating is harmless. Never `--no-verify` / never bypass the guard. (On a single-tenant rental whose
base IS the env, the right move is to exempt remote/ephemeral base, not to clone it — that's a profile fact.)

---

## Disk & Storage

### U6 — Disk-full crashes `torch.save` with `iostream error`

**Symptom**: mid-training exit=1; log shows `RuntimeError: basic_ios::clear: iostream error` and
`unexpected pos N vs M` from inside `torch.serialization`; a leftover `latest.pth.tmp` sits in the
checkpoint dir; `df` shows the data mount at 100%.

**Root cause**: `torch.save` writes atomically (write `.tmp` → rename); the `.tmp` write hits disk-full and
errors. Any quota'd/cgroup disk on any rental does this.

**Fix — prevent**: pre-budget `ckpt_size × N_runs + worst_case_latest + tracker_local_cache`; if it exceeds
the mount, schedule mid-run aggregation to durable storage + delete completed-and-aggregated dirs; in
`run_one.sh`, on success prune the rolling `latest.pth` and keep only `best.pth` (cross-link
verifying-dl-experiments **REQUIRED** for the keepable-checkpoint policy). **Recover**: delete the
`*.tmp`/`latest.pth` to free several GB — `best.pth` survives, the queue can resume.

### U7 — Storage fails on the dimension (and location) not being watched

**Symptom**: `cp`/`mkdir` fails `No space left on device`, yet `df -h` shows ~34% used — because `df -i`
reads `100%` (inodes exhausted). Or the data mount fills despite `runs/` looking small.

**Root cause**: disk dies on **inodes before bytes** — the classic trigger is **per-sample eval output**,
which writes on the order of `files_per_sample × N_samples × N_conditions` tiny files. And the real
byte-hog often hides where nobody looks: a **symlinked cache** (`~/.cache/huggingface` mapped onto the data
disk) can outweigh everything the run created.

**Fix**: monitor `df -i`, not just `df -h`, in Phase 0 and every space check. **Audit the real mount with
`du`, not assumptions** (`du -sh ~/.cache/huggingface/hub/models--* | sort -rh`). Clean by **value** — keep
the tiny irreplaceable evidence (metric/eval JSONs), drop the large reproducible scratch (periodic
checkpoints, unused caches). Cap per-sample eval visualization (cross-link verifying-dl-experiments
**REQUIRED** for the sizing policy). The *inode-cap number* is a profile fact (some platforms enforce a hard
~200K cap; GB-quota'd platforms have none); the many-small-files general form is **shard into tar** (U25).
Get explicit user confirmation naming `rm -rf` targets; offer "clean vs expand the disk" (principle #9).

### U8 — Stage hot data to local NVMe before training

**Symptom**: training is I/O-bound reading from a network/shared/HDD-backed volume; GPU starves between
batches.

**Root cause**: a remote/networked filesystem (or a spinning data disk) has far lower random-read
throughput than instance-local NVMe — HDD-vs-NVMe gaps reach ~35×.

**Fix**: at job start, copy the working dataset from the durable/shared tier to instance-local NVMe scratch,
train against the local copy, write checkpoints back to durable storage. The local-NVMe path is a profile
fact (`local_nvme` in the frontmatter); the stage-then-train discipline is universal. Pairs with U24/U25.

---

## Memory & OOM

### U9 — `num_workers` × a big in-RAM tensor → cgroup OOM-kill (bare "Killed", exit 137)

**Symptom**: training dies early with a bare `Killed` / `killed by signal: Killed (-9)` and **no Python
traceback**; lowering `num_workers` makes it vanish.

**Root cause**: each DataLoader worker is a `fork` that gets its **own copy** of any large object the
dataset holds (a 16384² float32 matrix ≈ 1 GB). `num_workers=W` ⇒ ~`(W+1)×` that footprint, which blows the
instance's cgroup `memory.max` even though a bare-process run fits. The kernel OOM-kills with no
Python-level error, so it reads as a mysterious crash.

**Fix**: size `num_workers` against `memory.max` and the per-worker resident set, **not** CPU count. Share
one copy across workers (memmap / module-level singleton built once) or generate the object on the fly.
Shrinking the problem also fixes it — a smaller matrix dim shrinks footprint *quadratically* (dim 1024 ≈
4 MB, 256× less than 16384). Confirm it's OOM: `dmesg | tail` shows `Out of memory: Killed process`, and the
same config survives `num_workers=0`.

### U10 — VRAM OOM (a big model or a concurrent job) is distinct from cgroup-RAM OOM (U9)

**Symptom**: `torch.OutOfMemoryError: CUDA out of memory` when launching a second train/eval while another
runs, or a big model (deep transformer / unrolled net at high res) OOMs alone.

**Root cause**: **VRAM** — the sum of concurrent jobs' allocations plus fragmentation exceeds the card. NOT
host-RAM (U9).

**Fix**: check free VRAM first (`nvidia-smi --query-gpu=memory.free --format=csv,noheader`); size the batch
to fit *alongside* any concurrent job; set `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True` to cut
fragmentation. (Run heavy DL on the box; do static/shape checks locally — cross-link
verifying-dl-experiments **REQUIRED** for local-OOM rationale.)

### U11 — A zombie holds VRAM `nvidia-smi` cannot see → OOM on an "empty" GPU

**Symptom**: `nvidia-smi` lists no process and shows free memory, yet a fresh job OOMs immediately; common
after a crashed DDP run or a killed container.

**Root cause**: a defunct/orphaned process (or a dead container's namespace) still holds CUDA context and
VRAM, but `nvidia-smi`'s process table can't attribute it — so the GPU *looks* empty while memory is locked.

**Fix**: enumerate the real holders via the device nodes and reap them:
```bash
fuser -v /dev/nvidia* 2>/dev/null   # or: lsof /dev/nvidia*  → kill -9 the listed PIDs
```
If containerized, restart the container. Ship a small reaper that flags any PID with persistent VRAM + ~0%
util beyond a timeout — cross-link `scripts/reap_vram_zombies.sh`.

### U41 — On a shared box, `uptime`/`free` describe the whole physical host, not your container — use cgroup-scoped readings + the `oom_kill` counter

**Symptom**: a detached run looks "dead" or "the host is overloaded" — `uptime` shows load average 400+,
`top`/`free -m` look maxed — so you suspect contention or an OOM-kill. But the job's own checkpoint `mtime`
keeps advancing and its log still grows.

**Root cause**: on a multi-tenant rental, host tools (`uptime`, `top`, `free -m`, `vmstat`) report the
**physical node you share with other tenants**, not your cgroup. A neighbor's job spikes the host load
average to ~490 while your container sits near-idle (your processes in `R`/`S`, none stuck in
uninterruptible `D`). Reading host load as your own → a false "overloaded / OOM-killed" verdict and a
needless kill-and-restart of a healthy run.

**Fix**: judge YOUR container from cgroup-scoped readings, not host tools:
- memory — `/sys/fs/cgroup/memory.current` vs `memory.max` (not `free -m`);
- were YOU OOM-killed — the **`oom_kill` counter** in `/sys/fs/cgroup/memory.events`
  (`grep oom_kill /sys/fs/cgroup/memory.events`); a non-incrementing counter means you were **not**
  OOM-killed, however red host `free` looks;
- CPU pressure — `/sys/fs/cgroup/cpu.stat` / `cpu.pressure`.

A high host load with your cgroup roomy and `oom_kill 0` is a **noisy neighbor**, not your bug — don't
shrink your batch or blame your code (a neighbor genuinely starving you on the shared card is U21/U23
throttle territory or a re-rent, not a code fix). Sharpens the **U3** vanished-process ladder: the
authoritative OOM check is the cgroup `oom_kill` counter, not host `dmesg`/`free` noise.

---

## Transfer & Download

### U12 — `scp -r` of a large dir resets mid-transfer → per-dir resumable loop

**Symptom**: 30–60 min into `scp -r host:...130GB ./`, the connection drops
(`Read from remote host ... reset by peer`); local has a few dirs, the rest gone. scp does not resume.

**Root cause**: a single SSH connection carries the whole transfer; any network blip kills all of it.

**Fix**: loop **per-dir**, each its own SSH session — one failure doesn't lose the others, and re-running
skips completed dirs. Prefer `rsync -avz --partial --append-verify` (resumes a half-file). Wrap bulk pulls
in a `timeout … && break` retry loop: a stall ≠ permanent failure, and resumable transfers accumulate
progress across kills. Validate any speed test on the **same route** the real transfer uses (principle #7).
See `scripts/download_loop.sh` for the per-dir pattern.

### U13 — `scp` into a remote dir a sibling command was supposed to create (race)

**Symptom**: a background `scp big.tar host:/root/x/` fails instantly with `dest open "/root/x/": Failure`
— the foreground command that would have `mkdir`-ed `/root/x` ran later, or was blocked/cancelled.

**Root cause**: ordering assumption between parallel/sibling commands; the destination dir didn't exist yet.

**Fix**: make every transfer self-sufficient inside its own retry loop:
`ssh host 'mkdir -p /root/x' && scp … || retry`. Never assume a sibling created the destination.

### U14 — Egress is a silent ~20% surcharge; co-locate and stay same-AZ

**Symptom**: the monthly bill is ~20% over the rented GPU-hours; a large model/dataset re-pulled daily from
a hyperscaler bucket dominates cost (a 140 GB model pulled daily from S3 ≈ $378/mo in egress alone).

**Root cause**: hyperscaler **egress** is metered (AWS ~$0.09/GB, GCP ~$0.08, Azure ~$0.087) while most
GPU-clouds (Lambda/RunPod/vast/CoreWeave) charge $0. Worse, **cross-AZ traffic bills ~$0.01/GB each
direction even inside one provider** — storage in a different zone than compute quietly meters every read.

**Fix**: co-locate storage with compute on the **same provider AND same AZ/region**. Pull a dataset once to
durable local storage, not per-epoch from a remote bucket. Record `free_egress` / `egress_per_gb` /
`cross_az_per_gb` as profile fields and prefer a $0-egress GPU-cloud for transfer-heavy jobs.

### U15 — Compress before the wire

**Symptom**: checkpoint/dataset transfers are slow and (on metered egress) expensive.

**Root cause**: raw tensors and JSON cross the network uncompressed.

**Fix**: zstd/gzip the payload before transfer — cuts checkpoints+datasets 30–60%, JSON 60–80%; store
weights fp16/int8 where the task tolerates it. Compounds with U14 (less egress $) and U12 (fewer bytes to
resume). Pairs with U25 (tar shards compress and transfer as one stream).

---

## Monitoring

### U16 — Stale background waiters pile up; supersede a run → STOP its waiter; pick the right lifetime

**Symptom**: a "Background tasks" panel shows 8+ "Running" wait-loops at 500–740 min elapsed, each
ssh-polling every ~20 s, while the GPU is idle and the experiment finished hours ago.

**Root cause**: every kill+restart of a flaky saga armed a NEW `until ssh grep MARKER; do sleep; done`
waiter but never stopped the OLD one — its marker (in a superseded log) never appears, so it loops forever.
A `run_in_background` waiter is **not** time-capped (a 781 s task ran to completion + notified; the ~600 s
cap is on **foreground** Bash only). The real silent-failure mode is a waiter that never EXITS (U17).

**Fix**: one waiter per live run — superseding a run, stop the old waiter first (`TaskStop`; cross-session
IDs aren't stoppable from a resumed session — dismiss those from the UI). Multi-hour wait → a **persistent
Monitor** (no 10-min cap) + a stall-detector emit so a hung run still notifies. A persistent Monitor dies on
session resume → after any resume, check the remote ground-truth directly (`tmux ls`, `grep DONE log`,
`nvidia-smi`); never trust a monitor that may be gone (principle #3).

### U17 — A silent background monitor that never returns: usually an unquoted `|` in grep

**Symptom**: a `run_in_background` ssh monitor never returns / never notifies; `pgrep` shows a process
"alive." The run looks hung — but the actual job finished and wrote results fine.

**Root cause**: the wrapper never EXITED because a sub-command blocks forever. The classic bug is an
**unquoted `|` in grep** — `grep -hE noise-sweep|snr=|wrote log` — the shell splits it into THREE piped
commands, and the first (`grep -hE noise-sweep`, no filename) reads **stdin** → blocks forever → the
pipeline never returns → ssh never returns → the local background process never exits → no completion
notification. (Background tasks notify on EXIT only — no 600 s cap; foreground Bash is the capped one, U16.)

**Fix — robust remote-poll template**:
- **Quote every regex AND give grep a filename**: `grep -hE 'noise-sweep|snr=|wrote' log` (a `|` inside quotes is alternation; a filename means read the file, never stdin).
- **Bound the ssh**: `ssh -o ConnectTimeout=15 -o ServerAliveInterval=10 -o ServerAliveCountMax=3 …` — a blip self-kills in ~30 s instead of half-open hanging for minutes.
- **Short-connection poll, not one long-held ssh**: each poll = ssh in → check → disconnect; loop locally with a bounded counter.
- **Verify by artifact, not notification**: when it "looks done," Read the local output + a fresh `ssh 'grep DONE log; tmux ls; nvidia-smi'` to confirm ground truth (cross-link verifying-dl-experiments **REQUIRED**); don't wait on a notification that may never fire.

### U18 — "I'll check periodically" is a lie unless a trigger is armed; two-leg remote self-completion

**Symptom**: a promise to monitor a multi-hour remote run, then no report for a day — because between turns
the assistant does not run. A cloud scheduler set up to "ssh in and check" silently can't reach the box.

**Root cause**: two conflated things. (a) Making the REMOTE self-complete (a waiter that blocks on a log
marker then runs eval) guarantees RESULTS but gives no *reporting cadence* — nothing re-invokes the
assistant on a timer. (b) A cloud schedule runs in an isolated sandbox with its own checkout and **no access
to the local SSH key or network** → it cannot `ssh` the rented box, and the SSH private key must **never** go
into a cloud agent (secret-leak).

**Fix — the two-leg pattern**:
- **Remote self-completion (guaranteed, survives session/SSH death)**: chain `train → eval → touch marker` under one `nohup ... </dev/null >log 2>&1 &`. Detect "done" by a **log marker** (`grep -q 'QUEUE DONE' master.log`), NEVER by `pgrep` — the waiter's own command line contains the pattern, so `pgrep -f` matches itself and loops forever (U17).
- **Live progress (best-effort)**: a session-bound local loop (e.g. `/loop 30m` / cron `3,33 * * * *`) that ssh-polls with the *local* key. Be honest it dies when the session closes — the remote still finishes; the user re-pings to pull.
- **Don't promise autonomous cross-session polling you can't deliver.** (`tmux` is often absent on a fresh box and `apt-get install` fails offline — `nohup ... </dev/null >log 2>&1 &` is zero-dependency and survives SSH drop; verify with `pgrep -af <script>`.) Full architecture → `references/monitoring_patterns.md`.

### U19 — Tracker run deletion lags; a fresh export resurrects "deleted" runs

**Symptom**: `run.delete()` returns, but an immediate `api.runs()` still lists every deleted run; a batch
history-export minutes later happily re-downloads `<run>__history.csv` for runs just deleted.

**Root cause**: deletion is asynchronous server-side; list/export endpoints serve stale listings for
minutes.

**Fix**: delete → re-verify on a **later** monitoring tick (not a tight loop; a second
`delete(delete_artifacts=True)` pass is safe). Order matters: do cloud deletions **before** local exports,
then re-check the export dir for resurrected files and remove them. (cross-link verifying-dl-experiments
**REQUIRED** for tracker forensics.)

### U20 — Local logs die with the instance: use a hosted tracker

**Symptom**: TensorBoard event files written to an ephemeral box vanish on teardown — every curve gone after
the meter-stop verb runs.

**Root cause**: a rented box's local disk is not durable past `terminate`/`destroy` (principle #4); the
metric history lived only there.

**Fix**: log metrics to a **hosted tracker** so they survive teardown — `trackio.init(space_id=...)` or
`wandb` online (push under the platform's proxy if behind a firewall). Poll the tracker's structured alerts
as the monitor instead of brittle ssh-tail. Cross-link huggingface-skills:huggingface-trackio **REQUIRED**
for the `init/log/finish/alert` mechanics and `space_id` sync.

### U43 — A detached run's log looks frozen for minutes though training is fine: stdout is block-buffered off a TTY

**Symptom**: a `nohup`/`tmux` run prints a few lines then nothing for many minutes; it reads as
"hung / died" and the reflex is to kill it — but checkpoint `mtime`, TB scalars, and `nvidia-smi` all show
it advancing.

**Root cause**: Python (and libc stdio) **line-buffer when stdout is a TTY but block-buffer (~4–8 KB) when
it is a pipe or file** — exactly the detached case. The log only flushes when the buffer fills, so a
healthy run looks silent and a `grep`-on-log liveness check false-alarms on the gap.

**Fix**: run unbuffered — `python -u` or `PYTHONUNBUFFERED=1` (the shipped `scripts/run_one.sh.template`
already exports it); for a shell pipeline use `stdbuf -oL`. And judge liveness by **artifacts, not stdout
cadence** — checkpoint `mtime`, the TB scalar API, `nvidia-smi` (monitoring_patterns §0 corollary; the
deeper "is it actually hung?" attach is py-spy, throughput-profiling **T21**). A frozen log is the single
most common false "dead run."

---

## GPU health

### U21 — `nvidia-smi` GPU-Util % is a liar

**Symptom**: the perf tile reads 100% util but throughput is poor; or util looks "busy" while the job is
actually starved (the inverse of U38, which is the 0%-but-running case).

**Root cause**: `GPU-Util` means "≥1 kernel ran in the sampling window," not "useful work filled the
window." A trickle of tiny kernels reads as 100%.

**Fix**: correlate util with **SM clock** (`clocks.current.sm`), memory-bandwidth util, and power draw —
`nvidia-smi dmon -s pucvmet -d 1`. Low SM clock or low power at "100% util" means the GPU is underfed (go to
U24). Always sample over several seconds, never one snapshot.

### U22 — Xid 48/79 = a dead GPU; on a rental, re-rent

**Symptom**: training crashes or the GPU drops out; `dmesg | grep -i xid` shows an Xid error.

**Root cause**: Xid is NVIDIA's canonical hardware-fault signal. **Xid 48 = double-bit ECC (the GPU is
dead); Xid 79 = "GPU has fallen off the bus."** These are hardware, not code.

**Fix**: on a *rental* the card can't be reseated — **stop the instance and re-rent a different box**; don't
burn hours debugging code for a hardware fault. Check `dmesg | grep -i xid` as part of the "vanished
process" ladder (U3) when the GPU goes idle unexpectedly.

### U23 — Thermal/power throttling silently steals 25–40% with no error

**Symptom**: "the same code is slower than yesterday" — no error, no crash, just lower throughput.

**Root cause**: the GPU is thermal- or power-throttling (an H100 throttles around 83 °C; target <75 °C). On
a shared rental, cooling/power headroom is outside tenant control.

**Fix**: detect — SM clock falling below base while temp >83 °C, or
`nvidia-smi -q -d PERFORMANCE` showing a throttle reason. A tenant can't fix cooling → **flag it and
re-rent** a healthier box; don't read the slowdown as a model/data regression. Pairs with U21 (clocks expose
it where util% hides it).

---

## Dataloader & IO

### U24 — GPU starves at 10–70% waiting on the dataloader, not on compute

**Symptom**: util sits well below 100% (but nonzero), step log advances slowly; profiling shows time spent
in data fetch, not fwd/bwd.

**Root cause**: the input pipeline can't keep the GPU fed — too few workers, no prefetch, host↔device copies
on the critical path. (Distinct from U38's *0%* CPU-data-bound transform case; this is the partial-starve
knob set.)

**Fix — tune in order**: `num_workers = cores − 1` (sized against per-worker footprint, U9),
`persistent_workers=True`, `pin_memory=True`, `prefetch_factor=2`. Pathological cases show >100× gaps from
these alone. If a heavy per-sample transform is the bottleneck, move it to the GPU (cross-link
verifying-dl-experiments **REQUIRED** for the 0%-util diagnosis, U38). Pairs with U8 (stage to NVMe) and U25.

### U25 — Millions of small files on a network/object store → transaction-overhead death; shard into tar

**Symptom**: a dataset of many tiny files streams glacially from a shared/object store; or eval output of
tens of thousands of per-sample files exhausts inodes (U7) or blows a visualization grid (U37).

**Root cause**: per-file open/stat/close overhead dominates on networked/object storage; the inode and
metadata cost scales with file *count*, not bytes.

**Fix**: pack into **sharded tar** (WebDataset), a few-hundred-MB per shard → 3–10× faster sequential I/O and
the only sane pattern for streaming from S3. This is the **general form** of the inode-exhaustion trap (U7)
and the per-sample-vis trap — cap and shard rather than emitting a file per sample. Pairs with U8 (stage the
shards to local NVMe) and U15 (shards compress as one stream).

### U40 — A vCPU-sliced rental starves its own GPU: torch intra-op threads default to the HOST core count, not your cgroup quota

**Symptom**: GPU `sm%` sits ~5–15% and runs grind, but the dataloader is not the bottleneck (few/no
workers, data already on-device, the U24 knobs don't help); `top` shows dozens of python threads fighting
over a handful of cores.

**Root cause**: you rent a **cgroup CPU slice** (e.g. 12 vCPUs of a 64-core host), but torch/OpenMP size
their intra-op thread pools to the **physical** core count — `torch.get_num_threads()` / `OMP_NUM_THREADS`
come up ~64. ~57 runnable threads thrashing 12 cores burn the slice on context-switching, so the CPU side
that launches kernels and feeds the GPU can't keep up and the card idles. No OOM, no error — pure scheduler
thrash (the *host scheduling* starves the GPU, the inverse of being data-bound).

**Fix**: cap the pools to your **slice's** vCPU count before launch —
`export OMP_NUM_THREADS=4 MKL_NUM_THREADS=4` (and/or `torch.set_num_threads(4)`); confirm torch honoured it
(`python -c "import torch; print(torch.get_num_threads())"` → 4, not 64). Read the real quota from the
cgroup, not `nproc` (which reports host cores): `cat /sys/fs/cgroup/cpu.max` → `quota period`, vCPUs ≈
quota/period. Bake the cap into the launch wrapper so every queue cell inherits it. Distinct from **U9**
(workers × RAM → cgroup OOM) and **U24** (dataloader starvation); the triage that catches it is
throughput-profiling **T3** (GPU SM% low while a python thread-storm pegs the cores).

---

## Env & Container

### U26 — CRLF breaks `.sh` on Linux (authored on Windows)

**Symptom**: a synced launcher silently does nothing (empty log); run by hand it errors `set: -: invalid
option`, `cd: /path\r: No such file or directory`, `syntax error near unexpected token $'do\r'` — every
line "ends in `\r`."

**Root cause**: Windows `core.autocrlf=true` (or `git archive` exporting working-tree EOL) writes `.sh` with
CRLF; Linux `bash` treats the trailing `\r` as part of each token. `.py` is unaffected (Python's universal
newlines); it is specifically `bash`/`.sh` that breaks.

**Fix**: add `.gitattributes` with `*.sh text eol=lf` (so `git archive`/checkout always emits LF); immediate
on-box unblock: `sed -i 's/\r$//' scripts/*.sh`.

### U27 — `-o dotted.key=value` overrides explode on null parents → freeze protocols as overlay config FILES

**Symptom**: `-o evaluation.sps_augmentation.enable=true` crashes
`KeyError: Override path '...' is not a mapping` because the base YAML has the parent as `null`. Worse
long-term: protocol variants that exist only as one-off CLI strings are unreproducible months later.

**Root cause**: dotted-key override traversal can't descend through a `null` parent; and a CLI-string-only
protocol has no diffable, reviewable record.

**Fix**: define each protocol variant as a small overlay config (`configs/eval_overlays/<protocol>.yaml` with
`_base_:` pointing at the canonical leaf) and pass it via `-c`. Reviewable, diff-able, immune to null-parent
traversal. This is also the **retry-the-identical-config mechanism** (principle #7): an overlay file is a
stable config a retry re-uses byte-for-byte. To reconstruct a historical protocol, read the artifact
manifest (`*_manifest.json` records the resolved overrides verbatim).

### U28 — The CUDA-toolkit ↔ host-driver ↔ torch-build triangle

**Symptom**: `detected CUDA version mismatches the version used to compile PyTorch`; or `no kernel image is
available for execution` at the first forward on a new-arch GPU.

**Root cause**: three independently-versioned layers must agree — **the host driver is host-global and a
tenant usually cannot change it on a rental; the CUDA toolkit is per-env and changeable; the torch build
must match both.** The toolkit must be ≤ what the host driver supports; a project that pins
`torch<2.9` can *downgrade* the only build with kernels for a new-arch card (e.g. sm_120).

**Fix**: keep the image's working torch — filter framework pins out of the remote install:
```bash
grep -ivE '^(torch|torchvision|torchaudio)' requirements.txt > /root/req_remote.txt
pip install -r /root/req_remote.txt
```
Set `LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH` when the per-env toolkit must win. Smoke
`torch.cuda.get_device_capability()` + a heavy project import before launching; the off-band torch version
lands in the runtime snapshot — disclose it with results. `host_driver_cuda_max` is a profile field.

### U29 — "Same version, different result": top-level pins let transitive deps drift → install from a lockfile

**Symptom**: two installs of the "same" `requirements.txt` produce different behavior/results.

**Root cause**: a hand-edited `requirements.txt` pins only top-level packages; transitive dependencies drift
between installs.

**Fix**: install from a **lock file** (`uv lock` / `pip-tools` / `conda-lock`) that pins the full resolved
graph, not a hand-edited top-level list. Pairs with U28 (filter the framework pins, then lock the rest).

### U30 — A Dockerfile is NOT reproducible: pin the base image by `@sha256:` digest

**Symptom**: a container built from the "same" Dockerfile months apart behaves differently.

**Root cause**: `FROM image:latest` (or any moving tag) resolves to a different layer set over time.

**Fix**: pin the base image by content digest — `FROM image@sha256:<digest>`, not `:latest` — so the build
is bit-reproducible. (`pin_image_by_sha256` is a per-platform expectation where the image is the env
contract.)

### U31 — Container runs but trains 100× slower = the GPU was never attached (CPU-only)

**Symptom**: a containerized job runs to completion but is absurdly slow; loss curves look normal, just
glacial.

**Root cause**: the container has no GPU — launched without `--gpus all`, or the NVIDIA Container Toolkit is
missing/too old, so CUDA silently fell back to CPU.

**Fix**: `docker run --gpus all …`, NVIDIA Container Toolkit ≥1.14, and **validate `nvidia-smi` *inside* the
container before training** — never assume GPU attachment from a clean `docker run`.

### U42 — The box runs a hand-synced copy with no git remote; a fix you "committed" may not be deployed — verify it is ON the box before trusting a run or tearing down

**Symptom**: a bug you fixed and committed locally still reproduces on the box, or an eval runs on stale
logic (wrong default, missing speedup, pathologically slow), even though local `git log` shows the fix
landed.

**Root cause**: most rentals have **no git remote** — the box holds a working tree you pushed by
`scp`/`rsync`/`tar-over-ssh`, so its code only advances when you re-sync. A local commit changes nothing on
the box; an interrupted or wrong-path sync, or simply forgetting, leaves the box pre-fix. "I committed it"
≠ "it's running on the box."

**Fix**: treat code deploy like the checked-sync (**U33**) — **verify, don't assume**. After syncing, grep
the box for the change before relying on it:
```bash
ssh "$HOST" "grep -n '<new symbol / changed line>' /root/<proj>/path/file.py" || echo 'NOT DEPLOYED'
```
or compare a hash (`ssh host 'sha256sum file'` vs local). Make it a pre-flight for any run whose result
depends on the fix, and part of the **Phase-5 teardown gate** — a verdict produced by stale code is not the
verdict you think it is (principle #3). Pairs with **U29/U30** (pin deps/image): code AND environment must
both be the version you believe.

---

## Cost & teardown

### U32 — A task's default epochs differ from another task's; CLI `--epochs` silently overrides the right value

**Symptom**: one CLI `--epochs N` is applied to all ablations; a subset (e.g. detection vs recon/seg)
consistently underperforms; a reviewer flags it.

**Root cause**: some task families need more epochs to converge and default to a higher value in their YAML;
a blanket CLI `--epochs` silently overrides that per-task default.

**Fix**: make the queue support a per-line epoch field (e.g. recon/seg `20`, det `50`); audit the codebase's
YAML for `epochs:` declarations before deploying (`grep -rE '^\s*epochs:' configs/ | sort -u`). This is a
config-drift instance — really a smoke/sanity target (cross-link verifying-dl-experiments **REQUIRED**).

### U33 — Silent sync failure: gate the success line on the actual copy result

**Symptom**: a wrapper prints `auto-synced <name> to durable storage` for every job, but at download time
the durable dir is missing or empty.

**Root cause**: the sync block does `mkdir -p "$DST"; cp -f ... 2>/dev/null` then `echo synced`
**unconditionally** — it never checks the exit code. When the durable FS is inode-exhausted (U7) `mkdir`
fails but the success line still fires, so monitoring looks green while nothing landed (principle #3).

**Fix — checked, gated sync**:
```bash
if mkdir -p "$DST" && cp -f "$CKPT_DIR/best.pth" "$DST/" && [ -f "$DST/best.pth" ]; then
    echo "[$(date +%H:%M:%S)] auto-synced $NAME to durable storage"
else
    echo "[$(date +%H:%M:%S)] !! SYNC FAILED for $NAME (check df -i) — data disk is still source-of-truth"
fi
```
Until a download is verified locally, trust the **data-disk** copy, not the "synced" log line. The shipped
`scripts/run_one.sh.template` carries the checked version.

---

## Secrets & trackers

### U34 — Move credentials to the box without the secret ever appearing in a command

**Symptom**: pasting a key into an ssh/scp command leaks it into shell history, transcripts, and hook logs;
security hooks (rightly) block scp-ing a whole `~/.netrc` (it carries other machines' credentials).

**Root cause**: any secret inside a command string is captured by history/transcript/hook logging.

**Fix**: stream exactly one machine block via **stdin** — the value flows file→pipe→file and never appears in
any command text or output:
```bash
grep -A 2 'machine api.wandb.ai' ~/.netrc | ssh <host> 'umask 077; cat > /root/.netrc && chmod 600 /root/.netrc'
```
Verify by capability, not by echoing the value:
`python -c "import wandb; print(wandb.Api(timeout=20).default_entity)"`. Never write the secret to a
shared/durable FS that a platform classifier scans (that platform detail is a profile fact).

### U35 — `WANDB_MODE=offline` still dies without an API key in wrapper stacks → zero curves

**Symptom**: a run launched `WANDB_MODE=offline` expecting "log locally, sync later" produces **no offline
run dirs at all**; the train log shows `Disabled WandB due to initialization error: No API key configured`.

**Root cause**: bare-SDK offline mode needs no key, but project logger *wrappers* often probe the API
(`wandb.login()` / `wandb.Api()`) before `init` and treat key-absence as fatal → they flip to fully-disabled,
not offline.

**Fix**: push credentials BEFORE the first launch (U34) and run online under the platform's proxy; verify the
first log lines show `Syncing run <name>` + a run URL — treat the *absence* of that line as a failure. Run
already finished without a tracker? Backfill from the train log (regex per-epoch summaries →
`init(..., tags=["backfilled"]) → run.log(..., step=epoch)`). Still in flight? Kill and relaunch with
`--resume <latest.pth>` (costs ≤1 epoch). Prefer a hosted tracker so metrics survive teardown (U20).

---

## Delegated — cross-link only, do NOT restate here

### U36 — cuDNN nondeterminism

Same config + seed gives slightly different metrics run-to-run (`cudnn.benchmark=True` picks the fastest
kernel by first-batch timing). Owned by **verifying-dl-experiments** (determinism). Cross-link
verifying-dl-experiments **REQUIRED**; do not restate the fix here.

### U37 — matplotlib `2^16`-per-axis limit on large eval visualization

A composite grid (one row per sample) on a large test set crashes
`Image size … must be less than 2^16`, often aborting the summary save. Owned by
**verifying-dl-experiments** (eval-artifact sizing). Cross-link verifying-dl-experiments **REQUIRED**;
prevent with U25 (cap + shard, don't emit a file/row per sample).

### U38 — GPU at 0% util but training IS running (CPU-data-bound, not stalled)

`nvidia-smi` reads ~0% util yet the step log advances and model memory is loaded — a heavy per-sample CPU
transform with `num_workers=0` serializes data prep and starves the GPU. Owned by
**verifying-dl-experiments** (0%-util diagnosis). Cross-link verifying-dl-experiments **REQUIRED**; the fix
knobs are U24, the move-to-GPU remedy is in that skill.

### U39 — Live monitoring shows nothing (TensorBoard panel empty / `INACTIVE`) but training is fine

**Symptom**: the platform's TensorBoard tile / web panel is blank or `INACTIVE`, or a backgrounded watcher
goes silent — yet the run is healthy: the loss advances on the box and the event/log files exist. You
conclude "monitoring is broken" or, worse, "the run died," and waste a check or restart a fine run.

**Root cause**: live observability breaks in three platform-shaped ways, none of which is a training
failure. (1) **Path mismatch** — the platform's built-in panel reads a FIXED logdir/port and your logger
wrote elsewhere, so the panel sees zero runs (AutoDL pins `tensorboard --logdir /root/tf-logs`; a
`SummaryWriter(log_dir="runs/<exp>")` is invisible to it). (2) **Process died / never backgrounded** — the
TB server or the watcher ran in the foreground or under the session and was killed at the foreground cap
or on session/SSH drop, so nothing serves the curves. (3) **Port not exposed** — the service is up on the
box but the port was never tunnelled / declared, so the panel can't reach it.

**Fix** (the rule is universal; the *value* is per-profile): (1) **align the path** — point your logger at
the panel's pinned dir, OR symlink the pinned dir at your output (`ln -sfn <your-runs>/<exp> <pinned>/<exp>`);
no retrain — the running writer keeps appending and the panel reloads it. The pinned path lives in the
profile (AutoDL `/root/tf-logs`, **AD7**; elsewhere write under the durable mount). (2) **run TB + the
watcher under the detach primitive** (tmux / nohup / the profile's `DETACH`), never foreground, so they
survive the session and the ~600 s cap (`references/monitoring_patterns.md` §1; cross-host background →
§7). (3) **expose the port the platform's way** — CN built-in tiles declare it at rent time (`china.md`),
RunPod via its HTTP proxy (100 s Cloudflare cap, fine for a TB UI, `runpod.md`), Lambda / Paperspace /
bare-SSH via an `ssh -L 6006:localhost:6006` tunnel (`generic-ssh.md`, `lambda.md`). Before blaming the
panel, verify ground truth: the event file is non-empty (`ls -la <logdir>; du -sh <logdir>`) and TB
answers locally (`curl -s localhost:<port>/ | head`). For curves that must **survive teardown**, don't
depend on a box-local panel at all → a hosted tracker (**U20**).

---

## Pointers — gotchas catalogued elsewhere

- **Spot / preemption** (grace windows 2 min → ~0 s, Young/Daly cadence, atomic-write resume, managed-spot frameworks restart-your-process) → `references/spot-resilience.md`.
- **Multi-node / NCCL** (fabric-manager hang, wrong NIC, NCCL timeout, jumbo-frame MTU mismatch, torchrun/Horovod elastic state restore) → `references/multinode.md`. Single-box users skip.