# Universal & mixed gotcha catalog — every metered remote-GPU rental The cross-platform gotchas: they bite on **any** metered, isolated, rented GPU — only the concrete path/proxy/billing-verb changes (those live in `profiles/.md`). Each entry is **Symptom → Root cause → Fix**. "Mixed" entries are universal in symptom but carry a *platform-specific value* in the fix — the rule stays here, the value lives in a profile. Platform-only gotchas (AutoDL's TB-pin, the wandb-key classifier, the network_turbo proxy literal) do NOT live here — see each profile's TOP GOTCHAS section. To jump: `grep -in '' references/gotchas_universal.md` (e.g. `inode`, `egress`, `xid`, `crlf`, `stdin`, `zombie`). Numbering `U1…` is stable; cross-platform additions continue the same series. ## Table of contents (by theme) - **Process & SSH** — U1 SSH-dies-on-kill · U2 tmux-holds-script-in-memory · U3 vanished-process-4-causes · U4 kill-drops-SSH-before-relaunch · U5 hook-safe-launch - **Disk & Storage** — U6 disk-full-crashes-torch.save · U7 storage-fails-on-inodes · U8 stage-hot-data-to-NVMe - **Memory & OOM** — U9 cgroup-OOM-num_workers×tensor · U10 VRAM-OOM-vs-cgroup-OOM · U11 zombie-VRAM-nvidia-smi-cant-see · U41 host-metrics-lie/oom_kill-counter - **Transfer & Download** — U12 scp-resets→resumable-loop · U13 scp-into-uncreated-dir · U14 egress-surcharge+same-AZ · U15 compress-before-the-wire - **Monitoring** — U16 stale-waiters/zombie-monitors · U17 unquoted-pipe-grep-hang+robust-poll · U18 two-leg-remote-self-completion · U19 tracker-deletion-lags · U20 hosted-tracker-survives-teardown · U39 live-panel/TB-silently-empty (path/port/process mismatch) · U43 block-buffered-stdout-looks-frozen - **GPU health** — U21 nvidia-smi-util%-is-a-liar · U22 Xid-48/79-dead-GPU-re-rent · U23 thermal/power-throttle-steals-25-40% - **Dataloader & IO** — U24 dataloader-starvation-knobs · U25 many-small-files→shard-into-tar · U40 intra-op-thread-oversubscription-starves-GPU - **Env & Container** — U26 CRLF-breaks-sh · U27 overlay-config-files · U28 CUDA-toolkit-vs-driver-vs-torch · U29 install-from-lockfile · U30 pin-image-by-sha256 · U31 container-runs-but-no-GPU · U42 box-code-drift/verify-deploy - **Cost & teardown** — U32 task-epoch-default · U33 silent-checked-sync - **Secrets & trackers** — U34 secrets-via-stdin · U35 tracker-offline-without-key - **Delegated (cross-link only)** — U36 cuDNN-nondeterminism · U37 matplotlib-2^16 · U38 GPU-0%-util-data-bound - **Pointers** — spot/preemption → `references/spot-resilience.md`; multi-node/NCCL → `references/multinode.md` --- ## Process & SSH ### U1 — SSH disconnects on `pkill -9` (exit 255, "Connection reset") **Symptom**: `ssh 'pkill -9 -f train'` returns `Connection reset by peer`, exit 255. **Root cause**: killing the python tree tears down the PTY chain; the SSH client gets EOF and exits. The remote command may have run fine. **Fix**: this is **normal, not an error** — re-ssh and verify state, do not panic-retry. ```bash ssh "tmux kill-session -t qN 2>/dev/null; sleep 3; pkill -9 -f 'src.train'" # SSH exits 255 here ssh "pgrep -af 'src.train' | head -1 || echo CLEAN" # separate call verifies ``` ### U2 — tmux holds the script in memory; editing it mid-run re-executes blocks **Symptom**: a queue/launcher script is updated mid-run, but the running job still uses the old logic; or an ablation completes cleanly yet **restarts from epoch 1** with a second tracker run and the queue never advances. **Root cause**: bash reads a script **by byte-offset on demand**. tmux keeps the launched script as-loaded; `scp`-ing a new version mid-run makes bash seek to its saved offset in a *now-different* file, land mid-command, and re-execute a block (duplicate runs, stalled queue). A child invocation (`bash run_one.sh`) IS re-read fresh for the *next* item — but only if none is parked mid-script. (principle #6.) **Fix**: **never overwrite a script any process is executing** — check `pgrep -af