705 lines
44 KiB
Markdown
705 lines
44 KiB
Markdown
# Universal & mixed gotcha catalog — every metered remote-GPU rental
|
||
|
||
The cross-platform gotchas: they bite on **any** metered, isolated, rented GPU — only the concrete
|
||
path/proxy/billing-verb changes (those live in `profiles/<platform>.md`). Each entry is
|
||
**Symptom → Root cause → Fix**. "Mixed" entries are universal in symptom but carry a *platform-specific
|
||
value* in the fix — the rule stays here, the value lives in a profile. Platform-only gotchas (AutoDL's
|
||
TB-pin, the wandb-key classifier, the network_turbo proxy literal) do NOT live here — see each profile's
|
||
TOP GOTCHAS section.
|
||
|
||
To jump: `grep -in '<keyword>' references/gotchas_universal.md` (e.g. `inode`, `egress`, `xid`, `crlf`,
|
||
`stdin`, `zombie`). Numbering `U1…` is stable; cross-platform additions continue the same series.
|
||
|
||
## Table of contents (by theme)
|
||
|
||
- **Process & SSH** — U1 SSH-dies-on-kill · U2 tmux-holds-script-in-memory · U3 vanished-process-4-causes · U4 kill-drops-SSH-before-relaunch · U5 hook-safe-launch
|
||
- **Disk & Storage** — U6 disk-full-crashes-torch.save · U7 storage-fails-on-inodes · U8 stage-hot-data-to-NVMe
|
||
- **Memory & OOM** — U9 cgroup-OOM-num_workers×tensor · U10 VRAM-OOM-vs-cgroup-OOM · U11 zombie-VRAM-nvidia-smi-cant-see · U41 host-metrics-lie/oom_kill-counter
|
||
- **Transfer & Download** — U12 scp-resets→resumable-loop · U13 scp-into-uncreated-dir · U14 egress-surcharge+same-AZ · U15 compress-before-the-wire
|
||
- **Monitoring** — U16 stale-waiters/zombie-monitors · U17 unquoted-pipe-grep-hang+robust-poll · U18 two-leg-remote-self-completion · U19 tracker-deletion-lags · U20 hosted-tracker-survives-teardown · U39 live-panel/TB-silently-empty (path/port/process mismatch) · U43 block-buffered-stdout-looks-frozen
|
||
- **GPU health** — U21 nvidia-smi-util%-is-a-liar · U22 Xid-48/79-dead-GPU-re-rent · U23 thermal/power-throttle-steals-25-40%
|
||
- **Dataloader & IO** — U24 dataloader-starvation-knobs · U25 many-small-files→shard-into-tar · U40 intra-op-thread-oversubscription-starves-GPU
|
||
- **Env & Container** — U26 CRLF-breaks-sh · U27 overlay-config-files · U28 CUDA-toolkit-vs-driver-vs-torch · U29 install-from-lockfile · U30 pin-image-by-sha256 · U31 container-runs-but-no-GPU · U42 box-code-drift/verify-deploy
|
||
- **Cost & teardown** — U32 task-epoch-default · U33 silent-checked-sync
|
||
- **Secrets & trackers** — U34 secrets-via-stdin · U35 tracker-offline-without-key
|
||
- **Delegated (cross-link only)** — U36 cuDNN-nondeterminism · U37 matplotlib-2^16 · U38 GPU-0%-util-data-bound
|
||
- **Pointers** — spot/preemption → `references/spot-resilience.md`; multi-node/NCCL → `references/multinode.md`
|
||
|
||
---
|
||
|
||
## Process & SSH
|
||
|
||
### U1 — SSH disconnects on `pkill -9` (exit 255, "Connection reset")
|
||
|
||
**Symptom**: `ssh <host> 'pkill -9 -f train'` returns `Connection reset by peer`, exit 255.
|
||
|
||
**Root cause**: killing the python tree tears down the PTY chain; the SSH client gets EOF and exits. The
|
||
remote command may have run fine.
|
||
|
||
**Fix**: this is **normal, not an error** — re-ssh and verify state, do not panic-retry.
|
||
```bash
|
||
ssh <host> "tmux kill-session -t qN 2>/dev/null; sleep 3; pkill -9 -f 'src.train'" # SSH exits 255 here
|
||
ssh <host> "pgrep -af 'src.train' | head -1 || echo CLEAN" # separate call verifies
|
||
```
|
||
|
||
### U2 — tmux holds the script in memory; editing it mid-run re-executes blocks
|
||
|
||
**Symptom**: a queue/launcher script is updated mid-run, but the running job still uses the old logic; or
|
||
an ablation completes cleanly yet **restarts from epoch 1** with a second tracker run and the queue never
|
||
advances.
|
||
|
||
**Root cause**: bash reads a script **by byte-offset on demand**. tmux keeps the launched script as-loaded;
|
||
`scp`-ing a new version mid-run makes bash seek to its saved offset in a *now-different* file, land
|
||
mid-command, and re-execute a block (duplicate runs, stalled queue). A child invocation (`bash run_one.sh`)
|
||
IS re-read fresh for the *next* item — but only if none is parked mid-script. (principle #6.)
|
||
|
||
**Fix**: **never overwrite a script any process is executing** — check `pgrep -af <script>` first; version
|
||
the filename for hot changes (`run_one_v2.sh`), point only *new* launches at it. Appending lines to a queue
|
||
file is safe (`while read < file` sees appended bytes); changing structure is not. To hot-swap, kill +
|
||
restart the detach session so fresh bash reads from the top. Recovery: kill the session, copy the finished
|
||
`best.pth` to durable storage, restart `run_queue.sh queue.txt <start_index>` to skip done items, delete any
|
||
duplicate tracker run (cross-link verifying-dl-experiments **REQUIRED**).
|
||
|
||
**Related detach trap — a non-exported var doesn't cross into the detach primitive.** A `VAR=x` set in
|
||
your shell before `tmux new-session` / `nohup` is **not** in the detached job's environment unless
|
||
**exported** (or inlined in the launched command) — the job sees it empty, and a launcher/monitor that
|
||
interpolates it silently misdirects (writes output to the wrong path, mis-reports "died"). `export VAR`
|
||
before launch, or inline it: `tmux new-session -d "VAR=$VAR bash run.sh"`.
|
||
|
||
### U3 — A vanished remote process ≠ OOM: enumerate the 4 causes
|
||
|
||
**Symptom**: a detached run's log stops right after `Starting training` with no epoch output and no
|
||
traceback; `pgrep` shows it gone. The reflex is "OOM-killed."
|
||
|
||
**Root cause is one of four** — OOM is only one:
|
||
1. **Machine restart / reboot** — `dmesg` is *clean*, GPU idle, cgroup roomy, `uptime` low. Most-missed: nothing in the log hints at it.
|
||
2. **OOM-kill (`-9`)** — `dmesg | grep -i 'killed process'` shows it, memory was tight (U9).
|
||
3. **SSH HUP** — a foreground (non-`nohup`/`tmux`/`setsid`) launch dies when its parent SSH drops.
|
||
4. **Manual kill** — an earlier `pkill` matched more than intended.
|
||
|
||
**Fix — diagnose cheap → conclusive before "fixing"**:
|
||
```bash
|
||
dmesg 2>/dev/null | grep -iE 'killed process|out of memory' | tail # OOM? empty = not OOM
|
||
nvidia-smi --query-gpu=memory.used,memory.free --format=csv,noheader # idle now = died, not hung
|
||
cat /sys/fs/cgroup/memory.max | numfmt --to=iec # roomy = OOM unlikely
|
||
uptime # low = recent reboot (cause 1)
|
||
```
|
||
Clean `dmesg` + idle GPU + roomy cgroup + low `uptime` ⇒ **reboot, not OOM**. Do NOT shrink batch size to
|
||
"fix" a phantom OOM — that masks the one variable under test. **Separate trap**: a dropped poll connection
|
||
≠ the training dying — re-ssh and check the process/artifact directly (`pgrep -af train`, log tail,
|
||
`best.pth` mtime) before concluding the run died (principle #3).
|
||
|
||
### U4 — `kill` drops the SSH before a relaunch in the SAME command runs
|
||
|
||
**Symptom**: `ssh <host> 'pkill -f X; relaunch X'` kills X but X is **not** relaunched; ssh returns 255.
|
||
|
||
**Root cause**: killing a session-tied process drops the SSH (U1, normal) at the kill, so everything after
|
||
it in that one command never executes.
|
||
|
||
**Fix**: split — kill in one ssh call, relaunch (with NO kill) in the next. To stop a kill/poll pattern
|
||
from matching the matcher's own command line, split the literal: `A=base; B=lines.; pgrep -f "${A}${B}"`
|
||
(the contiguous string `baselines.` never appears in the cmdline running `pgrep`).
|
||
|
||
### U5 — Hook-safe remote launch: keep env activation VISIBLE in the launch command
|
||
|
||
**Symptom**: an env-guard hook (e.g. "no DL in conda base") blocks or asks on
|
||
`ssh <host> 'nohup bash /root/job.sh ...'` even though `job.sh` activates the right env internally; it also
|
||
misfires on heredocs that inline `python -m <pkg>.train`.
|
||
|
||
**Root cause**: the hook scans the **command string** — it cannot see inside an scp'd script, and a bare
|
||
`bash job.sh` launch has no visible `conda activate <env>`, so the guard assumes base.
|
||
|
||
**Fix**: write the heavy script via Write/`scp` (so `python -m ...train` lives in the file, not the command)
|
||
and put a VISIBLE activation in the launch ssh command:
|
||
`ssh <host> 'source /path/to/conda.sh; conda activate <env>; nohup bash /root/job.sh ...'` — the script
|
||
re-activating is harmless. Never `--no-verify` / never bypass the guard. (On a single-tenant rental whose
|
||
base IS the env, the right move is to exempt remote/ephemeral base, not to clone it — that's a profile fact.)
|
||
|
||
---
|
||
|
||
## Disk & Storage
|
||
|
||
### U6 — Disk-full crashes `torch.save` with `iostream error`
|
||
|
||
**Symptom**: mid-training exit=1; log shows `RuntimeError: basic_ios::clear: iostream error` and
|
||
`unexpected pos N vs M` from inside `torch.serialization`; a leftover `latest.pth.tmp` sits in the
|
||
checkpoint dir; `df` shows the data mount at 100%.
|
||
|
||
**Root cause**: `torch.save` writes atomically (write `.tmp` → rename); the `.tmp` write hits disk-full and
|
||
errors. Any quota'd/cgroup disk on any rental does this.
|
||
|
||
**Fix — prevent**: pre-budget `ckpt_size × N_runs + worst_case_latest + tracker_local_cache`; if it exceeds
|
||
the mount, schedule mid-run aggregation to durable storage + delete completed-and-aggregated dirs; in
|
||
`run_one.sh`, on success prune the rolling `latest.pth` and keep only `best.pth` (cross-link
|
||
verifying-dl-experiments **REQUIRED** for the keepable-checkpoint policy). **Recover**: delete the
|
||
`*.tmp`/`latest.pth` to free several GB — `best.pth` survives, the queue can resume.
|
||
|
||
### U7 — Storage fails on the dimension (and location) not being watched
|
||
|
||
**Symptom**: `cp`/`mkdir` fails `No space left on device`, yet `df -h` shows ~34% used — because `df -i`
|
||
reads `100%` (inodes exhausted). Or the data mount fills despite `runs/` looking small.
|
||
|
||
**Root cause**: disk dies on **inodes before bytes** — the classic trigger is **per-sample eval output**,
|
||
which writes on the order of `files_per_sample × N_samples × N_conditions` tiny files. And the real
|
||
byte-hog often hides where nobody looks: a **symlinked cache** (`~/.cache/huggingface` mapped onto the data
|
||
disk) can outweigh everything the run created.
|
||
|
||
**Fix**: monitor `df -i`, not just `df -h`, in Phase 0 and every space check. **Audit the real mount with
|
||
`du`, not assumptions** (`du -sh ~/.cache/huggingface/hub/models--* | sort -rh`). Clean by **value** — keep
|
||
the tiny irreplaceable evidence (metric/eval JSONs), drop the large reproducible scratch (periodic
|
||
checkpoints, unused caches). Cap per-sample eval visualization (cross-link verifying-dl-experiments
|
||
**REQUIRED** for the sizing policy). The *inode-cap number* is a profile fact (some platforms enforce a hard
|
||
~200K cap; GB-quota'd platforms have none); the many-small-files general form is **shard into tar** (U25).
|
||
Get explicit user confirmation naming `rm -rf` targets; offer "clean vs expand the disk" (principle #9).
|
||
|
||
### U8 — Stage hot data to local NVMe before training
|
||
|
||
**Symptom**: training is I/O-bound reading from a network/shared/HDD-backed volume; GPU starves between
|
||
batches.
|
||
|
||
**Root cause**: a remote/networked filesystem (or a spinning data disk) has far lower random-read
|
||
throughput than instance-local NVMe — HDD-vs-NVMe gaps reach ~35×.
|
||
|
||
**Fix**: at job start, copy the working dataset from the durable/shared tier to instance-local NVMe scratch,
|
||
train against the local copy, write checkpoints back to durable storage. The local-NVMe path is a profile
|
||
fact (`local_nvme` in the frontmatter); the stage-then-train discipline is universal. Pairs with U24/U25.
|
||
|
||
---
|
||
|
||
## Memory & OOM
|
||
|
||
### U9 — `num_workers` × a big in-RAM tensor → cgroup OOM-kill (bare "Killed", exit 137)
|
||
|
||
**Symptom**: training dies early with a bare `Killed` / `killed by signal: Killed (-9)` and **no Python
|
||
traceback**; lowering `num_workers` makes it vanish.
|
||
|
||
**Root cause**: each DataLoader worker is a `fork` that gets its **own copy** of any large object the
|
||
dataset holds (a 16384² float32 matrix ≈ 1 GB). `num_workers=W` ⇒ ~`(W+1)×` that footprint, which blows the
|
||
instance's cgroup `memory.max` even though a bare-process run fits. The kernel OOM-kills with no
|
||
Python-level error, so it reads as a mysterious crash.
|
||
|
||
**Fix**: size `num_workers` against `memory.max` and the per-worker resident set, **not** CPU count. Share
|
||
one copy across workers (memmap / module-level singleton built once) or generate the object on the fly.
|
||
Shrinking the problem also fixes it — a smaller matrix dim shrinks footprint *quadratically* (dim 1024 ≈
|
||
4 MB, 256× less than 16384). Confirm it's OOM: `dmesg | tail` shows `Out of memory: Killed process`, and the
|
||
same config survives `num_workers=0`.
|
||
|
||
### U10 — VRAM OOM (a big model or a concurrent job) is distinct from cgroup-RAM OOM (U9)
|
||
|
||
**Symptom**: `torch.OutOfMemoryError: CUDA out of memory` when launching a second train/eval while another
|
||
runs, or a big model (deep transformer / unrolled net at high res) OOMs alone.
|
||
|
||
**Root cause**: **VRAM** — the sum of concurrent jobs' allocations plus fragmentation exceeds the card. NOT
|
||
host-RAM (U9).
|
||
|
||
**Fix**: check free VRAM first (`nvidia-smi --query-gpu=memory.free --format=csv,noheader`); size the batch
|
||
to fit *alongside* any concurrent job; set `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True` to cut
|
||
fragmentation. (Run heavy DL on the box; do static/shape checks locally — cross-link
|
||
verifying-dl-experiments **REQUIRED** for local-OOM rationale.)
|
||
|
||
### U11 — A zombie holds VRAM `nvidia-smi` cannot see → OOM on an "empty" GPU
|
||
|
||
**Symptom**: `nvidia-smi` lists no process and shows free memory, yet a fresh job OOMs immediately; common
|
||
after a crashed DDP run or a killed container.
|
||
|
||
**Root cause**: a defunct/orphaned process (or a dead container's namespace) still holds CUDA context and
|
||
VRAM, but `nvidia-smi`'s process table can't attribute it — so the GPU *looks* empty while memory is locked.
|
||
|
||
**Fix**: enumerate the real holders via the device nodes and reap them:
|
||
```bash
|
||
fuser -v /dev/nvidia* 2>/dev/null # or: lsof /dev/nvidia* → kill -9 the listed PIDs
|
||
```
|
||
If containerized, restart the container. Ship a small reaper that flags any PID with persistent VRAM + ~0%
|
||
util beyond a timeout — cross-link `scripts/reap_vram_zombies.sh`.
|
||
|
||
### U41 — On a shared box, `uptime`/`free` describe the whole physical host, not your container — use cgroup-scoped readings + the `oom_kill` counter
|
||
|
||
**Symptom**: a detached run looks "dead" or "the host is overloaded" — `uptime` shows load average 400+,
|
||
`top`/`free -m` look maxed — so you suspect contention or an OOM-kill. But the job's own checkpoint `mtime`
|
||
keeps advancing and its log still grows.
|
||
|
||
**Root cause**: on a multi-tenant rental, host tools (`uptime`, `top`, `free -m`, `vmstat`) report the
|
||
**physical node you share with other tenants**, not your cgroup. A neighbor's job spikes the host load
|
||
average to ~490 while your container sits near-idle (your processes in `R`/`S`, none stuck in
|
||
uninterruptible `D`). Reading host load as your own → a false "overloaded / OOM-killed" verdict and a
|
||
needless kill-and-restart of a healthy run.
|
||
|
||
**Fix**: judge YOUR container from cgroup-scoped readings, not host tools:
|
||
- memory — `/sys/fs/cgroup/memory.current` vs `memory.max` (not `free -m`);
|
||
- were YOU OOM-killed — the **`oom_kill` counter** in `/sys/fs/cgroup/memory.events`
|
||
(`grep oom_kill /sys/fs/cgroup/memory.events`); a non-incrementing counter means you were **not**
|
||
OOM-killed, however red host `free` looks;
|
||
- CPU pressure — `/sys/fs/cgroup/cpu.stat` / `cpu.pressure`.
|
||
|
||
A high host load with your cgroup roomy and `oom_kill 0` is a **noisy neighbor**, not your bug — don't
|
||
shrink your batch or blame your code (a neighbor genuinely starving you on the shared card is U21/U23
|
||
throttle territory or a re-rent, not a code fix). Sharpens the **U3** vanished-process ladder: the
|
||
authoritative OOM check is the cgroup `oom_kill` counter, not host `dmesg`/`free` noise.
|
||
|
||
---
|
||
|
||
## Transfer & Download
|
||
|
||
### U12 — `scp -r` of a large dir resets mid-transfer → per-dir resumable loop
|
||
|
||
**Symptom**: 30–60 min into `scp -r host:...130GB ./`, the connection drops
|
||
(`Read from remote host ... reset by peer`); local has a few dirs, the rest gone. scp does not resume.
|
||
|
||
**Root cause**: a single SSH connection carries the whole transfer; any network blip kills all of it.
|
||
|
||
**Fix**: loop **per-dir**, each its own SSH session — one failure doesn't lose the others, and re-running
|
||
skips completed dirs. Prefer `rsync -avz --partial --append-verify` (resumes a half-file). Wrap bulk pulls
|
||
in a `timeout … && break` retry loop: a stall ≠ permanent failure, and resumable transfers accumulate
|
||
progress across kills. Validate any speed test on the **same route** the real transfer uses (principle #7).
|
||
See `scripts/download_loop.sh` for the per-dir pattern.
|
||
|
||
### U13 — `scp` into a remote dir a sibling command was supposed to create (race)
|
||
|
||
**Symptom**: a background `scp big.tar host:/root/x/` fails instantly with `dest open "/root/x/": Failure`
|
||
— the foreground command that would have `mkdir`-ed `/root/x` ran later, or was blocked/cancelled.
|
||
|
||
**Root cause**: ordering assumption between parallel/sibling commands; the destination dir didn't exist yet.
|
||
|
||
**Fix**: make every transfer self-sufficient inside its own retry loop:
|
||
`ssh host 'mkdir -p /root/x' && scp … || retry`. Never assume a sibling created the destination.
|
||
|
||
### U14 — Egress is a silent ~20% surcharge; co-locate and stay same-AZ
|
||
|
||
**Symptom**: the monthly bill is ~20% over the rented GPU-hours; a large model/dataset re-pulled daily from
|
||
a hyperscaler bucket dominates cost (a 140 GB model pulled daily from S3 ≈ $378/mo in egress alone).
|
||
|
||
**Root cause**: hyperscaler **egress** is metered (AWS ~$0.09/GB, GCP ~$0.08, Azure ~$0.087) while most
|
||
GPU-clouds (Lambda/RunPod/vast/CoreWeave) charge $0. Worse, **cross-AZ traffic bills ~$0.01/GB each
|
||
direction even inside one provider** — storage in a different zone than compute quietly meters every read.
|
||
|
||
**Fix**: co-locate storage with compute on the **same provider AND same AZ/region**. Pull a dataset once to
|
||
durable local storage, not per-epoch from a remote bucket. Record `free_egress` / `egress_per_gb` /
|
||
`cross_az_per_gb` as profile fields and prefer a $0-egress GPU-cloud for transfer-heavy jobs.
|
||
|
||
### U15 — Compress before the wire
|
||
|
||
**Symptom**: checkpoint/dataset transfers are slow and (on metered egress) expensive.
|
||
|
||
**Root cause**: raw tensors and JSON cross the network uncompressed.
|
||
|
||
**Fix**: zstd/gzip the payload before transfer — cuts checkpoints+datasets 30–60%, JSON 60–80%; store
|
||
weights fp16/int8 where the task tolerates it. Compounds with U14 (less egress $) and U12 (fewer bytes to
|
||
resume). Pairs with U25 (tar shards compress and transfer as one stream).
|
||
|
||
---
|
||
|
||
## Monitoring
|
||
|
||
### U16 — Stale background waiters pile up; supersede a run → STOP its waiter; pick the right lifetime
|
||
|
||
**Symptom**: a "Background tasks" panel shows 8+ "Running" wait-loops at 500–740 min elapsed, each
|
||
ssh-polling every ~20 s, while the GPU is idle and the experiment finished hours ago.
|
||
|
||
**Root cause**: every kill+restart of a flaky saga armed a NEW `until ssh grep MARKER; do sleep; done`
|
||
waiter but never stopped the OLD one — its marker (in a superseded log) never appears, so it loops forever.
|
||
A `run_in_background` waiter is **not** time-capped (a 781 s task ran to completion + notified; the ~600 s
|
||
cap is on **foreground** Bash only). The real silent-failure mode is a waiter that never EXITS (U17).
|
||
|
||
**Fix**: one waiter per live run — superseding a run, stop the old waiter first (`TaskStop`; cross-session
|
||
IDs aren't stoppable from a resumed session — dismiss those from the UI). Multi-hour wait → a **persistent
|
||
Monitor** (no 10-min cap) + a stall-detector emit so a hung run still notifies. A persistent Monitor dies on
|
||
session resume → after any resume, check the remote ground-truth directly (`tmux ls`, `grep DONE log`,
|
||
`nvidia-smi`); never trust a monitor that may be gone (principle #3).
|
||
|
||
### U17 — A silent background monitor that never returns: usually an unquoted `|` in grep
|
||
|
||
**Symptom**: a `run_in_background` ssh monitor never returns / never notifies; `pgrep` shows a process
|
||
"alive." The run looks hung — but the actual job finished and wrote results fine.
|
||
|
||
**Root cause**: the wrapper never EXITED because a sub-command blocks forever. The classic bug is an
|
||
**unquoted `|` in grep** — `grep -hE noise-sweep|snr=|wrote log` — the shell splits it into THREE piped
|
||
commands, and the first (`grep -hE noise-sweep`, no filename) reads **stdin** → blocks forever → the
|
||
pipeline never returns → ssh never returns → the local background process never exits → no completion
|
||
notification. (Background tasks notify on EXIT only — no 600 s cap; foreground Bash is the capped one, U16.)
|
||
|
||
**Fix — robust remote-poll template**:
|
||
- **Quote every regex AND give grep a filename**: `grep -hE 'noise-sweep|snr=|wrote' log` (a `|` inside quotes is alternation; a filename means read the file, never stdin).
|
||
- **Bound the ssh**: `ssh -o ConnectTimeout=15 -o ServerAliveInterval=10 -o ServerAliveCountMax=3 …` — a blip self-kills in ~30 s instead of half-open hanging for minutes.
|
||
- **Short-connection poll, not one long-held ssh**: each poll = ssh in → check → disconnect; loop locally with a bounded counter.
|
||
- **Verify by artifact, not notification**: when it "looks done," Read the local output + a fresh `ssh 'grep DONE log; tmux ls; nvidia-smi'` to confirm ground truth (cross-link verifying-dl-experiments **REQUIRED**); don't wait on a notification that may never fire.
|
||
|
||
### U18 — "I'll check periodically" is a lie unless a trigger is armed; two-leg remote self-completion
|
||
|
||
**Symptom**: a promise to monitor a multi-hour remote run, then no report for a day — because between turns
|
||
the assistant does not run. A cloud scheduler set up to "ssh in and check" silently can't reach the box.
|
||
|
||
**Root cause**: two conflated things. (a) Making the REMOTE self-complete (a waiter that blocks on a log
|
||
marker then runs eval) guarantees RESULTS but gives no *reporting cadence* — nothing re-invokes the
|
||
assistant on a timer. (b) A cloud schedule runs in an isolated sandbox with its own checkout and **no access
|
||
to the local SSH key or network** → it cannot `ssh` the rented box, and the SSH private key must **never** go
|
||
into a cloud agent (secret-leak).
|
||
|
||
**Fix — the two-leg pattern**:
|
||
- **Remote self-completion (guaranteed, survives session/SSH death)**: chain `train → eval → touch marker` under one `nohup ... </dev/null >log 2>&1 &`. Detect "done" by a **log marker** (`grep -q 'QUEUE DONE' master.log`), NEVER by `pgrep` — the waiter's own command line contains the pattern, so `pgrep -f` matches itself and loops forever (U17).
|
||
- **Live progress (best-effort)**: a session-bound local loop (e.g. `/loop 30m` / cron `3,33 * * * *`) that ssh-polls with the *local* key. Be honest it dies when the session closes — the remote still finishes; the user re-pings to pull.
|
||
- **Don't promise autonomous cross-session polling you can't deliver.** (`tmux` is often absent on a fresh box and `apt-get install` fails offline — `nohup ... </dev/null >log 2>&1 &` is zero-dependency and survives SSH drop; verify with `pgrep -af <script>`.) Full architecture → `references/monitoring_patterns.md`.
|
||
|
||
### U19 — Tracker run deletion lags; a fresh export resurrects "deleted" runs
|
||
|
||
**Symptom**: `run.delete()` returns, but an immediate `api.runs()` still lists every deleted run; a batch
|
||
history-export minutes later happily re-downloads `<run>__history.csv` for runs just deleted.
|
||
|
||
**Root cause**: deletion is asynchronous server-side; list/export endpoints serve stale listings for
|
||
minutes.
|
||
|
||
**Fix**: delete → re-verify on a **later** monitoring tick (not a tight loop; a second
|
||
`delete(delete_artifacts=True)` pass is safe). Order matters: do cloud deletions **before** local exports,
|
||
then re-check the export dir for resurrected files and remove them. (cross-link verifying-dl-experiments
|
||
**REQUIRED** for tracker forensics.)
|
||
|
||
### U20 — Local logs die with the instance: use a hosted tracker
|
||
|
||
**Symptom**: TensorBoard event files written to an ephemeral box vanish on teardown — every curve gone after
|
||
the meter-stop verb runs.
|
||
|
||
**Root cause**: a rented box's local disk is not durable past `terminate`/`destroy` (principle #4); the
|
||
metric history lived only there.
|
||
|
||
**Fix**: log metrics to a **hosted tracker** so they survive teardown — `trackio.init(space_id=...)` or
|
||
`wandb` online (push under the platform's proxy if behind a firewall). Poll the tracker's structured alerts
|
||
as the monitor instead of brittle ssh-tail. Cross-link huggingface-skills:huggingface-trackio **REQUIRED**
|
||
for the `init/log/finish/alert` mechanics and `space_id` sync.
|
||
|
||
### U43 — A detached run's log looks frozen for minutes though training is fine: stdout is block-buffered off a TTY
|
||
|
||
**Symptom**: a `nohup`/`tmux` run prints a few lines then nothing for many minutes; it reads as
|
||
"hung / died" and the reflex is to kill it — but checkpoint `mtime`, TB scalars, and `nvidia-smi` all show
|
||
it advancing.
|
||
|
||
**Root cause**: Python (and libc stdio) **line-buffer when stdout is a TTY but block-buffer (~4–8 KB) when
|
||
it is a pipe or file** — exactly the detached case. The log only flushes when the buffer fills, so a
|
||
healthy run looks silent and a `grep`-on-log liveness check false-alarms on the gap.
|
||
|
||
**Fix**: run unbuffered — `python -u` or `PYTHONUNBUFFERED=1` (the shipped `scripts/run_one.sh.template`
|
||
already exports it); for a shell pipeline use `stdbuf -oL`. And judge liveness by **artifacts, not stdout
|
||
cadence** — checkpoint `mtime`, the TB scalar API, `nvidia-smi` (monitoring_patterns §0 corollary; the
|
||
deeper "is it actually hung?" attach is py-spy, throughput-profiling **T21**). A frozen log is the single
|
||
most common false "dead run."
|
||
|
||
---
|
||
|
||
## GPU health
|
||
|
||
### U21 — `nvidia-smi` GPU-Util % is a liar
|
||
|
||
**Symptom**: the perf tile reads 100% util but throughput is poor; or util looks "busy" while the job is
|
||
actually starved (the inverse of U38, which is the 0%-but-running case).
|
||
|
||
**Root cause**: `GPU-Util` means "≥1 kernel ran in the sampling window," not "useful work filled the
|
||
window." A trickle of tiny kernels reads as 100%.
|
||
|
||
**Fix**: correlate util with **SM clock** (`clocks.current.sm`), memory-bandwidth util, and power draw —
|
||
`nvidia-smi dmon -s pucvmet -d 1`. Low SM clock or low power at "100% util" means the GPU is underfed (go to
|
||
U24). Always sample over several seconds, never one snapshot.
|
||
|
||
### U22 — Xid 48/79 = a dead GPU; on a rental, re-rent
|
||
|
||
**Symptom**: training crashes or the GPU drops out; `dmesg | grep -i xid` shows an Xid error.
|
||
|
||
**Root cause**: Xid is NVIDIA's canonical hardware-fault signal. **Xid 48 = double-bit ECC (the GPU is
|
||
dead); Xid 79 = "GPU has fallen off the bus."** These are hardware, not code.
|
||
|
||
**Fix**: on a *rental* the card can't be reseated — **stop the instance and re-rent a different box**; don't
|
||
burn hours debugging code for a hardware fault. Check `dmesg | grep -i xid` as part of the "vanished
|
||
process" ladder (U3) when the GPU goes idle unexpectedly.
|
||
|
||
### U23 — Thermal/power throttling silently steals 25–40% with no error
|
||
|
||
**Symptom**: "the same code is slower than yesterday" — no error, no crash, just lower throughput.
|
||
|
||
**Root cause**: the GPU is thermal- or power-throttling (an H100 throttles around 83 °C; target <75 °C). On
|
||
a shared rental, cooling/power headroom is outside tenant control.
|
||
|
||
**Fix**: detect — SM clock falling below base while temp >83 °C, or
|
||
`nvidia-smi -q -d PERFORMANCE` showing a throttle reason. A tenant can't fix cooling → **flag it and
|
||
re-rent** a healthier box; don't read the slowdown as a model/data regression. Pairs with U21 (clocks expose
|
||
it where util% hides it).
|
||
|
||
---
|
||
|
||
## Dataloader & IO
|
||
|
||
### U24 — GPU starves at 10–70% waiting on the dataloader, not on compute
|
||
|
||
**Symptom**: util sits well below 100% (but nonzero), step log advances slowly; profiling shows time spent
|
||
in data fetch, not fwd/bwd.
|
||
|
||
**Root cause**: the input pipeline can't keep the GPU fed — too few workers, no prefetch, host↔device copies
|
||
on the critical path. (Distinct from U38's *0%* CPU-data-bound transform case; this is the partial-starve
|
||
knob set.)
|
||
|
||
**Fix — tune in order**: `num_workers = cores − 1` (sized against per-worker footprint, U9),
|
||
`persistent_workers=True`, `pin_memory=True`, `prefetch_factor=2`. Pathological cases show >100× gaps from
|
||
these alone. If a heavy per-sample transform is the bottleneck, move it to the GPU (cross-link
|
||
verifying-dl-experiments **REQUIRED** for the 0%-util diagnosis, U38). Pairs with U8 (stage to NVMe) and U25.
|
||
|
||
### U25 — Millions of small files on a network/object store → transaction-overhead death; shard into tar
|
||
|
||
**Symptom**: a dataset of many tiny files streams glacially from a shared/object store; or eval output of
|
||
tens of thousands of per-sample files exhausts inodes (U7) or blows a visualization grid (U37).
|
||
|
||
**Root cause**: per-file open/stat/close overhead dominates on networked/object storage; the inode and
|
||
metadata cost scales with file *count*, not bytes.
|
||
|
||
**Fix**: pack into **sharded tar** (WebDataset), a few-hundred-MB per shard → 3–10× faster sequential I/O and
|
||
the only sane pattern for streaming from S3. This is the **general form** of the inode-exhaustion trap (U7)
|
||
and the per-sample-vis trap — cap and shard rather than emitting a file per sample. Pairs with U8 (stage the
|
||
shards to local NVMe) and U15 (shards compress as one stream).
|
||
|
||
### U40 — A vCPU-sliced rental starves its own GPU: torch intra-op threads default to the HOST core count, not your cgroup quota
|
||
|
||
**Symptom**: GPU `sm%` sits ~5–15% and runs grind, but the dataloader is not the bottleneck (few/no
|
||
workers, data already on-device, the U24 knobs don't help); `top` shows dozens of python threads fighting
|
||
over a handful of cores.
|
||
|
||
**Root cause**: you rent a **cgroup CPU slice** (e.g. 12 vCPUs of a 64-core host), but torch/OpenMP size
|
||
their intra-op thread pools to the **physical** core count — `torch.get_num_threads()` / `OMP_NUM_THREADS`
|
||
come up ~64. ~57 runnable threads thrashing 12 cores burn the slice on context-switching, so the CPU side
|
||
that launches kernels and feeds the GPU can't keep up and the card idles. No OOM, no error — pure scheduler
|
||
thrash (the *host scheduling* starves the GPU, the inverse of being data-bound).
|
||
|
||
**Fix**: cap the pools to your **slice's** vCPU count before launch —
|
||
`export OMP_NUM_THREADS=4 MKL_NUM_THREADS=4` (and/or `torch.set_num_threads(4)`); confirm torch honoured it
|
||
(`python -c "import torch; print(torch.get_num_threads())"` → 4, not 64). Read the real quota from the
|
||
cgroup, not `nproc` (which reports host cores): `cat /sys/fs/cgroup/cpu.max` → `quota period`, vCPUs ≈
|
||
quota/period. Bake the cap into the launch wrapper so every queue cell inherits it. Distinct from **U9**
|
||
(workers × RAM → cgroup OOM) and **U24** (dataloader starvation); the triage that catches it is
|
||
throughput-profiling **T3** (GPU SM% low while a python thread-storm pegs the cores).
|
||
|
||
---
|
||
|
||
## Env & Container
|
||
|
||
### U26 — CRLF breaks `.sh` on Linux (authored on Windows)
|
||
|
||
**Symptom**: a synced launcher silently does nothing (empty log); run by hand it errors `set: -: invalid
|
||
option`, `cd: /path\r: No such file or directory`, `syntax error near unexpected token $'do\r'` — every
|
||
line "ends in `\r`."
|
||
|
||
**Root cause**: Windows `core.autocrlf=true` (or `git archive` exporting working-tree EOL) writes `.sh` with
|
||
CRLF; Linux `bash` treats the trailing `\r` as part of each token. `.py` is unaffected (Python's universal
|
||
newlines); it is specifically `bash`/`.sh` that breaks.
|
||
|
||
**Fix**: add `.gitattributes` with `*.sh text eol=lf` (so `git archive`/checkout always emits LF); immediate
|
||
on-box unblock: `sed -i 's/\r$//' scripts/*.sh`.
|
||
|
||
### U27 — `-o dotted.key=value` overrides explode on null parents → freeze protocols as overlay config FILES
|
||
|
||
**Symptom**: `-o evaluation.sps_augmentation.enable=true` crashes
|
||
`KeyError: Override path '...' is not a mapping` because the base YAML has the parent as `null`. Worse
|
||
long-term: protocol variants that exist only as one-off CLI strings are unreproducible months later.
|
||
|
||
**Root cause**: dotted-key override traversal can't descend through a `null` parent; and a CLI-string-only
|
||
protocol has no diffable, reviewable record.
|
||
|
||
**Fix**: define each protocol variant as a small overlay config (`configs/eval_overlays/<protocol>.yaml` with
|
||
`_base_:` pointing at the canonical leaf) and pass it via `-c`. Reviewable, diff-able, immune to null-parent
|
||
traversal. This is also the **retry-the-identical-config mechanism** (principle #7): an overlay file is a
|
||
stable config a retry re-uses byte-for-byte. To reconstruct a historical protocol, read the artifact
|
||
manifest (`*_manifest.json` records the resolved overrides verbatim).
|
||
|
||
### U28 — The CUDA-toolkit ↔ host-driver ↔ torch-build triangle
|
||
|
||
**Symptom**: `detected CUDA version mismatches the version used to compile PyTorch`; or `no kernel image is
|
||
available for execution` at the first forward on a new-arch GPU.
|
||
|
||
**Root cause**: three independently-versioned layers must agree — **the host driver is host-global and a
|
||
tenant usually cannot change it on a rental; the CUDA toolkit is per-env and changeable; the torch build
|
||
must match both.** The toolkit must be ≤ what the host driver supports; a project that pins
|
||
`torch<2.9` can *downgrade* the only build with kernels for a new-arch card (e.g. sm_120).
|
||
|
||
**Fix**: keep the image's working torch — filter framework pins out of the remote install:
|
||
```bash
|
||
grep -ivE '^(torch|torchvision|torchaudio)' requirements.txt > /root/req_remote.txt
|
||
pip install -r /root/req_remote.txt
|
||
```
|
||
Set `LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH` when the per-env toolkit must win. Smoke
|
||
`torch.cuda.get_device_capability()` + a heavy project import before launching; the off-band torch version
|
||
lands in the runtime snapshot — disclose it with results. `host_driver_cuda_max` is a profile field.
|
||
|
||
### U29 — "Same version, different result": top-level pins let transitive deps drift → install from a lockfile
|
||
|
||
**Symptom**: two installs of the "same" `requirements.txt` produce different behavior/results.
|
||
|
||
**Root cause**: a hand-edited `requirements.txt` pins only top-level packages; transitive dependencies drift
|
||
between installs.
|
||
|
||
**Fix**: install from a **lock file** (`uv lock` / `pip-tools` / `conda-lock`) that pins the full resolved
|
||
graph, not a hand-edited top-level list. Pairs with U28 (filter the framework pins, then lock the rest).
|
||
|
||
### U30 — A Dockerfile is NOT reproducible: pin the base image by `@sha256:` digest
|
||
|
||
**Symptom**: a container built from the "same" Dockerfile months apart behaves differently.
|
||
|
||
**Root cause**: `FROM image:latest` (or any moving tag) resolves to a different layer set over time.
|
||
|
||
**Fix**: pin the base image by content digest — `FROM image@sha256:<digest>`, not `:latest` — so the build
|
||
is bit-reproducible. (`pin_image_by_sha256` is a per-platform expectation where the image is the env
|
||
contract.)
|
||
|
||
### U31 — Container runs but trains 100× slower = the GPU was never attached (CPU-only)
|
||
|
||
**Symptom**: a containerized job runs to completion but is absurdly slow; loss curves look normal, just
|
||
glacial.
|
||
|
||
**Root cause**: the container has no GPU — launched without `--gpus all`, or the NVIDIA Container Toolkit is
|
||
missing/too old, so CUDA silently fell back to CPU.
|
||
|
||
**Fix**: `docker run --gpus all …`, NVIDIA Container Toolkit ≥1.14, and **validate `nvidia-smi` *inside* the
|
||
container before training** — never assume GPU attachment from a clean `docker run`.
|
||
|
||
### U42 — The box runs a hand-synced copy with no git remote; a fix you "committed" may not be deployed — verify it is ON the box before trusting a run or tearing down
|
||
|
||
**Symptom**: a bug you fixed and committed locally still reproduces on the box, or an eval runs on stale
|
||
logic (wrong default, missing speedup, pathologically slow), even though local `git log` shows the fix
|
||
landed.
|
||
|
||
**Root cause**: most rentals have **no git remote** — the box holds a working tree you pushed by
|
||
`scp`/`rsync`/`tar-over-ssh`, so its code only advances when you re-sync. A local commit changes nothing on
|
||
the box; an interrupted or wrong-path sync, or simply forgetting, leaves the box pre-fix. "I committed it"
|
||
≠ "it's running on the box."
|
||
|
||
**Fix**: treat code deploy like the checked-sync (**U33**) — **verify, don't assume**. After syncing, grep
|
||
the box for the change before relying on it:
|
||
```bash
|
||
ssh "$HOST" "grep -n '<new symbol / changed line>' /root/<proj>/path/file.py" || echo 'NOT DEPLOYED'
|
||
```
|
||
or compare a hash (`ssh host 'sha256sum file'` vs local). Make it a pre-flight for any run whose result
|
||
depends on the fix, and part of the **Phase-5 teardown gate** — a verdict produced by stale code is not the
|
||
verdict you think it is (principle #3). Pairs with **U29/U30** (pin deps/image): code AND environment must
|
||
both be the version you believe.
|
||
|
||
---
|
||
|
||
## Cost & teardown
|
||
|
||
### U32 — A task's default epochs differ from another task's; CLI `--epochs` silently overrides the right value
|
||
|
||
**Symptom**: one CLI `--epochs N` is applied to all ablations; a subset (e.g. detection vs recon/seg)
|
||
consistently underperforms; a reviewer flags it.
|
||
|
||
**Root cause**: some task families need more epochs to converge and default to a higher value in their YAML;
|
||
a blanket CLI `--epochs` silently overrides that per-task default.
|
||
|
||
**Fix**: make the queue support a per-line epoch field (e.g. recon/seg `20`, det `50`); audit the codebase's
|
||
YAML for `epochs:` declarations before deploying (`grep -rE '^\s*epochs:' configs/ | sort -u`). This is a
|
||
config-drift instance — really a smoke/sanity target (cross-link verifying-dl-experiments **REQUIRED**).
|
||
|
||
### U33 — Silent sync failure: gate the success line on the actual copy result
|
||
|
||
**Symptom**: a wrapper prints `auto-synced <name> to durable storage` for every job, but at download time
|
||
the durable dir is missing or empty.
|
||
|
||
**Root cause**: the sync block does `mkdir -p "$DST"; cp -f ... 2>/dev/null` then `echo synced`
|
||
**unconditionally** — it never checks the exit code. When the durable FS is inode-exhausted (U7) `mkdir`
|
||
fails but the success line still fires, so monitoring looks green while nothing landed (principle #3).
|
||
|
||
**Fix — checked, gated sync**:
|
||
```bash
|
||
if mkdir -p "$DST" && cp -f "$CKPT_DIR/best.pth" "$DST/" && [ -f "$DST/best.pth" ]; then
|
||
echo "[$(date +%H:%M:%S)] auto-synced $NAME to durable storage"
|
||
else
|
||
echo "[$(date +%H:%M:%S)] !! SYNC FAILED for $NAME (check df -i) — data disk is still source-of-truth"
|
||
fi
|
||
```
|
||
Until a download is verified locally, trust the **data-disk** copy, not the "synced" log line. The shipped
|
||
`scripts/run_one.sh.template` carries the checked version.
|
||
|
||
---
|
||
|
||
## Secrets & trackers
|
||
|
||
### U34 — Move credentials to the box without the secret ever appearing in a command
|
||
|
||
**Symptom**: pasting a key into an ssh/scp command leaks it into shell history, transcripts, and hook logs;
|
||
security hooks (rightly) block scp-ing a whole `~/.netrc` (it carries other machines' credentials).
|
||
|
||
**Root cause**: any secret inside a command string is captured by history/transcript/hook logging.
|
||
|
||
**Fix**: stream exactly one machine block via **stdin** — the value flows file→pipe→file and never appears in
|
||
any command text or output:
|
||
```bash
|
||
grep -A 2 'machine api.wandb.ai' ~/.netrc | ssh <host> 'umask 077; cat > /root/.netrc && chmod 600 /root/.netrc'
|
||
```
|
||
Verify by capability, not by echoing the value:
|
||
`python -c "import wandb; print(wandb.Api(timeout=20).default_entity)"`. Never write the secret to a
|
||
shared/durable FS that a platform classifier scans (that platform detail is a profile fact).
|
||
|
||
### U35 — `WANDB_MODE=offline` still dies without an API key in wrapper stacks → zero curves
|
||
|
||
**Symptom**: a run launched `WANDB_MODE=offline` expecting "log locally, sync later" produces **no offline
|
||
run dirs at all**; the train log shows `Disabled WandB due to initialization error: No API key configured`.
|
||
|
||
**Root cause**: bare-SDK offline mode needs no key, but project logger *wrappers* often probe the API
|
||
(`wandb.login()` / `wandb.Api()`) before `init` and treat key-absence as fatal → they flip to fully-disabled,
|
||
not offline.
|
||
|
||
**Fix**: push credentials BEFORE the first launch (U34) and run online under the platform's proxy; verify the
|
||
first log lines show `Syncing run <name>` + a run URL — treat the *absence* of that line as a failure. Run
|
||
already finished without a tracker? Backfill from the train log (regex per-epoch summaries →
|
||
`init(..., tags=["backfilled"]) → run.log(..., step=epoch)`). Still in flight? Kill and relaunch with
|
||
`--resume <latest.pth>` (costs ≤1 epoch). Prefer a hosted tracker so metrics survive teardown (U20).
|
||
|
||
---
|
||
|
||
## Delegated — cross-link only, do NOT restate here
|
||
|
||
### U36 — cuDNN nondeterminism
|
||
|
||
Same config + seed gives slightly different metrics run-to-run (`cudnn.benchmark=True` picks the fastest
|
||
kernel by first-batch timing). Owned by **verifying-dl-experiments** (determinism). Cross-link
|
||
verifying-dl-experiments **REQUIRED**; do not restate the fix here.
|
||
|
||
### U37 — matplotlib `2^16`-per-axis limit on large eval visualization
|
||
|
||
A composite grid (one row per sample) on a large test set crashes
|
||
`Image size … must be less than 2^16`, often aborting the summary save. Owned by
|
||
**verifying-dl-experiments** (eval-artifact sizing). Cross-link verifying-dl-experiments **REQUIRED**;
|
||
prevent with U25 (cap + shard, don't emit a file/row per sample).
|
||
|
||
### U38 — GPU at 0% util but training IS running (CPU-data-bound, not stalled)
|
||
|
||
`nvidia-smi` reads ~0% util yet the step log advances and model memory is loaded — a heavy per-sample CPU
|
||
transform with `num_workers=0` serializes data prep and starves the GPU. Owned by
|
||
**verifying-dl-experiments** (0%-util diagnosis). Cross-link verifying-dl-experiments **REQUIRED**; the fix
|
||
knobs are U24, the move-to-GPU remedy is in that skill.
|
||
|
||
### U39 — Live monitoring shows nothing (TensorBoard panel empty / `INACTIVE`) but training is fine
|
||
|
||
**Symptom**: the platform's TensorBoard tile / web panel is blank or `INACTIVE`, or a backgrounded watcher
|
||
goes silent — yet the run is healthy: the loss advances on the box and the event/log files exist. You
|
||
conclude "monitoring is broken" or, worse, "the run died," and waste a check or restart a fine run.
|
||
|
||
**Root cause**: live observability breaks in three platform-shaped ways, none of which is a training
|
||
failure. (1) **Path mismatch** — the platform's built-in panel reads a FIXED logdir/port and your logger
|
||
wrote elsewhere, so the panel sees zero runs (AutoDL pins `tensorboard --logdir /root/tf-logs`; a
|
||
`SummaryWriter(log_dir="runs/<exp>")` is invisible to it). (2) **Process died / never backgrounded** — the
|
||
TB server or the watcher ran in the foreground or under the session and was killed at the foreground cap
|
||
or on session/SSH drop, so nothing serves the curves. (3) **Port not exposed** — the service is up on the
|
||
box but the port was never tunnelled / declared, so the panel can't reach it.
|
||
|
||
**Fix** (the rule is universal; the *value* is per-profile): (1) **align the path** — point your logger at
|
||
the panel's pinned dir, OR symlink the pinned dir at your output (`ln -sfn <your-runs>/<exp> <pinned>/<exp>`);
|
||
no retrain — the running writer keeps appending and the panel reloads it. The pinned path lives in the
|
||
profile (AutoDL `/root/tf-logs`, **AD7**; elsewhere write under the durable mount). (2) **run TB + the
|
||
watcher under the detach primitive** (tmux / nohup / the profile's `DETACH`), never foreground, so they
|
||
survive the session and the ~600 s cap (`references/monitoring_patterns.md` §1; cross-host background →
|
||
§7). (3) **expose the port the platform's way** — CN built-in tiles declare it at rent time (`china.md`),
|
||
RunPod via its HTTP proxy (100 s Cloudflare cap, fine for a TB UI, `runpod.md`), Lambda / Paperspace /
|
||
bare-SSH via an `ssh -L 6006:localhost:6006` tunnel (`generic-ssh.md`, `lambda.md`). Before blaming the
|
||
panel, verify ground truth: the event file is non-empty (`ls -la <logdir>; du -sh <logdir>`) and TB
|
||
answers locally (`curl -s localhost:<port>/ | head`). For curves that must **survive teardown**, don't
|
||
depend on a box-local panel at all → a hosted tracker (**U20**).
|
||
|
||
---
|
||
|
||
## Pointers — gotchas catalogued elsewhere
|
||
|
||
- **Spot / preemption** (grace windows 2 min → ~0 s, Young/Daly cadence, atomic-write resume, managed-spot frameworks restart-your-process) → `references/spot-resilience.md`.
|
||
- **Multi-node / NCCL** (fabric-manager hang, wrong NIC, NCCL timeout, jumbo-frame MTU mismatch, torchrun/Horovod elastic state restore) → `references/multinode.md`. Single-box users skip.
|