# Launching & debugging multi-GPU / multi-node training — torchrun · Accelerate · DeepSpeed · DDP · FSDP Pick a launcher, get the rank/world-size env right, choose a parallelism (DDP vs FSDP vs ZeRO), and — when 8 processes silently freeze — find *which* rank diverged. This layer owns *making the distributed job RUN, not hang, and not silently mis-shard*; **verifying-dl-experiments** owns *is the resulting number correct* (a run whose LR silently rescaled with world size, or that resumed from step 0 after a restart, is its concern). Cross-link it (**REQUIRED**) wherever a launch fix changes effective batch size, LR, or precision. Single box, multiple GPUs is DDP/FSDP over NVLink/PCIe and lives here. The **inter-node** transport (NCCL NIC, fabric-manager, timeout, MTU, elastic restart) is `references/multinode.md` (**REQUIRED** for any job spanning ≥2 instances) — this file ends where the wire between boxes begins. To jump: `grep -in '' references/training/distributed-launch.md` (e.g. `rdzv`, `local_rank`, `unused`, `hang`, `desync`, `fsdp`, `zero`, `state_dict`, `port`, `barrier`, `accelerate`). ## Table of contents - **Launchers & env** — D1 torchrun-env-contract · D2 standalone-vs-rendezvous · D3 LOCAL_RANK-device-bug · D4 port-collision · D5 accelerate-launch · D6 deepspeed-launcher · D7 which-launcher - **DDP** — D8 find_unused_parameters · D9 uneven-inputs-Join · D10 SyncBN-&-buffers · D11 effective-batch/LR - **FSDP** — D12 wrapping-policy · D13 sharding-strategy · D14 mixed-precision · D15 state_dict-type - **DeepSpeed** — D16 ZeRO-stages · D17 config.json-knobs · D18 auto-&-engine.backward - **The HANGS** (highest-value) — D19 desync-debug-toolkit · D20 one-rank-diverged · D21 rank-conditional-collective · D22 dataloader-length-mismatch · D23 eval/print/save-on-one-rank - **Pointers** — inter-node NCCL/NIC/timeout → multinode.md · OOM/sharding-to-fit → oom-memory.md · spot-restart → spot-resilience.md --- ## Launchers & env ### D1 — The rank/world-size env contract every launcher must satisfy **Symptom**: a raw `python train.py` on a 4-GPU box uses **one** GPU; or `init_process_group` hangs forever because `MASTER_ADDR`/`RANK` were never set. **Root cause**: `torch.distributed` reads its topology from **environment variables**, not from the GPU count. A bare `python` sets none of them, so the process group never forms. **Fix**: launch through `torchrun`, which sets the full contract per process ([torchrun docs](https://docs.pytorch.org/docs/2.12/elastic/run.html)): | Var | Meaning | |---|---| | `RANK` | global rank `0..WORLD_SIZE-1` (unique across the whole job) | | `LOCAL_RANK` | rank **within this node** — bind it to the GPU (`cuda:LOCAL_RANK`), NOT `RANK` (D3) | | `WORLD_SIZE` | total workers = `nnodes × nproc_per_node` | | `LOCAL_WORLD_SIZE` | workers on this node | | `GROUP_RANK` | the node's rank (`0..nnodes-1`) | | `MASTER_ADDR` / `MASTER_PORT` | FQDN + port of rank-0 hosting the c10d TCP store | The script reads them (`int(os.environ["LOCAL_RANK"])`), calls `init_process_group(backend="nccl")` (NCCL for GPU; `gloo` for CPU), and `set_device(LOCAL_RANK)` before allocating any CUDA tensor. ### D2 — Single-node uses `--standalone`; multi-node needs a shared rendezvous id+endpoint **Symptom**: copying a single-node `torchrun` line to a second node either hangs at init or both nodes form two separate 1-node groups. **Root cause**: single-node and multi-node use **different rendezvous**. `--standalone` self-hosts a rendezvous on localhost (no coordination); multi-node requires every node to point at the *same* external rendezvous server with the *same* job id. **Fix** ([torchrun docs](https://docs.pytorch.org/docs/2.12/elastic/run.html)): ```bash # single node, 4 GPUs — self-contained, no addr/port to manage torchrun --standalone --nnodes=1 --nproc-per-node=4 train.py # multi-node: IDENTICAL command on every node; only env-derived node-rank differs torchrun --nnodes=2 --nproc-per-node=8 \ --rdzv-id=$JOB_ID --rdzv-backend=c10d \ --rdzv-endpoint=$HEAD_IP:29400 train.py ``` `c10d` is the recommended backend (no etcd dependency). `--nnodes=1:4` enables elastic scaling. The inter-node wire health (NIC pinning, fabric-manager, timeout) is `references/multinode.md`. ### D3 — Every process lands on GPU 0 (the `RANK` vs `LOCAL_RANK` bug) **Symptom**: on multi-node, all of node-1's processes pile onto `cuda:0` and OOM, while GPUs 1-7 sit idle; single-node looked fine. **Root cause**: the script did `torch.cuda.set_device(RANK)`. On a single node `RANK==LOCAL_RANK` so the bug hides; on node 1 of a 2-node job `RANK` is 8-15 but the node only has GPUs 0-7, so `set_device` wraps/collides and everything funnels to device 0. **Fix**: **always index the local device by `LOCAL_RANK`**, never `RANK`: `torch.cuda.set_device(int(os.environ["LOCAL_RANK"]))`. `RANK` selects the *data shard*; `LOCAL_RANK` selects the *physical GPU*. ### D4 — `RuntimeError: Address already in use` when launching a second job on one node **Symptom**: a second `torchrun` (e.g. a parallel ablation cell) on the same box dies immediately with `errno 98: Address already in use`. **Root cause**: both jobs default to `MASTER_PORT=29500`; the c10d TCP store can't bind a port the first job holds ([pytorch#85604](https://github.com/pytorch/pytorch/issues/85604)). **Fix**: give each co-located job a unique port **and** disjoint GPUs: ```bash CUDA_VISIBLE_DEVICES=0,1 torchrun --standalone --nproc-per-node=2 --master-port=29500 train.py & CUDA_VISIBLE_DEVICES=2,3 torchrun --standalone --nproc-per-node=2 --master-port=29600 train.py & ``` Or use `--rdzv-endpoint=localhost:0` to let torchrun pick a free port. Fanning cells across instances instead of one box → `references/parallel_ablation.md`. ### D5 — HF Accelerate: `accelerate launch` reads a config, not torchrun flags **Symptom**: `accelerate launch train.py` runs single-GPU despite 4 cards, because no config exists or `compute_environment` defaulted to one process. **Root cause**: Accelerate wraps the same env contract (D1) but sources it from `~/.cache/huggingface/accelerate/default_config.yaml` (written by `accelerate config`) or CLI flags ([launch docs](https://huggingface.co/docs/accelerate/en/basic_tutorials/launch)). **Fix**: generate a config once, then launch against it — and on a headless rental, write the YAML directly instead of the interactive `accelerate config`: ```bash accelerate launch --multi_gpu --num_processes=4 --mixed_precision=bf16 train.py # or a checked-in YAML (reproducible, diffable): accelerate launch --config_file configs/acc_fsdp.yaml train.py ``` Switching DDP↔FSDP↔DeepSpeed is *only* a config swap — the training script is unchanged. The same `--num_machines`/`--machine_rank`/`--main_process_ip` map onto multi-node (D2 territory). ### D6 — DeepSpeed: `deepspeed` launcher vs `accelerate launch`, and the `hostfile` **Symptom**: `deepspeed train.py` on multi-node can't find the other host, or `--num_gpus` is ignored. **Root cause**: the `deepspeed` launcher discovers nodes from a `hostfile` (`worker-1 slots=8`), distinct from torchrun's rendezvous. Under HF it's usually cleaner to let `accelerate launch` (with a DeepSpeed plugin/config) drive it ([HF DeepSpeed](https://huggingface.co/docs/accelerate/en/usage_guides/deepspeed)). **Fix**: single-node `deepspeed --num_gpus=8 train.py --deepspeed ds_config.json`; multi-node `deepspeed --hostfile=hostfile --num_gpus=8 train.py ...`. With HF Trainer/Accelerate, pass the config via `--config_file` and let it spawn the workers — don't mix both launchers. ### D7 — Which launcher / parallelism — decision in one breath - **Model fits on one GPU, just want more throughput** → **DDP** (`torchrun`), simplest, fastest. Each rank holds a full replica. - **Model does NOT fit (params+optim+grads ≈ 18 B/param, see oom-memory.md M1)** → shard it: **FSDP** (PyTorch-native) or **DeepSpeed ZeRO** (richer offload). Sharding-to-fit ladder → `references/training/oom-memory.md` M9. - **HF ecosystem / Trainer** → **Accelerate** as the launcher; flip a config field to choose DDP/FSDP/ZeRO. - **Need CPU/NVMe offload of params *and* optimizer separately, or ZeRO-Infinity** → **DeepSpeed** (FSDP1 offload is all-or-nothing; [HF concept guide](https://github.com/huggingface/accelerate/blob/main/docs/source/concept_guides/fsdp_and_deepspeed.md)). --- ## DDP ### D8 — `find_unused_parameters` — the "Expected to have finished reduction" error vs the silent hang **Symptom**: `RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. ... parameters that were not used in producing loss` ([HF discuss](https://discuss.huggingface.co/t/runtimeerror-expected-to-have-finished-reduction-in-the-prior-iteration-before-starting-a-new-one-this-error-indicates-that-your-module-has-parameters-that-were-not-used-in-producing-loss/64760)). **Root cause**: DDP registers an allreduce hook on every parameter and waits for *all* of them each step. If a branch (a frozen head, a conditional layer) produces no gradient, its bucket never fires and the reduction never completes. **Fix — in priority order**: 1. **Best**: make every output participate in the loss (often the real bug is a dropped/detached head). 2. If a branch is *legitimately* unused some steps, `DDP(model, find_unused_parameters=True)` — but it adds a full graph traversal each step and **can be drastically slower** ([PyTorch forum](https://discuss.pytorch.org/t/process-got-stuck-when-set-find-unused-parameters-true-in-ddp/106078)). Use only if (1) is impossible. 3. If the return value is a dict/list, DDP may not locate the output tensors — flatten or simplify the `forward` return. > Setting `find_unused_parameters=True` to *paper over* a real bug masks it — confirm the params are intentionally unused, don't silence the diagnostic. ### D9 — Ranks have unequal batch counts → hang at the last step (uneven inputs) **Symptom**: training completes most of an epoch then **freezes on the final batch**; one rank had fewer samples and exited the loop while the others wait in allreduce forever ([PyTorch forum](https://discuss.pytorch.org/t/understanding-distributedsampler-and-dataloader-drop-last/206271)). **Root cause**: DDP assumes every rank runs the **same number of collectives**. `DistributedSampler` pads (`drop_last=False`) or drops (`drop_last=True`) to equalize, but a custom sampler, a per-rank filter, or a `IterableDataset` can leave counts uneven — the short rank stops calling allreduce. **Fix**: - Use `DistributedSampler` (it equalizes by default) and set the **same** `drop_last` on every rank. - Truly uneven inputs (variable-length, can't pad): wrap the loop in the **Join** context manager — `from torch.distributed.algorithms.join import Join; with Join([model]): for batch in loader: ...` — which mirrors the missing ranks' collectives so finished ranks don't deadlock ([Join tutorial](https://docs.pytorch.org/tutorials/advanced/generic_join.html)). - Always call `sampler.set_epoch(epoch)` each epoch, or every epoch sees the identical shuffle (a silent correctness bug — **verifying-dl-experiments** **REQUIRED**). ### D10 — BatchNorm stats diverge across ranks; buffers aren't synced **Symptom**: DDP converges worse than single-GPU at the same effective batch, or eval is unstable — each rank computed BN statistics on only its local shard. **Root cause**: DDP all-reduces **gradients**, not **buffers** (BN running mean/var). With small per-GPU batches each replica's BN stats are noisy and inconsistent. **Fix**: convert BN to synchronized BN before wrapping: `model = nn.SyncBatchNorm.convert_sync_batchnorm(model)` then `DDP(model, ...)`. Adds a collective per BN layer (cost), but BN stats become global. (Whether the metric *needs* SyncBN is a **verifying-dl-experiments** call.) ### D11 — N GPUs silently N× the effective batch (and the LR is now wrong) **Symptom**: moving from 1→8 GPUs makes training diverge or plateau; loss curve is shaped differently even with "the same config." **Root cause**: DDP keeps per-GPU batch size, so **effective batch = per_gpu_batch × world_size**. The LR tuned for the 1-GPU batch is now mismatched (commonly under-scaled). This is the single most common silent multi-GPU regression. **Fix**: scale LR with effective batch (linear-scaling rule as a baseline, with warmup) and record `world_size`, per-GPU batch, and effective batch in the run manifest. **This changes the science** — declare it; comparing a 1-GPU baseline to an 8-GPU run with unscaled LR is not a clean datapoint (**verifying-dl-experiments** **REQUIRED**). --- ## FSDP (Fully Sharded Data Parallel) ### D12 — FSDP wraps the whole model as one unit → no memory saving (wrapping policy) **Symptom**: FSDP enabled but VRAM barely drops vs DDP, or it OOMs gathering one giant flat parameter. **Root cause**: with no `auto_wrap_policy`, FSDP makes the **entire model one FSDP unit** — it must all-gather all parameters at once, defeating sharding ([FSDP tutorial](https://docs.pytorch.org/tutorials/intermediate/FSDP_tutorial.html)). **Fix**: wrap per transformer block so only one block's params are gathered at a time: ```python from torch.distributed.fsdp.wrap import transformer_auto_wrap_policy import functools policy = functools.partial(transformer_auto_wrap_policy, transformer_layer_cls={LlamaDecoderLayer}) ``` Under Accelerate set `fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP` + `fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer` ([HF FSDP](https://huggingface.co/docs/accelerate/en/usage_guides/fsdp)). FSDP2 (`fully_shard`) is the current API; the wrapping principle is identical. ### D13 — Sharding strategy: FULL_SHARD vs SHARD_GRAD_OP vs HYBRID **Symptom**: FSDP is communication-bound (allgather/reducescatter dominate the step), or still OOMs. **Root cause**: the strategy trades memory against comms. `FULL_SHARD` (default, == ZeRO-3) shards params+grads+optimizer — max memory saving, max comms. `SHARD_GRAD_OP` (== ZeRO-2) shards grads+optim only, keeps params resident — less comms, more memory. **Fix**: pick by the binding constraint — OOM → `FULL_SHARD`; comms-bound but it fits → `SHARD_GRAD_OP`. On a **multi-node** job where intra-node NVLink is fast but inter-node is slow, `HYBRID_SHARD` shards within a node and replicates across nodes (cuts inter-node traffic; pairs with `references/multinode.md` NIC tuning). ### D14 — FSDP mixed precision: loss diverges or buffers stay fp32 **Symptom**: bf16 FSDP run diverges where bf16 DDP was fine; or BN/positional buffers silently run in the wrong dtype. **Root cause**: FSDP mixed precision is **explicit per-tensor-class** via `MixedPrecision(param_dtype, reduce_dtype, buffer_dtype)` — not a single AMP flag. Setting `param_dtype=bf16` but leaving `reduce_dtype=fp32` (or vice versa) changes gradient-reduction precision; FSDP keeps fp32 master weights and casts to bf16 for forward ([pytorch#146114](https://github.com/pytorch/pytorch/issues/146114)). **Fix**: set all three deliberately — a safe default is `param_dtype=bf16, reduce_dtype=fp32` (keep reductions in fp32 for stability), and set `buffer_dtype` explicitly so buffers don't drift. Prefer **bf16 over fp16** for sharded training (no loss-scaler needed). The numerical-correctness check is **verifying-dl-experiments**; this entry only ensures the dtypes are *set*, not left implicit. ### D15 — Checkpoint OOMs or saves an unloadable shard (state_dict type) **Symptom**: `FSDP.state_dict()` OOMs the host RAM on rank 0; or every rank wrote a `.pt` and reloading on a different world size fails. **Root cause**: FSDP has three state-dict types. `FULL_STATE_DICT` gathers + unflattens the whole model to **rank-0 CPU** (peaks host RAM, single-writer); `SHARDED_STATE_DICT` writes one shard per rank (scales, but tied to layout); `LOCAL_STATE_DICT` is raw flat params ([HF FSDP](https://huggingface.co/docs/accelerate/en/usage_guides/fsdp)). **Fix**: - Large models / want resumable-at-scale: **`SHARDED_STATE_DICT`** via Distributed Checkpoint (DCP) — each rank saves its shard, reload reshards to any world size. - Need a single portable file (export/inference): `FULL_STATE_DICT` with `rank0_only=True, offload_to_cpu=True` so only rank 0 materializes it on CPU (avoids the all-ranks OOM). FSDP2 uses `broadcast_from_rank0=True` to load the full dict on rank 0 then shard out. - Atomic-write + load-latest-on-startup is the resume spine regardless of type → `references/spot-resilience.md` and `references/multinode.md` MN5 (a torchrun restart restores the *group*, never the *state*). --- ## DeepSpeed ### D16 — ZeRO stage selection (1/2/3) and what each shards **Symptom**: ZeRO enabled but still OOM, or comms overhead with no memory need. **Root cause**: stages shard progressively more across data-parallel ranks ([DeepSpeed ZeRO](https://www.deepspeed.ai/tutorials/zero/)): **Stage 1** = optimizer states · **Stage 2** = + gradients · **Stage 3** = + parameters (== FSDP `FULL_SHARD`). **Fix**: smallest stage that fits — Stage 2 is the common sweet spot for models that *almost* fit; Stage 3 for models that don't fit even with grads sharded; add **ZeRO-Offload** (CPU) or **ZeRO-Infinity** (NVMe) only when Stage 3 alone still OOMs (each offload trades large slowdowns for capacity → `references/training/oom-memory.md` M10). ### D17 — The `ds_config.json` knobs that actually matter **Symptom**: config applied but behavior unchanged, or a cryptic key error at init. **Root cause**: DeepSpeed reads from the JSON, and several Accelerate/Trainer fields are **ignored** once a `deepspeed_config_file` is supplied ([HF Accelerate DeepSpeed](https://huggingface.co/docs/accelerate/en/usage_guides/deepspeed)). **Fix** — the load-bearing keys: ```jsonc { "zero_optimization": { "stage": 3, "offload_optimizer": {"device": "cpu"}, // or "nvme" "offload_param": {"device": "cpu"} }, "bf16": {"enabled": true}, // prefer over fp16 (no loss-scale tuning) "gradient_accumulation_steps": "auto", // let HF fill from Trainer "train_micro_batch_size_per_gpu": "auto", "gradient_clipping": "auto" } ``` When the JSON is present, `gradient_accumulation_steps`, `gradient_clipping`, `zero_stage`, `offload_*_device`, and `mixed_precision` from the Accelerate config are **overridden by the JSON** — set them there, not in two places. ### D18 — `"auto"` mismatch and `loss.backward()` vs `engine.backward()` **Symptom**: optimizer steps far less often than expected (gradient accumulation double-counted), or a `RuntimeError` about unscaled gradients. **Root cause**: two traps. (a) Setting `gradient_accumulation_steps` in *both* the Trainer/Accelerate config *and* the JSON to non-`"auto"` values multiplies them. (b) With DeepSpeed's own AMP, gradient scaling lives inside the engine — calling bare `loss.backward()` instead of `model_engine.backward(loss)` skips scaling ([DeepSpeed engine](https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/engine.py)). **Fix**: set accumulation in **one** place (use `"auto"` in the JSON and let HF fill it); in a manual loop call `model_engine.backward(loss); model_engine.step()` — never `loss.backward()` / `optimizer.step()` directly under DeepSpeed. --- ## The HANGS — debugging a frozen distributed job (highest-value section) A distributed hang has **no traceback** — every rank sits in a collective waiting for a peer that will never call it. The job to do is identify *which rank* diverged and *which collective* mismatched. (Distinct from a **single-process** vanish — for OOM/reboot/SSH-HUP/kill, see `gotchas_universal.md` U3; for the *inter-node* causes — fabric-manager, wrong NIC, MTU, the 1800 s NCCL timeout that *masks* the real failure — see `references/multinode.md` MN1-MN4.) ### D19 — The desync-debug toolkit: turn a silent freeze into a named mismatch **Symptom**: all ranks frozen, GPUs at 100% SM util but 0% memory-util (spin-wait), no output. **Root cause**: a collective desync — ranks enqueued *different* collectives, or one rank never reached the collective the others are blocked in. **Fix — set these and relaunch the hang**: - `export TORCH_DISTRIBUTED_DEBUG=DETAIL` + `export TORCH_CPP_LOG_LEVEL=INFO` → on mismatch PyTorch prints `Detected mismatch between collectives on ranks`, naming the op + sequence number per rank ([PyTorch forum](https://discuss.pytorch.org/t/torch-distributed-collectives-call-logging/172726)). (DETAIL itself does collectives — use to *diagnose*, remove for production; it can perturb timing.) - `export NCCL_DEBUG=INFO` (or `WARN`) → the node whose log **stops first** before others print their topology is the culprit. - `export TORCH_NCCL_ASYNC_ERROR_HANDLING=1` (older PyTorch: `NCCL_ASYNC_ERROR_HANDLING=1`) → a dead rank tears the group down *promptly* instead of every rank waiting out the 1800 s NCCL timeout (`references/multinode.md` MN3). - **Flight Recorder** (`TORCH_NCCL_TRACE_BUFFER_SIZE=2000`) dumps the last N collectives per rank with stack traces — read it to see which rank's queue is one collective behind. ### D20 — One rank diverged (NaN/OOM) and the survivors hang waiting for it **Symptom**: training ran for a while, then froze; one rank's last log shows a NaN, an OOM, or a data/CUDA error, the rest are stuck in allreduce. **Root cause**: a rank that crashes or `return`s early **stops calling collectives**; the others block. The crash is the cause, the hang is the symptom — and without async error handling (D19) it surfaces 30 min later as a timeout, far from the cause. **Fix**: with `TORCH_NCCL_ASYNC_ERROR_HANDLING=1` the group aborts near the true failure. Then fix the *diverged rank*, not the hang — common roots: one shard hit a bad sample (rank-dependent data), a per-rank OOM from uneven sequence lengths (longest-batch lands on one rank → `oom-memory.md` M16), or NaN from LR/precision. Don't lower batch size to "fix" a hang that was actually one rank's data bug. ### D21 — A rank-conditional collective (the `if rank == 0:` deadlock) **Symptom**: hangs reproducibly at the *same* spot — often validation, logging, or checkpoint save. **Root cause**: a collective (or a `dist.barrier()`, or an op that *implies* one like `all_gather`, SyncBN, or a metric `all_reduce`) placed inside a rank-conditional branch. Rank 0 calls it; others skip it; everyone deadlocks. The classic is "save/log on rank 0 only" where the save path triggers a collective ([Lightning#19604](https://github.com/Lightning-AI/pytorch-lightning/issues/19604)). **Fix**: collectives must run on **all ranks unconditionally**. Gate only the *side effect*, not the collective: compute the metric's `all_reduce` on every rank, then `if rank == 0: log(value)`. A `barrier()` must be reached by every rank or none. Audit every `if rank/local_rank == 0` block for a hidden collective. ### D22 — Dataloader length mismatch across ranks (and the `set_epoch` shuffle bug) **Symptom**: hang at end of epoch (D9's mechanism), OR every epoch trains on the identical data order. **Root cause**: two related dataloader faults. (a) Unequal `len(loader)` per rank → the short rank stops calling collectives. (b) Forgetting `sampler.set_epoch(epoch)` → `DistributedSampler` reshuffles identically every epoch. **Fix**: identical `batch_size`/`drop_last`/sampler on all ranks; call `set_epoch` each epoch; for genuinely uneven data use **Join** (D9). The shuffle-staleness is a correctness bug — **verifying-dl-experiments** **REQUIRED**. ### D23 — `print` / `tqdm` / eval / `torch.save` interleaving looks like a hang (but isn't always) **Symptom**: garbled interleaved logs from 8 ranks; or an apparent freeze during eval where only rank 0 should be working. **Root cause**: by default **every rank executes everything** — 8× the prints, 8× eval, 8 ranks racing to write the same checkpoint file (corrupting it). If the eval/save path contains a collective and is *also* rank-gated, it's the D21 deadlock; if not, it's just noisy + wasteful + a file race. **Fix**: gate pure side effects (logging, progress bar, file writes) to `if rank == 0:` — but keep any collective *outside* the gate (D21). Write checkpoints from rank 0 only, to a temp path, atomic-rename (`references/spot-resilience.md`), and `dist.barrier()` (on **all** ranks) before others read the file. A genuine hang vs noisy-but-progressing is told apart by the Flight Recorder / step counter (D19), not by the log soup. --- ## Pointers — handled elsewhere, do not restate - **Inter-node wire** (NCCL NIC pinning, `nvidia-fabricmanager`, the 1800 s timeout masking a dead rank, jumbo-frame MTU, torchrun/Horovod elastic restart restoring the *group* not the *state*) → `references/multinode.md` (**REQUIRED** for ≥2 instances). - **Sharding *to fit a model that OOMs*** (the FSDP/ZeRO ladder in cost order, activation checkpointing, offload, LoRA/QLoRA, reading the OOM trace) → `references/training/oom-memory.md`. - **Restart-and-resume mechanics** (atomic write, load-latest, cadence, preemption signals) → `references/spot-resilience.md`; the spine is `references/principles.md` #8. - **Single-process vanish** (OOM vs reboot vs SSH-HUP vs manual kill) → `references/gotchas_universal.md` U3; **cgroup host-RAM OOM from `num_workers`** → U9; **zombie VRAM after a crashed DDP run** → U11. - **Is the resulting number real** (LR-rescaled run, restarted-from-0 run, shuffle staleness, SyncBN necessity, precision change) → **verifying-dl-experiments** (**REQUIRED** at every "this fix changes the science" note above).