# Multi-node NCCL & elastic-training gotchas — ADVANCED **Single-box users skip this entire file.** None of it applies to a single instance — one node, however many GPUs, runs DDP/FSDP over NVLink/PCIe and never touches the inter-node NCCL transport, fabric-manager, or rendezvous logic below. This file is **only** for jobs spanning ≥2 rented instances (multi-node DDP, FSDP, pipeline/tensor parallel, or elastic training). It assumes the checkpoint-to-durable + idempotent-resume spine is already in place (`references/principles.md` #8; cadence + atomic-write in `references/spot-resilience.md`) — multi-node only changes *how the process group forms and breaks*, never the resume mechanism. These are all **[P] platform/topology-specific** gotchas. The universal ones (disk, OOM, CRLF, silent sync, spot grace) are **not** restated here — see `references/gotchas_universal.md`. ## Table of contents - **Fabric-manager** — one bad node hangs the WHOLE job at NCCL init (MN1) - **NIC selection** — NCCL picks docker0/loopback/slow NIC (MN2) - **Timeout masking** — default 1800 s hides a straggler/dead rank (MN3) - **MTU mismatch** — jumbo frames silently dropped, small messages fine (MN4) - **Elastic restart ≠ state restore** — torchrun `--max-restarts` (MN5) - **Elastic Horovod pause-below-min-np** — pauses then errors (MN6) - **First-move checklist** — bring up a healthy multi-node group To jump: `grep -in references/multinode.md` (e.g. `grep -in fabric references/multinode.md`). --- ## MN1 — Fabric-manager down on one node hangs the entire job at NCCL init **Symptom.** Launch a multi-node job; every rank prints up to NCCL init then freezes — no traceback, no progress, no OOM, just a silent hang at the first collective. Killing and relaunching reproduces the exact same stall. A single-node run of the identical code works fine. **Root cause.** On NVSwitch-based nodes (HGX/DGX A100/H100), `nvidia-fabricmanager` must be running and healthy on **every** node for NVLink/NVSwitch routing to come up. One node with a stopped, crashed, or version-mismatched fabric-manager cannot establish its NVLink fabric, so its ranks never join the collective — and because NCCL init is a **global barrier**, all the healthy nodes block waiting for the one that can never arrive. The failure is global; the cause is local to one box. **Fix.** - Check fabric-manager on **every** node before launching, not just the head: `systemctl status nvidia-fabricmanager` (or `nvidia-smi -q | grep -i fabric`). It must be `active (running)` and its version must match the driver on that node. - Turn the silent hang into a diagnosable one: launch with `NCCL_DEBUG=INFO` (optionally `NCCL_DEBUG_SUBSYS=INIT,NET`). The first node whose log **stops** before the others printed their topology is the culprit. - On a rental the fix is operational, not a reseat: restart the service (`systemctl restart nvidia-fabricmanager`) if permitted, otherwise **stop that instance and re-rent a different box** — a fabric-manager that won't start is usually a sick host (overlaps the Xid hardware-failure logic in `references/gotchas_universal.md`). URL: https://support.crusoecloud.com/hc/en-us/articles/46061806112155-NCCL-Hangs-and-Multi-Node-Training-Stalls-Caused-by-Failed-nvidia-fabricmanager --- ## MN2 — NCCL picks the wrong NIC (docker0 / loopback / a slow interface) **Symptom.** Multi-node init hangs forever, OR it connects but inter-node bandwidth is 10× too slow (allreduce dominates the step time; single-node throughput was fine). `NCCL_DEBUG=INFO` shows NCCL binding to `docker0`, `lo`, or a 1 GbE management NIC instead of the fast data-plane interface. **Root cause.** NCCL auto-discovers network interfaces and, with no guidance, can select an unroutable bridge (`docker0`), the loopback, or the slow management NIC — none of which carry traffic between the real nodes. The job either never connects (unroutable) or runs over the wrong, slow path. **Fix.** Pin the transport explicitly on **all** nodes (identical env on every rank): - `export NCCL_SOCKET_IFNAME=` — a prefix filter; e.g. `eth`, `ens`, `bond`. Exclude bad ones with a leading `^`: `NCCL_SOCKET_IFNAME=^docker0,lo`. - On RDMA/InfiniBand fabrics pin the HCA: `export NCCL_IB_HCA=mlx5` (the active adapter prefix). - **No RDMA (TCP-only rental):** `export NCCL_IB_DISABLE=1` so NCCL stops probing for nonexistent IB and falls back to the socket path cleanly instead of stalling on IB discovery. - Confirm the chosen interface with `NCCL_DEBUG=INFO` — the `NET/Socket` or `NET/IB` line names the bound device; verify it is the fast data-plane NIC, not the bridge. URL: https://github.com/NVIDIA/nccl/issues/1580 --- ## MN3 — Default 1800 s NCCL timeout masks a straggler or a dead rank **Symptom.** A run that was progressing freezes at a collective for exactly **30 minutes**, then dies with a `Watchdog ... collective ... timed out` / `NCCL timeout` error — or worse, hangs far longer because the watchdog is off. The real failure (a rank that OOM'd, crashed, or fell behind) happened 30 min earlier and is buried. **Root cause.** NCCL's default collective timeout is **1800 s**. When one rank dies or stalls, the others sit in the collective waiting out the full 30-minute window before anything surfaces — so the symptom appears half an hour after the cause, and a transient straggler can trip a hard abort it should have survived. **Fix.** - **Fail fast on a dead rank:** `export NCCL_ASYNC_ERROR_HANDLING=1` (newer PyTorch: `TORCH_NCCL_ASYNC_ERROR_HANDLING=1`) so a crashed/unreachable rank tears the group down promptly instead of waiting out the timeout — the surviving ranks get an actionable error near the true failure point. - **Tune the window deliberately, both directions.** Genuinely slow collectives (huge allreduce, slow checkpoint barrier) need a **longer** timeout to avoid false aborts: raise it via the process-group init (`torch.distributed.init_process_group(..., timeout=timedelta(minutes=60))`). To surface a hung straggler **sooner**, lower it. The default rarely fits a real job — set it on purpose. - Pair with MN1's `NCCL_DEBUG=INFO` so the timeout error names which rank went silent. URL: https://repost.aws/questions/QURXddiuikQLesRDGz39RhIw/nccl-socket-timeout-when-using-large-dataset-in-multi-node-pretraining --- ## MN4 — Jumbo-frame MTU mismatch silently drops large NCCL frames **Symptom.** Small collectives work (rendezvous succeeds, tiny tensors allreduce fine), but the job hangs or throws a transport error the moment a **large** payload is sent — large gradient buckets, the first big allreduce, or a model broadcast. The break correlates with message **size**, not with which ranks are involved. **Root cause.** The container's interface is configured for jumbo frames (MTU 9000) but the host veth / bridge it attaches to is still at 1500 (or vice-versa). Small packets fit under the smaller MTU and pass; oversized frames are silently dropped at the mismatched hop with no application-level error, so NCCL stalls waiting for data that never lands. Classic on containerized rentals where the container MTU and the host bridge MTU were set independently. **Fix.** - Match MTU end to end: set the **host veth/bridge** to the same MTU as the container interface (9000 ↔ 9000, or drop both to 1500). Inspect with `ip link show` on both the container and the host side. - Quick confirm the path actually carries jumbo frames between nodes: `ping -M do -s 8972 ` (8972 = 9000 − 28 bytes header). If it fails but `-s 1472` succeeds, the large frames are being dropped → fix the MTU, do not blame NCCL. - If the host bridge MTU cannot be changed on the rental, set the container interface **down** to 1500 to match — a uniform-but-smaller MTU works; a mismatch does not. URL: https://github.com/moby/moby/issues/4378 --- ## MN5 — torchrun / TorchElastic `--max-restarts` restarts the process group but does NOT restore training state **Symptom.** A worker dies (preemption, transient fault); torchrun's `--max-restarts=N` dutifully re-runs rendezvous and relaunches **all** workers — but training resumes from **step 0** (or the wrong epoch), silently throwing away the progress before the failure. The restart "worked" yet the run is set back hours. **Root cause.** TorchElastic guarantees only that the **worker group** is reconstituted: it re-runs the c10d rendezvous, re-derives `world_size`/`rank`, and relaunches every worker process. It does **not** persist or reload training state — that is entirely the script's responsibility. A `--max-restarts` with no in-script load-latest-checkpoint just re-runs `main()` from scratch on every restart. **Fix.** - The per-epoch (or per-N-step) snapshot is what restores state, exactly per `references/principles.md` #8: write full state (model + optimizer + LR scheduler + epoch/step + RNG + dataloader position) **atomically** (`tmp`→`fsync`→`os.rename`), and **load-latest unconditionally at the top of every launch** so a torchrun restart resumes instead of restarting. Cadence formula + atomic-write detail live in `references/spot-resilience.md`. - Use the c10d rendezvous backend hosted on a `host:port` (no etcd dependency); set `--max-restarts` to survive the expected number of preemptions, not 0. - **REQUIRED:** treat the restored run as a *resume of the identical config*, never a hand-patched relaunch — a silently-restarted or reshuffled run is the exact contamination `verifying-dl-experiments` guards against; confirm the resumed step/epoch against the loaded checkpoint before trusting any post-restart metric. URL: https://docs.pytorch.org/tutorials/beginner/ddp_series_fault_tolerance.html --- ## MN6 — Elastic Horovod pauses below `--min-np`, then errors at `HOROVOD_ELASTIC_TIMEOUT` **Symptom.** Under elastic Horovod (`horovodrun -np 8 --min-np 4 --max-np 12`), enough workers get preempted to drop the live count below `--min-np`; the job does **not** fail immediately — it appears to hang (paused, no progress) — and then, minutes later, dies with a timeout error. Operators waiting for an instant failure miss the pause window. **Root cause.** Elastic Horovod **pauses** (does not fail) when available workers fall below `--min-np`, waiting for capacity to return. It only errors once `HOROVOD_ELASTIC_TIMEOUT` (default **600 s**) elapses without recovering to `--min-np`. So a too-high `--min-np` turns a routine couple-of-preemptions event into a hard failure after a silent 10-minute wait. **Fix.** - Set `--min-np` **low enough** that the typical number of concurrent preemptions does not breach it — survivors keep training through membership changes instead of pausing. - Raise `HOROVOD_ELASTIC_TIMEOUT` if the spot tier's capacity routinely returns slower than 600 s, so a temporary capacity dip resumes rather than aborts. - **Pin LR-scaling and data-sharding to `--max-np`, not the live worker count** — otherwise the effective learning rate and shard assignment drift on every membership change, quietly corrupting the run (a `verifying-dl-experiments` concern: a metric from a run whose LR silently rescaled is not a clean datapoint). URL: https://horovod.readthedocs.io/en/stable/elastic_include.html --- ## First-move checklist — bring up a healthy multi-node group Run this order **before** trusting any multi-node throughput number; it isolates MN1–MN4 cheaply (no full job needed): - [ ] **Every** node: `systemctl status nvidia-fabricmanager` → `active (running)`, version matches driver (MN1). - [ ] Identical NCCL env exported on **all** ranks: `NCCL_SOCKET_IFNAME`, `NCCL_IB_HCA` (or `NCCL_IB_DISABLE=1`), and the chosen `init_process_group` timeout (MN2, MN3). - [ ] MTU path check between nodes: `ping -M do -s 8972 ` succeeds, or both ends pinned to 1500 (MN4). - [ ] First real launch with `NCCL_DEBUG=INFO` + `NCCL_ASYNC_ERROR_HANDLING=1`; confirm each rank's `NET/...` line names the fast data-plane NIC, not a bridge (MN2, MN3). - [ ] In-script load-latest-checkpoint verified to fire on restart **before** relying on `--max-restarts` / elastic membership recovery (MN5, MN6; spine in `references/principles.md` #8). - [ ] Distributed jobs checkpoint **more** often than single-GPU — one preemption wastes N× compute; cadence per `references/spot-resilience.md`. For fanning a *sweep* across nodes (independent cells, not one job over many nodes), that is `references/parallel_ablation.md` + **REQUIRED** `superpowers:dispatching-parallel-agents`, not this file.