playbook/antigravity-awesome-skills/skills/remote-gpu-trainer/references/multinode.md

12 KiB
Raw Blame History

Multi-node NCCL & elastic-training gotchas — ADVANCED

Single-box users skip this entire file. None of it applies to a single instance — one node, however many GPUs, runs DDP/FSDP over NVLink/PCIe and never touches the inter-node NCCL transport, fabric-manager, or rendezvous logic below. This file is only for jobs spanning ≥2 rented instances (multi-node DDP, FSDP, pipeline/tensor parallel, or elastic training). It assumes the checkpoint-to-durable + idempotent-resume spine is already in place (references/principles.md #8; cadence + atomic-write in references/spot-resilience.md) — multi-node only changes how the process group forms and breaks, never the resume mechanism.

These are all [P] platform/topology-specific gotchas. The universal ones (disk, OOM, CRLF, silent sync, spot grace) are not restated here — see references/gotchas_universal.md.

Table of contents

  • Fabric-manager — one bad node hangs the WHOLE job at NCCL init (MN1)
  • NIC selection — NCCL picks docker0/loopback/slow NIC (MN2)
  • Timeout masking — default 1800 s hides a straggler/dead rank (MN3)
  • MTU mismatch — jumbo frames silently dropped, small messages fine (MN4)
  • Elastic restart ≠ state restore — torchrun --max-restarts (MN5)
  • Elastic Horovod pause-below-min-np — pauses then errors (MN6)
  • First-move checklist — bring up a healthy multi-node group

To jump: grep -in <keyword> references/multinode.md (e.g. grep -in fabric references/multinode.md).


MN1 — Fabric-manager down on one node hangs the entire job at NCCL init

Symptom. Launch a multi-node job; every rank prints up to NCCL init then freezes — no traceback, no progress, no OOM, just a silent hang at the first collective. Killing and relaunching reproduces the exact same stall. A single-node run of the identical code works fine.

Root cause. On NVSwitch-based nodes (HGX/DGX A100/H100), nvidia-fabricmanager must be running and healthy on every node for NVLink/NVSwitch routing to come up. One node with a stopped, crashed, or version-mismatched fabric-manager cannot establish its NVLink fabric, so its ranks never join the collective — and because NCCL init is a global barrier, all the healthy nodes block waiting for the one that can never arrive. The failure is global; the cause is local to one box.

Fix.

  • Check fabric-manager on every node before launching, not just the head: systemctl status nvidia-fabricmanager (or nvidia-smi -q | grep -i fabric). It must be active (running) and its version must match the driver on that node.
  • Turn the silent hang into a diagnosable one: launch with NCCL_DEBUG=INFO (optionally NCCL_DEBUG_SUBSYS=INIT,NET). The first node whose log stops before the others printed their topology is the culprit.
  • On a rental the fix is operational, not a reseat: restart the service (systemctl restart nvidia-fabricmanager) if permitted, otherwise stop that instance and re-rent a different box — a fabric-manager that won't start is usually a sick host (overlaps the Xid hardware-failure logic in references/gotchas_universal.md).

URL: https://support.crusoecloud.com/hc/en-us/articles/46061806112155-NCCL-Hangs-and-Multi-Node-Training-Stalls-Caused-by-Failed-nvidia-fabricmanager


MN2 — NCCL picks the wrong NIC (docker0 / loopback / a slow interface)

Symptom. Multi-node init hangs forever, OR it connects but inter-node bandwidth is 10× too slow (allreduce dominates the step time; single-node throughput was fine). NCCL_DEBUG=INFO shows NCCL binding to docker0, lo, or a 1 GbE management NIC instead of the fast data-plane interface.

Root cause. NCCL auto-discovers network interfaces and, with no guidance, can select an unroutable bridge (docker0), the loopback, or the slow management NIC — none of which carry traffic between the real nodes. The job either never connects (unroutable) or runs over the wrong, slow path.

Fix. Pin the transport explicitly on all nodes (identical env on every rank):

  • export NCCL_SOCKET_IFNAME=<real-iface> — a prefix filter; e.g. eth, ens, bond. Exclude bad ones with a leading ^: NCCL_SOCKET_IFNAME=^docker0,lo.
  • On RDMA/InfiniBand fabrics pin the HCA: export NCCL_IB_HCA=mlx5 (the active adapter prefix).
  • No RDMA (TCP-only rental): export NCCL_IB_DISABLE=1 so NCCL stops probing for nonexistent IB and falls back to the socket path cleanly instead of stalling on IB discovery.
  • Confirm the chosen interface with NCCL_DEBUG=INFO — the NET/Socket or NET/IB line names the bound device; verify it is the fast data-plane NIC, not the bridge.

URL: https://github.com/NVIDIA/nccl/issues/1580


MN3 — Default 1800 s NCCL timeout masks a straggler or a dead rank

Symptom. A run that was progressing freezes at a collective for exactly 30 minutes, then dies with a Watchdog ... collective ... timed out / NCCL timeout error — or worse, hangs far longer because the watchdog is off. The real failure (a rank that OOM'd, crashed, or fell behind) happened 30 min earlier and is buried.

Root cause. NCCL's default collective timeout is 1800 s. When one rank dies or stalls, the others sit in the collective waiting out the full 30-minute window before anything surfaces — so the symptom appears half an hour after the cause, and a transient straggler can trip a hard abort it should have survived.

Fix.

  • Fail fast on a dead rank: export NCCL_ASYNC_ERROR_HANDLING=1 (newer PyTorch: TORCH_NCCL_ASYNC_ERROR_HANDLING=1) so a crashed/unreachable rank tears the group down promptly instead of waiting out the timeout — the surviving ranks get an actionable error near the true failure point.
  • Tune the window deliberately, both directions. Genuinely slow collectives (huge allreduce, slow checkpoint barrier) need a longer timeout to avoid false aborts: raise it via the process-group init (torch.distributed.init_process_group(..., timeout=timedelta(minutes=60))). To surface a hung straggler sooner, lower it. The default rarely fits a real job — set it on purpose.
  • Pair with MN1's NCCL_DEBUG=INFO so the timeout error names which rank went silent.

URL: https://repost.aws/questions/QURXddiuikQLesRDGz39RhIw/nccl-socket-timeout-when-using-large-dataset-in-multi-node-pretraining


MN4 — Jumbo-frame MTU mismatch silently drops large NCCL frames

Symptom. Small collectives work (rendezvous succeeds, tiny tensors allreduce fine), but the job hangs or throws a transport error the moment a large payload is sent — large gradient buckets, the first big allreduce, or a model broadcast. The break correlates with message size, not with which ranks are involved.

Root cause. The container's interface is configured for jumbo frames (MTU 9000) but the host veth / bridge it attaches to is still at 1500 (or vice-versa). Small packets fit under the smaller MTU and pass; oversized frames are silently dropped at the mismatched hop with no application-level error, so NCCL stalls waiting for data that never lands. Classic on containerized rentals where the container MTU and the host bridge MTU were set independently.

Fix.

  • Match MTU end to end: set the host veth/bridge to the same MTU as the container interface (9000 ↔ 9000, or drop both to 1500). Inspect with ip link show on both the container and the host side.
  • Quick confirm the path actually carries jumbo frames between nodes: ping -M do -s 8972 <other-node-ip> (8972 = 9000 28 bytes header). If it fails but -s 1472 succeeds, the large frames are being dropped → fix the MTU, do not blame NCCL.
  • If the host bridge MTU cannot be changed on the rental, set the container interface down to 1500 to match — a uniform-but-smaller MTU works; a mismatch does not.

URL: https://github.com/moby/moby/issues/4378


MN5 — torchrun / TorchElastic --max-restarts restarts the process group but does NOT restore training state

Symptom. A worker dies (preemption, transient fault); torchrun's --max-restarts=N dutifully re-runs rendezvous and relaunches all workers — but training resumes from step 0 (or the wrong epoch), silently throwing away the progress before the failure. The restart "worked" yet the run is set back hours.

Root cause. TorchElastic guarantees only that the worker group is reconstituted: it re-runs the c10d rendezvous, re-derives world_size/rank, and relaunches every worker process. It does not persist or reload training state — that is entirely the script's responsibility. A --max-restarts with no in-script load-latest-checkpoint just re-runs main() from scratch on every restart.

Fix.

  • The per-epoch (or per-N-step) snapshot is what restores state, exactly per references/principles.md #8: write full state (model + optimizer + LR scheduler + epoch/step + RNG + dataloader position) atomically (tmpfsyncos.rename), and load-latest unconditionally at the top of every launch so a torchrun restart resumes instead of restarting. Cadence formula + atomic-write detail live in references/spot-resilience.md.
  • Use the c10d rendezvous backend hosted on a host:port (no etcd dependency); set --max-restarts to survive the expected number of preemptions, not 0.
  • REQUIRED: treat the restored run as a resume of the identical config, never a hand-patched relaunch — a silently-restarted or reshuffled run is the exact contamination verifying-dl-experiments guards against; confirm the resumed step/epoch against the loaded checkpoint before trusting any post-restart metric.

URL: https://docs.pytorch.org/tutorials/beginner/ddp_series_fault_tolerance.html


MN6 — Elastic Horovod pauses below --min-np, then errors at HOROVOD_ELASTIC_TIMEOUT

Symptom. Under elastic Horovod (horovodrun -np 8 --min-np 4 --max-np 12), enough workers get preempted to drop the live count below --min-np; the job does not fail immediately — it appears to hang (paused, no progress) — and then, minutes later, dies with a timeout error. Operators waiting for an instant failure miss the pause window.

Root cause. Elastic Horovod pauses (does not fail) when available workers fall below --min-np, waiting for capacity to return. It only errors once HOROVOD_ELASTIC_TIMEOUT (default 600 s) elapses without recovering to --min-np. So a too-high --min-np turns a routine couple-of-preemptions event into a hard failure after a silent 10-minute wait.

Fix.

  • Set --min-np low enough that the typical number of concurrent preemptions does not breach it — survivors keep training through membership changes instead of pausing.
  • Raise HOROVOD_ELASTIC_TIMEOUT if the spot tier's capacity routinely returns slower than 600 s, so a temporary capacity dip resumes rather than aborts.
  • Pin LR-scaling and data-sharding to --max-np, not the live worker count — otherwise the effective learning rate and shard assignment drift on every membership change, quietly corrupting the run (a verifying-dl-experiments concern: a metric from a run whose LR silently rescaled is not a clean datapoint).

URL: https://horovod.readthedocs.io/en/stable/elastic_include.html


First-move checklist — bring up a healthy multi-node group

Run this order before trusting any multi-node throughput number; it isolates MN1MN4 cheaply (no full job needed):

  • Every node: systemctl status nvidia-fabricmanageractive (running), version matches driver (MN1).
  • Identical NCCL env exported on all ranks: NCCL_SOCKET_IFNAME, NCCL_IB_HCA (or NCCL_IB_DISABLE=1), and the chosen init_process_group timeout (MN2, MN3).
  • MTU path check between nodes: ping -M do -s 8972 <other-node-ip> succeeds, or both ends pinned to 1500 (MN4).
  • First real launch with NCCL_DEBUG=INFO + NCCL_ASYNC_ERROR_HANDLING=1; confirm each rank's NET/... line names the fast data-plane NIC, not a bridge (MN2, MN3).
  • In-script load-latest-checkpoint verified to fire on restart before relying on --max-restarts / elastic membership recovery (MN5, MN6; spine in references/principles.md #8).
  • Distributed jobs checkpoint more often than single-GPU — one preemption wastes N× compute; cadence per references/spot-resilience.md.

For fanning a sweep across nodes (independent cells, not one job over many nodes), that is references/parallel_ablation.md + REQUIRED superpowers:dispatching-parallel-agents, not this file.