playbook/antigravity-awesome-skills/skills/remote-gpu-trainer/references/training/distributed-launch.md

# Launching & debugging multi-GPU / multi-node training — torchrun · Accelerate · DeepSpeed · DDP · FSDP

Pick a launcher, get the rank/world-size env right, choose a parallelism (DDP vs FSDP vs ZeRO),
and — when 8 processes silently freeze — find *which* rank diverged. This layer owns *making the
distributed job RUN, not hang, and not silently mis-shard*; **verifying-dl-experiments** owns *is the
resulting number correct* (a run whose LR silently rescaled with world size, or that resumed from
step 0 after a restart, is its concern). Cross-link it (**REQUIRED**) wherever a launch fix changes
effective batch size, LR, or precision.

Single box, multiple GPUs is DDP/FSDP over NVLink/PCIe and lives here. The **inter-node** transport
(NCCL NIC, fabric-manager, timeout, MTU, elastic restart) is `references/multinode.md` (**REQUIRED**
for any job spanning ≥2 instances) — this file ends where the wire between boxes begins.

To jump: `grep -in '<keyword>' references/training/distributed-launch.md` (e.g. `rdzv`, `local_rank`,
`unused`, `hang`, `desync`, `fsdp`, `zero`, `state_dict`, `port`, `barrier`, `accelerate`).

## Table of contents

- **Launchers & env** — D1 torchrun-env-contract · D2 standalone-vs-rendezvous · D3 LOCAL_RANK-device-bug · D4 port-collision · D5 accelerate-launch · D6 deepspeed-launcher · D7 which-launcher
- **DDP** — D8 find_unused_parameters · D9 uneven-inputs-Join · D10 SyncBN-&-buffers · D11 effective-batch/LR
- **FSDP** — D12 wrapping-policy · D13 sharding-strategy · D14 mixed-precision · D15 state_dict-type
- **DeepSpeed** — D16 ZeRO-stages · D17 config.json-knobs · D18 auto-&-engine.backward
- **The HANGS** (highest-value) — D19 desync-debug-toolkit · D20 one-rank-diverged · D21 rank-conditional-collective · D22 dataloader-length-mismatch · D23 eval/print/save-on-one-rank
- **Pointers** — inter-node NCCL/NIC/timeout → multinode.md · OOM/sharding-to-fit → oom-memory.md · spot-restart → spot-resilience.md

---

## Launchers & env

### D1 — The rank/world-size env contract every launcher must satisfy

**Symptom**: a raw `python train.py` on a 4-GPU box uses **one** GPU; or `init_process_group` hangs
forever because `MASTER_ADDR`/`RANK` were never set.

**Root cause**: `torch.distributed` reads its topology from **environment variables**, not from the GPU
count. A bare `python` sets none of them, so the process group never forms.

**Fix**: launch through `torchrun`, which sets the full contract per process
([torchrun docs](https://docs.pytorch.org/docs/2.12/elastic/run.html)):

| Var | Meaning |
|---|---|
| `RANK` | global rank `0..WORLD_SIZE-1` (unique across the whole job) |
| `LOCAL_RANK` | rank **within this node** — bind it to the GPU (`cuda:LOCAL_RANK`), NOT `RANK` (D3) |
| `WORLD_SIZE` | total workers = `nnodes × nproc_per_node` |
| `LOCAL_WORLD_SIZE` | workers on this node |
| `GROUP_RANK` | the node's rank (`0..nnodes-1`) |
| `MASTER_ADDR` / `MASTER_PORT` | FQDN + port of rank-0 hosting the c10d TCP store |

The script reads them (`int(os.environ["LOCAL_RANK"])`), calls
`init_process_group(backend="nccl")` (NCCL for GPU; `gloo` for CPU), and `set_device(LOCAL_RANK)`
before allocating any CUDA tensor.

### D2 — Single-node uses `--standalone`; multi-node needs a shared rendezvous id+endpoint

**Symptom**: copying a single-node `torchrun` line to a second node either hangs at init or both nodes
form two separate 1-node groups.

**Root cause**: single-node and multi-node use **different rendezvous**. `--standalone` self-hosts a
rendezvous on localhost (no coordination); multi-node requires every node to point at the *same*
external rendezvous server with the *same* job id.

**Fix** ([torchrun docs](https://docs.pytorch.org/docs/2.12/elastic/run.html)):
```bash
# single node, 4 GPUs — self-contained, no addr/port to manage
torchrun --standalone --nnodes=1 --nproc-per-node=4 train.py

# multi-node: IDENTICAL command on every node; only env-derived node-rank differs
torchrun --nnodes=2 --nproc-per-node=8 \
         --rdzv-id=$JOB_ID --rdzv-backend=c10d \
         --rdzv-endpoint=$HEAD_IP:29400 train.py
```
`c10d` is the recommended backend (no etcd dependency). `--nnodes=1:4` enables elastic scaling. The
inter-node wire health (NIC pinning, fabric-manager, timeout) is `references/multinode.md`.

### D3 — Every process lands on GPU 0 (the `RANK` vs `LOCAL_RANK` bug)

**Symptom**: on multi-node, all of node-1's processes pile onto `cuda:0` and OOM, while GPUs 1-7 sit
idle; single-node looked fine.

**Root cause**: the script did `torch.cuda.set_device(RANK)`. On a single node `RANK==LOCAL_RANK` so
the bug hides; on node 1 of a 2-node job `RANK` is 8-15 but the node only has GPUs 0-7, so
`set_device` wraps/collides and everything funnels to device 0.

**Fix**: **always index the local device by `LOCAL_RANK`**, never `RANK`:
`torch.cuda.set_device(int(os.environ["LOCAL_RANK"]))`. `RANK` selects the *data shard*; `LOCAL_RANK`
selects the *physical GPU*.

### D4 — `RuntimeError: Address already in use` when launching a second job on one node

**Symptom**: a second `torchrun` (e.g. a parallel ablation cell) on the same box dies immediately with
`errno 98: Address already in use`.

**Root cause**: both jobs default to `MASTER_PORT=29500`; the c10d TCP store can't bind a port the
first job holds ([pytorch#85604](https://github.com/pytorch/pytorch/issues/85604)).

**Fix**: give each co-located job a unique port **and** disjoint GPUs:
```bash
CUDA_VISIBLE_DEVICES=0,1 torchrun --standalone --nproc-per-node=2 --master-port=29500 train.py &
CUDA_VISIBLE_DEVICES=2,3 torchrun --standalone --nproc-per-node=2 --master-port=29600 train.py &
```
Or use `--rdzv-endpoint=localhost:0` to let torchrun pick a free port. Fanning cells across instances
instead of one box → `references/parallel_ablation.md`.

### D5 — HF Accelerate: `accelerate launch` reads a config, not torchrun flags

**Symptom**: `accelerate launch train.py` runs single-GPU despite 4 cards, because no config exists or
`compute_environment` defaulted to one process.

**Root cause**: Accelerate wraps the same env contract (D1) but sources it from
`~/.cache/huggingface/accelerate/default_config.yaml` (written by `accelerate config`) or CLI flags
([launch docs](https://huggingface.co/docs/accelerate/en/basic_tutorials/launch)).

**Fix**: generate a config once, then launch against it — and on a headless rental, write the YAML
directly instead of the interactive `accelerate config`:
```bash
accelerate launch --multi_gpu --num_processes=4 --mixed_precision=bf16 train.py
# or a checked-in YAML (reproducible, diffable):
accelerate launch --config_file configs/acc_fsdp.yaml train.py
```
Switching DDP↔FSDP↔DeepSpeed is *only* a config swap — the training script is unchanged. The same
`--num_machines`/`--machine_rank`/`--main_process_ip` map onto multi-node (D2 territory).

### D6 — DeepSpeed: `deepspeed` launcher vs `accelerate launch`, and the `hostfile`

**Symptom**: `deepspeed train.py` on multi-node can't find the other host, or `--num_gpus` is ignored.

**Root cause**: the `deepspeed` launcher discovers nodes from a `hostfile`
(`worker-1 slots=8`), distinct from torchrun's rendezvous. Under HF it's usually cleaner to let
`accelerate launch` (with a DeepSpeed plugin/config) drive it
([HF DeepSpeed](https://huggingface.co/docs/accelerate/en/usage_guides/deepspeed)).

**Fix**: single-node `deepspeed --num_gpus=8 train.py --deepspeed ds_config.json`; multi-node
`deepspeed --hostfile=hostfile --num_gpus=8 train.py ...`. With HF Trainer/Accelerate, pass the config
via `--config_file` and let it spawn the workers — don't mix both launchers.

### D7 — Which launcher / parallelism — decision in one breath

- **Model fits on one GPU, just want more throughput** → **DDP** (`torchrun`), simplest, fastest. Each rank holds a full replica.
- **Model does NOT fit (params+optim+grads ≈ 18 B/param, see oom-memory.md M1)** → shard it: **FSDP** (PyTorch-native) or **DeepSpeed ZeRO** (richer offload). Sharding-to-fit ladder → `references/training/oom-memory.md` M9.
- **HF ecosystem / Trainer** → **Accelerate** as the launcher; flip a config field to choose DDP/FSDP/ZeRO.
- **Need CPU/NVMe offload of params *and* optimizer separately, or ZeRO-Infinity** → **DeepSpeed** (FSDP1 offload is all-or-nothing; [HF concept guide](https://github.com/huggingface/accelerate/blob/main/docs/source/concept_guides/fsdp_and_deepspeed.md)).

---

## DDP

### D8 — `find_unused_parameters` — the "Expected to have finished reduction" error vs the silent hang

**Symptom**: `RuntimeError: Expected to have finished reduction in the prior iteration before starting
a new one. ... parameters that were not used in producing loss`
([HF discuss](https://discuss.huggingface.co/t/runtimeerror-expected-to-have-finished-reduction-in-the-prior-iteration-before-starting-a-new-one-this-error-indicates-that-your-module-has-parameters-that-were-not-used-in-producing-loss/64760)).

**Root cause**: DDP registers an allreduce hook on every parameter and waits for *all* of them each
step. If a branch (a frozen head, a conditional layer) produces no gradient, its bucket never fires and
the reduction never completes.

**Fix — in priority order**:
1. **Best**: make every output participate in the loss (often the real bug is a dropped/detached head).
2. If a branch is *legitimately* unused some steps, `DDP(model, find_unused_parameters=True)` — but it adds a full graph traversal each step and **can be drastically slower** ([PyTorch forum](https://discuss.pytorch.org/t/process-got-stuck-when-set-find-unused-parameters-true-in-ddp/106078)). Use only if (1) is impossible.
3. If the return value is a dict/list, DDP may not locate the output tensors — flatten or simplify the `forward` return.
> Setting `find_unused_parameters=True` to *paper over* a real bug masks it — confirm the params are intentionally unused, don't silence the diagnostic.

### D9 — Ranks have unequal batch counts → hang at the last step (uneven inputs)

**Symptom**: training completes most of an epoch then **freezes on the final batch**; one rank had fewer
samples and exited the loop while the others wait in allreduce forever
([PyTorch forum](https://discuss.pytorch.org/t/understanding-distributedsampler-and-dataloader-drop-last/206271)).

**Root cause**: DDP assumes every rank runs the **same number of collectives**. `DistributedSampler`
pads (`drop_last=False`) or drops (`drop_last=True`) to equalize, but a custom sampler, a per-rank
filter, or a `IterableDataset` can leave counts uneven — the short rank stops calling allreduce.

**Fix**:
- Use `DistributedSampler` (it equalizes by default) and set the **same** `drop_last` on every rank.
- Truly uneven inputs (variable-length, can't pad): wrap the loop in the **Join** context manager —
  `from torch.distributed.algorithms.join import Join; with Join([model]): for batch in loader: ...`
  — which mirrors the missing ranks' collectives so finished ranks don't deadlock
  ([Join tutorial](https://docs.pytorch.org/tutorials/advanced/generic_join.html)).
- Always call `sampler.set_epoch(epoch)` each epoch, or every epoch sees the identical shuffle (a
  silent correctness bug — **verifying-dl-experiments** **REQUIRED**).

### D10 — BatchNorm stats diverge across ranks; buffers aren't synced

**Symptom**: DDP converges worse than single-GPU at the same effective batch, or eval is unstable —
each rank computed BN statistics on only its local shard.

**Root cause**: DDP all-reduces **gradients**, not **buffers** (BN running mean/var). With small
per-GPU batches each replica's BN stats are noisy and inconsistent.

**Fix**: convert BN to synchronized BN before wrapping:
`model = nn.SyncBatchNorm.convert_sync_batchnorm(model)` then `DDP(model, ...)`. Adds a collective per
BN layer (cost), but BN stats become global. (Whether the metric *needs* SyncBN is a
**verifying-dl-experiments** call.)

### D11 — N GPUs silently N× the effective batch (and the LR is now wrong)

**Symptom**: moving from 1→8 GPUs makes training diverge or plateau; loss curve is shaped differently
even with "the same config."

**Root cause**: DDP keeps per-GPU batch size, so **effective batch = per_gpu_batch × world_size**. The
LR tuned for the 1-GPU batch is now mismatched (commonly under-scaled). This is the single most common
silent multi-GPU regression.

**Fix**: scale LR with effective batch (linear-scaling rule as a baseline, with warmup) and record
`world_size`, per-GPU batch, and effective batch in the run manifest. **This changes the science** —
declare it; comparing a 1-GPU baseline to an 8-GPU run with unscaled LR is not a clean datapoint
(**verifying-dl-experiments** **REQUIRED**).

---

## FSDP (Fully Sharded Data Parallel)

### D12 — FSDP wraps the whole model as one unit → no memory saving (wrapping policy)

**Symptom**: FSDP enabled but VRAM barely drops vs DDP, or it OOMs gathering one giant flat parameter.

**Root cause**: with no `auto_wrap_policy`, FSDP makes the **entire model one FSDP unit** — it must
all-gather all parameters at once, defeating sharding
([FSDP tutorial](https://docs.pytorch.org/tutorials/intermediate/FSDP_tutorial.html)).

**Fix**: wrap per transformer block so only one block's params are gathered at a time:
```python
from torch.distributed.fsdp.wrap import transformer_auto_wrap_policy
import functools
policy = functools.partial(transformer_auto_wrap_policy,
                           transformer_layer_cls={LlamaDecoderLayer})
```
Under Accelerate set `fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP` +
`fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer`
([HF FSDP](https://huggingface.co/docs/accelerate/en/usage_guides/fsdp)). FSDP2 (`fully_shard`) is the
current API; the wrapping principle is identical.

### D13 — Sharding strategy: FULL_SHARD vs SHARD_GRAD_OP vs HYBRID

**Symptom**: FSDP is communication-bound (allgather/reducescatter dominate the step), or still OOMs.

**Root cause**: the strategy trades memory against comms. `FULL_SHARD` (default, == ZeRO-3) shards
params+grads+optimizer — max memory saving, max comms. `SHARD_GRAD_OP` (== ZeRO-2) shards grads+optim
only, keeps params resident — less comms, more memory.

**Fix**: pick by the binding constraint — OOM → `FULL_SHARD`; comms-bound but it fits →
`SHARD_GRAD_OP`. On a **multi-node** job where intra-node NVLink is fast but inter-node is slow,
`HYBRID_SHARD` shards within a node and replicates across nodes (cuts inter-node traffic; pairs with
`references/multinode.md` NIC tuning).

### D14 — FSDP mixed precision: loss diverges or buffers stay fp32

**Symptom**: bf16 FSDP run diverges where bf16 DDP was fine; or BN/positional buffers silently run in
the wrong dtype.

**Root cause**: FSDP mixed precision is **explicit per-tensor-class** via `MixedPrecision(param_dtype,
reduce_dtype, buffer_dtype)` — not a single AMP flag. Setting `param_dtype=bf16` but leaving
`reduce_dtype=fp32` (or vice versa) changes gradient-reduction precision; FSDP keeps fp32 master
weights and casts to bf16 for forward
([pytorch#146114](https://github.com/pytorch/pytorch/issues/146114)).

**Fix**: set all three deliberately — a safe default is `param_dtype=bf16, reduce_dtype=fp32` (keep
reductions in fp32 for stability), and set `buffer_dtype` explicitly so buffers don't drift. Prefer
**bf16 over fp16** for sharded training (no loss-scaler needed). The numerical-correctness check is
**verifying-dl-experiments**; this entry only ensures the dtypes are *set*, not left implicit.

### D15 — Checkpoint OOMs or saves an unloadable shard (state_dict type)

**Symptom**: `FSDP.state_dict()` OOMs the host RAM on rank 0; or every rank wrote a `.pt` and reloading
on a different world size fails.

**Root cause**: FSDP has three state-dict types. `FULL_STATE_DICT` gathers + unflattens the whole model
to **rank-0 CPU** (peaks host RAM, single-writer); `SHARDED_STATE_DICT` writes one shard per rank
(scales, but tied to layout); `LOCAL_STATE_DICT` is raw flat params
([HF FSDP](https://huggingface.co/docs/accelerate/en/usage_guides/fsdp)).

**Fix**:
- Large models / want resumable-at-scale: **`SHARDED_STATE_DICT`** via Distributed Checkpoint (DCP) — each rank saves its shard, reload reshards to any world size.
- Need a single portable file (export/inference): `FULL_STATE_DICT` with `rank0_only=True, offload_to_cpu=True` so only rank 0 materializes it on CPU (avoids the all-ranks OOM). FSDP2 uses `broadcast_from_rank0=True` to load the full dict on rank 0 then shard out.
- Atomic-write + load-latest-on-startup is the resume spine regardless of type → `references/spot-resilience.md` and `references/multinode.md` MN5 (a torchrun restart restores the *group*, never the *state*).

---

## DeepSpeed

### D16 — ZeRO stage selection (1/2/3) and what each shards

**Symptom**: ZeRO enabled but still OOM, or comms overhead with no memory need.

**Root cause**: stages shard progressively more across data-parallel ranks
([DeepSpeed ZeRO](https://www.deepspeed.ai/tutorials/zero/)):
**Stage 1** = optimizer states · **Stage 2** = + gradients · **Stage 3** = + parameters (== FSDP
`FULL_SHARD`).

**Fix**: smallest stage that fits — Stage 2 is the common sweet spot for models that *almost* fit;
Stage 3 for models that don't fit even with grads sharded; add **ZeRO-Offload** (CPU) or
**ZeRO-Infinity** (NVMe) only when Stage 3 alone still OOMs (each offload trades large slowdowns for
capacity → `references/training/oom-memory.md` M10).

### D17 — The `ds_config.json` knobs that actually matter

**Symptom**: config applied but behavior unchanged, or a cryptic key error at init.

**Root cause**: DeepSpeed reads from the JSON, and several Accelerate/Trainer fields are **ignored** once
a `deepspeed_config_file` is supplied
([HF Accelerate DeepSpeed](https://huggingface.co/docs/accelerate/en/usage_guides/deepspeed)).

**Fix** — the load-bearing keys:
```jsonc
{
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": {"device": "cpu"},      // or "nvme"
    "offload_param":     {"device": "cpu"}
  },
  "bf16": {"enabled": true},                       // prefer over fp16 (no loss-scale tuning)
  "gradient_accumulation_steps": "auto",           // let HF fill from Trainer
  "train_micro_batch_size_per_gpu": "auto",
  "gradient_clipping": "auto"
}
```
When the JSON is present, `gradient_accumulation_steps`, `gradient_clipping`, `zero_stage`,
`offload_*_device`, and `mixed_precision` from the Accelerate config are **overridden by the JSON** —
set them there, not in two places.

### D18 — `"auto"` mismatch and `loss.backward()` vs `engine.backward()`

**Symptom**: optimizer steps far less often than expected (gradient accumulation double-counted), or a
`RuntimeError` about unscaled gradients.

**Root cause**: two traps. (a) Setting `gradient_accumulation_steps` in *both* the Trainer/Accelerate
config *and* the JSON to non-`"auto"` values multiplies them. (b) With DeepSpeed's own AMP, gradient
scaling lives inside the engine — calling bare `loss.backward()` instead of `model_engine.backward(loss)`
skips scaling ([DeepSpeed engine](https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/engine.py)).

**Fix**: set accumulation in **one** place (use `"auto"` in the JSON and let HF fill it); in a manual
loop call `model_engine.backward(loss); model_engine.step()` — never `loss.backward()` /
`optimizer.step()` directly under DeepSpeed.

---

## The HANGS — debugging a frozen distributed job (highest-value section)

A distributed hang has **no traceback** — every rank sits in a collective waiting for a peer that will
never call it. The job to do is identify *which rank* diverged and *which collective* mismatched.
(Distinct from a **single-process** vanish — for OOM/reboot/SSH-HUP/kill, see `gotchas_universal.md`
U3; for the *inter-node* causes — fabric-manager, wrong NIC, MTU, the 1800 s NCCL timeout that *masks*
the real failure — see `references/multinode.md` MN1-MN4.)

### D19 — The desync-debug toolkit: turn a silent freeze into a named mismatch

**Symptom**: all ranks frozen, GPUs at 100% SM util but 0% memory-util (spin-wait), no output.

**Root cause**: a collective desync — ranks enqueued *different* collectives, or one rank never reached
the collective the others are blocked in.

**Fix — set these and relaunch the hang**:
- `export TORCH_DISTRIBUTED_DEBUG=DETAIL` + `export TORCH_CPP_LOG_LEVEL=INFO` → on mismatch PyTorch prints `Detected mismatch between collectives on ranks`, naming the op + sequence number per rank ([PyTorch forum](https://discuss.pytorch.org/t/torch-distributed-collectives-call-logging/172726)). (DETAIL itself does collectives — use to *diagnose*, remove for production; it can perturb timing.)
- `export NCCL_DEBUG=INFO` (or `WARN`) → the node whose log **stops first** before others print their topology is the culprit.
- `export TORCH_NCCL_ASYNC_ERROR_HANDLING=1` (older PyTorch: `NCCL_ASYNC_ERROR_HANDLING=1`) → a dead rank tears the group down *promptly* instead of every rank waiting out the 1800 s NCCL timeout (`references/multinode.md` MN3).
- **Flight Recorder** (`TORCH_NCCL_TRACE_BUFFER_SIZE=2000`) dumps the last N collectives per rank with stack traces — read it to see which rank's queue is one collective behind.

### D20 — One rank diverged (NaN/OOM) and the survivors hang waiting for it

**Symptom**: training ran for a while, then froze; one rank's last log shows a NaN, an OOM, or a
data/CUDA error, the rest are stuck in allreduce.

**Root cause**: a rank that crashes or `return`s early **stops calling collectives**; the others block.
The crash is the cause, the hang is the symptom — and without async error handling (D19) it surfaces
30 min later as a timeout, far from the cause.

**Fix**: with `TORCH_NCCL_ASYNC_ERROR_HANDLING=1` the group aborts near the true failure. Then fix the
*diverged rank*, not the hang — common roots: one shard hit a bad sample (rank-dependent data), a
per-rank OOM from uneven sequence lengths (longest-batch lands on one rank → `oom-memory.md` M16), or
NaN from LR/precision. Don't lower batch size to "fix" a hang that was actually one rank's data bug.

### D21 — A rank-conditional collective (the `if rank == 0:` deadlock)

**Symptom**: hangs reproducibly at the *same* spot — often validation, logging, or checkpoint save.

**Root cause**: a collective (or a `dist.barrier()`, or an op that *implies* one like `all_gather`,
SyncBN, or a metric `all_reduce`) placed inside a rank-conditional branch. Rank 0 calls it; others
skip it; everyone deadlocks. The classic is "save/log on rank 0 only" where the save path triggers a
collective ([Lightning#19604](https://github.com/Lightning-AI/pytorch-lightning/issues/19604)).

**Fix**: collectives must run on **all ranks unconditionally**. Gate only the *side effect*, not the
collective: compute the metric's `all_reduce` on every rank, then `if rank == 0: log(value)`. A
`barrier()` must be reached by every rank or none. Audit every `if rank/local_rank == 0` block for a
hidden collective.

### D22 — Dataloader length mismatch across ranks (and the `set_epoch` shuffle bug)

**Symptom**: hang at end of epoch (D9's mechanism), OR every epoch trains on the identical data order.

**Root cause**: two related dataloader faults. (a) Unequal `len(loader)` per rank → the short rank
stops calling collectives. (b) Forgetting `sampler.set_epoch(epoch)` → `DistributedSampler` reshuffles
identically every epoch.

**Fix**: identical `batch_size`/`drop_last`/sampler on all ranks; call `set_epoch` each epoch; for
genuinely uneven data use **Join** (D9). The shuffle-staleness is a correctness bug —
**verifying-dl-experiments** **REQUIRED**.

### D23 — `print` / `tqdm` / eval / `torch.save` interleaving looks like a hang (but isn't always)

**Symptom**: garbled interleaved logs from 8 ranks; or an apparent freeze during eval where only rank 0
should be working.

**Root cause**: by default **every rank executes everything** — 8× the prints, 8× eval, 8 ranks racing
to write the same checkpoint file (corrupting it). If the eval/save path contains a collective and is
*also* rank-gated, it's the D21 deadlock; if not, it's just noisy + wasteful + a file race.

**Fix**: gate pure side effects (logging, progress bar, file writes) to `if rank == 0:` — but keep any
collective *outside* the gate (D21). Write checkpoints from rank 0 only, to a temp path, atomic-rename
(`references/spot-resilience.md`), and `dist.barrier()` (on **all** ranks) before others read the file.
A genuine hang vs noisy-but-progressing is told apart by the Flight Recorder / step counter (D19), not
by the log soup.

---

## Pointers — handled elsewhere, do not restate

- **Inter-node wire** (NCCL NIC pinning, `nvidia-fabricmanager`, the 1800 s timeout masking a dead rank, jumbo-frame MTU, torchrun/Horovod elastic restart restoring the *group* not the *state*) → `references/multinode.md` (**REQUIRED** for ≥2 instances).
- **Sharding *to fit a model that OOMs*** (the FSDP/ZeRO ladder in cost order, activation checkpointing, offload, LoRA/QLoRA, reading the OOM trace) → `references/training/oom-memory.md`.
- **Restart-and-resume mechanics** (atomic write, load-latest, cadence, preemption signals) → `references/spot-resilience.md`; the spine is `references/principles.md` #8.
- **Single-process vanish** (OOM vs reboot vs SSH-HUP vs manual kill) → `references/gotchas_universal.md` U3; **cgroup host-RAM OOM from `num_workers`** → U9; **zombie VRAM after a crashed DDP run** → U11.
- **Is the resulting number real** (LR-rescaled run, restarted-from-0 run, shuffle staleness, SyncBN necessity, precision change) → **verifying-dl-experiments** (**REQUIRED** at every "this fix changes the science" note above).