playbook/antigravity-awesome-skills/skills/remote-gpu-trainer/references/training/throughput-profiling.md

# Throughput & profiling — make training FAST, find the one bottleneck

How to tell *why* a rented GPU is underfed (GPU-bound vs data-bound vs comms-bound), then apply the
right speedup in cost order — from a free dataloader knob to `torch.compile` and fused attention. This
layer owns *making it RUN fast + locating the mechanical bottleneck*; **verifying-dl-experiments** owns
*is the resulting number correct*. Cross-link it (**REQUIRED**) wherever a speedup risks changing the
science (a kernel that alters numerics, a precision swap, dropping samples to "go faster").

> **Size the run to the box — then PIN it for any comparison.** Auto-sizing batch/`num_workers` to the
> measured GPU/VRAM/vCPU (Phase 0) to use the card well is fine for a STANDALONE job; but for an ablation
> or baseline-vs-variant comparison, **pin the same batch across all cells** — auto-maximizing per-box
> silently changes a variable and breaks comparability (**verifying-dl-experiments**, REQUIRED).

To jump: `grep -in '<keyword>' references/training/throughput-profiling.md` (e.g. `bound`, `workers`,
`compile`, `recompile`, `flash`, `sdpa`, `nsys`, `py-spy`, `channels_last`, `tf32`, `overlap`).

## Table of contents

- **Diagnose first** — T1 the 3-way split (GPU/data/comms-bound) · T2 util%-is-a-liar pointer · T3 the cheap CPU/GPU-busy triage
- **Dataloader (the #1 cause of a starved GPU)** — T4 num_workers · T5 persistent_workers · T6 pin_memory + non_blocking · T7 prefetch_factor · T8 IO-bound vs CPU-transform-bound
- **Free / near-free knobs** — T9 TF32 + matmul precision · T10 cudnn.benchmark · T11 channels_last · T12 set_to_none + disable debug APIs
- **Mixed precision for speed** — T13 bf16/fp16 throughput
- **Kernels** — T14 SDPA / FlashAttention · T15 torch.compile gains · T16 torch.compile recompilation traps
- **Memory↔speed trades** — T17 activation checkpointing speed cost · T18 batch size vs throughput
- **Profilers** — T19 torch.profiler (is-it-data-bound) · T20 nsys / Nsight Systems · T21 py-spy (live, no restart) · T22 memory-snapshot pointer
- **Multi-GPU / multi-node comms** — T23 DDP/FSDP compute-comm overlap
- **Pointers** — gotchas_universal.md U8/U21/U24/U25/U38 · oom-memory.md · distributed-launch.md · multinode.md · verifying-dl-experiments (skill)

---

## Diagnose first — do NOT tune blind

### T1 — The 3-way split: GPU-bound vs data-bound vs comms-bound (decide before touching a knob)

**Symptom**: training is "slow" and the instinct is to change the model or batch size at random.

**Root cause**: throughput is gated by exactly one of three resources at a time; the fix for each is
disjoint, so guessing wastes paid wall-clock (principle #1).

**Fix — classify with one cheap reading each** (heuristic: util consistently >90% ⇒ GPU-bound;
low/fluctuating ⇒ elsewhere; both CPU+GPU low ⇒ I/O —
https://apxml.com/courses/planning-optimizing-ai-infrastructure/chapter-5-strategies-for-performance-optimization/identifying-performance-bottlenecks):
- **GPU-bound** (the good case): util high *and* SM clock/power high (T2); adding workers doesn't help. Only
  levers left: kernels (T14–T15), precision (T13), a bigger card.
- **Data-bound**: util low-but-nonzero or sawtoothing, host CPU busy in `DataLoader`/transforms; a trace
  shows GPU-idle gaps lining up with CPU data work (T19). Go to T4–T8.
- **Comms-bound** (multi-GPU/-node only): per-GPU util high, scaling efficiency poor; time in
  `nccl:all_reduce`/`all_gather` not overlapped with compute. Go to T23.

The highest-signal instrument is a **profiler trace** (T19) — read it before changing anything.

### T2 — `nvidia-smi` GPU-Util % lies; correlate clock + power → gotchas_universal.md U21

A 100%-util tile can hide a starved GPU (a trickle of tiny kernels reads as 100%). The full diagnosis —
correlate `clocks.current.sm` + mem-bandwidth util + power via `nvidia-smi dmon -s pucvmet -d 1`, and the
thermal/power-throttle slowdown — lives in **gotchas_universal.md U21/U23**; read it before concluding a run
is GPU-bound. The *0%-util-but-running* (CPU-data-bound) inverse is **U38**, owned by verifying-dl-experiments.

### T3 — Cheap triage when no profiler is wired yet: is the host CPU busy?

**Symptom**: need a 30-second answer to "GPU or data?" before instrumenting.

**Fix**: watch GPU and CPU at once for ~10 s —
```bash
nvidia-smi dmon -s pu -d 1 -c 10          # per-second SM% + power; sawtooth/low = starved
top -b -n 1 | grep -i python | head        # a worker pegged at ~100% CPU = CPU-transform-bound
```
GPU SM% high and steady ⇒ GPU-bound (stop here, go to kernels/precision). GPU SM% sawtoothing while a
python worker is CPU-pegged ⇒ data-bound (T4–T8). Both idle ⇒ I/O-bound (stage to NVMe, U8). Then confirm
with a real trace (T19) before investing in a fix. **GPU SM% low while *many* python threads thrash a few
cores (not one worker pegged) ⇒ intra-op thread oversubscription** on a vCPU slice, not data-bound — cap
`OMP_NUM_THREADS` to your cgroup quota (gotchas_universal.md **U40**), don't add dataloader workers.

---

## Dataloader — the #1 reason a rented GPU sits idle

The partial-starve knob set (and its order) is **gotchas_universal.md U24**; this section is the per-knob
*why/when*. Each helps a *different* failure, so apply by symptom, not as a blanket cargo-cult.

### T4 — `num_workers`: 0 means the main process loads serially (the default starves the GPU)

**Symptom**: `DataLoader(num_workers=0)` (the default) — every batch is fetched on the main thread, GPU
waits the whole fetch.

**Root cause**: with `num_workers=0` "the data will be loaded in the main process" — no overlap between
data prep and compute (https://docs.pytorch.org/docs/2.12/data.html).

**Fix**: set `num_workers > 0` to load asynchronously and overlap fetch with the GPU step. Start at
`cores − 1`, but **size against per-worker RAM, not CPU count** — each worker `fork`s a full copy of any
large in-dataset object; too many OOM the cgroup with a bare `Killed` (the quadratic trap + sizing rule are
**gotchas_universal.md U9**). Not monotonic: past the point where the GPU is fed, extra workers only add RAM
and startup cost.

### T5 — `persistent_workers=True`: stop paying worker-startup every epoch

**Symptom**: a visible stall at the **start of every epoch** (especially short epochs / many epochs); GPU
idle while workers respawn.

**Root cause**: default `persistent_workers=False` shuts down all workers after the dataset is consumed
once and **re-forks them next epoch** — re-importing, re-opening files, rebuilding the dataset object each
time (https://docs.pytorch.org/docs/2.12/data.html).

**Fix**: `persistent_workers=True` keeps the worker Dataset instances alive between epochs, removing the
per-epoch respawn cost. Requires `num_workers > 0`. Biggest win when epochs are short or the dataset's
`__init__` is heavy (loads an index/manifest).

### T6 — `pin_memory=True` + `non_blocking=True`: overlap the host→device copy

**Symptom**: the H2D copy (`x.to('cuda')`) sits on the critical path between fetch and forward.

**Root cause**: a pageable-memory tensor must be staged through a pinned buffer by the driver before DMA;
a synchronous `.to(device)` blocks the step. "When using a GPU it's better to set `pin_memory=True`"
(https://docs.pytorch.org/tutorials/recipes/recipes/tuning_guide.html).

**Fix**: `DataLoader(pin_memory=True)` allocates batches page-locked, **then** transfer
`x = x.to(device, non_blocking=True)` so the copy runs async on a copy stream and overlaps compute. Both
halves needed — `pin_memory` alone still blocks; `non_blocking` without pinned memory silently falls back to
a blocking copy. Costs host RAM (pinned pages aren't swappable) — back off if it pressures the cgroup (U9).

### T7 — `prefetch_factor`: deepen the queue when fetch time is bursty

**Symptom**: with workers on, the GPU still periodically stalls — every *Nth* step (N = `num_workers`) has
a long idle gap because all workers were busy producing the next batch when the GPU asked
(https://docs.pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html).

**Root cause**: `prefetch_factor` defaults to **2** when `num_workers>0` (None when 0) — "2 means there
will be a total of `2 * num_workers` batches prefetched across all workers"
(https://docs.pytorch.org/docs/2.12/data.html). A shallow queue can't absorb a variance spike in
per-sample fetch/decode time.

**Fix**: raise `prefetch_factor` (3–4) so workers run ahead and bursts hide — at the cost of more resident
batches in RAM (re-check U9). A *smoothing* knob, not a multiplier: if the **average** fetch rate is below
the GPU's consume rate, no depth helps — fix the rate (workers T4, GPU transform T8, NVMe U8) instead.

### T8 — IO-bound vs CPU-transform-bound are different data-bound cases (different fix)

**Symptom**: data-bound (T1), but adding workers barely helps.

**Root cause — split the case**:
- **IO-bound**: bytes arrive slowly from network/HDD/object store; workers sit in `read`. Stage the working
  set to instance-local **NVMe** (HDD→NVMe gaps reach ~35×) = **gotchas_universal.md U8**; the many-tiny-files
  transaction death + **shard-into-tar / WebDataset** fix = **U25**.
- **CPU-transform-bound**: a heavy per-sample augment (resize/decode/FFT) saturates CPU; workers CPU-pegged
  (T3), capping at core count. Move the transform to the **GPU** (NVIDIA DALI, `torchvision.transforms.v2`
  on tensors, kornia) onto idle GPU cycles. The *0%-util* serialized-transform variant is **U38**, owned by
  verifying-dl-experiments **REQUIRED** (which also owns whether a GPU-side transform shifted the data
  distribution).

**Fix**: read the trace (T19) — time in `read`/`stat` ⇒ U8/U25; time in a transform fn ⇒ move to GPU.

---

## Free / near-free knobs (set these once at startup on any box)

### T9 — TF32 / `set_float32_matmul_precision("high")` — the "why is my A100 slow" footgun

The biggest free speedup on Ampere+ for any fp32 matmul path; **OFF by default since PyTorch 1.12**. The
decision and exact knobs (`torch.set_float32_matmul_precision("high")`, the legacy `allow_tf32` flags,
`--tf32 1` in HF Trainer, convergence impact) are owned by **references/training/precision-stability.md P2**
(cross-link there; do NOT restate). If a fresh PyTorch 2.x rental's fp32-heavy run is 2–4× slow with no bug,
this is the first suspect.

### T10 — `cudnn.benchmark=True`: autotune conv algorithms (fixed input shapes only)

**Symptom**: a conv-heavy net (CNN/UNet) is slower than it should be; input shapes are constant.

**Root cause**: by default cuDNN picks a generic conv algorithm; the autotuner benchmarks variants on the
first batch of each new shape and caches the fastest
(https://docs.pytorch.org/tutorials/recipes/recipes/tuning_guide.html).

**Fix**: `torch.backends.cudnn.benchmark = True` once at startup. **Only helps when input shapes are
stable** — with variable shapes (dynamic resolution, ragged batches) it re-benchmarks every new shape and
*loses* time. Trade-off: it is **nondeterministic** (picks by first-batch timing), so it fights the
determinism knobs — whether to enable it for a clean datapoint is owned by precision-stability P19 /
verifying-dl-experiments (U36, **REQUIRED**).

### T11 — `channels_last`: free Tensor-Core speedup for conv nets under AMP

**Symptom**: a CNN under mixed precision isn't hitting Tensor Cores; throughput below the card's potential.

**Root cause**: default NCHW contiguous layout forces layout transposes around Tensor-Core convolutions.

**Fix**: convert model and inputs to `memory_format=torch.channels_last` —
`model = model.to(memory_format=torch.channels_last)` and `x = x.to(memory_format=torch.channels_last)`.
Optimizes convolutional networks with Tensor Cores + AMP
(https://docs.pytorch.org/tutorials/recipes/recipes/tuning_guide.html). Marked experimental and CNN-specific
(no benefit for pure transformers). No numerics change — purely a layout speedup.

### T12 — `set_to_none` + disable debug APIs (two free per-step taxes to remove)

- **`optimizer.zero_grad(set_to_none=True)`** (the **default** since PyTorch 2.0) over zero-filling —
  assigning `None` skips a memory-write kernel per param and lets the next backward write fresh
  (https://docs.pytorch.org/tutorials/recipes/recipes/tuning_guide.html). Edge case: code reading `.grad`
  between steps must tolerate `None`.
- **Turn OFF debug APIs for the real run** — `torch.autograd.set_detect_anomaly(True)`,
  `torch.autograd.profiler.profile`, `gradcheck` add per-op bookkeeping (anomaly detection is ~10× slower,
  precision-stability P9). Grep `detect_anomaly` / leftover `with profile(` wrappers before a long launch
  (https://docs.pytorch.org/tutorials/recipes/recipes/tuning_guide.html); easy to leave on after a NaN hunt.

---

## Mixed precision for speed

### T13 — bf16/fp16 is a throughput lever, not just a memory lever

**Symptom**: fp32 training under-uses Tensor Cores; the GPU has bf16/fp16 tensor cores.

**Root cause**: 16-bit matmuls run on Tensor Cores at much higher FLOP/s and halve activation
read/write bandwidth — a speedup *on top of* the memory saving (oom-memory.md M6).

**Fix**: `torch.autocast("cuda", dtype=torch.bfloat16)` on Ampere+ (the modern default; no GradScaler —
precision-stability P6) or `bf16=True` in HF `TrainingArguments`. The full precision decision (bf16 vs fp16
vs the V100/T4 fp16-only path, GradScaler mechanics, NaN/overflow) is owned by
**references/training/precision-stability.md P1–P10** (cross-link; do NOT restate). The *memory* angle and
the activation-bucket math is **oom-memory.md M6**. A NaN/divergence after the swap is a numerics question →
precision-stability / verifying-dl-experiments (**REQUIRED**).

---

## Kernels — the levers left once the GPU is fed

### T14 — SDPA / FlashAttention: stop materializing the O(seq²) attention matrix

**Symptom**: a transformer is attention-bound; long sequences are slow and memory-heavy; or `flash_attn`
"installed" but the run is no faster.

**Root cause**: the eager/`math` attention path materializes the full `seq×seq` score matrix. The fused
**FlashAttention** / **memory-efficient** backends never do, but PyTorch's `scaled_dot_product_attention`
**silently falls back to the slow `math` backend** when the fused kernel's input constraints aren't met
(wrong dtype, head dim, mask shape) — "if a fused implementation is not available, a warning will be
raised" (https://docs.pytorch.org/docs/2.12/generated/torch.nn.functional.scaled_dot_product_attention.html).

**Fix**:
- Use `F.scaled_dot_product_attention(q,k,v)` (or `attn_implementation="sdpa"`, the HF default on 2.1.1+),
  which auto-picks FlashAttention / memory-efficient / cuDNN / math. Feed it **fp16/bf16** inputs — the
  fused backends need 16-bit (the `math` fallback is what runs in fp32).
- **Force-verify** the fast backend instead of trusting silence:
  ```python
  from torch.nn.attention import sdpa_kernel, SDPBackend
  with sdpa_kernel(backends=[SDPBackend.FLASH_ATTENTION]):   # errors loudly if it can't be used
      out = F.scaled_dot_product_attention(q, k, v, is_causal=True)
  ```
- **Installing `flash_attn` from source is a trap**: without `ninja` (`pip install ninja`) the CUDA
  extension compiles single-threaded ~2 h; with ninja ~3–5 min on a 64-core box. With many cores but
  `<96 GB` RAM ninja over-parallelizes and OOMs the build — cap `MAX_JOBS=4 pip install flash-attn
  --no-build-isolation`. Prefer a **prebuilt wheel** matching the `cuXX/torchYY/cpZZ` triple
  (https://github.com/Dao-AILab/flash-attention/issues/1038, https://pypi.org/project/flash-attn/). A
  torch/CUDA mismatch is **gotchas_universal.md U28**. Whether the fused kernel changes outputs (causal-mask
  edge cases) is a numerics check → verifying-dl-experiments (**REQUIRED**).

### T15 — `torch.compile`: fuse kernels + cut launch overhead (one line, real gains)

**Symptom**: many small pointwise/elementwise ops; Python/launch overhead dominates between big matmuls.

**Root cause**: eager launches each op separately; Inductor fuses adjacent ops into Triton kernels and
(in CUDA-graph modes) eliminates per-step launch overhead, reusing the execution plan across steps.

**Fix**: wrap the model — `model = torch.compile(model)`. Modes
(https://huggingface.co/docs/transformers/en/perf_torch_compile):
- `default` — balanced speed/memory.
- `mode="reduce-overhead"` — uses **CUDA graphs** to kill Python overhead (best for many tiny ops /
  small-batch / inference), at a little more memory.
- `mode="max-autotune"` — longest compile, fastest steady-state.
- HF `TrainingArguments(torch_compile=True, torch_compile_mode="reduce-overhead")`.

Reported ~2.2× mean-inference speedups; training gains real but model-dependent. **First step(s) are slow**
— compilation is lazy on first call (https://huggingface.co/docs/transformers/en/perf_torch_compile); exclude
warm-up from any throughput measurement. Set `fullgraph=True` while developing to surface graph breaks loudly
instead of silently losing speed. Whether the compiled *numbers* match eager → verifying-dl-experiments
(**REQUIRED**).

### T16 — `torch.compile` recompilation trap: variable shapes silently blow the cache → eager

**Symptom**: a compiled run is *slower* than eager, or stutters periodically; throughput never stabilizes.
Common with variable batch/seq-len, dynamic padding, or per-step changing shapes.

**Root cause**: compile creates **guards** on traced shapes; a new shape violates a guard and triggers a
**recompile**. Past the recompile cap (`torch._dynamo.config.recompile_limit`, default **8**; legacy
`cache_size_limit`) Dynamo **stops compiling that function and runs it eagerly** — paying all the compile
cost and getting none of the benefit
(https://docs.pytorch.org/docs/stable/compile/programming_model.recompilation.html,
https://github.com/pytorch/pytorch/issues/93457).

**Fix**:
- **See it**: `TORCH_LOGS=recompiles python train.py` logs which function recompiled and the failed guard;
  `TORCH_LOGS=graph_breaks` and `torch._dynamo.explain(...)` locate graph breaks
  (https://docs.pytorch.org/docs/stable/torch.compiler_troubleshooting.html).
- **Tame shapes**: pad/bucket to a few fixed shapes so guards stop firing; or mark the varying dim dynamic
  — `torch.compile(model, dynamic=True)` (or `mark_dynamic` / `TORCH_COMPILE_DYNAMIC_SOURCES`) compiles
  one shape-generic graph instead of one per size. `dynamic=False` forces a fresh recompile per distinct
  size (use only with truly few shapes)
  (https://docs.pytorch.org/docs/stable/compile/programming_model.html).
- **Last resort**: raise `torch._dynamo.config.recompile_limit` only if a handful of *stable* extra shapes
  legitimately exist — raising it to mask genuinely unbounded shapes just thrashes.

---

## Memory ↔ speed trades

### T17 — Activation checkpointing buys memory by spending ~20–30% compute (know the cost)

**Symptom**: gradient/activation checkpointing is on "to be safe" and training is slow — but the model
actually fits without it.

**Fix**: checkpointing **recomputes** activations in backward instead of storing them — trading **~20–30%
extra compute** for a large memory cut (https://docs.pytorch.org/tutorials/recipes/recipes/tuning_guide.html,
oom-memory.md M7). Enable it **only when activations actually OOM** (full rationale + `use_reentrant=False` /
`use_cache=False` gotchas = **oom-memory.md M7**); if it fits without, turning it off is a free ~25% speedup.
On the frontier, checkpoint only the *fewest/heaviest* blocks needed to fit, not the whole model.

### T18 — Bigger micro-batch ≈ better GPU utilization (up to the memory wall)

**Symptom**: tiny batches under-feed the GPU; util and throughput both low though VRAM is mostly free (small
batches under-fill Tensor Cores and amortize launch/sync overhead poorly).

**Fix**: raise micro-batch toward the VRAM limit; keep the **effective** batch fixed with grad-accum if the
result depends on it (`batch 4 × accum 16` beats `batch 1 × accum 64` — oom-memory.md M5). Accuracy/effective-
batch implications (LR scaling, accumulation loss-weighting) → verifying-dl-experiments (**REQUIRED**).
Sizing alongside a concurrent job + `expandable_segments` = **gotchas_universal.md U10** / oom-memory.md M8.

---

## Profilers — measure the bottleneck, don't guess it

### T19 — `torch.profiler`: the definitive data-bound vs compute-bound verdict

**Symptom**: need to *prove* where time goes (which T1 case), not infer from util%.

**Fix — scheduled profile of a few steps**
(https://docs.pytorch.org/tutorials/recipes/recipes/profiler_recipe.html):
```python
from torch.profiler import profile, schedule, ProfilerActivity, tensorboard_trace_handler
with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    schedule=schedule(wait=1, warmup=1, active=3),     # skip warm-up; record 3 steps
    on_trace_ready=tensorboard_trace_handler("./tb_trace"),
    record_shapes=True, with_stack=True,
) as prof:
    for step, batch in enumerate(loader):
        train_step(batch); prof.step()
        if step >= 6: break
print(prof.key_averages().table(sort_by="self_cuda_time_total", row_limit=15))
```
**Read it**: large **GPU-timeline gaps** with CPU busy in `DataLoader`/transforms during the gap ⇒
**data-bound** (T4–T8); the TensorBoard "Performance Recommendation" panel names the DataLoader directly
(https://docs.pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html). Densely-packed GPU
timeline ⇒ GPU-bound; sort by `self_cuda_time_total` for the hottest kernel (T14/T15). Time in `nccl:*` not
overlapped ⇒ comms-bound (T23). On a remote box write the trace and view locally — for raw
`export_chrome_trace("trace.json")` open at `chrome://tracing`; `scp` it down (references/ssh_transport.md),
never run a viewer over ssh.

### T20 — `nsys` / Nsight Systems: system-wide timeline when the gap is below PyTorch's view

**Symptom**: torch.profiler shows GPU-idle gaps but not *why* (CPU launch latency, a hidden sync, a memcpy,
a kernel-launch storm); or want CUDA-API + NVTX + OS-runtime on one timeline.

**Root cause**: torch.profiler sees PyTorch ops; `nsys` traces the whole system — CUDA API, kernels,
memcpy, NVTX ranges, OS-runtime — so it exposes launch-bound stalls and CPU↔GPU sync that PyTorch can't.
"Periodic gaps in the CUDA HW row are moments when the GPU is idle — a red flag"
(https://docs.lxp.lu/howto/pytorch-profiling-with-nsight/).

**Fix — profile a bounded window on the box, view locally** (canonical PyTorch recipe,
https://gist.github.com/mcarilli/376821aa1a7182dfcf59928a7cde3223):
```bash
nsys profile -w true -t cuda,nvtx,osrt,cudnn,cublas -s cpu \
  --capture-range=cudaProfilerApi -x true -o report python train.py
```
In the script, bound the window so the `.nsys-rep` stays small:
```python
torch.cuda.profiler.cudart().cudaProfilerStart()   # after warm-up
# ... a handful of steps, optionally wrapped in torch.cuda.nvtx.range_push/pop ...
torch.cuda.profiler.cudart().cudaProfilerStop()
```
`scp` the `.nsys-rep` down, open in the Nsight Systems GUI. Nsight **Systems** finds *which* kernel is slow;
Nsight **Compute** (`ncu`) finds *why* (occupancy, bandwidth, warp stalls) — but `ncu` is heavy, reserve it
for one hot kernel (https://www.spheron.network/blog/gpu-profiling-ai-workloads-nsight-compute-pytorch-profiler-guide/).

### T21 — `py-spy`: profile a LIVE training process with no restart, no code change

**Symptom**: a long run is mysteriously slow or apparently hung; restarting it to add a profiler would cost
hours and might not reproduce.

**Root cause**: a Python-side bottleneck or deadlock (a slow transform, a lock, a blocking collective) that
needs inspection *in situ*.

**Fix — attach by PID, zero instrumentation** (https://github.com/benfred/py-spy):
```bash
py-spy dump --pid <PID>            # one-shot stack of every thread → where it's hung RIGHT NOW
py-spy top  --pid <PID>            # live "which functions burn time" (Unix top-style)
py-spy record -o prof.svg --pid <PID>   # flame graph over a window
```
"The profiled program needs no import, no decorator, and no restart." On a rented box mid-run, `py-spy dump`
instantly distinguishes a *hung* process (stuck in `recv`/lock/`all_reduce`) from a *slow* one (busy in a
transform) — pairs with the "is it actually hung?" check (gotchas_universal.md U17, verifying-dl-experiments
**REQUIRED**). May need `--native` for C-extension frames and `sudo`/`SYS_PTRACE` to attach.

### T22 — CUDA memory snapshot/visualizer → oom-memory.md M19

For *what allocated the memory* (not time), the `torch.cuda.memory._record_memory_history` snapshot +
https://pytorch.org/memory_viz timeline is owned by **references/training/oom-memory.md M19/M18**. It is a
memory tool, not a throughput tool — listed here only so the profiler menu is complete. Do NOT restate.

---

## Multi-GPU / multi-node communication

### T23 — Compute-comms overlap: DDP overlaps by default; tune the bucket, watch for breakers

**Symptom**: scaling efficiency is poor — per-GPU util high, but N GPUs deliver far less than N× throughput;
trace shows `all_reduce`/`all_gather` *not* overlapped with backward compute.

**Root cause**: DDP overlaps gradient all-reduce with backward by bucketing gradients and launching each
bucket's reduce on a separate CUDA stream as soon as it's ready
(https://github.com/pytorch/pytorch/issues/67570). Overlap *breaks* when something forces a sync: an
unused-parameter recompute, an off-by-default `find_unused_parameters=True`, a `.item()`/print/`.cpu()` in
the step, or too-small/too-large buckets.

**Fix (single box, DDP/FSDP — the launch/sharding mechanics live in
references/training/distributed-launch.md, REQUIRED)**:
- Tune `bucket_cap_mb` (DDP) to batch gradient chunks into fewer, larger all-reduces; set
  `gradient_as_bucket_view=True` to cut a copy. Buckets too small = launch overhead; too large = late
  overlap.
- FSDP: enable `backward_prefetch` (prefetch the next layer's all-gather during current backward) and
  `forward_prefetch` so comms hide under compute; `limit_all_gathers` if memory-pressured.
- Remove per-step host syncs (`loss.item()` every step, prints, eager `.cpu()`) that serialize the stream.

**Inter-node** transport (NCCL picking the wrong NIC, fabric-manager hang, 1800 s timeout masking a
straggler, MTU mismatch) is **references/multinode.md** (**REQUIRED** for ≥2 instances) — a comms "slowdown"
across boxes is usually one of those, not a bucket-size tune. Whether a world-size change silently rescaled
the effective batch/LR is a science question → verifying-dl-experiments (**REQUIRED**).

---

## Pointers — throughput gotchas catalogued elsewhere (do NOT restate)

- **gotchas_universal.md** — **U8** stage hot data to local NVMe (IO-bound) · **U21** `nvidia-smi` util% is
  a liar (+ **U23** thermal/power throttle) · **U24** dataloader-starvation knob order · **U25** millions of
  small files → shard into tar/WebDataset · **U38** GPU 0%-util CPU-data-bound (owned by verifying-dl).
- **references/training/oom-memory.md** — M5 micro-batch/grad-accum · M6 bf16 activations · M7 activation
  checkpointing memory rationale · M8 `expandable_segments` · M19 memory snapshot/visualizer.
- **references/training/precision-stability.md** — P1–P10 the precision decision + AMP mechanics · P2 the
  TF32-off footgun · P19 determinism-vs-`cudnn.benchmark` speed trade.
- **references/training/distributed-launch.md** — torchrun/Accelerate/DeepSpeed launch, DDP/FSDP sharding,
  and the desync/hang toolkit (the launch substrate this file's T23 sits on).
- **references/multinode.md** — inter-node NCCL/NIC/fabric/timeout/MTU (the wire between boxes). Single-box
  users skip.
- **verifying-dl-experiments** (**REQUIRED**) — owns *is-the-number-real*: whether a kernel/precision/compile
  swap changed the result, whether dropping samples or a GPU-side transform shifted the distribution, the
  0%-util diagnosis (U38), determinism (U36). This file makes training *fast*; that skill decides if the
  *faster result is still true*.