28 KiB
Throughput & profiling — make training FAST, find the one bottleneck
How to tell why a rented GPU is underfed (GPU-bound vs data-bound vs comms-bound), then apply the
right speedup in cost order — from a free dataloader knob to torch.compile and fused attention. This
layer owns making it RUN fast + locating the mechanical bottleneck; verifying-dl-experiments owns
is the resulting number correct. Cross-link it (REQUIRED) wherever a speedup risks changing the
science (a kernel that alters numerics, a precision swap, dropping samples to "go faster").
Size the run to the box — then PIN it for any comparison. Auto-sizing batch/
num_workersto the measured GPU/VRAM/vCPU (Phase 0) to use the card well is fine for a STANDALONE job; but for an ablation or baseline-vs-variant comparison, pin the same batch across all cells — auto-maximizing per-box silently changes a variable and breaks comparability (verifying-dl-experiments, REQUIRED).
To jump: grep -in '<keyword>' references/training/throughput-profiling.md (e.g. bound, workers,
compile, recompile, flash, sdpa, nsys, py-spy, channels_last, tf32, overlap).
Table of contents
- Diagnose first — T1 the 3-way split (GPU/data/comms-bound) · T2 util%-is-a-liar pointer · T3 the cheap CPU/GPU-busy triage
- Dataloader (the #1 cause of a starved GPU) — T4 num_workers · T5 persistent_workers · T6 pin_memory + non_blocking · T7 prefetch_factor · T8 IO-bound vs CPU-transform-bound
- Free / near-free knobs — T9 TF32 + matmul precision · T10 cudnn.benchmark · T11 channels_last · T12 set_to_none + disable debug APIs
- Mixed precision for speed — T13 bf16/fp16 throughput
- Kernels — T14 SDPA / FlashAttention · T15 torch.compile gains · T16 torch.compile recompilation traps
- Memory↔speed trades — T17 activation checkpointing speed cost · T18 batch size vs throughput
- Profilers — T19 torch.profiler (is-it-data-bound) · T20 nsys / Nsight Systems · T21 py-spy (live, no restart) · T22 memory-snapshot pointer
- Multi-GPU / multi-node comms — T23 DDP/FSDP compute-comm overlap
- Pointers — gotchas_universal.md U8/U21/U24/U25/U38 · oom-memory.md · distributed-launch.md · multinode.md · verifying-dl-experiments (skill)
Diagnose first — do NOT tune blind
T1 — The 3-way split: GPU-bound vs data-bound vs comms-bound (decide before touching a knob)
Symptom: training is "slow" and the instinct is to change the model or batch size at random.
Root cause: throughput is gated by exactly one of three resources at a time; the fix for each is disjoint, so guessing wastes paid wall-clock (principle #1).
Fix — classify with one cheap reading each (heuristic: util consistently >90% ⇒ GPU-bound; low/fluctuating ⇒ elsewhere; both CPU+GPU low ⇒ I/O — https://apxml.com/courses/planning-optimizing-ai-infrastructure/chapter-5-strategies-for-performance-optimization/identifying-performance-bottlenecks):
- GPU-bound (the good case): util high and SM clock/power high (T2); adding workers doesn't help. Only levers left: kernels (T14–T15), precision (T13), a bigger card.
- Data-bound: util low-but-nonzero or sawtoothing, host CPU busy in
DataLoader/transforms; a trace shows GPU-idle gaps lining up with CPU data work (T19). Go to T4–T8. - Comms-bound (multi-GPU/-node only): per-GPU util high, scaling efficiency poor; time in
nccl:all_reduce/all_gathernot overlapped with compute. Go to T23.
The highest-signal instrument is a profiler trace (T19) — read it before changing anything.
T2 — nvidia-smi GPU-Util % lies; correlate clock + power → gotchas_universal.md U21
A 100%-util tile can hide a starved GPU (a trickle of tiny kernels reads as 100%). The full diagnosis —
correlate clocks.current.sm + mem-bandwidth util + power via nvidia-smi dmon -s pucvmet -d 1, and the
thermal/power-throttle slowdown — lives in gotchas_universal.md U21/U23; read it before concluding a run
is GPU-bound. The 0%-util-but-running (CPU-data-bound) inverse is U38, owned by verifying-dl-experiments.
T3 — Cheap triage when no profiler is wired yet: is the host CPU busy?
Symptom: need a 30-second answer to "GPU or data?" before instrumenting.
Fix: watch GPU and CPU at once for ~10 s —
nvidia-smi dmon -s pu -d 1 -c 10 # per-second SM% + power; sawtooth/low = starved
top -b -n 1 | grep -i python | head # a worker pegged at ~100% CPU = CPU-transform-bound
GPU SM% high and steady ⇒ GPU-bound (stop here, go to kernels/precision). GPU SM% sawtoothing while a
python worker is CPU-pegged ⇒ data-bound (T4–T8). Both idle ⇒ I/O-bound (stage to NVMe, U8). Then confirm
with a real trace (T19) before investing in a fix. GPU SM% low while many python threads thrash a few
cores (not one worker pegged) ⇒ intra-op thread oversubscription on a vCPU slice, not data-bound — cap
OMP_NUM_THREADS to your cgroup quota (gotchas_universal.md U40), don't add dataloader workers.
Dataloader — the #1 reason a rented GPU sits idle
The partial-starve knob set (and its order) is gotchas_universal.md U24; this section is the per-knob why/when. Each helps a different failure, so apply by symptom, not as a blanket cargo-cult.
T4 — num_workers: 0 means the main process loads serially (the default starves the GPU)
Symptom: DataLoader(num_workers=0) (the default) — every batch is fetched on the main thread, GPU
waits the whole fetch.
Root cause: with num_workers=0 "the data will be loaded in the main process" — no overlap between
data prep and compute (https://docs.pytorch.org/docs/2.12/data.html).
Fix: set num_workers > 0 to load asynchronously and overlap fetch with the GPU step. Start at
cores − 1, but size against per-worker RAM, not CPU count — each worker forks a full copy of any
large in-dataset object; too many OOM the cgroup with a bare Killed (the quadratic trap + sizing rule are
gotchas_universal.md U9). Not monotonic: past the point where the GPU is fed, extra workers only add RAM
and startup cost.
T5 — persistent_workers=True: stop paying worker-startup every epoch
Symptom: a visible stall at the start of every epoch (especially short epochs / many epochs); GPU idle while workers respawn.
Root cause: default persistent_workers=False shuts down all workers after the dataset is consumed
once and re-forks them next epoch — re-importing, re-opening files, rebuilding the dataset object each
time (https://docs.pytorch.org/docs/2.12/data.html).
Fix: persistent_workers=True keeps the worker Dataset instances alive between epochs, removing the
per-epoch respawn cost. Requires num_workers > 0. Biggest win when epochs are short or the dataset's
__init__ is heavy (loads an index/manifest).
T6 — pin_memory=True + non_blocking=True: overlap the host→device copy
Symptom: the H2D copy (x.to('cuda')) sits on the critical path between fetch and forward.
Root cause: a pageable-memory tensor must be staged through a pinned buffer by the driver before DMA;
a synchronous .to(device) blocks the step. "When using a GPU it's better to set pin_memory=True"
(https://docs.pytorch.org/tutorials/recipes/recipes/tuning_guide.html).
Fix: DataLoader(pin_memory=True) allocates batches page-locked, then transfer
x = x.to(device, non_blocking=True) so the copy runs async on a copy stream and overlaps compute. Both
halves needed — pin_memory alone still blocks; non_blocking without pinned memory silently falls back to
a blocking copy. Costs host RAM (pinned pages aren't swappable) — back off if it pressures the cgroup (U9).
T7 — prefetch_factor: deepen the queue when fetch time is bursty
Symptom: with workers on, the GPU still periodically stalls — every Nth step (N = num_workers) has
a long idle gap because all workers were busy producing the next batch when the GPU asked
(https://docs.pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html).
Root cause: prefetch_factor defaults to 2 when num_workers>0 (None when 0) — "2 means there
will be a total of 2 * num_workers batches prefetched across all workers"
(https://docs.pytorch.org/docs/2.12/data.html). A shallow queue can't absorb a variance spike in
per-sample fetch/decode time.
Fix: raise prefetch_factor (3–4) so workers run ahead and bursts hide — at the cost of more resident
batches in RAM (re-check U9). A smoothing knob, not a multiplier: if the average fetch rate is below
the GPU's consume rate, no depth helps — fix the rate (workers T4, GPU transform T8, NVMe U8) instead.
T8 — IO-bound vs CPU-transform-bound are different data-bound cases (different fix)
Symptom: data-bound (T1), but adding workers barely helps.
Root cause — split the case:
- IO-bound: bytes arrive slowly from network/HDD/object store; workers sit in
read. Stage the working set to instance-local NVMe (HDD→NVMe gaps reach ~35×) = gotchas_universal.md U8; the many-tiny-files transaction death + shard-into-tar / WebDataset fix = U25. - CPU-transform-bound: a heavy per-sample augment (resize/decode/FFT) saturates CPU; workers CPU-pegged
(T3), capping at core count. Move the transform to the GPU (NVIDIA DALI,
torchvision.transforms.v2on tensors, kornia) onto idle GPU cycles. The 0%-util serialized-transform variant is U38, owned by verifying-dl-experiments REQUIRED (which also owns whether a GPU-side transform shifted the data distribution).
Fix: read the trace (T19) — time in read/stat ⇒ U8/U25; time in a transform fn ⇒ move to GPU.
Free / near-free knobs (set these once at startup on any box)
T9 — TF32 / set_float32_matmul_precision("high") — the "why is my A100 slow" footgun
The biggest free speedup on Ampere+ for any fp32 matmul path; OFF by default since PyTorch 1.12. The
decision and exact knobs (torch.set_float32_matmul_precision("high"), the legacy allow_tf32 flags,
--tf32 1 in HF Trainer, convergence impact) are owned by references/training/precision-stability.md P2
(cross-link there; do NOT restate). If a fresh PyTorch 2.x rental's fp32-heavy run is 2–4× slow with no bug,
this is the first suspect.
T10 — cudnn.benchmark=True: autotune conv algorithms (fixed input shapes only)
Symptom: a conv-heavy net (CNN/UNet) is slower than it should be; input shapes are constant.
Root cause: by default cuDNN picks a generic conv algorithm; the autotuner benchmarks variants on the first batch of each new shape and caches the fastest (https://docs.pytorch.org/tutorials/recipes/recipes/tuning_guide.html).
Fix: torch.backends.cudnn.benchmark = True once at startup. Only helps when input shapes are
stable — with variable shapes (dynamic resolution, ragged batches) it re-benchmarks every new shape and
loses time. Trade-off: it is nondeterministic (picks by first-batch timing), so it fights the
determinism knobs — whether to enable it for a clean datapoint is owned by precision-stability P19 /
verifying-dl-experiments (U36, REQUIRED).
T11 — channels_last: free Tensor-Core speedup for conv nets under AMP
Symptom: a CNN under mixed precision isn't hitting Tensor Cores; throughput below the card's potential.
Root cause: default NCHW contiguous layout forces layout transposes around Tensor-Core convolutions.
Fix: convert model and inputs to memory_format=torch.channels_last —
model = model.to(memory_format=torch.channels_last) and x = x.to(memory_format=torch.channels_last).
Optimizes convolutional networks with Tensor Cores + AMP
(https://docs.pytorch.org/tutorials/recipes/recipes/tuning_guide.html). Marked experimental and CNN-specific
(no benefit for pure transformers). No numerics change — purely a layout speedup.
T12 — set_to_none + disable debug APIs (two free per-step taxes to remove)
optimizer.zero_grad(set_to_none=True)(the default since PyTorch 2.0) over zero-filling — assigningNoneskips a memory-write kernel per param and lets the next backward write fresh (https://docs.pytorch.org/tutorials/recipes/recipes/tuning_guide.html). Edge case: code reading.gradbetween steps must tolerateNone.- Turn OFF debug APIs for the real run —
torch.autograd.set_detect_anomaly(True),torch.autograd.profiler.profile,gradcheckadd per-op bookkeeping (anomaly detection is ~10× slower, precision-stability P9). Grepdetect_anomaly/ leftoverwith profile(wrappers before a long launch (https://docs.pytorch.org/tutorials/recipes/recipes/tuning_guide.html); easy to leave on after a NaN hunt.
Mixed precision for speed
T13 — bf16/fp16 is a throughput lever, not just a memory lever
Symptom: fp32 training under-uses Tensor Cores; the GPU has bf16/fp16 tensor cores.
Root cause: 16-bit matmuls run on Tensor Cores at much higher FLOP/s and halve activation read/write bandwidth — a speedup on top of the memory saving (oom-memory.md M6).
Fix: torch.autocast("cuda", dtype=torch.bfloat16) on Ampere+ (the modern default; no GradScaler —
precision-stability P6) or bf16=True in HF TrainingArguments. The full precision decision (bf16 vs fp16
vs the V100/T4 fp16-only path, GradScaler mechanics, NaN/overflow) is owned by
references/training/precision-stability.md P1–P10 (cross-link; do NOT restate). The memory angle and
the activation-bucket math is oom-memory.md M6. A NaN/divergence after the swap is a numerics question →
precision-stability / verifying-dl-experiments (REQUIRED).
Kernels — the levers left once the GPU is fed
T14 — SDPA / FlashAttention: stop materializing the O(seq²) attention matrix
Symptom: a transformer is attention-bound; long sequences are slow and memory-heavy; or flash_attn
"installed" but the run is no faster.
Root cause: the eager/math attention path materializes the full seq×seq score matrix. The fused
FlashAttention / memory-efficient backends never do, but PyTorch's scaled_dot_product_attention
silently falls back to the slow math backend when the fused kernel's input constraints aren't met
(wrong dtype, head dim, mask shape) — "if a fused implementation is not available, a warning will be
raised" (https://docs.pytorch.org/docs/2.12/generated/torch.nn.functional.scaled_dot_product_attention.html).
Fix:
- Use
F.scaled_dot_product_attention(q,k,v)(orattn_implementation="sdpa", the HF default on 2.1.1+), which auto-picks FlashAttention / memory-efficient / cuDNN / math. Feed it fp16/bf16 inputs — the fused backends need 16-bit (themathfallback is what runs in fp32). - Force-verify the fast backend instead of trusting silence:
from torch.nn.attention import sdpa_kernel, SDPBackend with sdpa_kernel(backends=[SDPBackend.FLASH_ATTENTION]): # errors loudly if it can't be used out = F.scaled_dot_product_attention(q, k, v, is_causal=True) - Installing
flash_attnfrom source is a trap: withoutninja(pip install ninja) the CUDA extension compiles single-threaded ~2 h; with ninja ~3–5 min on a 64-core box. With many cores but<96 GBRAM ninja over-parallelizes and OOMs the build — capMAX_JOBS=4 pip install flash-attn --no-build-isolation. Prefer a prebuilt wheel matching thecuXX/torchYY/cpZZtriple (https://github.com/Dao-AILab/flash-attention/issues/1038, https://pypi.org/project/flash-attn/). A torch/CUDA mismatch is gotchas_universal.md U28. Whether the fused kernel changes outputs (causal-mask edge cases) is a numerics check → verifying-dl-experiments (REQUIRED).
T15 — torch.compile: fuse kernels + cut launch overhead (one line, real gains)
Symptom: many small pointwise/elementwise ops; Python/launch overhead dominates between big matmuls.
Root cause: eager launches each op separately; Inductor fuses adjacent ops into Triton kernels and (in CUDA-graph modes) eliminates per-step launch overhead, reusing the execution plan across steps.
Fix: wrap the model — model = torch.compile(model). Modes
(https://huggingface.co/docs/transformers/en/perf_torch_compile):
default— balanced speed/memory.mode="reduce-overhead"— uses CUDA graphs to kill Python overhead (best for many tiny ops / small-batch / inference), at a little more memory.mode="max-autotune"— longest compile, fastest steady-state.- HF
TrainingArguments(torch_compile=True, torch_compile_mode="reduce-overhead").
Reported ~2.2× mean-inference speedups; training gains real but model-dependent. First step(s) are slow
— compilation is lazy on first call (https://huggingface.co/docs/transformers/en/perf_torch_compile); exclude
warm-up from any throughput measurement. Set fullgraph=True while developing to surface graph breaks loudly
instead of silently losing speed. Whether the compiled numbers match eager → verifying-dl-experiments
(REQUIRED).
T16 — torch.compile recompilation trap: variable shapes silently blow the cache → eager
Symptom: a compiled run is slower than eager, or stutters periodically; throughput never stabilizes. Common with variable batch/seq-len, dynamic padding, or per-step changing shapes.
Root cause: compile creates guards on traced shapes; a new shape violates a guard and triggers a
recompile. Past the recompile cap (torch._dynamo.config.recompile_limit, default 8; legacy
cache_size_limit) Dynamo stops compiling that function and runs it eagerly — paying all the compile
cost and getting none of the benefit
(https://docs.pytorch.org/docs/stable/compile/programming_model.recompilation.html,
https://github.com/pytorch/pytorch/issues/93457).
Fix:
- See it:
TORCH_LOGS=recompiles python train.pylogs which function recompiled and the failed guard;TORCH_LOGS=graph_breaksandtorch._dynamo.explain(...)locate graph breaks (https://docs.pytorch.org/docs/stable/torch.compiler_troubleshooting.html). - Tame shapes: pad/bucket to a few fixed shapes so guards stop firing; or mark the varying dim dynamic
—
torch.compile(model, dynamic=True)(ormark_dynamic/TORCH_COMPILE_DYNAMIC_SOURCES) compiles one shape-generic graph instead of one per size.dynamic=Falseforces a fresh recompile per distinct size (use only with truly few shapes) (https://docs.pytorch.org/docs/stable/compile/programming_model.html). - Last resort: raise
torch._dynamo.config.recompile_limitonly if a handful of stable extra shapes legitimately exist — raising it to mask genuinely unbounded shapes just thrashes.
Memory ↔ speed trades
T17 — Activation checkpointing buys memory by spending ~20–30% compute (know the cost)
Symptom: gradient/activation checkpointing is on "to be safe" and training is slow — but the model actually fits without it.
Fix: checkpointing recomputes activations in backward instead of storing them — trading ~20–30%
extra compute for a large memory cut (https://docs.pytorch.org/tutorials/recipes/recipes/tuning_guide.html,
oom-memory.md M7). Enable it only when activations actually OOM (full rationale + use_reentrant=False /
use_cache=False gotchas = oom-memory.md M7); if it fits without, turning it off is a free ~25% speedup.
On the frontier, checkpoint only the fewest/heaviest blocks needed to fit, not the whole model.
T18 — Bigger micro-batch ≈ better GPU utilization (up to the memory wall)
Symptom: tiny batches under-feed the GPU; util and throughput both low though VRAM is mostly free (small batches under-fill Tensor Cores and amortize launch/sync overhead poorly).
Fix: raise micro-batch toward the VRAM limit; keep the effective batch fixed with grad-accum if the
result depends on it (batch 4 × accum 16 beats batch 1 × accum 64 — oom-memory.md M5). Accuracy/effective-
batch implications (LR scaling, accumulation loss-weighting) → verifying-dl-experiments (REQUIRED).
Sizing alongside a concurrent job + expandable_segments = gotchas_universal.md U10 / oom-memory.md M8.
Profilers — measure the bottleneck, don't guess it
T19 — torch.profiler: the definitive data-bound vs compute-bound verdict
Symptom: need to prove where time goes (which T1 case), not infer from util%.
Fix — scheduled profile of a few steps (https://docs.pytorch.org/tutorials/recipes/recipes/profiler_recipe.html):
from torch.profiler import profile, schedule, ProfilerActivity, tensorboard_trace_handler
with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
schedule=schedule(wait=1, warmup=1, active=3), # skip warm-up; record 3 steps
on_trace_ready=tensorboard_trace_handler("./tb_trace"),
record_shapes=True, with_stack=True,
) as prof:
for step, batch in enumerate(loader):
train_step(batch); prof.step()
if step >= 6: break
print(prof.key_averages().table(sort_by="self_cuda_time_total", row_limit=15))
Read it: large GPU-timeline gaps with CPU busy in DataLoader/transforms during the gap ⇒
data-bound (T4–T8); the TensorBoard "Performance Recommendation" panel names the DataLoader directly
(https://docs.pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html). Densely-packed GPU
timeline ⇒ GPU-bound; sort by self_cuda_time_total for the hottest kernel (T14/T15). Time in nccl:* not
overlapped ⇒ comms-bound (T23). On a remote box write the trace and view locally — for raw
export_chrome_trace("trace.json") open at chrome://tracing; scp it down (references/ssh_transport.md),
never run a viewer over ssh.
T20 — nsys / Nsight Systems: system-wide timeline when the gap is below PyTorch's view
Symptom: torch.profiler shows GPU-idle gaps but not why (CPU launch latency, a hidden sync, a memcpy, a kernel-launch storm); or want CUDA-API + NVTX + OS-runtime on one timeline.
Root cause: torch.profiler sees PyTorch ops; nsys traces the whole system — CUDA API, kernels,
memcpy, NVTX ranges, OS-runtime — so it exposes launch-bound stalls and CPU↔GPU sync that PyTorch can't.
"Periodic gaps in the CUDA HW row are moments when the GPU is idle — a red flag"
(https://docs.lxp.lu/howto/pytorch-profiling-with-nsight/).
Fix — profile a bounded window on the box, view locally (canonical PyTorch recipe, https://gist.github.com/mcarilli/376821aa1a7182dfcf59928a7cde3223):
nsys profile -w true -t cuda,nvtx,osrt,cudnn,cublas -s cpu \
--capture-range=cudaProfilerApi -x true -o report python train.py
In the script, bound the window so the .nsys-rep stays small:
torch.cuda.profiler.cudart().cudaProfilerStart() # after warm-up
# ... a handful of steps, optionally wrapped in torch.cuda.nvtx.range_push/pop ...
torch.cuda.profiler.cudart().cudaProfilerStop()
scp the .nsys-rep down, open in the Nsight Systems GUI. Nsight Systems finds which kernel is slow;
Nsight Compute (ncu) finds why (occupancy, bandwidth, warp stalls) — but ncu is heavy, reserve it
for one hot kernel (https://www.spheron.network/blog/gpu-profiling-ai-workloads-nsight-compute-pytorch-profiler-guide/).
T21 — py-spy: profile a LIVE training process with no restart, no code change
Symptom: a long run is mysteriously slow or apparently hung; restarting it to add a profiler would cost hours and might not reproduce.
Root cause: a Python-side bottleneck or deadlock (a slow transform, a lock, a blocking collective) that needs inspection in situ.
Fix — attach by PID, zero instrumentation (https://github.com/benfred/py-spy):
py-spy dump --pid <PID> # one-shot stack of every thread → where it's hung RIGHT NOW
py-spy top --pid <PID> # live "which functions burn time" (Unix top-style)
py-spy record -o prof.svg --pid <PID> # flame graph over a window
"The profiled program needs no import, no decorator, and no restart." On a rented box mid-run, py-spy dump
instantly distinguishes a hung process (stuck in recv/lock/all_reduce) from a slow one (busy in a
transform) — pairs with the "is it actually hung?" check (gotchas_universal.md U17, verifying-dl-experiments
REQUIRED). May need --native for C-extension frames and sudo/SYS_PTRACE to attach.
T22 — CUDA memory snapshot/visualizer → oom-memory.md M19
For what allocated the memory (not time), the torch.cuda.memory._record_memory_history snapshot +
https://pytorch.org/memory_viz timeline is owned by references/training/oom-memory.md M19/M18. It is a
memory tool, not a throughput tool — listed here only so the profiler menu is complete. Do NOT restate.
Multi-GPU / multi-node communication
T23 — Compute-comms overlap: DDP overlaps by default; tune the bucket, watch for breakers
Symptom: scaling efficiency is poor — per-GPU util high, but N GPUs deliver far less than N× throughput;
trace shows all_reduce/all_gather not overlapped with backward compute.
Root cause: DDP overlaps gradient all-reduce with backward by bucketing gradients and launching each
bucket's reduce on a separate CUDA stream as soon as it's ready
(https://github.com/pytorch/pytorch/issues/67570). Overlap breaks when something forces a sync: an
unused-parameter recompute, an off-by-default find_unused_parameters=True, a .item()/print/.cpu() in
the step, or too-small/too-large buckets.
Fix (single box, DDP/FSDP — the launch/sharding mechanics live in references/training/distributed-launch.md, REQUIRED):
- Tune
bucket_cap_mb(DDP) to batch gradient chunks into fewer, larger all-reduces; setgradient_as_bucket_view=Trueto cut a copy. Buckets too small = launch overhead; too large = late overlap. - FSDP: enable
backward_prefetch(prefetch the next layer's all-gather during current backward) andforward_prefetchso comms hide under compute;limit_all_gathersif memory-pressured. - Remove per-step host syncs (
loss.item()every step, prints, eager.cpu()) that serialize the stream.
Inter-node transport (NCCL picking the wrong NIC, fabric-manager hang, 1800 s timeout masking a straggler, MTU mismatch) is references/multinode.md (REQUIRED for ≥2 instances) — a comms "slowdown" across boxes is usually one of those, not a bucket-size tune. Whether a world-size change silently rescaled the effective batch/LR is a science question → verifying-dl-experiments (REQUIRED).
Pointers — throughput gotchas catalogued elsewhere (do NOT restate)
- gotchas_universal.md — U8 stage hot data to local NVMe (IO-bound) · U21
nvidia-smiutil% is a liar (+ U23 thermal/power throttle) · U24 dataloader-starvation knob order · U25 millions of small files → shard into tar/WebDataset · U38 GPU 0%-util CPU-data-bound (owned by verifying-dl). - references/training/oom-memory.md — M5 micro-batch/grad-accum · M6 bf16 activations · M7 activation
checkpointing memory rationale · M8
expandable_segments· M19 memory snapshot/visualizer. - references/training/precision-stability.md — P1–P10 the precision decision + AMP mechanics · P2 the
TF32-off footgun · P19 determinism-vs-
cudnn.benchmarkspeed trade. - references/training/distributed-launch.md — torchrun/Accelerate/DeepSpeed launch, DDP/FSDP sharding, and the desync/hang toolkit (the launch substrate this file's T23 sits on).
- references/multinode.md — inter-node NCCL/NIC/fabric/timeout/MTU (the wire between boxes). Single-box users skip.
- verifying-dl-experiments (REQUIRED) — owns is-the-number-real: whether a kernel/precision/compile swap changed the result, whether dropping samples or a GPU-side transform shifted the distribution, the 0%-util diagnosis (U38), determinism (U36). This file makes training fast; that skill decides if the faster result is still true.