452 lines
28 KiB
Markdown
452 lines
28 KiB
Markdown
# Throughput & profiling — make training FAST, find the one bottleneck
|
||
|
||
How to tell *why* a rented GPU is underfed (GPU-bound vs data-bound vs comms-bound), then apply the
|
||
right speedup in cost order — from a free dataloader knob to `torch.compile` and fused attention. This
|
||
layer owns *making it RUN fast + locating the mechanical bottleneck*; **verifying-dl-experiments** owns
|
||
*is the resulting number correct*. Cross-link it (**REQUIRED**) wherever a speedup risks changing the
|
||
science (a kernel that alters numerics, a precision swap, dropping samples to "go faster").
|
||
|
||
> **Size the run to the box — then PIN it for any comparison.** Auto-sizing batch/`num_workers` to the
|
||
> measured GPU/VRAM/vCPU (Phase 0) to use the card well is fine for a STANDALONE job; but for an ablation
|
||
> or baseline-vs-variant comparison, **pin the same batch across all cells** — auto-maximizing per-box
|
||
> silently changes a variable and breaks comparability (**verifying-dl-experiments**, REQUIRED).
|
||
|
||
To jump: `grep -in '<keyword>' references/training/throughput-profiling.md` (e.g. `bound`, `workers`,
|
||
`compile`, `recompile`, `flash`, `sdpa`, `nsys`, `py-spy`, `channels_last`, `tf32`, `overlap`).
|
||
|
||
## Table of contents
|
||
|
||
- **Diagnose first** — T1 the 3-way split (GPU/data/comms-bound) · T2 util%-is-a-liar pointer · T3 the cheap CPU/GPU-busy triage
|
||
- **Dataloader (the #1 cause of a starved GPU)** — T4 num_workers · T5 persistent_workers · T6 pin_memory + non_blocking · T7 prefetch_factor · T8 IO-bound vs CPU-transform-bound
|
||
- **Free / near-free knobs** — T9 TF32 + matmul precision · T10 cudnn.benchmark · T11 channels_last · T12 set_to_none + disable debug APIs
|
||
- **Mixed precision for speed** — T13 bf16/fp16 throughput
|
||
- **Kernels** — T14 SDPA / FlashAttention · T15 torch.compile gains · T16 torch.compile recompilation traps
|
||
- **Memory↔speed trades** — T17 activation checkpointing speed cost · T18 batch size vs throughput
|
||
- **Profilers** — T19 torch.profiler (is-it-data-bound) · T20 nsys / Nsight Systems · T21 py-spy (live, no restart) · T22 memory-snapshot pointer
|
||
- **Multi-GPU / multi-node comms** — T23 DDP/FSDP compute-comm overlap
|
||
- **Pointers** — gotchas_universal.md U8/U21/U24/U25/U38 · oom-memory.md · distributed-launch.md · multinode.md · verifying-dl-experiments (skill)
|
||
|
||
---
|
||
|
||
## Diagnose first — do NOT tune blind
|
||
|
||
### T1 — The 3-way split: GPU-bound vs data-bound vs comms-bound (decide before touching a knob)
|
||
|
||
**Symptom**: training is "slow" and the instinct is to change the model or batch size at random.
|
||
|
||
**Root cause**: throughput is gated by exactly one of three resources at a time; the fix for each is
|
||
disjoint, so guessing wastes paid wall-clock (principle #1).
|
||
|
||
**Fix — classify with one cheap reading each** (heuristic: util consistently >90% ⇒ GPU-bound;
|
||
low/fluctuating ⇒ elsewhere; both CPU+GPU low ⇒ I/O —
|
||
https://apxml.com/courses/planning-optimizing-ai-infrastructure/chapter-5-strategies-for-performance-optimization/identifying-performance-bottlenecks):
|
||
- **GPU-bound** (the good case): util high *and* SM clock/power high (T2); adding workers doesn't help. Only
|
||
levers left: kernels (T14–T15), precision (T13), a bigger card.
|
||
- **Data-bound**: util low-but-nonzero or sawtoothing, host CPU busy in `DataLoader`/transforms; a trace
|
||
shows GPU-idle gaps lining up with CPU data work (T19). Go to T4–T8.
|
||
- **Comms-bound** (multi-GPU/-node only): per-GPU util high, scaling efficiency poor; time in
|
||
`nccl:all_reduce`/`all_gather` not overlapped with compute. Go to T23.
|
||
|
||
The highest-signal instrument is a **profiler trace** (T19) — read it before changing anything.
|
||
|
||
### T2 — `nvidia-smi` GPU-Util % lies; correlate clock + power → gotchas_universal.md U21
|
||
|
||
A 100%-util tile can hide a starved GPU (a trickle of tiny kernels reads as 100%). The full diagnosis —
|
||
correlate `clocks.current.sm` + mem-bandwidth util + power via `nvidia-smi dmon -s pucvmet -d 1`, and the
|
||
thermal/power-throttle slowdown — lives in **gotchas_universal.md U21/U23**; read it before concluding a run
|
||
is GPU-bound. The *0%-util-but-running* (CPU-data-bound) inverse is **U38**, owned by verifying-dl-experiments.
|
||
|
||
### T3 — Cheap triage when no profiler is wired yet: is the host CPU busy?
|
||
|
||
**Symptom**: need a 30-second answer to "GPU or data?" before instrumenting.
|
||
|
||
**Fix**: watch GPU and CPU at once for ~10 s —
|
||
```bash
|
||
nvidia-smi dmon -s pu -d 1 -c 10 # per-second SM% + power; sawtooth/low = starved
|
||
top -b -n 1 | grep -i python | head # a worker pegged at ~100% CPU = CPU-transform-bound
|
||
```
|
||
GPU SM% high and steady ⇒ GPU-bound (stop here, go to kernels/precision). GPU SM% sawtoothing while a
|
||
python worker is CPU-pegged ⇒ data-bound (T4–T8). Both idle ⇒ I/O-bound (stage to NVMe, U8). Then confirm
|
||
with a real trace (T19) before investing in a fix. **GPU SM% low while *many* python threads thrash a few
|
||
cores (not one worker pegged) ⇒ intra-op thread oversubscription** on a vCPU slice, not data-bound — cap
|
||
`OMP_NUM_THREADS` to your cgroup quota (gotchas_universal.md **U40**), don't add dataloader workers.
|
||
|
||
---
|
||
|
||
## Dataloader — the #1 reason a rented GPU sits idle
|
||
|
||
The partial-starve knob set (and its order) is **gotchas_universal.md U24**; this section is the per-knob
|
||
*why/when*. Each helps a *different* failure, so apply by symptom, not as a blanket cargo-cult.
|
||
|
||
### T4 — `num_workers`: 0 means the main process loads serially (the default starves the GPU)
|
||
|
||
**Symptom**: `DataLoader(num_workers=0)` (the default) — every batch is fetched on the main thread, GPU
|
||
waits the whole fetch.
|
||
|
||
**Root cause**: with `num_workers=0` "the data will be loaded in the main process" — no overlap between
|
||
data prep and compute (https://docs.pytorch.org/docs/2.12/data.html).
|
||
|
||
**Fix**: set `num_workers > 0` to load asynchronously and overlap fetch with the GPU step. Start at
|
||
`cores − 1`, but **size against per-worker RAM, not CPU count** — each worker `fork`s a full copy of any
|
||
large in-dataset object; too many OOM the cgroup with a bare `Killed` (the quadratic trap + sizing rule are
|
||
**gotchas_universal.md U9**). Not monotonic: past the point where the GPU is fed, extra workers only add RAM
|
||
and startup cost.
|
||
|
||
### T5 — `persistent_workers=True`: stop paying worker-startup every epoch
|
||
|
||
**Symptom**: a visible stall at the **start of every epoch** (especially short epochs / many epochs); GPU
|
||
idle while workers respawn.
|
||
|
||
**Root cause**: default `persistent_workers=False` shuts down all workers after the dataset is consumed
|
||
once and **re-forks them next epoch** — re-importing, re-opening files, rebuilding the dataset object each
|
||
time (https://docs.pytorch.org/docs/2.12/data.html).
|
||
|
||
**Fix**: `persistent_workers=True` keeps the worker Dataset instances alive between epochs, removing the
|
||
per-epoch respawn cost. Requires `num_workers > 0`. Biggest win when epochs are short or the dataset's
|
||
`__init__` is heavy (loads an index/manifest).
|
||
|
||
### T6 — `pin_memory=True` + `non_blocking=True`: overlap the host→device copy
|
||
|
||
**Symptom**: the H2D copy (`x.to('cuda')`) sits on the critical path between fetch and forward.
|
||
|
||
**Root cause**: a pageable-memory tensor must be staged through a pinned buffer by the driver before DMA;
|
||
a synchronous `.to(device)` blocks the step. "When using a GPU it's better to set `pin_memory=True`"
|
||
(https://docs.pytorch.org/tutorials/recipes/recipes/tuning_guide.html).
|
||
|
||
**Fix**: `DataLoader(pin_memory=True)` allocates batches page-locked, **then** transfer
|
||
`x = x.to(device, non_blocking=True)` so the copy runs async on a copy stream and overlaps compute. Both
|
||
halves needed — `pin_memory` alone still blocks; `non_blocking` without pinned memory silently falls back to
|
||
a blocking copy. Costs host RAM (pinned pages aren't swappable) — back off if it pressures the cgroup (U9).
|
||
|
||
### T7 — `prefetch_factor`: deepen the queue when fetch time is bursty
|
||
|
||
**Symptom**: with workers on, the GPU still periodically stalls — every *Nth* step (N = `num_workers`) has
|
||
a long idle gap because all workers were busy producing the next batch when the GPU asked
|
||
(https://docs.pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html).
|
||
|
||
**Root cause**: `prefetch_factor` defaults to **2** when `num_workers>0` (None when 0) — "2 means there
|
||
will be a total of `2 * num_workers` batches prefetched across all workers"
|
||
(https://docs.pytorch.org/docs/2.12/data.html). A shallow queue can't absorb a variance spike in
|
||
per-sample fetch/decode time.
|
||
|
||
**Fix**: raise `prefetch_factor` (3–4) so workers run ahead and bursts hide — at the cost of more resident
|
||
batches in RAM (re-check U9). A *smoothing* knob, not a multiplier: if the **average** fetch rate is below
|
||
the GPU's consume rate, no depth helps — fix the rate (workers T4, GPU transform T8, NVMe U8) instead.
|
||
|
||
### T8 — IO-bound vs CPU-transform-bound are different data-bound cases (different fix)
|
||
|
||
**Symptom**: data-bound (T1), but adding workers barely helps.
|
||
|
||
**Root cause — split the case**:
|
||
- **IO-bound**: bytes arrive slowly from network/HDD/object store; workers sit in `read`. Stage the working
|
||
set to instance-local **NVMe** (HDD→NVMe gaps reach ~35×) = **gotchas_universal.md U8**; the many-tiny-files
|
||
transaction death + **shard-into-tar / WebDataset** fix = **U25**.
|
||
- **CPU-transform-bound**: a heavy per-sample augment (resize/decode/FFT) saturates CPU; workers CPU-pegged
|
||
(T3), capping at core count. Move the transform to the **GPU** (NVIDIA DALI, `torchvision.transforms.v2`
|
||
on tensors, kornia) onto idle GPU cycles. The *0%-util* serialized-transform variant is **U38**, owned by
|
||
verifying-dl-experiments **REQUIRED** (which also owns whether a GPU-side transform shifted the data
|
||
distribution).
|
||
|
||
**Fix**: read the trace (T19) — time in `read`/`stat` ⇒ U8/U25; time in a transform fn ⇒ move to GPU.
|
||
|
||
---
|
||
|
||
## Free / near-free knobs (set these once at startup on any box)
|
||
|
||
### T9 — TF32 / `set_float32_matmul_precision("high")` — the "why is my A100 slow" footgun
|
||
|
||
The biggest free speedup on Ampere+ for any fp32 matmul path; **OFF by default since PyTorch 1.12**. The
|
||
decision and exact knobs (`torch.set_float32_matmul_precision("high")`, the legacy `allow_tf32` flags,
|
||
`--tf32 1` in HF Trainer, convergence impact) are owned by **references/training/precision-stability.md P2**
|
||
(cross-link there; do NOT restate). If a fresh PyTorch 2.x rental's fp32-heavy run is 2–4× slow with no bug,
|
||
this is the first suspect.
|
||
|
||
### T10 — `cudnn.benchmark=True`: autotune conv algorithms (fixed input shapes only)
|
||
|
||
**Symptom**: a conv-heavy net (CNN/UNet) is slower than it should be; input shapes are constant.
|
||
|
||
**Root cause**: by default cuDNN picks a generic conv algorithm; the autotuner benchmarks variants on the
|
||
first batch of each new shape and caches the fastest
|
||
(https://docs.pytorch.org/tutorials/recipes/recipes/tuning_guide.html).
|
||
|
||
**Fix**: `torch.backends.cudnn.benchmark = True` once at startup. **Only helps when input shapes are
|
||
stable** — with variable shapes (dynamic resolution, ragged batches) it re-benchmarks every new shape and
|
||
*loses* time. Trade-off: it is **nondeterministic** (picks by first-batch timing), so it fights the
|
||
determinism knobs — whether to enable it for a clean datapoint is owned by precision-stability P19 /
|
||
verifying-dl-experiments (U36, **REQUIRED**).
|
||
|
||
### T11 — `channels_last`: free Tensor-Core speedup for conv nets under AMP
|
||
|
||
**Symptom**: a CNN under mixed precision isn't hitting Tensor Cores; throughput below the card's potential.
|
||
|
||
**Root cause**: default NCHW contiguous layout forces layout transposes around Tensor-Core convolutions.
|
||
|
||
**Fix**: convert model and inputs to `memory_format=torch.channels_last` —
|
||
`model = model.to(memory_format=torch.channels_last)` and `x = x.to(memory_format=torch.channels_last)`.
|
||
Optimizes convolutional networks with Tensor Cores + AMP
|
||
(https://docs.pytorch.org/tutorials/recipes/recipes/tuning_guide.html). Marked experimental and CNN-specific
|
||
(no benefit for pure transformers). No numerics change — purely a layout speedup.
|
||
|
||
### T12 — `set_to_none` + disable debug APIs (two free per-step taxes to remove)
|
||
|
||
- **`optimizer.zero_grad(set_to_none=True)`** (the **default** since PyTorch 2.0) over zero-filling —
|
||
assigning `None` skips a memory-write kernel per param and lets the next backward write fresh
|
||
(https://docs.pytorch.org/tutorials/recipes/recipes/tuning_guide.html). Edge case: code reading `.grad`
|
||
between steps must tolerate `None`.
|
||
- **Turn OFF debug APIs for the real run** — `torch.autograd.set_detect_anomaly(True)`,
|
||
`torch.autograd.profiler.profile`, `gradcheck` add per-op bookkeeping (anomaly detection is ~10× slower,
|
||
precision-stability P9). Grep `detect_anomaly` / leftover `with profile(` wrappers before a long launch
|
||
(https://docs.pytorch.org/tutorials/recipes/recipes/tuning_guide.html); easy to leave on after a NaN hunt.
|
||
|
||
---
|
||
|
||
## Mixed precision for speed
|
||
|
||
### T13 — bf16/fp16 is a throughput lever, not just a memory lever
|
||
|
||
**Symptom**: fp32 training under-uses Tensor Cores; the GPU has bf16/fp16 tensor cores.
|
||
|
||
**Root cause**: 16-bit matmuls run on Tensor Cores at much higher FLOP/s and halve activation
|
||
read/write bandwidth — a speedup *on top of* the memory saving (oom-memory.md M6).
|
||
|
||
**Fix**: `torch.autocast("cuda", dtype=torch.bfloat16)` on Ampere+ (the modern default; no GradScaler —
|
||
precision-stability P6) or `bf16=True` in HF `TrainingArguments`. The full precision decision (bf16 vs fp16
|
||
vs the V100/T4 fp16-only path, GradScaler mechanics, NaN/overflow) is owned by
|
||
**references/training/precision-stability.md P1–P10** (cross-link; do NOT restate). The *memory* angle and
|
||
the activation-bucket math is **oom-memory.md M6**. A NaN/divergence after the swap is a numerics question →
|
||
precision-stability / verifying-dl-experiments (**REQUIRED**).
|
||
|
||
---
|
||
|
||
## Kernels — the levers left once the GPU is fed
|
||
|
||
### T14 — SDPA / FlashAttention: stop materializing the O(seq²) attention matrix
|
||
|
||
**Symptom**: a transformer is attention-bound; long sequences are slow and memory-heavy; or `flash_attn`
|
||
"installed" but the run is no faster.
|
||
|
||
**Root cause**: the eager/`math` attention path materializes the full `seq×seq` score matrix. The fused
|
||
**FlashAttention** / **memory-efficient** backends never do, but PyTorch's `scaled_dot_product_attention`
|
||
**silently falls back to the slow `math` backend** when the fused kernel's input constraints aren't met
|
||
(wrong dtype, head dim, mask shape) — "if a fused implementation is not available, a warning will be
|
||
raised" (https://docs.pytorch.org/docs/2.12/generated/torch.nn.functional.scaled_dot_product_attention.html).
|
||
|
||
**Fix**:
|
||
- Use `F.scaled_dot_product_attention(q,k,v)` (or `attn_implementation="sdpa"`, the HF default on 2.1.1+),
|
||
which auto-picks FlashAttention / memory-efficient / cuDNN / math. Feed it **fp16/bf16** inputs — the
|
||
fused backends need 16-bit (the `math` fallback is what runs in fp32).
|
||
- **Force-verify** the fast backend instead of trusting silence:
|
||
```python
|
||
from torch.nn.attention import sdpa_kernel, SDPBackend
|
||
with sdpa_kernel(backends=[SDPBackend.FLASH_ATTENTION]): # errors loudly if it can't be used
|
||
out = F.scaled_dot_product_attention(q, k, v, is_causal=True)
|
||
```
|
||
- **Installing `flash_attn` from source is a trap**: without `ninja` (`pip install ninja`) the CUDA
|
||
extension compiles single-threaded ~2 h; with ninja ~3–5 min on a 64-core box. With many cores but
|
||
`<96 GB` RAM ninja over-parallelizes and OOMs the build — cap `MAX_JOBS=4 pip install flash-attn
|
||
--no-build-isolation`. Prefer a **prebuilt wheel** matching the `cuXX/torchYY/cpZZ` triple
|
||
(https://github.com/Dao-AILab/flash-attention/issues/1038, https://pypi.org/project/flash-attn/). A
|
||
torch/CUDA mismatch is **gotchas_universal.md U28**. Whether the fused kernel changes outputs (causal-mask
|
||
edge cases) is a numerics check → verifying-dl-experiments (**REQUIRED**).
|
||
|
||
### T15 — `torch.compile`: fuse kernels + cut launch overhead (one line, real gains)
|
||
|
||
**Symptom**: many small pointwise/elementwise ops; Python/launch overhead dominates between big matmuls.
|
||
|
||
**Root cause**: eager launches each op separately; Inductor fuses adjacent ops into Triton kernels and
|
||
(in CUDA-graph modes) eliminates per-step launch overhead, reusing the execution plan across steps.
|
||
|
||
**Fix**: wrap the model — `model = torch.compile(model)`. Modes
|
||
(https://huggingface.co/docs/transformers/en/perf_torch_compile):
|
||
- `default` — balanced speed/memory.
|
||
- `mode="reduce-overhead"` — uses **CUDA graphs** to kill Python overhead (best for many tiny ops /
|
||
small-batch / inference), at a little more memory.
|
||
- `mode="max-autotune"` — longest compile, fastest steady-state.
|
||
- HF `TrainingArguments(torch_compile=True, torch_compile_mode="reduce-overhead")`.
|
||
|
||
Reported ~2.2× mean-inference speedups; training gains real but model-dependent. **First step(s) are slow**
|
||
— compilation is lazy on first call (https://huggingface.co/docs/transformers/en/perf_torch_compile); exclude
|
||
warm-up from any throughput measurement. Set `fullgraph=True` while developing to surface graph breaks loudly
|
||
instead of silently losing speed. Whether the compiled *numbers* match eager → verifying-dl-experiments
|
||
(**REQUIRED**).
|
||
|
||
### T16 — `torch.compile` recompilation trap: variable shapes silently blow the cache → eager
|
||
|
||
**Symptom**: a compiled run is *slower* than eager, or stutters periodically; throughput never stabilizes.
|
||
Common with variable batch/seq-len, dynamic padding, or per-step changing shapes.
|
||
|
||
**Root cause**: compile creates **guards** on traced shapes; a new shape violates a guard and triggers a
|
||
**recompile**. Past the recompile cap (`torch._dynamo.config.recompile_limit`, default **8**; legacy
|
||
`cache_size_limit`) Dynamo **stops compiling that function and runs it eagerly** — paying all the compile
|
||
cost and getting none of the benefit
|
||
(https://docs.pytorch.org/docs/stable/compile/programming_model.recompilation.html,
|
||
https://github.com/pytorch/pytorch/issues/93457).
|
||
|
||
**Fix**:
|
||
- **See it**: `TORCH_LOGS=recompiles python train.py` logs which function recompiled and the failed guard;
|
||
`TORCH_LOGS=graph_breaks` and `torch._dynamo.explain(...)` locate graph breaks
|
||
(https://docs.pytorch.org/docs/stable/torch.compiler_troubleshooting.html).
|
||
- **Tame shapes**: pad/bucket to a few fixed shapes so guards stop firing; or mark the varying dim dynamic
|
||
— `torch.compile(model, dynamic=True)` (or `mark_dynamic` / `TORCH_COMPILE_DYNAMIC_SOURCES`) compiles
|
||
one shape-generic graph instead of one per size. `dynamic=False` forces a fresh recompile per distinct
|
||
size (use only with truly few shapes)
|
||
(https://docs.pytorch.org/docs/stable/compile/programming_model.html).
|
||
- **Last resort**: raise `torch._dynamo.config.recompile_limit` only if a handful of *stable* extra shapes
|
||
legitimately exist — raising it to mask genuinely unbounded shapes just thrashes.
|
||
|
||
---
|
||
|
||
## Memory ↔ speed trades
|
||
|
||
### T17 — Activation checkpointing buys memory by spending ~20–30% compute (know the cost)
|
||
|
||
**Symptom**: gradient/activation checkpointing is on "to be safe" and training is slow — but the model
|
||
actually fits without it.
|
||
|
||
**Fix**: checkpointing **recomputes** activations in backward instead of storing them — trading **~20–30%
|
||
extra compute** for a large memory cut (https://docs.pytorch.org/tutorials/recipes/recipes/tuning_guide.html,
|
||
oom-memory.md M7). Enable it **only when activations actually OOM** (full rationale + `use_reentrant=False` /
|
||
`use_cache=False` gotchas = **oom-memory.md M7**); if it fits without, turning it off is a free ~25% speedup.
|
||
On the frontier, checkpoint only the *fewest/heaviest* blocks needed to fit, not the whole model.
|
||
|
||
### T18 — Bigger micro-batch ≈ better GPU utilization (up to the memory wall)
|
||
|
||
**Symptom**: tiny batches under-feed the GPU; util and throughput both low though VRAM is mostly free (small
|
||
batches under-fill Tensor Cores and amortize launch/sync overhead poorly).
|
||
|
||
**Fix**: raise micro-batch toward the VRAM limit; keep the **effective** batch fixed with grad-accum if the
|
||
result depends on it (`batch 4 × accum 16` beats `batch 1 × accum 64` — oom-memory.md M5). Accuracy/effective-
|
||
batch implications (LR scaling, accumulation loss-weighting) → verifying-dl-experiments (**REQUIRED**).
|
||
Sizing alongside a concurrent job + `expandable_segments` = **gotchas_universal.md U10** / oom-memory.md M8.
|
||
|
||
---
|
||
|
||
## Profilers — measure the bottleneck, don't guess it
|
||
|
||
### T19 — `torch.profiler`: the definitive data-bound vs compute-bound verdict
|
||
|
||
**Symptom**: need to *prove* where time goes (which T1 case), not infer from util%.
|
||
|
||
**Fix — scheduled profile of a few steps**
|
||
(https://docs.pytorch.org/tutorials/recipes/recipes/profiler_recipe.html):
|
||
```python
|
||
from torch.profiler import profile, schedule, ProfilerActivity, tensorboard_trace_handler
|
||
with profile(
|
||
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
|
||
schedule=schedule(wait=1, warmup=1, active=3), # skip warm-up; record 3 steps
|
||
on_trace_ready=tensorboard_trace_handler("./tb_trace"),
|
||
record_shapes=True, with_stack=True,
|
||
) as prof:
|
||
for step, batch in enumerate(loader):
|
||
train_step(batch); prof.step()
|
||
if step >= 6: break
|
||
print(prof.key_averages().table(sort_by="self_cuda_time_total", row_limit=15))
|
||
```
|
||
**Read it**: large **GPU-timeline gaps** with CPU busy in `DataLoader`/transforms during the gap ⇒
|
||
**data-bound** (T4–T8); the TensorBoard "Performance Recommendation" panel names the DataLoader directly
|
||
(https://docs.pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html). Densely-packed GPU
|
||
timeline ⇒ GPU-bound; sort by `self_cuda_time_total` for the hottest kernel (T14/T15). Time in `nccl:*` not
|
||
overlapped ⇒ comms-bound (T23). On a remote box write the trace and view locally — for raw
|
||
`export_chrome_trace("trace.json")` open at `chrome://tracing`; `scp` it down (references/ssh_transport.md),
|
||
never run a viewer over ssh.
|
||
|
||
### T20 — `nsys` / Nsight Systems: system-wide timeline when the gap is below PyTorch's view
|
||
|
||
**Symptom**: torch.profiler shows GPU-idle gaps but not *why* (CPU launch latency, a hidden sync, a memcpy,
|
||
a kernel-launch storm); or want CUDA-API + NVTX + OS-runtime on one timeline.
|
||
|
||
**Root cause**: torch.profiler sees PyTorch ops; `nsys` traces the whole system — CUDA API, kernels,
|
||
memcpy, NVTX ranges, OS-runtime — so it exposes launch-bound stalls and CPU↔GPU sync that PyTorch can't.
|
||
"Periodic gaps in the CUDA HW row are moments when the GPU is idle — a red flag"
|
||
(https://docs.lxp.lu/howto/pytorch-profiling-with-nsight/).
|
||
|
||
**Fix — profile a bounded window on the box, view locally** (canonical PyTorch recipe,
|
||
https://gist.github.com/mcarilli/376821aa1a7182dfcf59928a7cde3223):
|
||
```bash
|
||
nsys profile -w true -t cuda,nvtx,osrt,cudnn,cublas -s cpu \
|
||
--capture-range=cudaProfilerApi -x true -o report python train.py
|
||
```
|
||
In the script, bound the window so the `.nsys-rep` stays small:
|
||
```python
|
||
torch.cuda.profiler.cudart().cudaProfilerStart() # after warm-up
|
||
# ... a handful of steps, optionally wrapped in torch.cuda.nvtx.range_push/pop ...
|
||
torch.cuda.profiler.cudart().cudaProfilerStop()
|
||
```
|
||
`scp` the `.nsys-rep` down, open in the Nsight Systems GUI. Nsight **Systems** finds *which* kernel is slow;
|
||
Nsight **Compute** (`ncu`) finds *why* (occupancy, bandwidth, warp stalls) — but `ncu` is heavy, reserve it
|
||
for one hot kernel (https://www.spheron.network/blog/gpu-profiling-ai-workloads-nsight-compute-pytorch-profiler-guide/).
|
||
|
||
### T21 — `py-spy`: profile a LIVE training process with no restart, no code change
|
||
|
||
**Symptom**: a long run is mysteriously slow or apparently hung; restarting it to add a profiler would cost
|
||
hours and might not reproduce.
|
||
|
||
**Root cause**: a Python-side bottleneck or deadlock (a slow transform, a lock, a blocking collective) that
|
||
needs inspection *in situ*.
|
||
|
||
**Fix — attach by PID, zero instrumentation** (https://github.com/benfred/py-spy):
|
||
```bash
|
||
py-spy dump --pid <PID> # one-shot stack of every thread → where it's hung RIGHT NOW
|
||
py-spy top --pid <PID> # live "which functions burn time" (Unix top-style)
|
||
py-spy record -o prof.svg --pid <PID> # flame graph over a window
|
||
```
|
||
"The profiled program needs no import, no decorator, and no restart." On a rented box mid-run, `py-spy dump`
|
||
instantly distinguishes a *hung* process (stuck in `recv`/lock/`all_reduce`) from a *slow* one (busy in a
|
||
transform) — pairs with the "is it actually hung?" check (gotchas_universal.md U17, verifying-dl-experiments
|
||
**REQUIRED**). May need `--native` for C-extension frames and `sudo`/`SYS_PTRACE` to attach.
|
||
|
||
### T22 — CUDA memory snapshot/visualizer → oom-memory.md M19
|
||
|
||
For *what allocated the memory* (not time), the `torch.cuda.memory._record_memory_history` snapshot +
|
||
https://pytorch.org/memory_viz timeline is owned by **references/training/oom-memory.md M19/M18**. It is a
|
||
memory tool, not a throughput tool — listed here only so the profiler menu is complete. Do NOT restate.
|
||
|
||
---
|
||
|
||
## Multi-GPU / multi-node communication
|
||
|
||
### T23 — Compute-comms overlap: DDP overlaps by default; tune the bucket, watch for breakers
|
||
|
||
**Symptom**: scaling efficiency is poor — per-GPU util high, but N GPUs deliver far less than N× throughput;
|
||
trace shows `all_reduce`/`all_gather` *not* overlapped with backward compute.
|
||
|
||
**Root cause**: DDP overlaps gradient all-reduce with backward by bucketing gradients and launching each
|
||
bucket's reduce on a separate CUDA stream as soon as it's ready
|
||
(https://github.com/pytorch/pytorch/issues/67570). Overlap *breaks* when something forces a sync: an
|
||
unused-parameter recompute, an off-by-default `find_unused_parameters=True`, a `.item()`/print/`.cpu()` in
|
||
the step, or too-small/too-large buckets.
|
||
|
||
**Fix (single box, DDP/FSDP — the launch/sharding mechanics live in
|
||
references/training/distributed-launch.md, REQUIRED)**:
|
||
- Tune `bucket_cap_mb` (DDP) to batch gradient chunks into fewer, larger all-reduces; set
|
||
`gradient_as_bucket_view=True` to cut a copy. Buckets too small = launch overhead; too large = late
|
||
overlap.
|
||
- FSDP: enable `backward_prefetch` (prefetch the next layer's all-gather during current backward) and
|
||
`forward_prefetch` so comms hide under compute; `limit_all_gathers` if memory-pressured.
|
||
- Remove per-step host syncs (`loss.item()` every step, prints, eager `.cpu()`) that serialize the stream.
|
||
|
||
**Inter-node** transport (NCCL picking the wrong NIC, fabric-manager hang, 1800 s timeout masking a
|
||
straggler, MTU mismatch) is **references/multinode.md** (**REQUIRED** for ≥2 instances) — a comms "slowdown"
|
||
across boxes is usually one of those, not a bucket-size tune. Whether a world-size change silently rescaled
|
||
the effective batch/LR is a science question → verifying-dl-experiments (**REQUIRED**).
|
||
|
||
---
|
||
|
||
## Pointers — throughput gotchas catalogued elsewhere (do NOT restate)
|
||
|
||
- **gotchas_universal.md** — **U8** stage hot data to local NVMe (IO-bound) · **U21** `nvidia-smi` util% is
|
||
a liar (+ **U23** thermal/power throttle) · **U24** dataloader-starvation knob order · **U25** millions of
|
||
small files → shard into tar/WebDataset · **U38** GPU 0%-util CPU-data-bound (owned by verifying-dl).
|
||
- **references/training/oom-memory.md** — M5 micro-batch/grad-accum · M6 bf16 activations · M7 activation
|
||
checkpointing memory rationale · M8 `expandable_segments` · M19 memory snapshot/visualizer.
|
||
- **references/training/precision-stability.md** — P1–P10 the precision decision + AMP mechanics · P2 the
|
||
TF32-off footgun · P19 determinism-vs-`cudnn.benchmark` speed trade.
|
||
- **references/training/distributed-launch.md** — torchrun/Accelerate/DeepSpeed launch, DDP/FSDP sharding,
|
||
and the desync/hang toolkit (the launch substrate this file's T23 sits on).
|
||
- **references/multinode.md** — inter-node NCCL/NIC/fabric/timeout/MTU (the wire between boxes). Single-box
|
||
users skip.
|
||
- **verifying-dl-experiments** (**REQUIRED**) — owns *is-the-number-real*: whether a kernel/precision/compile
|
||
swap changed the result, whether dropping samples or a GPU-side transform shifted the distribution, the
|
||
0%-util diagnosis (U38), determinism (U36). This file makes training *fast*; that skill decides if the
|
||
*faster result is still true*.
|