# Convergence & optimization debugging — it runs, doesn't crash, but won't learn (or learns badly) The other training layers cover the run that **crashes** (`oom-memory.md`), **NaNs** (`precision-stability.md`), **hangs** (`distributed-launch.md`), or is **slow** (`throughput-profiling.md`). This file owns the quieter, far more common failure: the job runs cleanly to the end but the **loss is flat, falls too slowly, or the model underfits** — and the bug is in the optimization wiring, not the hardware. Each entry is **Symptom → Root cause → Fix** with the exact knob. **Always start with O1 (overfit one batch)** — it separates "the loop is broken" from "the model/data is weak" in five minutes and tells you which half of this file you need. Boundary: **verifying-dl-experiments** (**REQUIRED** at every "is the result real" fork) owns collapse, leakage, metric validity, train-vs-val generalization, and seed interpretation; this file owns the *mechanism* of why a correct-looking loop doesn't converge. NaN / loss-spike / LR-too-**HIGH** live next door in `precision-stability.md` (P8–P18) — this file is the LR-too-**LOW** / won't-move / mis-wired side. To jump: `grep -in '' references/training/convergence-debugging.md` (e.g. `overfit`, `requires_grad`, `no_grad`, `optimizer`, `weight decay`, `adamw`, `lr finder`, `scheduler`, `accum`, `cross entropy`, `bcewithlogits`, `nllloss`, `freeze`, `batchnorm`, `discriminative`, `lora`, `update ratio`, `dead relu`). ## Table of contents - **It isn't learning at all (start here)** — O1 overfit-one-batch · O2 params-not-in-optimizer · O3 loss-detached-from-graph · O4 zero_grad/backward/step-order · O5 train()/eval()-mode - **Optimizer / LR / weight-decay / schedule** — O6 AdamW-vs-Adam+no-decay-group · O7 LR-too-LOW+finder · O8 scheduler-order/cadence · O9 grad-accum-divide · O10 AdamW-eps-in-bf16 · O11 fused/foreach - **Loss-function footguns** — O12 double-softmax · O13 BCEWithLogits · O14 CE-target-form · O15 padded-loss-reduction · O16 NLLLoss-needs-log_softmax - **Fine-tuning / transfer** — O17 frozen-but-still-in-optimizer · O18 frozen-BN-running-stats · O19 discriminative-LR/forgetting · O20 strict=False-shape-mismatch · O21 LoRA/PEFT-wiring - **Training-dynamics dashboard (instrument it)** — O22 update:weight-ratio · O23 actual-LR · O24 GradScaler-scale · O25 dead-ReLU-fraction · O26 weight/grad/act-histograms - **Pointers** — precision-stability.md, distributed-launch.md, verifying-dl-experiments (skill) --- ## It isn't learning at all — the first-hour triage ### O1 — Run the overfit-one-batch smoke BEFORE tuning anything (the canonical correctness test) **Symptom**: training "runs" (no error, normal throughput) but loss plateaus near its init value or wanders without trending down, across LRs/optimizers/architectures. You're tuning hyperparameters blind because nothing proves the loop can learn at all. **Root cause**: the loop is broken somewhere between forward and weight-update (any of O2–O5, or a label/shape bug) and no single test isolates "can this code memorize?" from "is this a modeling/data problem?". **Fix**: take ONE fixed mini-batch (2 examples is enough) and loop forward/backward/step on **that same batch** for hundreds of iters — a correct loop drives train loss → ~0. Turn **off** augmentation, shuffling, dropout, and weight decay for the test. Also "verify the loss at init" (e.g. softmax CE should start near `-log(1/n_classes)` then fall). If it will not reach ~0, *"there is a bug somewhere and we cannot continue"* — debug the loop (O2–O5) before touching hyperparameters. (Smoke *content/interpretation* → **verifying-dl-experiments**; this is the mechanical gate.) ([Karpathy, "A Recipe for Training Neural Networks"](https://karpathy.github.io/2019/04/25/recipe/)) ### O2 — Loss flat from step 0, weights byte-identical after `step()` → params aren't in the optimizer **Symptom**: overfit-one-batch fails; a snapshotted param is unchanged before/after `optimizer.step()`; grad-norm may even be nonzero. No error. **Root cause**: the optimizer updates a **different** set of tensors than the model forwards through. Four causes: (a) the params have `requires_grad=False` so `.grad` stays `None` and `step()` skips them; (b) a submodule/head was never passed into the optimizer's param iterable; (c) the optimizer was built from `model.parameters()` **before** `model.to(device)`, so it holds stale CPU tensors while the model forwards the GPU copies; (d) freeze/unfreeze toggled `requires_grad` but left the wrong set in the optimizer. **Fix**: build the optimizer **after** `model.to(device)`. Assert it sees every trainable param: `opt_ids={id(p) for g in optimizer.param_groups for p in g['params']}; assert all(id(p) in opt_ids for p in model.parameters() if p.requires_grad)`. Log `sum(p.requires_grad for p in model.parameters())` at startup. Probe: `w0=next(model.parameters()).clone(); ; assert not torch.equal(w0, next(model.parameters()))`. ([autograd notes](https://docs.pytorch.org/docs/stable/notes/autograd.html), [torch.optim](https://docs.pytorch.org/docs/stable/optim.html), [stale-optimizer-after-.to bug](https://github.com/pytorch/xla/issues/1623)) ### O3 — `backward()` is a no-op / raises "does not require grad" → loss detached from the graph **Symptom**: overfit fails with every `p.grad is None`; or `loss.backward()` raises *"element 0 of tensors does not require grad and does not have a grad_fn"*. **Root cause**: the loss tensor was severed from autograd before `backward`. Common severings: (a) the train forward+loss ran inside `with torch.no_grad():` / `@torch.inference_mode()` left over from eval — *"computations in no-grad mode are never recorded in the backward graph"*; (b) `.item()` / `.detach()` / `.cpu().numpy()` / `float(loss)` on the loss path (e.g. back-propping an accumulated `total_loss += loss.item()`); (c) a tensor rebuilt from numpy mid-network; (d) the metric, not the differentiable loss, was passed to `backward()`. **Fix**: before `backward`, `assert loss.requires_grad and loss.grad_fn is not None`. Keep the differentiable loss tensor distinct from logging scalars (log `loss.item()`, back-prop the raw tensor). Reserve `no_grad`/`inference_mode` for eval only. After `backward`, assert at least one `p.grad is not None`. ([autograd notes](https://docs.pytorch.org/docs/stable/notes/autograd.html)) ### O4 — Wrong `zero_grad` / `backward` / `step` order, or a missing `step()` **Symptom**: overfit fails; weights never move, or training is erratic despite nonzero grads. **Root cause**: PyTorch's contract is *"gradients by default add up; to prevent double-counting we explicitly zero them each iteration"*, `backward` deposits into `.grad`, `step` reads `.grad`. Failure modes: (a) `optimizer.step()` omitted → grads computed, weights never updated; (b) `zero_grad()` placed **after** `backward()` → wipes the fresh grads; (c) `step()` **before** `backward()` → steps on stale/zero grads; (d) `zero_grad` never called → grads from all iters keep summing → effective LR explodes. **Fix**: the canonical order, exactly — `optimizer.zero_grad(set_to_none=True)` → forward → `loss=loss_fn(out,y)` → `loss.backward()` → `optimizer.step()` (under AMP: `scaler.scale(loss).backward()` → `scaler.step(optimizer)` → `scaler.update()`). Gradient accumulation is the one exception (O9): `backward` every micro-step, `step`+`zero_grad` only on the boundary. ([optimization tutorial](https://docs.pytorch.org/tutorials/beginner/basics/optimization_tutorial.html), [torch.optim](https://docs.pytorch.org/docs/stable/optim.html)) ### O5 — Forgot `model.train()` / left `model.eval()` on → Dropout & BatchNorm in the wrong mode **Symptom**: two faces — (1) trained under `eval()`: BN uses frozen running stats and never updates them, Dropout is off → underfits / loss barely moves; (2) evaluated under `train()`: BN uses noisy per-batch stats and Dropout fires → val loss flickers run-to-run and looks worse than train. **Root cause**: `train()`/`eval()` set a per-module flag that *"has an effect only on certain modules ... e.g. Dropout, BatchNorm"* (`eval()` == `train(False)`). In eval mode BN switches to stored `running_mean/var` and **stops** updating them; Dropout becomes identity. A fresh `nn.Module` defaults to `train()`, but any prior `.eval()` (a reused object, an inference helper, a val loop that didn't switch back) persists. **Fix**: bracket phases explicitly — `model.train()` atop each train epoch; `model.eval()` + `with torch.no_grad():` for every val/test pass; `model.train()` again before resuming. After build/load, `assert model.training` before the train loop. (Frozen-backbone BN is a *different* axis → O18; tiny-batch BN → by-domain V7.) ([nn.Module.train/eval](https://docs.pytorch.org/docs/stable/generated/torch.nn.Module.html)) --- ## Optimizer / learning-rate / weight-decay / schedule ### O6 — Weight decay "does nothing" / Norm gains destabilize → coupled `Adam(weight_decay=)` + decaying bias & Norm **Symptom**: `weight_decay` on `torch.optim.Adam` barely regularizes (or hurts) vs the literature's AdamW recipe; or a from-scratch transformer/CNN trains worse than a reference at the "same" wd; or small models destabilize when LayerNorm/BN gains and biases get shrunk toward 0. **Root cause**: (1) `Adam`'s `weight_decay` is classic **L2** — added into the gradient, so it passes through Adam's `1/(sqrt(v)+eps)` preconditioner and params with large historical grads get **less** decay; the intended strength decouples from `wd`. **AdamW** applies decoupled decay directly to the weight (`θ ← θ − lr·wd·θ`), outside the moment path — uniform and lr-independent. They are **not** interchangeable at the same `wd`. (2) Decaying 1-D params (biases, LayerNorm/BN weight & bias) shrinks Norm gains toward 0 — they have no overfitting capacity and shrinking them degrades training. **Fix**: use `torch.optim.AdamW`, not `Adam(weight_decay=...)`. Split into two param groups with `weight_decay=0.0` on the no-decay group — nanoGPT's rule: decay `p.dim()>=2` (matmul/embedding weights), no-decay `p.dim()<2` (all biases + all LayerNorm weights); HF/timm exclude by name (`bias`, `LayerNorm.weight`). ([AdamW doc](https://docs.pytorch.org/docs/stable/generated/torch.optim.AdamW.html), [Loshchilov & Hutter 2017 "Decoupled Weight Decay"](https://arxiv.org/abs/1711.05101), [nanoGPT configure_optimizers](https://github.com/karpathy/nanoGPT/blob/master/model.py)) ### O7 — Loss crawls with no NaN → LR is too **LOW**; find the band with an LR range test **Symptom**: no divergence, no NaN, grads finite — loss just falls glacially or plateaus high; throughput is fine but the model "won't learn." Often after copying an LR from a different-batch/optimizer recipe or defaulting to a tiny "safe" LR. (The mirror of P12's too-HIGH spike.) **Root cause**: the LR sits far below the productive band, so each update is a negligible fraction of the loss-landscape curvature and optimization crawls. The usable band for adaptive optimizers is narrow and architecture-dependent, so a guessed LR is often 1–2 orders of magnitude too small. Distinguishable from vanishing grads — the grad-norm is healthy, just under-applied. **Fix**: run an **LR range test** (Smith) — from a tiny LR, multiply it geometrically each batch over ~100–1000 steps, plot loss vs LR, pick ~1 decade below where loss starts to diverge. Tools: `pytorch-lr-finder` `LRFinder(model,opt,crit).range_test(loader,end_lr=1,num_iter=100)`, fast.ai `learn.lr_find()`, Lightning `Tuner(trainer).lr_find()`. Re-run whenever batch size / optimizer / architecture changes — the band moves; then confirm the LR survives warmup without the P12 spike. ([Smith 2015 "Cyclical Learning Rates"](https://arxiv.org/abs/1506.01186), [pytorch-lr-finder](https://github.com/davidtvs/pytorch-lr-finder), [Smith 2018 disciplined-approach](https://arxiv.org/abs/1803.09820)) ### O8 — `lr_scheduler.step()` before `optimizer.step()` skips the first LR; per-step vs per-epoch cadence **Symptom**: PyTorch warns *"Detected call of `lr_scheduler.step()` before `optimizer.step()`"* and the LR curve is off-by-one; OR a cosine/warmup schedule sized in optimizer steps barely moves (stepped per-epoch) or decays to ~0 in one epoch (per-step schedule stepped per-batch under accumulation). **Root cause**: (1) since PyTorch 1.1 the scheduler must step **after** the optimizer — *"if you ... call scheduler.step() before the optimizer's update ... this will skip the first value of the learning rate schedule."* (2) A scheduler advances one tick per `.step()`; schedulers built around `total_steps`/`num_training_steps` in **optimizer** steps (OneCycleLR, HF `get_cosine_schedule_with_warmup`, Lightning `interval='step'`) must be stepped every optimizer step, and under accumulation an "optimizer step" ≠ a batch. **Fix**: order it `optimizer.step(); scheduler.step()`. Step at the granularity its `total_steps` was computed in — per optimizer step for warmup/cosine/OneCycle (inside the `if (i+1)%accum==0` block, **not** every micro-batch), per epoch only for epoch schedulers. HF `Trainer` steps it automatically — don't also step it manually. ([torch.optim — scheduler order](https://docs.pytorch.org/docs/stable/optim.html), [OneCycleLR](https://docs.pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.OneCycleLR.html)) ### O9 — Gradient accumulation gives effective N×LR → divide the loss by `accum_steps` (and normalize per token) **Symptom**: switching from batch `B` to (micro-batch `B/N` × N accumulation) "at the same config" trains hotter/diverges — loss/grad magnitude ~N× too big, i.e. you silently get N× the LR. For token tasks the accumulated loss also differs from the un-accumulated run even after `/N` when micro-batches hold unequal #non-pad tokens. **Root cause**: each micro-batch loss is `reduction='mean'`; `backward` **adds** grads across the N micro-batches, so the accumulated grad = SUM of N mean-grads = N× the full-batch mean grad → stepping on it ≈ N× LR. Subtler: dividing each mean-loss by N still mis-weights tokens when micro-batches have different valid-token counts (average-of-means ≠ total-loss / total-tokens) — HF found and fixed exactly this in `transformers` in 2024. **Fix**: divide before backward — `loss = loss_fn(out,y) / accum_steps; loss.backward()`, with `step()`/`zero_grad()` only on the boundary. For token-level losses, normalize by the **total** non-pad tokens across the accumulation window (accumulate `reduction='sum'`, divide by total tokens), not the mean-of-means. Under DDP wrap non-boundary micro-steps in `with model.no_sync():` to skip the all-reduce (correctness-neutral, perf win). (DeepSpeed double-counts accum in some configs → D18; world-size×batch → D11.) ([HF "Fixing Gradient Accumulation"](https://huggingface.co/blog/gradient_accumulation), [DDP no_sync](https://docs.pytorch.org/docs/stable/notes/ddp.html)) ### O10 — `AdamW(eps=1e-8)` underflows in bf16/fp16 → unbounded updates where `v` is tiny **Symptom**: a run stable in fp32 develops update spikes/NaNs once optimizer math is half precision; or AdamW behaves as if `eps=0` (huge updates where the second moment `v` is small). Most visible with fp16 optimizer states or foreach/8-bit paths computing `sqrt(v)+eps` in reduced precision. **Root cause**: the AdamW update is `θ -= lr·m̂/(sqrt(v̂)+eps)`. The default `eps=1e-8` is an fp32 value; in fp16 (and to a lesser degree bf16's 7-bit mantissa) `1e-8` rounds to **0** — *"if you use 1e-8 as default and you use 16 bit, it will round to zero."* With `eps≈0`, params whose `v̂≈0` get an unbounded step. (Separate from GradScaler, which protects activations/grads, not this denominator.) **Fix**: raise eps for half-precision optimizer math — `eps=1e-7` (proposed in pytorch#26218 for fp16) up to `1e-6` for bf16; or keep optimizer states / master weights in **fp32** (FSDP `MixedPrecision`, DeepSpeed bf16 keep an fp32 master) so the default eps stays meaningful. Related: `betas=(0.9,0.999)` averages `v` over ~1000 steps — too slow for short fine-tunes; `0.95` is the common LLM-scale second-moment choice. ([pytorch#26218](https://github.com/pytorch/pytorch/issues/26218), [AdamW doc](https://docs.pytorch.org/docs/stable/generated/torch.optim.AdamW.html)) ### O11 — `fused=True` AdamW breaks under AMP/FSDP; `foreach` inflates peak memory **Symptom**: `AdamW(fused=True)` raises (e.g. on `_foreach_sub_` of `device_found_inf`) or mis-steps under GradScaler / bf16-mixed / FSDP; **or** the default `foreach` path OOMs at the optimizer step on a model that fit during forward/backward. **Root cause**: (1) fused AdamW does unscale + step + the inf/NaN check inside one CUDA kernel via `found_inf`; version-specific bugs (pytorch#140514, Lightning#21435) come from that plumbing / FSDP interaction — fused is still the experimental path. (2) `foreach` (the CUDA default when unset) horizontally fuses by allocating intermediates across **all** params at once, raising peak memory at the step vs the for-loop path. **Fix**: on a fused error/suspicious step under AMP/FSDP/bf16-mixed, fall back to `fused=False` (lets `foreach` default) or upgrade past the fixed issue — confirm a parity loss-curve before trusting fused for a real datapoint. If the **step** OOMs, set `foreach=False` for the low-peak for-loop path (slower, less memory; see oom-memory). Pick deliberately: fused fastest-when-correct, foreach faster than for-loop but higher peak. ([pytorch#140514](https://github.com/pytorch/pytorch/issues/140514), [Lightning#21435](https://github.com/Lightning-AI/pytorch-lightning/issues/21435), [AdamW doc](https://docs.pytorch.org/docs/stable/generated/torch.optim.AdamW.html)) --- ## Loss-function footguns ### O12 — `softmax`/`log_softmax` before `nn.CrossEntropyLoss` → double-softmax → diluted gradient, slow/no learning **Symptom**: a model with a softmax (or log_softmax) final layer trains far slower than expected, plateaus high, or barely learns; loss is sluggish but not NaN. Classic when porting a Keras/TF model (expects probabilities) to PyTorch, or after "adding softmax to get probabilities." **Root cause**: `nn.CrossEntropyLoss` internally does `LogSoftmax + NLLLoss` and *"expects ... unnormalized logits."* Feeding already-softmaxed values applies softmax twice; `softmax(softmax(z))` flattens toward uniform, shrinking the logit dynamic range, so the CE gradient w.r.t. the pre-softmax activations becomes small and ill-conditioned. It still trains — just with a near-vanishing signal. **Fix**: pass **raw logits** of shape `(N,C)` — remove any `nn.Softmax`/`F.log_softmax`/`nn.LogSoftmax` from the head. Apply softmax only at inference (for probabilities) or argmax (for the class). If you genuinely need log-probs in-graph, use `F.log_softmax` + `nn.NLLLoss` (O16) instead — never both. ([CrossEntropyLoss doc](https://docs.pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html)) ### O13 — `sigmoid` + `nn.BCELoss` → `log(0)=-inf` → NaN; use `nn.BCEWithLogitsLoss` (+`pos_weight`) **Symptom**: a binary / multi-label head shows NaN or inf loss (often once outputs saturate toward 0/1), or spiky loss; the model has an explicit `torch.sigmoid` before `nn.BCELoss`. Under imbalance it also collapses to always predicting the majority (negative) class. **Root cause**: `nn.BCELoss` takes probabilities and computes `log(p)`/`log(1-p)` directly; when the preceding sigmoid saturates (`p`→0 or 1) `log(0)=-inf` and its gradient is inf/NaN, poisoning every param. Two separate ops can't use the stabilized formulation. Plain BCE also weights positives and negatives equally → rare-positive data drives the trivial all-negative solution. **Fix**: feed **raw logits** to `nn.BCEWithLogitsLoss` — it fuses sigmoid+BCE with the log-sum-exp trick, avoiding `log(0)`. Remove the explicit sigmoid (apply only at inference). For imbalance pass `pos_weight = #neg/#pos` per class (`>1` raises recall, `<1` raises precision). Target must be **float**, same shape as the logits. ([BCEWithLogitsLoss doc](https://docs.pytorch.org/docs/stable/generated/torch.nn.BCEWithLogitsLoss.html), [numerical-stability thread](https://discuss.pytorch.org/t/numerical-stability-of-bcewithlogitsloss/8246)) (imbalance *strategy* → by-domain V6.) ### O14 — `CrossEntropyLoss` target form: long `(N,)` indices in `[0,C)` vs float `(N,C)` soft; off-by-one → device-side assert **Symptom**: any of — `RuntimeError: 0D or 1D target tensor expected, multi-target not supported` (one-hot target); `expected scalar type Long but found Float`; `IndexError: Target N is out of bounds` / CUDA `device-side assert ... t >= 0 && t < n_classes` (a label `== C`, or labels `1..C`, or arbitrary ids); or a plausible-but-non-converging loss. **Root cause**: `nn.CrossEntropyLoss` has **two** target forms. **Class-index** form: target shape `(N,)` (one fewer dim than the `(N,C,...)` input), dtype `long`, every value in `[0,C)`. A `(N,C)` target is read as multiple targets ("multi-target"); a value `==C` (off-by-one from 1-indexed classes) or non-contiguous ids trips the bounds assert — on CUDA an **async** device-side assert that may surface at a later, unrelated line. **Class-probability** form (soft/smoothed/mixup): target must be float, same shape `(N,C,...)`, summing to 1. Mixing them is the error. **Fix**: hard labels → `targets.long()` of shape `(N,)`; remap ids to contiguous `0..C-1` (`{orig:i for i,orig in enumerate(sorted(set(labels)))}`; subtract 1 if 1-indexed); `assert targets.min()>=0 and targets.max() fp16's 65504` → grads overflow → step skipped every step → weights frozen while loss still looks plausible. The config "fp16" tells you nothing; only the live scale reveals it. **Fix**: add `scaler.get_scale()` and a skipped-step counter to the dashboard. Healthy: a high plateau (`2^13..2^16`) after early calibration. Bad: monotonic decay toward 1, or step-count not advancing with iteration count. Lever when it collapses: switch **fp16 → bf16** (no scaler; fp32 exponent range absorbs the large activations — highest leverage), or keep the overflow-prone block (final logits / attention) in fp32 via a nested `autocast(enabled=False)`, plus z-loss / qk-norm (P15/P16). Don't "fix" it by lowering `init_scale`. ([torch.amp GradScaler](https://docs.pytorch.org/docs/stable/amp.html)) (mechanism → P5/P10.) ### O25 — Rising dead-ReLU / zero-activation fraction → a slice of the net is permanently off **Symptom**: capacity quietly vanishes — a layer's outputs are increasingly all-zero, loss plateaus above where it should, and adding width doesn't help. No crash; it just under-fits. Worst case the net degenerates toward a constant function. **Root cause**: a ReLU whose pre-activation is driven negative for ~all inputs outputs 0 and has **zero** local gradient there, so backprop sends no signal to its incoming weights — the unit is stuck off and unrecoverable. Triggered by too-high LR (a big update pushes weights/bias deep negative) or a large negative bias. Once a large fraction of a layer dies, gradients can't flow through it and that capacity is gone. The same shape (saturation → ~0 gradient → frozen region) applies to sigmoid/tanh tails. **Fix**: instrument the zero/saturation fraction per activation with a forward hook — `(out==0).float().mean()` for ReLU (or `|out|>0.99` for tanh/sigmoid), logged every K steps per layer. Healthy: a stable modest dead fraction (ReLU is sparse by design). Bad: a fraction climbing over training or a layer pinned near ~100% dead. Levers, in order: (1) lower LR (the primary cause); (2) ReLU → LeakyReLU / GELU / SiLU so the negative region keeps a gradient; (3) fix init / large negative biases. ([CS231n "Neural Networks 1" — dying ReLU](https://cs231n.github.io/neural-networks-1/)) (the *output* being constant is owned by verifying-dl-experiments; this is the internal mechanism.) ### O26 — No weight/grad/activation histograms → scalar norms hide bimodal/saturating/collapsing distributions **Symptom**: scalar dashboards (loss, one grad-norm) look fine yet the model under-performs or destabilizes — a mean/norm hides the shape: activations drifting to a saturated tail, weights collapsing to a spike at 0 (a layer dying, O25), or a gradient distribution growing fat outlier tails all read as an unremarkable scalar. **Root cause**: norms and means are lossy summaries — a healthy spread and a bimodal/all-saturated/all-zero distribution can share the same L2 norm. The diagnostic signal is the **change in shape over training**, which a scalar can't show. **Fix**: periodically (every few hundred steps — histograms aren't free) log `SummaryWriter.add_histogram(tag, values, global_step)` for each layer's **weights**, its **gradients** (after `backward`, before `zero_grad`), and key **activations** (forward hook). Read the time-evolution: weights collapsing to a spike = a layer dying; gradient histograms collapsing to ~0 = vanishing (lever: residual/norm/init, P17); fat tails = clip + lower LR (P13/P12); activations wandering into a saturating tail = init/normalization fix (P17). Pair with the scalars above. ([SummaryWriter.add_histogram](https://docs.pytorch.org/docs/stable/tensorboard.html), [Karpathy recipe — visualize weights/activations](https://karpathy.github.io/2019/04/25/recipe/)) --- ## Pointers — adjacent mechanics catalogued elsewhere - **NaN / loss-spike / LR-too-HIGH / grad explosion / z-loss / qk-norm / init & norm placement / determinism** → `references/training/precision-stability.md` (P8–P19). This file is the LR-too-LOW / won't-move side; that one is the blows-up side. - **OOM from the optimizer step / activation checkpointing / LoRA-QLoRA memory** → `references/training/oom-memory.md` (M5, M12–M13). - **N-GPU effective batch × LR, DeepSpeed accum double-count, find_unused_parameters** → `references/training/distributed-launch.md` (D11, D18, D8). - **Dataloader correctness (worker RNG, collate, labels, shuffle) that mimics "won't learn"** → `references/training/data-pipeline.md`. - **Is the converged number REAL** (collapse, leakage, train-vs-val, metric validity, seed discipline) → **verifying-dl-experiments** (**REQUIRED** — every "is the result real" fork above).