playbook/antigravity-awesome-skills/skills/remote-gpu-trainer/references/training/convergence-debugging.md

188 lines
39 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Convergence & optimization debugging — it runs, doesn't crash, but won't learn (or learns badly)
The other training layers cover the run that **crashes** (`oom-memory.md`), **NaNs**
(`precision-stability.md`), **hangs** (`distributed-launch.md`), or is **slow** (`throughput-profiling.md`).
This file owns the quieter, far more common failure: the job runs cleanly to the end but the **loss is
flat, falls too slowly, or the model underfits** — and the bug is in the optimization wiring, not the
hardware. Each entry is **Symptom → Root cause → Fix** with the exact knob. **Always start with O1
(overfit one batch)** — it separates "the loop is broken" from "the model/data is weak" in five minutes
and tells you which half of this file you need.
Boundary: **verifying-dl-experiments** (**REQUIRED** at every "is the result real" fork) owns collapse,
leakage, metric validity, train-vs-val generalization, and seed interpretation; this file owns the
*mechanism* of why a correct-looking loop doesn't converge. NaN / loss-spike / LR-too-**HIGH** live next
door in `precision-stability.md` (P8P18) — this file is the LR-too-**LOW** / won't-move / mis-wired side.
To jump: `grep -in '<keyword>' references/training/convergence-debugging.md` (e.g. `overfit`, `requires_grad`,
`no_grad`, `optimizer`, `weight decay`, `adamw`, `lr finder`, `scheduler`, `accum`, `cross entropy`,
`bcewithlogits`, `nllloss`, `freeze`, `batchnorm`, `discriminative`, `lora`, `update ratio`, `dead relu`).
## Table of contents
- **It isn't learning at all (start here)** — O1 overfit-one-batch · O2 params-not-in-optimizer · O3 loss-detached-from-graph · O4 zero_grad/backward/step-order · O5 train()/eval()-mode
- **Optimizer / LR / weight-decay / schedule** — O6 AdamW-vs-Adam+no-decay-group · O7 LR-too-LOW+finder · O8 scheduler-order/cadence · O9 grad-accum-divide · O10 AdamW-eps-in-bf16 · O11 fused/foreach
- **Loss-function footguns** — O12 double-softmax · O13 BCEWithLogits · O14 CE-target-form · O15 padded-loss-reduction · O16 NLLLoss-needs-log_softmax
- **Fine-tuning / transfer** — O17 frozen-but-still-in-optimizer · O18 frozen-BN-running-stats · O19 discriminative-LR/forgetting · O20 strict=False-shape-mismatch · O21 LoRA/PEFT-wiring
- **Training-dynamics dashboard (instrument it)** — O22 update:weight-ratio · O23 actual-LR · O24 GradScaler-scale · O25 dead-ReLU-fraction · O26 weight/grad/act-histograms
- **Pointers** — precision-stability.md, distributed-launch.md, verifying-dl-experiments (skill)
---
## It isn't learning at all — the first-hour triage
### O1 — Run the overfit-one-batch smoke BEFORE tuning anything (the canonical correctness test)
**Symptom**: training "runs" (no error, normal throughput) but loss plateaus near its init value or wanders without trending down, across LRs/optimizers/architectures. You're tuning hyperparameters blind because nothing proves the loop can learn at all.
**Root cause**: the loop is broken somewhere between forward and weight-update (any of O2O5, or a label/shape bug) and no single test isolates "can this code memorize?" from "is this a modeling/data problem?".
**Fix**: take ONE fixed mini-batch (2 examples is enough) and loop forward/backward/step on **that same batch** for hundreds of iters — a correct loop drives train loss → ~0. Turn **off** augmentation, shuffling, dropout, and weight decay for the test. Also "verify the loss at init" (e.g. softmax CE should start near `-log(1/n_classes)` then fall). If it will not reach ~0, *"there is a bug somewhere and we cannot continue"* — debug the loop (O2O5) before touching hyperparameters. (Smoke *content/interpretation***verifying-dl-experiments**; this is the mechanical gate.) ([Karpathy, "A Recipe for Training Neural Networks"](https://karpathy.github.io/2019/04/25/recipe/))
### O2 — Loss flat from step 0, weights byte-identical after `step()` → params aren't in the optimizer
**Symptom**: overfit-one-batch fails; a snapshotted param is unchanged before/after `optimizer.step()`; grad-norm may even be nonzero. No error.
**Root cause**: the optimizer updates a **different** set of tensors than the model forwards through. Four causes: (a) the params have `requires_grad=False` so `.grad` stays `None` and `step()` skips them; (b) a submodule/head was never passed into the optimizer's param iterable; (c) the optimizer was built from `model.parameters()` **before** `model.to(device)`, so it holds stale CPU tensors while the model forwards the GPU copies; (d) freeze/unfreeze toggled `requires_grad` but left the wrong set in the optimizer.
**Fix**: build the optimizer **after** `model.to(device)`. Assert it sees every trainable param: `opt_ids={id(p) for g in optimizer.param_groups for p in g['params']}; assert all(id(p) in opt_ids for p in model.parameters() if p.requires_grad)`. Log `sum(p.requires_grad for p in model.parameters())` at startup. Probe: `w0=next(model.parameters()).clone(); <one step>; assert not torch.equal(w0, next(model.parameters()))`. ([autograd notes](https://docs.pytorch.org/docs/stable/notes/autograd.html), [torch.optim](https://docs.pytorch.org/docs/stable/optim.html), [stale-optimizer-after-.to bug](https://github.com/pytorch/xla/issues/1623))
### O3 — `backward()` is a no-op / raises "does not require grad" → loss detached from the graph
**Symptom**: overfit fails with every `p.grad is None`; or `loss.backward()` raises *"element 0 of tensors does not require grad and does not have a grad_fn"*.
**Root cause**: the loss tensor was severed from autograd before `backward`. Common severings: (a) the train forward+loss ran inside `with torch.no_grad():` / `@torch.inference_mode()` left over from eval — *"computations in no-grad mode are never recorded in the backward graph"*; (b) `.item()` / `.detach()` / `.cpu().numpy()` / `float(loss)` on the loss path (e.g. back-propping an accumulated `total_loss += loss.item()`); (c) a tensor rebuilt from numpy mid-network; (d) the metric, not the differentiable loss, was passed to `backward()`.
**Fix**: before `backward`, `assert loss.requires_grad and loss.grad_fn is not None`. Keep the differentiable loss tensor distinct from logging scalars (log `loss.item()`, back-prop the raw tensor). Reserve `no_grad`/`inference_mode` for eval only. After `backward`, assert at least one `p.grad is not None`. ([autograd notes](https://docs.pytorch.org/docs/stable/notes/autograd.html))
### O4 — Wrong `zero_grad` / `backward` / `step` order, or a missing `step()`
**Symptom**: overfit fails; weights never move, or training is erratic despite nonzero grads.
**Root cause**: PyTorch's contract is *"gradients by default add up; to prevent double-counting we explicitly zero them each iteration"*, `backward` deposits into `.grad`, `step` reads `.grad`. Failure modes: (a) `optimizer.step()` omitted → grads computed, weights never updated; (b) `zero_grad()` placed **after** `backward()` → wipes the fresh grads; (c) `step()` **before** `backward()` → steps on stale/zero grads; (d) `zero_grad` never called → grads from all iters keep summing → effective LR explodes.
**Fix**: the canonical order, exactly — `optimizer.zero_grad(set_to_none=True)` → forward → `loss=loss_fn(out,y)``loss.backward()``optimizer.step()` (under AMP: `scaler.scale(loss).backward()``scaler.step(optimizer)``scaler.update()`). Gradient accumulation is the one exception (O9): `backward` every micro-step, `step`+`zero_grad` only on the boundary. ([optimization tutorial](https://docs.pytorch.org/tutorials/beginner/basics/optimization_tutorial.html), [torch.optim](https://docs.pytorch.org/docs/stable/optim.html))
### O5 — Forgot `model.train()` / left `model.eval()` on → Dropout & BatchNorm in the wrong mode
**Symptom**: two faces — (1) trained under `eval()`: BN uses frozen running stats and never updates them, Dropout is off → underfits / loss barely moves; (2) evaluated under `train()`: BN uses noisy per-batch stats and Dropout fires → val loss flickers run-to-run and looks worse than train.
**Root cause**: `train()`/`eval()` set a per-module flag that *"has an effect only on certain modules ... e.g. Dropout, BatchNorm"* (`eval()` == `train(False)`). In eval mode BN switches to stored `running_mean/var` and **stops** updating them; Dropout becomes identity. A fresh `nn.Module` defaults to `train()`, but any prior `.eval()` (a reused object, an inference helper, a val loop that didn't switch back) persists.
**Fix**: bracket phases explicitly — `model.train()` atop each train epoch; `model.eval()` + `with torch.no_grad():` for every val/test pass; `model.train()` again before resuming. After build/load, `assert model.training` before the train loop. (Frozen-backbone BN is a *different* axis → O18; tiny-batch BN → by-domain V7.) ([nn.Module.train/eval](https://docs.pytorch.org/docs/stable/generated/torch.nn.Module.html))
---
## Optimizer / learning-rate / weight-decay / schedule
### O6 — Weight decay "does nothing" / Norm gains destabilize → coupled `Adam(weight_decay=)` + decaying bias & Norm
**Symptom**: `weight_decay` on `torch.optim.Adam` barely regularizes (or hurts) vs the literature's AdamW recipe; or a from-scratch transformer/CNN trains worse than a reference at the "same" wd; or small models destabilize when LayerNorm/BN gains and biases get shrunk toward 0.
**Root cause**: (1) `Adam`'s `weight_decay` is classic **L2** — added into the gradient, so it passes through Adam's `1/(sqrt(v)+eps)` preconditioner and params with large historical grads get **less** decay; the intended strength decouples from `wd`. **AdamW** applies decoupled decay directly to the weight (`θ ← θ lr·wd·θ`), outside the moment path — uniform and lr-independent. They are **not** interchangeable at the same `wd`. (2) Decaying 1-D params (biases, LayerNorm/BN weight & bias) shrinks Norm gains toward 0 — they have no overfitting capacity and shrinking them degrades training.
**Fix**: use `torch.optim.AdamW`, not `Adam(weight_decay=...)`. Split into two param groups with `weight_decay=0.0` on the no-decay group — nanoGPT's rule: decay `p.dim()>=2` (matmul/embedding weights), no-decay `p.dim()<2` (all biases + all LayerNorm weights); HF/timm exclude by name (`bias`, `LayerNorm.weight`). ([AdamW doc](https://docs.pytorch.org/docs/stable/generated/torch.optim.AdamW.html), [Loshchilov & Hutter 2017 "Decoupled Weight Decay"](https://arxiv.org/abs/1711.05101), [nanoGPT configure_optimizers](https://github.com/karpathy/nanoGPT/blob/master/model.py))
### O7 — Loss crawls with no NaN → LR is too **LOW**; find the band with an LR range test
**Symptom**: no divergence, no NaN, grads finite — loss just falls glacially or plateaus high; throughput is fine but the model "won't learn." Often after copying an LR from a different-batch/optimizer recipe or defaulting to a tiny "safe" LR. (The mirror of P12's too-HIGH spike.)
**Root cause**: the LR sits far below the productive band, so each update is a negligible fraction of the loss-landscape curvature and optimization crawls. The usable band for adaptive optimizers is narrow and architecture-dependent, so a guessed LR is often 12 orders of magnitude too small. Distinguishable from vanishing grads — the grad-norm is healthy, just under-applied.
**Fix**: run an **LR range test** (Smith) — from a tiny LR, multiply it geometrically each batch over ~1001000 steps, plot loss vs LR, pick ~1 decade below where loss starts to diverge. Tools: `pytorch-lr-finder` `LRFinder(model,opt,crit).range_test(loader,end_lr=1,num_iter=100)`, fast.ai `learn.lr_find()`, Lightning `Tuner(trainer).lr_find()`. Re-run whenever batch size / optimizer / architecture changes — the band moves; then confirm the LR survives warmup without the P12 spike. ([Smith 2015 "Cyclical Learning Rates"](https://arxiv.org/abs/1506.01186), [pytorch-lr-finder](https://github.com/davidtvs/pytorch-lr-finder), [Smith 2018 disciplined-approach](https://arxiv.org/abs/1803.09820))
### O8 — `lr_scheduler.step()` before `optimizer.step()` skips the first LR; per-step vs per-epoch cadence
**Symptom**: PyTorch warns *"Detected call of `lr_scheduler.step()` before `optimizer.step()`"* and the LR curve is off-by-one; OR a cosine/warmup schedule sized in optimizer steps barely moves (stepped per-epoch) or decays to ~0 in one epoch (per-step schedule stepped per-batch under accumulation).
**Root cause**: (1) since PyTorch 1.1 the scheduler must step **after** the optimizer — *"if you ... call scheduler.step() before the optimizer's update ... this will skip the first value of the learning rate schedule."* (2) A scheduler advances one tick per `.step()`; schedulers built around `total_steps`/`num_training_steps` in **optimizer** steps (OneCycleLR, HF `get_cosine_schedule_with_warmup`, Lightning `interval='step'`) must be stepped every optimizer step, and under accumulation an "optimizer step" ≠ a batch.
**Fix**: order it `optimizer.step(); scheduler.step()`. Step at the granularity its `total_steps` was computed in — per optimizer step for warmup/cosine/OneCycle (inside the `if (i+1)%accum==0` block, **not** every micro-batch), per epoch only for epoch schedulers. HF `Trainer` steps it automatically — don't also step it manually. ([torch.optim — scheduler order](https://docs.pytorch.org/docs/stable/optim.html), [OneCycleLR](https://docs.pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.OneCycleLR.html))
### O9 — Gradient accumulation gives effective N×LR → divide the loss by `accum_steps` (and normalize per token)
**Symptom**: switching from batch `B` to (micro-batch `B/N` × N accumulation) "at the same config" trains hotter/diverges — loss/grad magnitude ~N× too big, i.e. you silently get N× the LR. For token tasks the accumulated loss also differs from the un-accumulated run even after `/N` when micro-batches hold unequal #non-pad tokens.
**Root cause**: each micro-batch loss is `reduction='mean'`; `backward` **adds** grads across the N micro-batches, so the accumulated grad = SUM of N mean-grads = N× the full-batch mean grad → stepping on it ≈ N× LR. Subtler: dividing each mean-loss by N still mis-weights tokens when micro-batches have different valid-token counts (average-of-means ≠ total-loss / total-tokens) — HF found and fixed exactly this in `transformers` in 2024.
**Fix**: divide before backward — `loss = loss_fn(out,y) / accum_steps; loss.backward()`, with `step()`/`zero_grad()` only on the boundary. For token-level losses, normalize by the **total** non-pad tokens across the accumulation window (accumulate `reduction='sum'`, divide by total tokens), not the mean-of-means. Under DDP wrap non-boundary micro-steps in `with model.no_sync():` to skip the all-reduce (correctness-neutral, perf win). (DeepSpeed double-counts accum in some configs → D18; world-size×batch → D11.) ([HF "Fixing Gradient Accumulation"](https://huggingface.co/blog/gradient_accumulation), [DDP no_sync](https://docs.pytorch.org/docs/stable/notes/ddp.html))
### O10 — `AdamW(eps=1e-8)` underflows in bf16/fp16 → unbounded updates where `v` is tiny
**Symptom**: a run stable in fp32 develops update spikes/NaNs once optimizer math is half precision; or AdamW behaves as if `eps=0` (huge updates where the second moment `v` is small). Most visible with fp16 optimizer states or foreach/8-bit paths computing `sqrt(v)+eps` in reduced precision.
**Root cause**: the AdamW update is `θ -= lr·m̂/(sqrt(v̂)+eps)`. The default `eps=1e-8` is an fp32 value; in fp16 (and to a lesser degree bf16's 7-bit mantissa) `1e-8` rounds to **0***"if you use 1e-8 as default and you use 16 bit, it will round to zero."* With `eps≈0`, params whose `v̂≈0` get an unbounded step. (Separate from GradScaler, which protects activations/grads, not this denominator.)
**Fix**: raise eps for half-precision optimizer math — `eps=1e-7` (proposed in pytorch#26218 for fp16) up to `1e-6` for bf16; or keep optimizer states / master weights in **fp32** (FSDP `MixedPrecision`, DeepSpeed bf16 keep an fp32 master) so the default eps stays meaningful. Related: `betas=(0.9,0.999)` averages `v` over ~1000 steps — too slow for short fine-tunes; `0.95` is the common LLM-scale second-moment choice. ([pytorch#26218](https://github.com/pytorch/pytorch/issues/26218), [AdamW doc](https://docs.pytorch.org/docs/stable/generated/torch.optim.AdamW.html))
### O11 — `fused=True` AdamW breaks under AMP/FSDP; `foreach` inflates peak memory
**Symptom**: `AdamW(fused=True)` raises (e.g. on `_foreach_sub_` of `device_found_inf`) or mis-steps under GradScaler / bf16-mixed / FSDP; **or** the default `foreach` path OOMs at the optimizer step on a model that fit during forward/backward.
**Root cause**: (1) fused AdamW does unscale + step + the inf/NaN check inside one CUDA kernel via `found_inf`; version-specific bugs (pytorch#140514, Lightning#21435) come from that plumbing / FSDP interaction — fused is still the experimental path. (2) `foreach` (the CUDA default when unset) horizontally fuses by allocating intermediates across **all** params at once, raising peak memory at the step vs the for-loop path.
**Fix**: on a fused error/suspicious step under AMP/FSDP/bf16-mixed, fall back to `fused=False` (lets `foreach` default) or upgrade past the fixed issue — confirm a parity loss-curve before trusting fused for a real datapoint. If the **step** OOMs, set `foreach=False` for the low-peak for-loop path (slower, less memory; see oom-memory). Pick deliberately: fused fastest-when-correct, foreach faster than for-loop but higher peak. ([pytorch#140514](https://github.com/pytorch/pytorch/issues/140514), [Lightning#21435](https://github.com/Lightning-AI/pytorch-lightning/issues/21435), [AdamW doc](https://docs.pytorch.org/docs/stable/generated/torch.optim.AdamW.html))
---
## Loss-function footguns
### O12 — `softmax`/`log_softmax` before `nn.CrossEntropyLoss` → double-softmax → diluted gradient, slow/no learning
**Symptom**: a model with a softmax (or log_softmax) final layer trains far slower than expected, plateaus high, or barely learns; loss is sluggish but not NaN. Classic when porting a Keras/TF model (expects probabilities) to PyTorch, or after "adding softmax to get probabilities."
**Root cause**: `nn.CrossEntropyLoss` internally does `LogSoftmax + NLLLoss` and *"expects ... unnormalized logits."* Feeding already-softmaxed values applies softmax twice; `softmax(softmax(z))` flattens toward uniform, shrinking the logit dynamic range, so the CE gradient w.r.t. the pre-softmax activations becomes small and ill-conditioned. It still trains — just with a near-vanishing signal.
**Fix**: pass **raw logits** of shape `(N,C)` — remove any `nn.Softmax`/`F.log_softmax`/`nn.LogSoftmax` from the head. Apply softmax only at inference (for probabilities) or argmax (for the class). If you genuinely need log-probs in-graph, use `F.log_softmax` + `nn.NLLLoss` (O16) instead — never both. ([CrossEntropyLoss doc](https://docs.pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html))
### O13 — `sigmoid` + `nn.BCELoss` → `log(0)=-inf` → NaN; use `nn.BCEWithLogitsLoss` (+`pos_weight`)
**Symptom**: a binary / multi-label head shows NaN or inf loss (often once outputs saturate toward 0/1), or spiky loss; the model has an explicit `torch.sigmoid` before `nn.BCELoss`. Under imbalance it also collapses to always predicting the majority (negative) class.
**Root cause**: `nn.BCELoss` takes probabilities and computes `log(p)`/`log(1-p)` directly; when the preceding sigmoid saturates (`p`→0 or 1) `log(0)=-inf` and its gradient is inf/NaN, poisoning every param. Two separate ops can't use the stabilized formulation. Plain BCE also weights positives and negatives equally → rare-positive data drives the trivial all-negative solution.
**Fix**: feed **raw logits** to `nn.BCEWithLogitsLoss` — it fuses sigmoid+BCE with the log-sum-exp trick, avoiding `log(0)`. Remove the explicit sigmoid (apply only at inference). For imbalance pass `pos_weight = #neg/#pos` per class (`>1` raises recall, `<1` raises precision). Target must be **float**, same shape as the logits. ([BCEWithLogitsLoss doc](https://docs.pytorch.org/docs/stable/generated/torch.nn.BCEWithLogitsLoss.html), [numerical-stability thread](https://discuss.pytorch.org/t/numerical-stability-of-bcewithlogitsloss/8246)) (imbalance *strategy* → by-domain V6.)
### O14 — `CrossEntropyLoss` target form: long `(N,)` indices in `[0,C)` vs float `(N,C)` soft; off-by-one → device-side assert
**Symptom**: any of — `RuntimeError: 0D or 1D target tensor expected, multi-target not supported` (one-hot target); `expected scalar type Long but found Float`; `IndexError: Target N is out of bounds` / CUDA `device-side assert ... t >= 0 && t < n_classes` (a label `== C`, or labels `1..C`, or arbitrary ids); or a plausible-but-non-converging loss.
**Root cause**: `nn.CrossEntropyLoss` has **two** target forms. **Class-index** form: target shape `(N,)` (one fewer dim than the `(N,C,...)` input), dtype `long`, every value in `[0,C)`. A `(N,C)` target is read as multiple targets ("multi-target"); a value `==C` (off-by-one from 1-indexed classes) or non-contiguous ids trips the bounds assert — on CUDA an **async** device-side assert that may surface at a later, unrelated line. **Class-probability** form (soft/smoothed/mixup): target must be float, same shape `(N,C,...)`, summing to 1. Mixing them is the error.
**Fix**: hard labels → `targets.long()` of shape `(N,)`; remap ids to contiguous `0..C-1` (`{orig:i for i,orig in enumerate(sorted(set(labels)))}`; subtract 1 if 1-indexed); `assert targets.min()>=0 and targets.max()<C`. Don't one-hot the standard path. Debug the opaque CUDA assert with `CUDA_LAUNCH_BLOCKING=1` (or rerun on CPU) for the real line. Soft labels → a float `(N,C)` distribution (no manual log_softmax). Use `ignore_index` for pad, not an out-of-range sentinel (O15). ([CrossEntropyLoss doc](https://docs.pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html), ["Target N out of bounds" + remap](https://discuss.huggingface.co/t/indexerror-target-4-is-out-of-bounds/10213))
### O15 — Padded-token loss: `reduction='mean'` averages over PAD → diluted, length-dependent loss
**Symptom**: a seq/NLP model's loss looks suspiciously small from step 0 and scales with how much padding is in the batch (more pad → lower loss); the model under-learns real tokens; changing batch size or max-length changes the loss magnitude for the same data.
**Root cause**: default `reduction='mean'` divides the summed loss by the **total** element count, **including** padded positions, so the real-token loss is averaged with (near-zero) pad contributions — shrinking reported loss and the effective gradient on real tokens by the pad ratio. Unmasked pad targets also contribute real gradient, teaching the model to predict padding.
**Fix**: skip padding. Easiest: `nn.CrossEntropyLoss(ignore_index=PAD_ID)`*"the loss is averaged over non-ignored targets"* (sums valid positions, divides by valid count). Otherwise compute `reduction='none'`, multiply by a 0/1 mask, and divide by `mask.sum()` (valid tokens), **not** `mask.numel()`: `loss=(per_tok*mask).sum()/mask.sum().clamp(min=1)`. Reshape logits→`(N*T,C)`, targets→`(N*T,)` first. (Masking the inputs/attention → by-domain L1/L2; this owns the loss **denominator**.) ([CrossEntropyLoss doc](https://docs.pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html), [ignore_index nuance pytorch#63004](https://github.com/pytorch/pytorch/issues/63004))
### O16 — `nn.NLLLoss` fed raw logits (no `log_softmax`) → silently wrong loss
**Symptom**: a model uses `nn.NLLLoss` but has no `LogSoftmax`/`F.log_softmax` before it (or a plain `Softmax`): training "runs" with no error but loss is nonsensical / won't converge, accuracy stuck near chance.
**Root cause**: `nn.NLLLoss` computes **no** softmax — *"the input ... is expected to contain log-probabilities."* It simply gathers `-input[target]`. Raw logits → it negates an arbitrary-scale value; softmax **probabilities** (not log) → it negates a value in `[0,1]` giving a tiny, ill-scaled loss. Either way it isn't cross-entropy and the gradient is wrong, but the shapes are valid so PyTorch can't catch it.
**Fix**: put `F.log_softmax(logits, dim=1)` (or an `nn.LogSoftmax(dim=1)` final layer) immediately before `nn.NLLLoss` (class dim = 1 for `(N,C)`). Simpler and less error-prone: drop NLLLoss+LogSoftmax and use `nn.CrossEntropyLoss` on raw logits (O12), which fuses both. Never pair NLLLoss with a plain (non-log) Softmax. ([NLLLoss doc](https://docs.pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html))
---
## Fine-tuning / transfer
### O17 — A "frozen" layer keeps changing → `requires_grad=False` but still in the optimizer
**Symptom**: you set `requires_grad=False` on the backbone (or set it **after** building the optimizer over `model.parameters()`), yet the frozen weights keep moving every step; pretrained features drift and degrade though no real gradient flows.
**Root cause**: whether an optimizer touches a param is decided by `param.grad is None`, **not** by `param.requires_grad`. If a frozen param is in the optimizer, after `backward()` its `.grad` is often a **zero tensor** (not `None`), and SGD/Adam apply **weight decay** (`+wd·param`) and **momentum/Adam buffers** *before* the update — so the param moves even on a zero gradient. `requires_grad=False` only stops grad *accumulation*; it does not remove the param from the optimizer.
**Fix**: exclude frozen params from the optimizer at construction — `optim.SGD([p for p in model.parameters() if p.requires_grad], lr=...)` (or per-module param groups). If you froze after building the optimizer, rebuild it, or set `param.grad=None` for the frozen params each step. Freezing correctly = `requires_grad=False` **AND** not in any optimizer param group (and for Norm layers, O18). ([forum: WD/momentum on zero grad](https://discuss.pytorch.org/t/parameters-with-requires-grad-false-are-updated-during-training/90096), [pytorch#679](https://github.com/pytorch/pytorch/issues/679))
### O18 — Frozen backbone left in `.train()` → BatchNorm `running_mean/var` silently drift
**Symptom**: the backbone is "frozen" (`requires_grad=False`) yet val accuracy is unstable / worse than train, or `eval()` vs `train()`-mode inference disagree; small fine-tuning batches make it worse. The frozen features keep shifting batch-to-batch.
**Root cause**: BatchNorm has two kinds of state — learnable affine (`gamma/beta`, gated by `requires_grad`) and **non-learnable** `running_mean/running_var` buffers updated by the **forward pass whenever the module is in training mode** (default `momentum=0.1`), independent of `requires_grad` and the optimizer. A frozen backbone left in `.train()` therefore overwrites the pretrained BN stats with your (often tiny, domain-shifted) batch stats — so the "frozen" extractor isn't frozen.
**Fix**: put the frozen Norm layers in eval mode after `model.train()`: `for m in backbone.modules():\n if isinstance(m,(nn.BatchNorm1d,nn.BatchNorm2d,nn.BatchNorm3d)): m.eval()` — or build them `track_running_stats=False`. Re-apply every epoch, because a top-level `model.train()` flips children back. ([BatchNorm2d doc](https://docs.pytorch.org/docs/stable/generated/torch.nn.BatchNorm2d.html)) (general train/eval-mode bug → O5; tiny-batch BN → by-domain V7.)
### O19 — One global LR wrecks pretrained features (catastrophic forgetting) → discriminative LR + gradual unfreezing
**Symptom**: fine-tuning with a single LR either (too high) destroys the pretrained representations on the first updates and accuracy collapses below a frozen-feature baseline, or (too low) the random new head can't move. Both are the same misconfiguration.
**Root cause**: at step 0 the backbone is near a good optimum but the new head is random, so its large initial loss yields large gradients that, under one high LR, propagate into and overwrite the low-level pretrained layers (catastrophic forgetting). A single LR can't be simultaneously small enough to preserve early layers and large enough to fit the head — the fix is per-group LRs, not more data.
**Fix**: discriminative fine-tuning — per-layer param groups with LR decaying toward the input (head highest, stem lowest), e.g. `AdamW([{'params':head,'lr':1e-3},{'params':backbone,'lr':1e-5}])`. Combine with **gradual unfreezing** (train the head with the backbone frozen first, then unfreeze deeper→shallower) and an LR **warmup** so the random head settles before its gradients reach the backbone. ([Howard & Ruder 2018, ULMFiT — discriminative fine-tuning + gradual unfreezing](https://arxiv.org/abs/1801.06146)) (the general too-high-LR spike → P12.)
### O20 — `load_state_dict(strict=False)` still RuntimeErrors on the replaced head → shape ≠ key mismatch
**Symptom**: you replaced the classifier for a new `num_classes` and pass `strict=False` expecting it to skip the head, but loading still crashes: `RuntimeError: ... size mismatch for fc.weight: copying a param with shape [1000,...] ..., the shape in current model is [N,...]`.
**Root cause**: `strict=False` relaxes only the **presence** check — it tolerates `missing_keys`/`unexpected_keys`. It does **not** relax tensor-shape compatibility: any key present in **both** the checkpoint and the model whose shapes differ (exactly your old-vs-new head `fc.weight/bias`) still raises. So `strict=False` is necessary but not sufficient when the head keeps the same name.
**Fix**: drop the incompatible head entries before loading, then load non-strict — `sd={k:v for k,v in ckpt.items() if not k.startswith('fc.')}; missing,unexpected = model.load_state_dict(sd, strict=False)` — and inspect `missing/unexpected` to confirm only the head is missing. Or give the new head a different attribute name so it never collides. (Save/resume of matching architectures → checkpoint-resume C1C18.) ([load_state_dict doc](https://docs.pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.load_state_dict), [forum: strict=False ≠ shape](https://discuss.pytorch.org/t/when-load-state-dict-strict-false-do-not-work/82301))
### O21 — LoRA/PEFT "barely trains" or reloads random → `target_modules` don't match, head/Norm not in `modules_to_save`
**Symptom**: after `get_peft_model(...)`, `print_trainable_parameters()` shows ~0% (or far fewer than expected) and loss won't drop; or a PEFT classifier reloads with a **random** head / shifted metrics after `save_pretrained`.
**Root cause**: (a) LoRA only wraps modules whose names match `target_modules`, and names are architecture-specific (`q_proj/v_proj` for Llama vs `query/value` for BERT vs `convolution` for resnet) — a wrong/absent name injects no adapter, PEFT just warns "no modules matched," and you train nothing. (b) A newly-initialized task head (`score`/`classifier`) or a base-model BatchNorm's `running_mean/var` are **not** saved unless listed in `modules_to_save` — reload restores the base's random head / original BN stats → garbage / non-reproducible outputs.
**Fix**: enumerate real names with `[n for n,_ in model.named_modules()]` and set `LoraConfig(target_modules=[...])` (or `'all-linear'`); confirm with `model.print_trainable_parameters()` and that you see `lora.Linear` layers. Add the new head and any base Norm layers to `modules_to_save` (e.g. `modules_to_save=['classifier','normalization']`) — or pass the right `task_type` (PEFT auto-adds the standard head). ([PEFT troubleshooting](https://huggingface.co/docs/peft/developer_guides/troubleshooting))
---
## Training-dynamics dashboard — instrument it so the failure is visible
### O22 — Update-to-weight L2 ratio ≈ 1e-3 (the single highest-signal LR dial)
**Symptom**: loss barely moves (under-stepping) or is jittery (over-stepping), and the bare grad-norm can't tell you which — it isn't scale-relative to the weights.
**Root cause**: what matters is the size of the **actual update** relative to the param's own magnitude — `ratio = ||lr·update|| / ||W||`, measured per tensor **after** `step()` (so it folds in lr, momentum, Adam's preconditioning, weight decay). CS231n's heuristic: this should sit ~`1e-3`. Lower → LR too low (weights barely change); higher (`1e-2..1e-1`) → LR too high. Being per-tensor, it exposes individually mis-scaled layers (an embedding moving 100× faster than the trunk) that a global grad-norm hides.
**Fix**: log it every K steps — snapshot `w0={n:p.detach().clone() for n,p in model.named_parameters()}` before `step()`, then `(p.detach()-w0[n]).norm()/(w0[n].norm()+1e-12)` per name after. Lever: `≪1e-3` → raise that group's LR; `≫1e-3` → lower LR / lengthen warmup. Track per param group, not just globally. ([CS231n "Neural Networks 3"](https://cs231n.github.io/neural-networks-3/)) (complements P12/P18.)
### O23 — Log the **actual** per-step LR, not the config value
**Symptom**: you log `cfg.lr` (a constant) so the dashboard LR is flat — yet you're on warmup+cosine. You can't see warmup, decay, a restart, or a frozen scheduler; LR-related loss behavior (spike on ramp, stall at the floor) is invisible.
**Root cause**: the effective LR lives in `optimizer.param_groups[i]['lr']` and is rewritten by `scheduler.step()` each step (and per group for differential/no-decay LRs). Failure modes: plotting the config scalar (never changes); or the O8 order bug skipping the first value. Also `get_lr()` returns a value "one step ahead" — reading it instead of `get_last_lr()` logs the wrong number.
**Fix**: log `scheduler.get_last_lr()` (a list — one per param group; log them all if you use differential LRs) or read `optimizer.param_groups[0]['lr']` directly, every step. Don't use `get_lr()` for logging. If the logged LR plateaus when it should ramp/decay, your scheduler isn't being stepped (or is stepped at the wrong cadence → O8). ([torch.optim](https://docs.pytorch.org/docs/stable/optim.html), [lr_scheduler source — get_last_lr](https://github.com/pytorch/pytorch/blob/main/torch/optim/lr_scheduler.py))
### O24 — GradScaler scale drifting toward 0 = silent persistent fp16 overflow
**Symptom**: an fp16-AMP run looks healthy (loss prints, no crash) but isn't learning or silently skips many optimizer steps — because you never plotted `scaler.get_scale()` and the loss-scale has cratered from 65536 toward ~1 (or sawtooths down).
**Root cause**: GradScaler adapts a multiplicative loss-scale: on any inf/NaN grad it multiplies by `backoff_factor=0.5` **and skips** that `step()`; after `growth_interval=2000` clean steps it multiplies by `growth_factor=2.0` (`init_scale=65536`). A few early backoffs are normal calibration (P5/P10), but a scale that keeps halving and stays low means the forward keeps producing values `> fp16's 65504` → grads overflow → step skipped every step → weights frozen while loss still looks plausible. The config "fp16" tells you nothing; only the live scale reveals it.
**Fix**: add `scaler.get_scale()` and a skipped-step counter to the dashboard. Healthy: a high plateau (`2^13..2^16`) after early calibration. Bad: monotonic decay toward 1, or step-count not advancing with iteration count. Lever when it collapses: switch **fp16 → bf16** (no scaler; fp32 exponent range absorbs the large activations — highest leverage), or keep the overflow-prone block (final logits / attention) in fp32 via a nested `autocast(enabled=False)`, plus z-loss / qk-norm (P15/P16). Don't "fix" it by lowering `init_scale`. ([torch.amp GradScaler](https://docs.pytorch.org/docs/stable/amp.html)) (mechanism → P5/P10.)
### O25 — Rising dead-ReLU / zero-activation fraction → a slice of the net is permanently off
**Symptom**: capacity quietly vanishes — a layer's outputs are increasingly all-zero, loss plateaus above where it should, and adding width doesn't help. No crash; it just under-fits. Worst case the net degenerates toward a constant function.
**Root cause**: a ReLU whose pre-activation is driven negative for ~all inputs outputs 0 and has **zero** local gradient there, so backprop sends no signal to its incoming weights — the unit is stuck off and unrecoverable. Triggered by too-high LR (a big update pushes weights/bias deep negative) or a large negative bias. Once a large fraction of a layer dies, gradients can't flow through it and that capacity is gone. The same shape (saturation → ~0 gradient → frozen region) applies to sigmoid/tanh tails.
**Fix**: instrument the zero/saturation fraction per activation with a forward hook — `(out==0).float().mean()` for ReLU (or `|out|>0.99` for tanh/sigmoid), logged every K steps per layer. Healthy: a stable modest dead fraction (ReLU is sparse by design). Bad: a fraction climbing over training or a layer pinned near ~100% dead. Levers, in order: (1) lower LR (the primary cause); (2) ReLU → LeakyReLU / GELU / SiLU so the negative region keeps a gradient; (3) fix init / large negative biases. ([CS231n "Neural Networks 1" — dying ReLU](https://cs231n.github.io/neural-networks-1/)) (the *output* being constant is owned by verifying-dl-experiments; this is the internal mechanism.)
### O26 — No weight/grad/activation histograms → scalar norms hide bimodal/saturating/collapsing distributions
**Symptom**: scalar dashboards (loss, one grad-norm) look fine yet the model under-performs or destabilizes — a mean/norm hides the shape: activations drifting to a saturated tail, weights collapsing to a spike at 0 (a layer dying, O25), or a gradient distribution growing fat outlier tails all read as an unremarkable scalar.
**Root cause**: norms and means are lossy summaries — a healthy spread and a bimodal/all-saturated/all-zero distribution can share the same L2 norm. The diagnostic signal is the **change in shape over training**, which a scalar can't show.
**Fix**: periodically (every few hundred steps — histograms aren't free) log `SummaryWriter.add_histogram(tag, values, global_step)` for each layer's **weights**, its **gradients** (after `backward`, before `zero_grad`), and key **activations** (forward hook). Read the time-evolution: weights collapsing to a spike = a layer dying; gradient histograms collapsing to ~0 = vanishing (lever: residual/norm/init, P17); fat tails = clip + lower LR (P13/P12); activations wandering into a saturating tail = init/normalization fix (P17). Pair with the scalars above. ([SummaryWriter.add_histogram](https://docs.pytorch.org/docs/stable/tensorboard.html), [Karpathy recipe — visualize weights/activations](https://karpathy.github.io/2019/04/25/recipe/))
---
## Pointers — adjacent mechanics catalogued elsewhere
- **NaN / loss-spike / LR-too-HIGH / grad explosion / z-loss / qk-norm / init & norm placement / determinism** → `references/training/precision-stability.md` (P8P19). This file is the LR-too-LOW / won't-move side; that one is the blows-up side.
- **OOM from the optimizer step / activation checkpointing / LoRA-QLoRA memory** → `references/training/oom-memory.md` (M5, M12M13).
- **N-GPU effective batch × LR, DeepSpeed accum double-count, find_unused_parameters** → `references/training/distributed-launch.md` (D11, D18, D8).
- **Dataloader correctness (worker RNG, collate, labels, shuffle) that mimics "won't learn"** → `references/training/data-pipeline.md`.
- **Is the converged number REAL** (collapse, leakage, train-vs-val, metric validity, seed discipline) → **verifying-dl-experiments** (**REQUIRED** — every "is the result real" fork above).