188 lines
39 KiB
Markdown
188 lines
39 KiB
Markdown
# Convergence & optimization debugging — it runs, doesn't crash, but won't learn (or learns badly)
|
||
|
||
The other training layers cover the run that **crashes** (`oom-memory.md`), **NaNs**
|
||
(`precision-stability.md`), **hangs** (`distributed-launch.md`), or is **slow** (`throughput-profiling.md`).
|
||
This file owns the quieter, far more common failure: the job runs cleanly to the end but the **loss is
|
||
flat, falls too slowly, or the model underfits** — and the bug is in the optimization wiring, not the
|
||
hardware. Each entry is **Symptom → Root cause → Fix** with the exact knob. **Always start with O1
|
||
(overfit one batch)** — it separates "the loop is broken" from "the model/data is weak" in five minutes
|
||
and tells you which half of this file you need.
|
||
|
||
Boundary: **verifying-dl-experiments** (**REQUIRED** at every "is the result real" fork) owns collapse,
|
||
leakage, metric validity, train-vs-val generalization, and seed interpretation; this file owns the
|
||
*mechanism* of why a correct-looking loop doesn't converge. NaN / loss-spike / LR-too-**HIGH** live next
|
||
door in `precision-stability.md` (P8–P18) — this file is the LR-too-**LOW** / won't-move / mis-wired side.
|
||
|
||
To jump: `grep -in '<keyword>' references/training/convergence-debugging.md` (e.g. `overfit`, `requires_grad`,
|
||
`no_grad`, `optimizer`, `weight decay`, `adamw`, `lr finder`, `scheduler`, `accum`, `cross entropy`,
|
||
`bcewithlogits`, `nllloss`, `freeze`, `batchnorm`, `discriminative`, `lora`, `update ratio`, `dead relu`).
|
||
|
||
## Table of contents
|
||
|
||
- **It isn't learning at all (start here)** — O1 overfit-one-batch · O2 params-not-in-optimizer · O3 loss-detached-from-graph · O4 zero_grad/backward/step-order · O5 train()/eval()-mode
|
||
- **Optimizer / LR / weight-decay / schedule** — O6 AdamW-vs-Adam+no-decay-group · O7 LR-too-LOW+finder · O8 scheduler-order/cadence · O9 grad-accum-divide · O10 AdamW-eps-in-bf16 · O11 fused/foreach
|
||
- **Loss-function footguns** — O12 double-softmax · O13 BCEWithLogits · O14 CE-target-form · O15 padded-loss-reduction · O16 NLLLoss-needs-log_softmax
|
||
- **Fine-tuning / transfer** — O17 frozen-but-still-in-optimizer · O18 frozen-BN-running-stats · O19 discriminative-LR/forgetting · O20 strict=False-shape-mismatch · O21 LoRA/PEFT-wiring
|
||
- **Training-dynamics dashboard (instrument it)** — O22 update:weight-ratio · O23 actual-LR · O24 GradScaler-scale · O25 dead-ReLU-fraction · O26 weight/grad/act-histograms
|
||
- **Pointers** — precision-stability.md, distributed-launch.md, verifying-dl-experiments (skill)
|
||
|
||
---
|
||
|
||
## It isn't learning at all — the first-hour triage
|
||
|
||
### O1 — Run the overfit-one-batch smoke BEFORE tuning anything (the canonical correctness test)
|
||
**Symptom**: training "runs" (no error, normal throughput) but loss plateaus near its init value or wanders without trending down, across LRs/optimizers/architectures. You're tuning hyperparameters blind because nothing proves the loop can learn at all.
|
||
**Root cause**: the loop is broken somewhere between forward and weight-update (any of O2–O5, or a label/shape bug) and no single test isolates "can this code memorize?" from "is this a modeling/data problem?".
|
||
**Fix**: take ONE fixed mini-batch (2 examples is enough) and loop forward/backward/step on **that same batch** for hundreds of iters — a correct loop drives train loss → ~0. Turn **off** augmentation, shuffling, dropout, and weight decay for the test. Also "verify the loss at init" (e.g. softmax CE should start near `-log(1/n_classes)` then fall). If it will not reach ~0, *"there is a bug somewhere and we cannot continue"* — debug the loop (O2–O5) before touching hyperparameters. (Smoke *content/interpretation* → **verifying-dl-experiments**; this is the mechanical gate.) ([Karpathy, "A Recipe for Training Neural Networks"](https://karpathy.github.io/2019/04/25/recipe/))
|
||
|
||
### O2 — Loss flat from step 0, weights byte-identical after `step()` → params aren't in the optimizer
|
||
**Symptom**: overfit-one-batch fails; a snapshotted param is unchanged before/after `optimizer.step()`; grad-norm may even be nonzero. No error.
|
||
**Root cause**: the optimizer updates a **different** set of tensors than the model forwards through. Four causes: (a) the params have `requires_grad=False` so `.grad` stays `None` and `step()` skips them; (b) a submodule/head was never passed into the optimizer's param iterable; (c) the optimizer was built from `model.parameters()` **before** `model.to(device)`, so it holds stale CPU tensors while the model forwards the GPU copies; (d) freeze/unfreeze toggled `requires_grad` but left the wrong set in the optimizer.
|
||
**Fix**: build the optimizer **after** `model.to(device)`. Assert it sees every trainable param: `opt_ids={id(p) for g in optimizer.param_groups for p in g['params']}; assert all(id(p) in opt_ids for p in model.parameters() if p.requires_grad)`. Log `sum(p.requires_grad for p in model.parameters())` at startup. Probe: `w0=next(model.parameters()).clone(); <one step>; assert not torch.equal(w0, next(model.parameters()))`. ([autograd notes](https://docs.pytorch.org/docs/stable/notes/autograd.html), [torch.optim](https://docs.pytorch.org/docs/stable/optim.html), [stale-optimizer-after-.to bug](https://github.com/pytorch/xla/issues/1623))
|
||
|
||
### O3 — `backward()` is a no-op / raises "does not require grad" → loss detached from the graph
|
||
**Symptom**: overfit fails with every `p.grad is None`; or `loss.backward()` raises *"element 0 of tensors does not require grad and does not have a grad_fn"*.
|
||
**Root cause**: the loss tensor was severed from autograd before `backward`. Common severings: (a) the train forward+loss ran inside `with torch.no_grad():` / `@torch.inference_mode()` left over from eval — *"computations in no-grad mode are never recorded in the backward graph"*; (b) `.item()` / `.detach()` / `.cpu().numpy()` / `float(loss)` on the loss path (e.g. back-propping an accumulated `total_loss += loss.item()`); (c) a tensor rebuilt from numpy mid-network; (d) the metric, not the differentiable loss, was passed to `backward()`.
|
||
**Fix**: before `backward`, `assert loss.requires_grad and loss.grad_fn is not None`. Keep the differentiable loss tensor distinct from logging scalars (log `loss.item()`, back-prop the raw tensor). Reserve `no_grad`/`inference_mode` for eval only. After `backward`, assert at least one `p.grad is not None`. ([autograd notes](https://docs.pytorch.org/docs/stable/notes/autograd.html))
|
||
|
||
### O4 — Wrong `zero_grad` / `backward` / `step` order, or a missing `step()`
|
||
**Symptom**: overfit fails; weights never move, or training is erratic despite nonzero grads.
|
||
**Root cause**: PyTorch's contract is *"gradients by default add up; to prevent double-counting we explicitly zero them each iteration"*, `backward` deposits into `.grad`, `step` reads `.grad`. Failure modes: (a) `optimizer.step()` omitted → grads computed, weights never updated; (b) `zero_grad()` placed **after** `backward()` → wipes the fresh grads; (c) `step()` **before** `backward()` → steps on stale/zero grads; (d) `zero_grad` never called → grads from all iters keep summing → effective LR explodes.
|
||
**Fix**: the canonical order, exactly — `optimizer.zero_grad(set_to_none=True)` → forward → `loss=loss_fn(out,y)` → `loss.backward()` → `optimizer.step()` (under AMP: `scaler.scale(loss).backward()` → `scaler.step(optimizer)` → `scaler.update()`). Gradient accumulation is the one exception (O9): `backward` every micro-step, `step`+`zero_grad` only on the boundary. ([optimization tutorial](https://docs.pytorch.org/tutorials/beginner/basics/optimization_tutorial.html), [torch.optim](https://docs.pytorch.org/docs/stable/optim.html))
|
||
|
||
### O5 — Forgot `model.train()` / left `model.eval()` on → Dropout & BatchNorm in the wrong mode
|
||
**Symptom**: two faces — (1) trained under `eval()`: BN uses frozen running stats and never updates them, Dropout is off → underfits / loss barely moves; (2) evaluated under `train()`: BN uses noisy per-batch stats and Dropout fires → val loss flickers run-to-run and looks worse than train.
|
||
**Root cause**: `train()`/`eval()` set a per-module flag that *"has an effect only on certain modules ... e.g. Dropout, BatchNorm"* (`eval()` == `train(False)`). In eval mode BN switches to stored `running_mean/var` and **stops** updating them; Dropout becomes identity. A fresh `nn.Module` defaults to `train()`, but any prior `.eval()` (a reused object, an inference helper, a val loop that didn't switch back) persists.
|
||
**Fix**: bracket phases explicitly — `model.train()` atop each train epoch; `model.eval()` + `with torch.no_grad():` for every val/test pass; `model.train()` again before resuming. After build/load, `assert model.training` before the train loop. (Frozen-backbone BN is a *different* axis → O18; tiny-batch BN → by-domain V7.) ([nn.Module.train/eval](https://docs.pytorch.org/docs/stable/generated/torch.nn.Module.html))
|
||
|
||
---
|
||
|
||
## Optimizer / learning-rate / weight-decay / schedule
|
||
|
||
### O6 — Weight decay "does nothing" / Norm gains destabilize → coupled `Adam(weight_decay=)` + decaying bias & Norm
|
||
**Symptom**: `weight_decay` on `torch.optim.Adam` barely regularizes (or hurts) vs the literature's AdamW recipe; or a from-scratch transformer/CNN trains worse than a reference at the "same" wd; or small models destabilize when LayerNorm/BN gains and biases get shrunk toward 0.
|
||
**Root cause**: (1) `Adam`'s `weight_decay` is classic **L2** — added into the gradient, so it passes through Adam's `1/(sqrt(v)+eps)` preconditioner and params with large historical grads get **less** decay; the intended strength decouples from `wd`. **AdamW** applies decoupled decay directly to the weight (`θ ← θ − lr·wd·θ`), outside the moment path — uniform and lr-independent. They are **not** interchangeable at the same `wd`. (2) Decaying 1-D params (biases, LayerNorm/BN weight & bias) shrinks Norm gains toward 0 — they have no overfitting capacity and shrinking them degrades training.
|
||
**Fix**: use `torch.optim.AdamW`, not `Adam(weight_decay=...)`. Split into two param groups with `weight_decay=0.0` on the no-decay group — nanoGPT's rule: decay `p.dim()>=2` (matmul/embedding weights), no-decay `p.dim()<2` (all biases + all LayerNorm weights); HF/timm exclude by name (`bias`, `LayerNorm.weight`). ([AdamW doc](https://docs.pytorch.org/docs/stable/generated/torch.optim.AdamW.html), [Loshchilov & Hutter 2017 "Decoupled Weight Decay"](https://arxiv.org/abs/1711.05101), [nanoGPT configure_optimizers](https://github.com/karpathy/nanoGPT/blob/master/model.py))
|
||
|
||
### O7 — Loss crawls with no NaN → LR is too **LOW**; find the band with an LR range test
|
||
**Symptom**: no divergence, no NaN, grads finite — loss just falls glacially or plateaus high; throughput is fine but the model "won't learn." Often after copying an LR from a different-batch/optimizer recipe or defaulting to a tiny "safe" LR. (The mirror of P12's too-HIGH spike.)
|
||
**Root cause**: the LR sits far below the productive band, so each update is a negligible fraction of the loss-landscape curvature and optimization crawls. The usable band for adaptive optimizers is narrow and architecture-dependent, so a guessed LR is often 1–2 orders of magnitude too small. Distinguishable from vanishing grads — the grad-norm is healthy, just under-applied.
|
||
**Fix**: run an **LR range test** (Smith) — from a tiny LR, multiply it geometrically each batch over ~100–1000 steps, plot loss vs LR, pick ~1 decade below where loss starts to diverge. Tools: `pytorch-lr-finder` `LRFinder(model,opt,crit).range_test(loader,end_lr=1,num_iter=100)`, fast.ai `learn.lr_find()`, Lightning `Tuner(trainer).lr_find()`. Re-run whenever batch size / optimizer / architecture changes — the band moves; then confirm the LR survives warmup without the P12 spike. ([Smith 2015 "Cyclical Learning Rates"](https://arxiv.org/abs/1506.01186), [pytorch-lr-finder](https://github.com/davidtvs/pytorch-lr-finder), [Smith 2018 disciplined-approach](https://arxiv.org/abs/1803.09820))
|
||
|
||
### O8 — `lr_scheduler.step()` before `optimizer.step()` skips the first LR; per-step vs per-epoch cadence
|
||
**Symptom**: PyTorch warns *"Detected call of `lr_scheduler.step()` before `optimizer.step()`"* and the LR curve is off-by-one; OR a cosine/warmup schedule sized in optimizer steps barely moves (stepped per-epoch) or decays to ~0 in one epoch (per-step schedule stepped per-batch under accumulation).
|
||
**Root cause**: (1) since PyTorch 1.1 the scheduler must step **after** the optimizer — *"if you ... call scheduler.step() before the optimizer's update ... this will skip the first value of the learning rate schedule."* (2) A scheduler advances one tick per `.step()`; schedulers built around `total_steps`/`num_training_steps` in **optimizer** steps (OneCycleLR, HF `get_cosine_schedule_with_warmup`, Lightning `interval='step'`) must be stepped every optimizer step, and under accumulation an "optimizer step" ≠ a batch.
|
||
**Fix**: order it `optimizer.step(); scheduler.step()`. Step at the granularity its `total_steps` was computed in — per optimizer step for warmup/cosine/OneCycle (inside the `if (i+1)%accum==0` block, **not** every micro-batch), per epoch only for epoch schedulers. HF `Trainer` steps it automatically — don't also step it manually. ([torch.optim — scheduler order](https://docs.pytorch.org/docs/stable/optim.html), [OneCycleLR](https://docs.pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.OneCycleLR.html))
|
||
|
||
### O9 — Gradient accumulation gives effective N×LR → divide the loss by `accum_steps` (and normalize per token)
|
||
**Symptom**: switching from batch `B` to (micro-batch `B/N` × N accumulation) "at the same config" trains hotter/diverges — loss/grad magnitude ~N× too big, i.e. you silently get N× the LR. For token tasks the accumulated loss also differs from the un-accumulated run even after `/N` when micro-batches hold unequal #non-pad tokens.
|
||
**Root cause**: each micro-batch loss is `reduction='mean'`; `backward` **adds** grads across the N micro-batches, so the accumulated grad = SUM of N mean-grads = N× the full-batch mean grad → stepping on it ≈ N× LR. Subtler: dividing each mean-loss by N still mis-weights tokens when micro-batches have different valid-token counts (average-of-means ≠ total-loss / total-tokens) — HF found and fixed exactly this in `transformers` in 2024.
|
||
**Fix**: divide before backward — `loss = loss_fn(out,y) / accum_steps; loss.backward()`, with `step()`/`zero_grad()` only on the boundary. For token-level losses, normalize by the **total** non-pad tokens across the accumulation window (accumulate `reduction='sum'`, divide by total tokens), not the mean-of-means. Under DDP wrap non-boundary micro-steps in `with model.no_sync():` to skip the all-reduce (correctness-neutral, perf win). (DeepSpeed double-counts accum in some configs → D18; world-size×batch → D11.) ([HF "Fixing Gradient Accumulation"](https://huggingface.co/blog/gradient_accumulation), [DDP no_sync](https://docs.pytorch.org/docs/stable/notes/ddp.html))
|
||
|
||
### O10 — `AdamW(eps=1e-8)` underflows in bf16/fp16 → unbounded updates where `v` is tiny
|
||
**Symptom**: a run stable in fp32 develops update spikes/NaNs once optimizer math is half precision; or AdamW behaves as if `eps=0` (huge updates where the second moment `v` is small). Most visible with fp16 optimizer states or foreach/8-bit paths computing `sqrt(v)+eps` in reduced precision.
|
||
**Root cause**: the AdamW update is `θ -= lr·m̂/(sqrt(v̂)+eps)`. The default `eps=1e-8` is an fp32 value; in fp16 (and to a lesser degree bf16's 7-bit mantissa) `1e-8` rounds to **0** — *"if you use 1e-8 as default and you use 16 bit, it will round to zero."* With `eps≈0`, params whose `v̂≈0` get an unbounded step. (Separate from GradScaler, which protects activations/grads, not this denominator.)
|
||
**Fix**: raise eps for half-precision optimizer math — `eps=1e-7` (proposed in pytorch#26218 for fp16) up to `1e-6` for bf16; or keep optimizer states / master weights in **fp32** (FSDP `MixedPrecision`, DeepSpeed bf16 keep an fp32 master) so the default eps stays meaningful. Related: `betas=(0.9,0.999)` averages `v` over ~1000 steps — too slow for short fine-tunes; `0.95` is the common LLM-scale second-moment choice. ([pytorch#26218](https://github.com/pytorch/pytorch/issues/26218), [AdamW doc](https://docs.pytorch.org/docs/stable/generated/torch.optim.AdamW.html))
|
||
|
||
### O11 — `fused=True` AdamW breaks under AMP/FSDP; `foreach` inflates peak memory
|
||
**Symptom**: `AdamW(fused=True)` raises (e.g. on `_foreach_sub_` of `device_found_inf`) or mis-steps under GradScaler / bf16-mixed / FSDP; **or** the default `foreach` path OOMs at the optimizer step on a model that fit during forward/backward.
|
||
**Root cause**: (1) fused AdamW does unscale + step + the inf/NaN check inside one CUDA kernel via `found_inf`; version-specific bugs (pytorch#140514, Lightning#21435) come from that plumbing / FSDP interaction — fused is still the experimental path. (2) `foreach` (the CUDA default when unset) horizontally fuses by allocating intermediates across **all** params at once, raising peak memory at the step vs the for-loop path.
|
||
**Fix**: on a fused error/suspicious step under AMP/FSDP/bf16-mixed, fall back to `fused=False` (lets `foreach` default) or upgrade past the fixed issue — confirm a parity loss-curve before trusting fused for a real datapoint. If the **step** OOMs, set `foreach=False` for the low-peak for-loop path (slower, less memory; see oom-memory). Pick deliberately: fused fastest-when-correct, foreach faster than for-loop but higher peak. ([pytorch#140514](https://github.com/pytorch/pytorch/issues/140514), [Lightning#21435](https://github.com/Lightning-AI/pytorch-lightning/issues/21435), [AdamW doc](https://docs.pytorch.org/docs/stable/generated/torch.optim.AdamW.html))
|
||
|
||
---
|
||
|
||
## Loss-function footguns
|
||
|
||
### O12 — `softmax`/`log_softmax` before `nn.CrossEntropyLoss` → double-softmax → diluted gradient, slow/no learning
|
||
**Symptom**: a model with a softmax (or log_softmax) final layer trains far slower than expected, plateaus high, or barely learns; loss is sluggish but not NaN. Classic when porting a Keras/TF model (expects probabilities) to PyTorch, or after "adding softmax to get probabilities."
|
||
**Root cause**: `nn.CrossEntropyLoss` internally does `LogSoftmax + NLLLoss` and *"expects ... unnormalized logits."* Feeding already-softmaxed values applies softmax twice; `softmax(softmax(z))` flattens toward uniform, shrinking the logit dynamic range, so the CE gradient w.r.t. the pre-softmax activations becomes small and ill-conditioned. It still trains — just with a near-vanishing signal.
|
||
**Fix**: pass **raw logits** of shape `(N,C)` — remove any `nn.Softmax`/`F.log_softmax`/`nn.LogSoftmax` from the head. Apply softmax only at inference (for probabilities) or argmax (for the class). If you genuinely need log-probs in-graph, use `F.log_softmax` + `nn.NLLLoss` (O16) instead — never both. ([CrossEntropyLoss doc](https://docs.pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html))
|
||
|
||
### O13 — `sigmoid` + `nn.BCELoss` → `log(0)=-inf` → NaN; use `nn.BCEWithLogitsLoss` (+`pos_weight`)
|
||
**Symptom**: a binary / multi-label head shows NaN or inf loss (often once outputs saturate toward 0/1), or spiky loss; the model has an explicit `torch.sigmoid` before `nn.BCELoss`. Under imbalance it also collapses to always predicting the majority (negative) class.
|
||
**Root cause**: `nn.BCELoss` takes probabilities and computes `log(p)`/`log(1-p)` directly; when the preceding sigmoid saturates (`p`→0 or 1) `log(0)=-inf` and its gradient is inf/NaN, poisoning every param. Two separate ops can't use the stabilized formulation. Plain BCE also weights positives and negatives equally → rare-positive data drives the trivial all-negative solution.
|
||
**Fix**: feed **raw logits** to `nn.BCEWithLogitsLoss` — it fuses sigmoid+BCE with the log-sum-exp trick, avoiding `log(0)`. Remove the explicit sigmoid (apply only at inference). For imbalance pass `pos_weight = #neg/#pos` per class (`>1` raises recall, `<1` raises precision). Target must be **float**, same shape as the logits. ([BCEWithLogitsLoss doc](https://docs.pytorch.org/docs/stable/generated/torch.nn.BCEWithLogitsLoss.html), [numerical-stability thread](https://discuss.pytorch.org/t/numerical-stability-of-bcewithlogitsloss/8246)) (imbalance *strategy* → by-domain V6.)
|
||
|
||
### O14 — `CrossEntropyLoss` target form: long `(N,)` indices in `[0,C)` vs float `(N,C)` soft; off-by-one → device-side assert
|
||
**Symptom**: any of — `RuntimeError: 0D or 1D target tensor expected, multi-target not supported` (one-hot target); `expected scalar type Long but found Float`; `IndexError: Target N is out of bounds` / CUDA `device-side assert ... t >= 0 && t < n_classes` (a label `== C`, or labels `1..C`, or arbitrary ids); or a plausible-but-non-converging loss.
|
||
**Root cause**: `nn.CrossEntropyLoss` has **two** target forms. **Class-index** form: target shape `(N,)` (one fewer dim than the `(N,C,...)` input), dtype `long`, every value in `[0,C)`. A `(N,C)` target is read as multiple targets ("multi-target"); a value `==C` (off-by-one from 1-indexed classes) or non-contiguous ids trips the bounds assert — on CUDA an **async** device-side assert that may surface at a later, unrelated line. **Class-probability** form (soft/smoothed/mixup): target must be float, same shape `(N,C,...)`, summing to 1. Mixing them is the error.
|
||
**Fix**: hard labels → `targets.long()` of shape `(N,)`; remap ids to contiguous `0..C-1` (`{orig:i for i,orig in enumerate(sorted(set(labels)))}`; subtract 1 if 1-indexed); `assert targets.min()>=0 and targets.max()<C`. Don't one-hot the standard path. Debug the opaque CUDA assert with `CUDA_LAUNCH_BLOCKING=1` (or rerun on CPU) for the real line. Soft labels → a float `(N,C)` distribution (no manual log_softmax). Use `ignore_index` for pad, not an out-of-range sentinel (O15). ([CrossEntropyLoss doc](https://docs.pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html), ["Target N out of bounds" + remap](https://discuss.huggingface.co/t/indexerror-target-4-is-out-of-bounds/10213))
|
||
|
||
### O15 — Padded-token loss: `reduction='mean'` averages over PAD → diluted, length-dependent loss
|
||
**Symptom**: a seq/NLP model's loss looks suspiciously small from step 0 and scales with how much padding is in the batch (more pad → lower loss); the model under-learns real tokens; changing batch size or max-length changes the loss magnitude for the same data.
|
||
**Root cause**: default `reduction='mean'` divides the summed loss by the **total** element count, **including** padded positions, so the real-token loss is averaged with (near-zero) pad contributions — shrinking reported loss and the effective gradient on real tokens by the pad ratio. Unmasked pad targets also contribute real gradient, teaching the model to predict padding.
|
||
**Fix**: skip padding. Easiest: `nn.CrossEntropyLoss(ignore_index=PAD_ID)` — *"the loss is averaged over non-ignored targets"* (sums valid positions, divides by valid count). Otherwise compute `reduction='none'`, multiply by a 0/1 mask, and divide by `mask.sum()` (valid tokens), **not** `mask.numel()`: `loss=(per_tok*mask).sum()/mask.sum().clamp(min=1)`. Reshape logits→`(N*T,C)`, targets→`(N*T,)` first. (Masking the inputs/attention → by-domain L1/L2; this owns the loss **denominator**.) ([CrossEntropyLoss doc](https://docs.pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html), [ignore_index nuance pytorch#63004](https://github.com/pytorch/pytorch/issues/63004))
|
||
|
||
### O16 — `nn.NLLLoss` fed raw logits (no `log_softmax`) → silently wrong loss
|
||
**Symptom**: a model uses `nn.NLLLoss` but has no `LogSoftmax`/`F.log_softmax` before it (or a plain `Softmax`): training "runs" with no error but loss is nonsensical / won't converge, accuracy stuck near chance.
|
||
**Root cause**: `nn.NLLLoss` computes **no** softmax — *"the input ... is expected to contain log-probabilities."* It simply gathers `-input[target]`. Raw logits → it negates an arbitrary-scale value; softmax **probabilities** (not log) → it negates a value in `[0,1]` giving a tiny, ill-scaled loss. Either way it isn't cross-entropy and the gradient is wrong, but the shapes are valid so PyTorch can't catch it.
|
||
**Fix**: put `F.log_softmax(logits, dim=1)` (or an `nn.LogSoftmax(dim=1)` final layer) immediately before `nn.NLLLoss` (class dim = 1 for `(N,C)`). Simpler and less error-prone: drop NLLLoss+LogSoftmax and use `nn.CrossEntropyLoss` on raw logits (O12), which fuses both. Never pair NLLLoss with a plain (non-log) Softmax. ([NLLLoss doc](https://docs.pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html))
|
||
|
||
---
|
||
|
||
## Fine-tuning / transfer
|
||
|
||
### O17 — A "frozen" layer keeps changing → `requires_grad=False` but still in the optimizer
|
||
**Symptom**: you set `requires_grad=False` on the backbone (or set it **after** building the optimizer over `model.parameters()`), yet the frozen weights keep moving every step; pretrained features drift and degrade though no real gradient flows.
|
||
**Root cause**: whether an optimizer touches a param is decided by `param.grad is None`, **not** by `param.requires_grad`. If a frozen param is in the optimizer, after `backward()` its `.grad` is often a **zero tensor** (not `None`), and SGD/Adam apply **weight decay** (`+wd·param`) and **momentum/Adam buffers** *before* the update — so the param moves even on a zero gradient. `requires_grad=False` only stops grad *accumulation*; it does not remove the param from the optimizer.
|
||
**Fix**: exclude frozen params from the optimizer at construction — `optim.SGD([p for p in model.parameters() if p.requires_grad], lr=...)` (or per-module param groups). If you froze after building the optimizer, rebuild it, or set `param.grad=None` for the frozen params each step. Freezing correctly = `requires_grad=False` **AND** not in any optimizer param group (and for Norm layers, O18). ([forum: WD/momentum on zero grad](https://discuss.pytorch.org/t/parameters-with-requires-grad-false-are-updated-during-training/90096), [pytorch#679](https://github.com/pytorch/pytorch/issues/679))
|
||
|
||
### O18 — Frozen backbone left in `.train()` → BatchNorm `running_mean/var` silently drift
|
||
**Symptom**: the backbone is "frozen" (`requires_grad=False`) yet val accuracy is unstable / worse than train, or `eval()` vs `train()`-mode inference disagree; small fine-tuning batches make it worse. The frozen features keep shifting batch-to-batch.
|
||
**Root cause**: BatchNorm has two kinds of state — learnable affine (`gamma/beta`, gated by `requires_grad`) and **non-learnable** `running_mean/running_var` buffers updated by the **forward pass whenever the module is in training mode** (default `momentum=0.1`), independent of `requires_grad` and the optimizer. A frozen backbone left in `.train()` therefore overwrites the pretrained BN stats with your (often tiny, domain-shifted) batch stats — so the "frozen" extractor isn't frozen.
|
||
**Fix**: put the frozen Norm layers in eval mode after `model.train()`: `for m in backbone.modules():\n if isinstance(m,(nn.BatchNorm1d,nn.BatchNorm2d,nn.BatchNorm3d)): m.eval()` — or build them `track_running_stats=False`. Re-apply every epoch, because a top-level `model.train()` flips children back. ([BatchNorm2d doc](https://docs.pytorch.org/docs/stable/generated/torch.nn.BatchNorm2d.html)) (general train/eval-mode bug → O5; tiny-batch BN → by-domain V7.)
|
||
|
||
### O19 — One global LR wrecks pretrained features (catastrophic forgetting) → discriminative LR + gradual unfreezing
|
||
**Symptom**: fine-tuning with a single LR either (too high) destroys the pretrained representations on the first updates and accuracy collapses below a frozen-feature baseline, or (too low) the random new head can't move. Both are the same misconfiguration.
|
||
**Root cause**: at step 0 the backbone is near a good optimum but the new head is random, so its large initial loss yields large gradients that, under one high LR, propagate into and overwrite the low-level pretrained layers (catastrophic forgetting). A single LR can't be simultaneously small enough to preserve early layers and large enough to fit the head — the fix is per-group LRs, not more data.
|
||
**Fix**: discriminative fine-tuning — per-layer param groups with LR decaying toward the input (head highest, stem lowest), e.g. `AdamW([{'params':head,'lr':1e-3},{'params':backbone,'lr':1e-5}])`. Combine with **gradual unfreezing** (train the head with the backbone frozen first, then unfreeze deeper→shallower) and an LR **warmup** so the random head settles before its gradients reach the backbone. ([Howard & Ruder 2018, ULMFiT — discriminative fine-tuning + gradual unfreezing](https://arxiv.org/abs/1801.06146)) (the general too-high-LR spike → P12.)
|
||
|
||
### O20 — `load_state_dict(strict=False)` still RuntimeErrors on the replaced head → shape ≠ key mismatch
|
||
**Symptom**: you replaced the classifier for a new `num_classes` and pass `strict=False` expecting it to skip the head, but loading still crashes: `RuntimeError: ... size mismatch for fc.weight: copying a param with shape [1000,...] ..., the shape in current model is [N,...]`.
|
||
**Root cause**: `strict=False` relaxes only the **presence** check — it tolerates `missing_keys`/`unexpected_keys`. It does **not** relax tensor-shape compatibility: any key present in **both** the checkpoint and the model whose shapes differ (exactly your old-vs-new head `fc.weight/bias`) still raises. So `strict=False` is necessary but not sufficient when the head keeps the same name.
|
||
**Fix**: drop the incompatible head entries before loading, then load non-strict — `sd={k:v for k,v in ckpt.items() if not k.startswith('fc.')}; missing,unexpected = model.load_state_dict(sd, strict=False)` — and inspect `missing/unexpected` to confirm only the head is missing. Or give the new head a different attribute name so it never collides. (Save/resume of matching architectures → checkpoint-resume C1–C18.) ([load_state_dict doc](https://docs.pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.load_state_dict), [forum: strict=False ≠ shape](https://discuss.pytorch.org/t/when-load-state-dict-strict-false-do-not-work/82301))
|
||
|
||
### O21 — LoRA/PEFT "barely trains" or reloads random → `target_modules` don't match, head/Norm not in `modules_to_save`
|
||
**Symptom**: after `get_peft_model(...)`, `print_trainable_parameters()` shows ~0% (or far fewer than expected) and loss won't drop; or a PEFT classifier reloads with a **random** head / shifted metrics after `save_pretrained`.
|
||
**Root cause**: (a) LoRA only wraps modules whose names match `target_modules`, and names are architecture-specific (`q_proj/v_proj` for Llama vs `query/value` for BERT vs `convolution` for resnet) — a wrong/absent name injects no adapter, PEFT just warns "no modules matched," and you train nothing. (b) A newly-initialized task head (`score`/`classifier`) or a base-model BatchNorm's `running_mean/var` are **not** saved unless listed in `modules_to_save` — reload restores the base's random head / original BN stats → garbage / non-reproducible outputs.
|
||
**Fix**: enumerate real names with `[n for n,_ in model.named_modules()]` and set `LoraConfig(target_modules=[...])` (or `'all-linear'`); confirm with `model.print_trainable_parameters()` and that you see `lora.Linear` layers. Add the new head and any base Norm layers to `modules_to_save` (e.g. `modules_to_save=['classifier','normalization']`) — or pass the right `task_type` (PEFT auto-adds the standard head). ([PEFT troubleshooting](https://huggingface.co/docs/peft/developer_guides/troubleshooting))
|
||
|
||
---
|
||
|
||
## Training-dynamics dashboard — instrument it so the failure is visible
|
||
|
||
### O22 — Update-to-weight L2 ratio ≈ 1e-3 (the single highest-signal LR dial)
|
||
**Symptom**: loss barely moves (under-stepping) or is jittery (over-stepping), and the bare grad-norm can't tell you which — it isn't scale-relative to the weights.
|
||
**Root cause**: what matters is the size of the **actual update** relative to the param's own magnitude — `ratio = ||lr·update|| / ||W||`, measured per tensor **after** `step()` (so it folds in lr, momentum, Adam's preconditioning, weight decay). CS231n's heuristic: this should sit ~`1e-3`. Lower → LR too low (weights barely change); higher (`1e-2..1e-1`) → LR too high. Being per-tensor, it exposes individually mis-scaled layers (an embedding moving 100× faster than the trunk) that a global grad-norm hides.
|
||
**Fix**: log it every K steps — snapshot `w0={n:p.detach().clone() for n,p in model.named_parameters()}` before `step()`, then `(p.detach()-w0[n]).norm()/(w0[n].norm()+1e-12)` per name after. Lever: `≪1e-3` → raise that group's LR; `≫1e-3` → lower LR / lengthen warmup. Track per param group, not just globally. ([CS231n "Neural Networks 3"](https://cs231n.github.io/neural-networks-3/)) (complements P12/P18.)
|
||
|
||
### O23 — Log the **actual** per-step LR, not the config value
|
||
**Symptom**: you log `cfg.lr` (a constant) so the dashboard LR is flat — yet you're on warmup+cosine. You can't see warmup, decay, a restart, or a frozen scheduler; LR-related loss behavior (spike on ramp, stall at the floor) is invisible.
|
||
**Root cause**: the effective LR lives in `optimizer.param_groups[i]['lr']` and is rewritten by `scheduler.step()` each step (and per group for differential/no-decay LRs). Failure modes: plotting the config scalar (never changes); or the O8 order bug skipping the first value. Also `get_lr()` returns a value "one step ahead" — reading it instead of `get_last_lr()` logs the wrong number.
|
||
**Fix**: log `scheduler.get_last_lr()` (a list — one per param group; log them all if you use differential LRs) or read `optimizer.param_groups[0]['lr']` directly, every step. Don't use `get_lr()` for logging. If the logged LR plateaus when it should ramp/decay, your scheduler isn't being stepped (or is stepped at the wrong cadence → O8). ([torch.optim](https://docs.pytorch.org/docs/stable/optim.html), [lr_scheduler source — get_last_lr](https://github.com/pytorch/pytorch/blob/main/torch/optim/lr_scheduler.py))
|
||
|
||
### O24 — GradScaler scale drifting toward 0 = silent persistent fp16 overflow
|
||
**Symptom**: an fp16-AMP run looks healthy (loss prints, no crash) but isn't learning or silently skips many optimizer steps — because you never plotted `scaler.get_scale()` and the loss-scale has cratered from 65536 toward ~1 (or sawtooths down).
|
||
**Root cause**: GradScaler adapts a multiplicative loss-scale: on any inf/NaN grad it multiplies by `backoff_factor=0.5` **and skips** that `step()`; after `growth_interval=2000` clean steps it multiplies by `growth_factor=2.0` (`init_scale=65536`). A few early backoffs are normal calibration (P5/P10), but a scale that keeps halving and stays low means the forward keeps producing values `> fp16's 65504` → grads overflow → step skipped every step → weights frozen while loss still looks plausible. The config "fp16" tells you nothing; only the live scale reveals it.
|
||
**Fix**: add `scaler.get_scale()` and a skipped-step counter to the dashboard. Healthy: a high plateau (`2^13..2^16`) after early calibration. Bad: monotonic decay toward 1, or step-count not advancing with iteration count. Lever when it collapses: switch **fp16 → bf16** (no scaler; fp32 exponent range absorbs the large activations — highest leverage), or keep the overflow-prone block (final logits / attention) in fp32 via a nested `autocast(enabled=False)`, plus z-loss / qk-norm (P15/P16). Don't "fix" it by lowering `init_scale`. ([torch.amp GradScaler](https://docs.pytorch.org/docs/stable/amp.html)) (mechanism → P5/P10.)
|
||
|
||
### O25 — Rising dead-ReLU / zero-activation fraction → a slice of the net is permanently off
|
||
**Symptom**: capacity quietly vanishes — a layer's outputs are increasingly all-zero, loss plateaus above where it should, and adding width doesn't help. No crash; it just under-fits. Worst case the net degenerates toward a constant function.
|
||
**Root cause**: a ReLU whose pre-activation is driven negative for ~all inputs outputs 0 and has **zero** local gradient there, so backprop sends no signal to its incoming weights — the unit is stuck off and unrecoverable. Triggered by too-high LR (a big update pushes weights/bias deep negative) or a large negative bias. Once a large fraction of a layer dies, gradients can't flow through it and that capacity is gone. The same shape (saturation → ~0 gradient → frozen region) applies to sigmoid/tanh tails.
|
||
**Fix**: instrument the zero/saturation fraction per activation with a forward hook — `(out==0).float().mean()` for ReLU (or `|out|>0.99` for tanh/sigmoid), logged every K steps per layer. Healthy: a stable modest dead fraction (ReLU is sparse by design). Bad: a fraction climbing over training or a layer pinned near ~100% dead. Levers, in order: (1) lower LR (the primary cause); (2) ReLU → LeakyReLU / GELU / SiLU so the negative region keeps a gradient; (3) fix init / large negative biases. ([CS231n "Neural Networks 1" — dying ReLU](https://cs231n.github.io/neural-networks-1/)) (the *output* being constant is owned by verifying-dl-experiments; this is the internal mechanism.)
|
||
|
||
### O26 — No weight/grad/activation histograms → scalar norms hide bimodal/saturating/collapsing distributions
|
||
**Symptom**: scalar dashboards (loss, one grad-norm) look fine yet the model under-performs or destabilizes — a mean/norm hides the shape: activations drifting to a saturated tail, weights collapsing to a spike at 0 (a layer dying, O25), or a gradient distribution growing fat outlier tails all read as an unremarkable scalar.
|
||
**Root cause**: norms and means are lossy summaries — a healthy spread and a bimodal/all-saturated/all-zero distribution can share the same L2 norm. The diagnostic signal is the **change in shape over training**, which a scalar can't show.
|
||
**Fix**: periodically (every few hundred steps — histograms aren't free) log `SummaryWriter.add_histogram(tag, values, global_step)` for each layer's **weights**, its **gradients** (after `backward`, before `zero_grad`), and key **activations** (forward hook). Read the time-evolution: weights collapsing to a spike = a layer dying; gradient histograms collapsing to ~0 = vanishing (lever: residual/norm/init, P17); fat tails = clip + lower LR (P13/P12); activations wandering into a saturating tail = init/normalization fix (P17). Pair with the scalars above. ([SummaryWriter.add_histogram](https://docs.pytorch.org/docs/stable/tensorboard.html), [Karpathy recipe — visualize weights/activations](https://karpathy.github.io/2019/04/25/recipe/))
|
||
|
||
---
|
||
|
||
## Pointers — adjacent mechanics catalogued elsewhere
|
||
|
||
- **NaN / loss-spike / LR-too-HIGH / grad explosion / z-loss / qk-norm / init & norm placement / determinism** → `references/training/precision-stability.md` (P8–P19). This file is the LR-too-LOW / won't-move side; that one is the blows-up side.
|
||
- **OOM from the optimizer step / activation checkpointing / LoRA-QLoRA memory** → `references/training/oom-memory.md` (M5, M12–M13).
|
||
- **N-GPU effective batch × LR, DeepSpeed accum double-count, find_unused_parameters** → `references/training/distributed-launch.md` (D11, D18, D8).
|
||
- **Dataloader correctness (worker RNG, collate, labels, shuffle) that mimics "won't learn"** → `references/training/data-pipeline.md`.
|
||
- **Is the converged number REAL** (collapse, leakage, train-vs-val, metric validity, seed discipline) → **verifying-dl-experiments** (**REQUIRED** — every "is the result real" fork above).
|