39 KiB

Raw Blame History

Convergence & optimization debugging — it runs, doesn't crash, but won't learn (or learns badly)

The other training layers cover the run that crashes (oom-memory.md), NaNs (precision-stability.md), hangs (distributed-launch.md), or is slow (throughput-profiling.md). This file owns the quieter, far more common failure: the job runs cleanly to the end but the loss is flat, falls too slowly, or the model underfits — and the bug is in the optimization wiring, not the hardware. Each entry is Symptom → Root cause → Fix with the exact knob. Always start with O1 (overfit one batch) — it separates "the loop is broken" from "the model/data is weak" in five minutes and tells you which half of this file you need.

Boundary: verifying-dl-experiments (REQUIRED at every "is the result real" fork) owns collapse, leakage, metric validity, train-vs-val generalization, and seed interpretation; this file owns the mechanism of why a correct-looking loop doesn't converge. NaN / loss-spike / LR-too-HIGH live next door in precision-stability.md (P8–P18) — this file is the LR-too-LOW / won't-move / mis-wired side.

To jump: grep -in '<keyword>' references/training/convergence-debugging.md (e.g. overfit, requires_grad, no_grad, optimizer, weight decay, adamw, lr finder, scheduler, accum, cross entropy, bcewithlogits, nllloss, freeze, batchnorm, discriminative, lora, update ratio, dead relu).

It isn't learning at all (start here) — O1 overfit-one-batch · O2 params-not-in-optimizer · O3 loss-detached-from-graph · O4 zero_grad/backward/step-order · O5 train()/eval()-mode
Optimizer / LR / weight-decay / schedule — O6 AdamW-vs-Adam+no-decay-group · O7 LR-too-LOW+finder · O8 scheduler-order/cadence · O9 grad-accum-divide · O10 AdamW-eps-in-bf16 · O11 fused/foreach
Loss-function footguns — O12 double-softmax · O13 BCEWithLogits · O14 CE-target-form · O15 padded-loss-reduction · O16 NLLLoss-needs-log_softmax
Fine-tuning / transfer — O17 frozen-but-still-in-optimizer · O18 frozen-BN-running-stats · O19 discriminative-LR/forgetting · O20 strict=False-shape-mismatch · O21 LoRA/PEFT-wiring
Training-dynamics dashboard (instrument it) — O22 update:weight-ratio · O23 actual-LR · O24 GradScaler-scale · O25 dead-ReLU-fraction · O26 weight/grad/act-histograms
Pointers — precision-stability.md, distributed-launch.md, verifying-dl-experiments (skill)

It isn't learning at all — the first-hour triage

O1 — Run the overfit-one-batch smoke BEFORE tuning anything (the canonical correctness test)

Symptom: training "runs" (no error, normal throughput) but loss plateaus near its init value or wanders without trending down, across LRs/optimizers/architectures. You're tuning hyperparameters blind because nothing proves the loop can learn at all. Root cause: the loop is broken somewhere between forward and weight-update (any of O2–O5, or a label/shape bug) and no single test isolates "can this code memorize?" from "is this a modeling/data problem?". Fix: take ONE fixed mini-batch (2 examples is enough) and loop forward/backward/step on that same batch for hundreds of iters — a correct loop drives train loss → ~0. Turn off augmentation, shuffling, dropout, and weight decay for the test. Also "verify the loss at init" (e.g. softmax CE should start near -log(1/n_classes) then fall). If it will not reach ~0, "there is a bug somewhere and we cannot continue" — debug the loop (O2–O5) before touching hyperparameters. (Smoke content/interpretation → verifying-dl-experiments; this is the mechanical gate.) (Karpathy, "A Recipe for Training Neural Networks")

O2 — Loss flat from step 0, weights byte-identical after `step()` → params aren't in the optimizer

Symptom: overfit-one-batch fails; a snapshotted param is unchanged before/after optimizer.step(); grad-norm may even be nonzero. No error. Root cause: the optimizer updates a different set of tensors than the model forwards through. Four causes: (a) the params have requires_grad=False so .grad stays None and step() skips them; (b) a submodule/head was never passed into the optimizer's param iterable; (c) the optimizer was built from model.parameters() before model.to(device), so it holds stale CPU tensors while the model forwards the GPU copies; (d) freeze/unfreeze toggled requires_grad but left the wrong set in the optimizer. Fix: build the optimizer after model.to(device). Assert it sees every trainable param: opt_ids={id(p) for g in optimizer.param_groups for p in g['params']}; assert all(id(p) in opt_ids for p in model.parameters() if p.requires_grad). Log sum(p.requires_grad for p in model.parameters()) at startup. Probe: w0=next(model.parameters()).clone(); <one step>; assert not torch.equal(w0, next(model.parameters())). (autograd notes, torch.optim, stale-optimizer-after-.to bug)

O3 — `backward()` is a no-op / raises "does not require grad" → loss detached from the graph

Symptom: overfit fails with every p.grad is None; or loss.backward() raises "element 0 of tensors does not require grad and does not have a grad_fn". Root cause: the loss tensor was severed from autograd before backward. Common severings: (a) the train forward+loss ran inside with torch.no_grad(): / @torch.inference_mode() left over from eval — "computations in no-grad mode are never recorded in the backward graph"; (b) .item() / .detach() / .cpu().numpy() / float(loss) on the loss path (e.g. back-propping an accumulated total_loss += loss.item()); (c) a tensor rebuilt from numpy mid-network; (d) the metric, not the differentiable loss, was passed to backward(). Fix: before backward, assert loss.requires_grad and loss.grad_fn is not None. Keep the differentiable loss tensor distinct from logging scalars (log loss.item(), back-prop the raw tensor). Reserve no_grad/inference_mode for eval only. After backward, assert at least one p.grad is not None. (autograd notes)

O4 — Wrong `zero_grad` / `backward` / `step` order, or a missing `step()`

Symptom: overfit fails; weights never move, or training is erratic despite nonzero grads. Root cause: PyTorch's contract is "gradients by default add up; to prevent double-counting we explicitly zero them each iteration", backward deposits into .grad, step reads .grad. Failure modes: (a) optimizer.step() omitted → grads computed, weights never updated; (b) zero_grad() placed after backward() → wipes the fresh grads; (c) step() before backward() → steps on stale/zero grads; (d) zero_grad never called → grads from all iters keep summing → effective LR explodes. Fix: the canonical order, exactly — optimizer.zero_grad(set_to_none=True) → forward → loss=loss_fn(out,y) → loss.backward() → optimizer.step() (under AMP: scaler.scale(loss).backward() → scaler.step(optimizer) → scaler.update()). Gradient accumulation is the one exception (O9): backward every micro-step, step+zero_grad only on the boundary. (optimization tutorial, torch.optim)

O5 — Forgot `model.train()` / left `model.eval()` on → Dropout & BatchNorm in the wrong mode

Symptom: two faces — (1) trained under eval(): BN uses frozen running stats and never updates them, Dropout is off → underfits / loss barely moves; (2) evaluated under train(): BN uses noisy per-batch stats and Dropout fires → val loss flickers run-to-run and looks worse than train. Root cause: train()/eval() set a per-module flag that "has an effect only on certain modules ... e.g. Dropout, BatchNorm" (eval() == train(False)). In eval mode BN switches to stored running_mean/var and stops updating them; Dropout becomes identity. A fresh nn.Module defaults to train(), but any prior .eval() (a reused object, an inference helper, a val loop that didn't switch back) persists. Fix: bracket phases explicitly — model.train() atop each train epoch; model.eval() + with torch.no_grad(): for every val/test pass; model.train() again before resuming. After build/load, assert model.training before the train loop. (Frozen-backbone BN is a different axis → O18; tiny-batch BN → by-domain V7.) (nn.Module.train/eval)

Optimizer / learning-rate / weight-decay / schedule

O6 — Weight decay "does nothing" / Norm gains destabilize → coupled `Adam(weight_decay=)` + decaying bias & Norm

Symptom: weight_decay on torch.optim.Adam barely regularizes (or hurts) vs the literature's AdamW recipe; or a from-scratch transformer/CNN trains worse than a reference at the "same" wd; or small models destabilize when LayerNorm/BN gains and biases get shrunk toward 0. Root cause: (1) Adam's weight_decay is classic L2 — added into the gradient, so it passes through Adam's 1/(sqrt(v)+eps) preconditioner and params with large historical grads get less decay; the intended strength decouples from wd. AdamW applies decoupled decay directly to the weight (θ ← θ − lr·wd·θ), outside the moment path — uniform and lr-independent. They are not interchangeable at the same wd. (2) Decaying 1-D params (biases, LayerNorm/BN weight & bias) shrinks Norm gains toward 0 — they have no overfitting capacity and shrinking them degrades training. Fix: use torch.optim.AdamW, not Adam(weight_decay=...). Split into two param groups with weight_decay=0.0 on the no-decay group — nanoGPT's rule: decay p.dim()>=2 (matmul/embedding weights), no-decay p.dim()<2 (all biases + all LayerNorm weights); HF/timm exclude by name (bias, LayerNorm.weight). (AdamW doc, Loshchilov & Hutter 2017 "Decoupled Weight Decay", nanoGPT configure_optimizers)

O7 — Loss crawls with no NaN → LR is too LOW; find the band with an LR range test

Symptom: no divergence, no NaN, grads finite — loss just falls glacially or plateaus high; throughput is fine but the model "won't learn." Often after copying an LR from a different-batch/optimizer recipe or defaulting to a tiny "safe" LR. (The mirror of P12's too-HIGH spike.) Root cause: the LR sits far below the productive band, so each update is a negligible fraction of the loss-landscape curvature and optimization crawls. The usable band for adaptive optimizers is narrow and architecture-dependent, so a guessed LR is often 1–2 orders of magnitude too small. Distinguishable from vanishing grads — the grad-norm is healthy, just under-applied. Fix: run an LR range test (Smith) — from a tiny LR, multiply it geometrically each batch over ~100–1000 steps, plot loss vs LR, pick ~1 decade below where loss starts to diverge. Tools: pytorch-lr-finder LRFinder(model,opt,crit).range_test(loader,end_lr=1,num_iter=100), fast.ai learn.lr_find(), Lightning Tuner(trainer).lr_find(). Re-run whenever batch size / optimizer / architecture changes — the band moves; then confirm the LR survives warmup without the P12 spike. (Smith 2015 "Cyclical Learning Rates", pytorch-lr-finder, Smith 2018 disciplined-approach)

O8 — `lr_scheduler.step()` before `optimizer.step()` skips the first LR; per-step vs per-epoch cadence

Symptom: PyTorch warns "Detected call of lr_scheduler.step() before optimizer.step()" and the LR curve is off-by-one; OR a cosine/warmup schedule sized in optimizer steps barely moves (stepped per-epoch) or decays to ~0 in one epoch (per-step schedule stepped per-batch under accumulation). Root cause: (1) since PyTorch 1.1 the scheduler must step after the optimizer — "if you ... call scheduler.step() before the optimizer's update ... this will skip the first value of the learning rate schedule." (2) A scheduler advances one tick per .step(); schedulers built around total_steps/num_training_steps in optimizer steps (OneCycleLR, HF get_cosine_schedule_with_warmup, Lightning interval='step') must be stepped every optimizer step, and under accumulation an "optimizer step" ≠ a batch. Fix: order it optimizer.step(); scheduler.step(). Step at the granularity its total_steps was computed in — per optimizer step for warmup/cosine/OneCycle (inside the if (i+1)%accum==0 block, not every micro-batch), per epoch only for epoch schedulers. HF Trainer steps it automatically — don't also step it manually. (torch.optim — scheduler order, OneCycleLR)

O9 — Gradient accumulation gives effective N×LR → divide the loss by `accum_steps` (and normalize per token)

Symptom: switching from batch B to (micro-batch B/N × N accumulation) "at the same config" trains hotter/diverges — loss/grad magnitude ~N× too big, i.e. you silently get N× the LR. For token tasks the accumulated loss also differs from the un-accumulated run even after /N when micro-batches hold unequal #non-pad tokens. Root cause: each micro-batch loss is reduction='mean'; backward adds grads across the N micro-batches, so the accumulated grad = SUM of N mean-grads = N× the full-batch mean grad → stepping on it ≈ N× LR. Subtler: dividing each mean-loss by N still mis-weights tokens when micro-batches have different valid-token counts (average-of-means ≠ total-loss / total-tokens) — HF found and fixed exactly this in transformers in 2024. Fix: divide before backward — loss = loss_fn(out,y) / accum_steps; loss.backward(), with step()/zero_grad() only on the boundary. For token-level losses, normalize by the total non-pad tokens across the accumulation window (accumulate reduction='sum', divide by total tokens), not the mean-of-means. Under DDP wrap non-boundary micro-steps in with model.no_sync(): to skip the all-reduce (correctness-neutral, perf win). (DeepSpeed double-counts accum in some configs → D18; world-size×batch → D11.) (HF "Fixing Gradient Accumulation", DDP no_sync)

O10 — `AdamW(eps=1e-8)` underflows in bf16/fp16 → unbounded updates where `v` is tiny

Symptom: a run stable in fp32 develops update spikes/NaNs once optimizer math is half precision; or AdamW behaves as if eps=0 (huge updates where the second moment v is small). Most visible with fp16 optimizer states or foreach/8-bit paths computing sqrt(v)+eps in reduced precision. Root cause: the AdamW update is θ -= lr·m̂/(sqrt(v̂)+eps). The default eps=1e-8 is an fp32 value; in fp16 (and to a lesser degree bf16's 7-bit mantissa) 1e-8 rounds to 0 — "if you use 1e-8 as default and you use 16 bit, it will round to zero." With eps≈0, params whose v̂≈0 get an unbounded step. (Separate from GradScaler, which protects activations/grads, not this denominator.) Fix: raise eps for half-precision optimizer math — eps=1e-7 (proposed in pytorch#26218 for fp16) up to 1e-6 for bf16; or keep optimizer states / master weights in fp32 (FSDP MixedPrecision, DeepSpeed bf16 keep an fp32 master) so the default eps stays meaningful. Related: betas=(0.9,0.999) averages v over ~1000 steps — too slow for short fine-tunes; 0.95 is the common LLM-scale second-moment choice. (pytorch#26218, AdamW doc)

O11 — `fused=True` AdamW breaks under AMP/FSDP; `foreach` inflates peak memory

Symptom: AdamW(fused=True) raises (e.g. on _foreach_sub_ of device_found_inf) or mis-steps under GradScaler / bf16-mixed / FSDP; or the default foreach path OOMs at the optimizer step on a model that fit during forward/backward. Root cause: (1) fused AdamW does unscale + step + the inf/NaN check inside one CUDA kernel via found_inf; version-specific bugs (pytorch#140514, Lightning#21435) come from that plumbing / FSDP interaction — fused is still the experimental path. (2) foreach (the CUDA default when unset) horizontally fuses by allocating intermediates across all params at once, raising peak memory at the step vs the for-loop path. Fix: on a fused error/suspicious step under AMP/FSDP/bf16-mixed, fall back to fused=False (lets foreach default) or upgrade past the fixed issue — confirm a parity loss-curve before trusting fused for a real datapoint. If the step OOMs, set foreach=False for the low-peak for-loop path (slower, less memory; see oom-memory). Pick deliberately: fused fastest-when-correct, foreach faster than for-loop but higher peak. (pytorch#140514, Lightning#21435, AdamW doc)

Loss-function footguns

O12 — `softmax`/`log_softmax` before `nn.CrossEntropyLoss` → double-softmax → diluted gradient, slow/no learning

Symptom: a model with a softmax (or log_softmax) final layer trains far slower than expected, plateaus high, or barely learns; loss is sluggish but not NaN. Classic when porting a Keras/TF model (expects probabilities) to PyTorch, or after "adding softmax to get probabilities." Root cause: nn.CrossEntropyLoss internally does LogSoftmax + NLLLoss and "expects ... unnormalized logits." Feeding already-softmaxed values applies softmax twice; softmax(softmax(z)) flattens toward uniform, shrinking the logit dynamic range, so the CE gradient w.r.t. the pre-softmax activations becomes small and ill-conditioned. It still trains — just with a near-vanishing signal. Fix: pass raw logits of shape (N,C) — remove any nn.Softmax/F.log_softmax/nn.LogSoftmax from the head. Apply softmax only at inference (for probabilities) or argmax (for the class). If you genuinely need log-probs in-graph, use F.log_softmax + nn.NLLLoss (O16) instead — never both. (CrossEntropyLoss doc)

O13 — `sigmoid` + `nn.BCELoss` → `log(0)=-inf` → NaN; use `nn.BCEWithLogitsLoss` (+`pos_weight`)

Symptom: a binary / multi-label head shows NaN or inf loss (often once outputs saturate toward 0/1), or spiky loss; the model has an explicit torch.sigmoid before nn.BCELoss. Under imbalance it also collapses to always predicting the majority (negative) class. Root cause: nn.BCELoss takes probabilities and computes log(p)/log(1-p) directly; when the preceding sigmoid saturates (p→0 or 1) log(0)=-inf and its gradient is inf/NaN, poisoning every param. Two separate ops can't use the stabilized formulation. Plain BCE also weights positives and negatives equally → rare-positive data drives the trivial all-negative solution. Fix: feed raw logits to nn.BCEWithLogitsLoss — it fuses sigmoid+BCE with the log-sum-exp trick, avoiding log(0). Remove the explicit sigmoid (apply only at inference). For imbalance pass pos_weight = #neg/#pos per class (>1 raises recall, <1 raises precision). Target must be float, same shape as the logits. (BCEWithLogitsLoss doc, numerical-stability thread) (imbalance strategy → by-domain V6.)

O14 — `CrossEntropyLoss` target form: long `(N,)` indices in `[0,C)` vs float `(N,C)` soft; off-by-one → device-side assert

Symptom: any of — RuntimeError: 0D or 1D target tensor expected, multi-target not supported (one-hot target); expected scalar type Long but found Float; IndexError: Target N is out of bounds / CUDA device-side assert ... t >= 0 && t < n_classes (a label == C, or labels 1..C, or arbitrary ids); or a plausible-but-non-converging loss. Root cause: nn.CrossEntropyLoss has two target forms. Class-index form: target shape (N,) (one fewer dim than the (N,C,...) input), dtype long, every value in [0,C). A (N,C) target is read as multiple targets ("multi-target"); a value ==C (off-by-one from 1-indexed classes) or non-contiguous ids trips the bounds assert — on CUDA an async device-side assert that may surface at a later, unrelated line. Class-probability form (soft/smoothed/mixup): target must be float, same shape (N,C,...), summing to 1. Mixing them is the error. Fix: hard labels → targets.long() of shape (N,); remap ids to contiguous 0..C-1 ({orig:i for i,orig in enumerate(sorted(set(labels)))}; subtract 1 if 1-indexed); assert targets.min()>=0 and targets.max()<C. Don't one-hot the standard path. Debug the opaque CUDA assert with CUDA_LAUNCH_BLOCKING=1 (or rerun on CPU) for the real line. Soft labels → a float (N,C) distribution (no manual log_softmax). Use ignore_index for pad, not an out-of-range sentinel (O15). (CrossEntropyLoss doc, "Target N out of bounds" + remap)

O15 — Padded-token loss: `reduction='mean'` averages over PAD → diluted, length-dependent loss

Symptom: a seq/NLP model's loss looks suspiciously small from step 0 and scales with how much padding is in the batch (more pad → lower loss); the model under-learns real tokens; changing batch size or max-length changes the loss magnitude for the same data. Root cause: default reduction='mean' divides the summed loss by the total element count, including padded positions, so the real-token loss is averaged with (near-zero) pad contributions — shrinking reported loss and the effective gradient on real tokens by the pad ratio. Unmasked pad targets also contribute real gradient, teaching the model to predict padding. Fix: skip padding. Easiest: nn.CrossEntropyLoss(ignore_index=PAD_ID) — "the loss is averaged over non-ignored targets" (sums valid positions, divides by valid count). Otherwise compute reduction='none', multiply by a 0/1 mask, and divide by mask.sum() (valid tokens), not mask.numel(): loss=(per_tok*mask).sum()/mask.sum().clamp(min=1). Reshape logits→(N*T,C), targets→(N*T,) first. (Masking the inputs/attention → by-domain L1/L2; this owns the loss denominator.) (CrossEntropyLoss doc, ignore_index nuance pytorch#63004)

O16 — `nn.NLLLoss` fed raw logits (no `log_softmax`) → silently wrong loss

Symptom: a model uses nn.NLLLoss but has no LogSoftmax/F.log_softmax before it (or a plain Softmax): training "runs" with no error but loss is nonsensical / won't converge, accuracy stuck near chance. Root cause: nn.NLLLoss computes no softmax — "the input ... is expected to contain log-probabilities." It simply gathers -input[target]. Raw logits → it negates an arbitrary-scale value; softmax probabilities (not log) → it negates a value in [0,1] giving a tiny, ill-scaled loss. Either way it isn't cross-entropy and the gradient is wrong, but the shapes are valid so PyTorch can't catch it. Fix: put F.log_softmax(logits, dim=1) (or an nn.LogSoftmax(dim=1) final layer) immediately before nn.NLLLoss (class dim = 1 for (N,C)). Simpler and less error-prone: drop NLLLoss+LogSoftmax and use nn.CrossEntropyLoss on raw logits (O12), which fuses both. Never pair NLLLoss with a plain (non-log) Softmax. (NLLLoss doc)

Fine-tuning / transfer

O17 — A "frozen" layer keeps changing → `requires_grad=False` but still in the optimizer

Symptom: you set requires_grad=False on the backbone (or set it after building the optimizer over model.parameters()), yet the frozen weights keep moving every step; pretrained features drift and degrade though no real gradient flows. Root cause: whether an optimizer touches a param is decided by param.grad is None, not by param.requires_grad. If a frozen param is in the optimizer, after backward() its .grad is often a zero tensor (not None), and SGD/Adam apply weight decay (+wd·param) and momentum/Adam buffers before the update — so the param moves even on a zero gradient. requires_grad=False only stops grad accumulation; it does not remove the param from the optimizer. Fix: exclude frozen params from the optimizer at construction — optim.SGD([p for p in model.parameters() if p.requires_grad], lr=...) (or per-module param groups). If you froze after building the optimizer, rebuild it, or set param.grad=None for the frozen params each step. Freezing correctly = requires_grad=False AND not in any optimizer param group (and for Norm layers, O18). (forum: WD/momentum on zero grad, pytorch#679)

O18 — Frozen backbone left in `.train()` → BatchNorm `running_mean/var` silently drift

Symptom: the backbone is "frozen" (requires_grad=False) yet val accuracy is unstable / worse than train, or eval() vs train()-mode inference disagree; small fine-tuning batches make it worse. The frozen features keep shifting batch-to-batch. Root cause: BatchNorm has two kinds of state — learnable affine (gamma/beta, gated by requires_grad) and non-learnable running_mean/running_var buffers updated by the forward pass whenever the module is in training mode (default momentum=0.1), independent of requires_grad and the optimizer. A frozen backbone left in .train() therefore overwrites the pretrained BN stats with your (often tiny, domain-shifted) batch stats — so the "frozen" extractor isn't frozen. Fix: put the frozen Norm layers in eval mode after model.train(): for m in backbone.modules():\n if isinstance(m,(nn.BatchNorm1d,nn.BatchNorm2d,nn.BatchNorm3d)): m.eval() — or build them track_running_stats=False. Re-apply every epoch, because a top-level model.train() flips children back. (BatchNorm2d doc) (general train/eval-mode bug → O5; tiny-batch BN → by-domain V7.)

O19 — One global LR wrecks pretrained features (catastrophic forgetting) → discriminative LR + gradual unfreezing

Symptom: fine-tuning with a single LR either (too high) destroys the pretrained representations on the first updates and accuracy collapses below a frozen-feature baseline, or (too low) the random new head can't move. Both are the same misconfiguration. Root cause: at step 0 the backbone is near a good optimum but the new head is random, so its large initial loss yields large gradients that, under one high LR, propagate into and overwrite the low-level pretrained layers (catastrophic forgetting). A single LR can't be simultaneously small enough to preserve early layers and large enough to fit the head — the fix is per-group LRs, not more data. Fix: discriminative fine-tuning — per-layer param groups with LR decaying toward the input (head highest, stem lowest), e.g. AdamW([{'params':head,'lr':1e-3},{'params':backbone,'lr':1e-5}]). Combine with gradual unfreezing (train the head with the backbone frozen first, then unfreeze deeper→shallower) and an LR warmup so the random head settles before its gradients reach the backbone. (Howard & Ruder 2018, ULMFiT — discriminative fine-tuning + gradual unfreezing) (the general too-high-LR spike → P12.)

O20 — `load_state_dict(strict=False)` still RuntimeErrors on the replaced head → shape ≠ key mismatch

Symptom: you replaced the classifier for a new num_classes and pass strict=False expecting it to skip the head, but loading still crashes: RuntimeError: ... size mismatch for fc.weight: copying a param with shape [1000,...] ..., the shape in current model is [N,...]. Root cause: strict=False relaxes only the presence check — it tolerates missing_keys/unexpected_keys. It does not relax tensor-shape compatibility: any key present in both the checkpoint and the model whose shapes differ (exactly your old-vs-new head fc.weight/bias) still raises. So strict=False is necessary but not sufficient when the head keeps the same name. Fix: drop the incompatible head entries before loading, then load non-strict — sd={k:v for k,v in ckpt.items() if not k.startswith('fc.')}; missing,unexpected = model.load_state_dict(sd, strict=False) — and inspect missing/unexpected to confirm only the head is missing. Or give the new head a different attribute name so it never collides. (Save/resume of matching architectures → checkpoint-resume C1–C18.) (load_state_dict doc, forum: strict=False ≠ shape)

O21 — LoRA/PEFT "barely trains" or reloads random → `target_modules` don't match, head/Norm not in `modules_to_save`

Symptom: after get_peft_model(...), print_trainable_parameters() shows ~0% (or far fewer than expected) and loss won't drop; or a PEFT classifier reloads with a random head / shifted metrics after save_pretrained. Root cause: (a) LoRA only wraps modules whose names match target_modules, and names are architecture-specific (q_proj/v_proj for Llama vs query/value for BERT vs convolution for resnet) — a wrong/absent name injects no adapter, PEFT just warns "no modules matched," and you train nothing. (b) A newly-initialized task head (score/classifier) or a base-model BatchNorm's running_mean/var are not saved unless listed in modules_to_save — reload restores the base's random head / original BN stats → garbage / non-reproducible outputs. Fix: enumerate real names with [n for n,_ in model.named_modules()] and set LoraConfig(target_modules=[...]) (or 'all-linear'); confirm with model.print_trainable_parameters() and that you see lora.Linear layers. Add the new head and any base Norm layers to modules_to_save (e.g. modules_to_save=['classifier','normalization']) — or pass the right task_type (PEFT auto-adds the standard head). (PEFT troubleshooting)

Training-dynamics dashboard — instrument it so the failure is visible

O22 — Update-to-weight L2 ratio ≈ 1e-3 (the single highest-signal LR dial)

Symptom: loss barely moves (under-stepping) or is jittery (over-stepping), and the bare grad-norm can't tell you which — it isn't scale-relative to the weights. Root cause: what matters is the size of the actual update relative to the param's own magnitude — ratio = ||lr·update|| / ||W||, measured per tensor after step() (so it folds in lr, momentum, Adam's preconditioning, weight decay). CS231n's heuristic: this should sit ~1e-3. Lower → LR too low (weights barely change); higher (1e-2..1e-1) → LR too high. Being per-tensor, it exposes individually mis-scaled layers (an embedding moving 100× faster than the trunk) that a global grad-norm hides. Fix: log it every K steps — snapshot w0={n:p.detach().clone() for n,p in model.named_parameters()} before step(), then (p.detach()-w0[n]).norm()/(w0[n].norm()+1e-12) per name after. Lever: ≪1e-3 → raise that group's LR; ≫1e-3 → lower LR / lengthen warmup. Track per param group, not just globally. (CS231n "Neural Networks 3") (complements P12/P18.)

O23 — Log the actual per-step LR, not the config value

Symptom: you log cfg.lr (a constant) so the dashboard LR is flat — yet you're on warmup+cosine. You can't see warmup, decay, a restart, or a frozen scheduler; LR-related loss behavior (spike on ramp, stall at the floor) is invisible. Root cause: the effective LR lives in optimizer.param_groups[i]['lr'] and is rewritten by scheduler.step() each step (and per group for differential/no-decay LRs). Failure modes: plotting the config scalar (never changes); or the O8 order bug skipping the first value. Also get_lr() returns a value "one step ahead" — reading it instead of get_last_lr() logs the wrong number. Fix: log scheduler.get_last_lr() (a list — one per param group; log them all if you use differential LRs) or read optimizer.param_groups[0]['lr'] directly, every step. Don't use get_lr() for logging. If the logged LR plateaus when it should ramp/decay, your scheduler isn't being stepped (or is stepped at the wrong cadence → O8). (torch.optim, lr_scheduler source — get_last_lr)

O24 — GradScaler scale drifting toward 0 = silent persistent fp16 overflow

Symptom: an fp16-AMP run looks healthy (loss prints, no crash) but isn't learning or silently skips many optimizer steps — because you never plotted scaler.get_scale() and the loss-scale has cratered from 65536 toward ~1 (or sawtooths down). Root cause: GradScaler adapts a multiplicative loss-scale: on any inf/NaN grad it multiplies by backoff_factor=0.5 and skips that step(); after growth_interval=2000 clean steps it multiplies by growth_factor=2.0 (init_scale=65536). A few early backoffs are normal calibration (P5/P10), but a scale that keeps halving and stays low means the forward keeps producing values > fp16's 65504 → grads overflow → step skipped every step → weights frozen while loss still looks plausible. The config "fp16" tells you nothing; only the live scale reveals it. Fix: add scaler.get_scale() and a skipped-step counter to the dashboard. Healthy: a high plateau (2^13..2^16) after early calibration. Bad: monotonic decay toward 1, or step-count not advancing with iteration count. Lever when it collapses: switch fp16 → bf16 (no scaler; fp32 exponent range absorbs the large activations — highest leverage), or keep the overflow-prone block (final logits / attention) in fp32 via a nested autocast(enabled=False), plus z-loss / qk-norm (P15/P16). Don't "fix" it by lowering init_scale. (torch.amp GradScaler) (mechanism → P5/P10.)

O25 — Rising dead-ReLU / zero-activation fraction → a slice of the net is permanently off

Symptom: capacity quietly vanishes — a layer's outputs are increasingly all-zero, loss plateaus above where it should, and adding width doesn't help. No crash; it just under-fits. Worst case the net degenerates toward a constant function. Root cause: a ReLU whose pre-activation is driven negative for ~all inputs outputs 0 and has zero local gradient there, so backprop sends no signal to its incoming weights — the unit is stuck off and unrecoverable. Triggered by too-high LR (a big update pushes weights/bias deep negative) or a large negative bias. Once a large fraction of a layer dies, gradients can't flow through it and that capacity is gone. The same shape (saturation → ~0 gradient → frozen region) applies to sigmoid/tanh tails. Fix: instrument the zero/saturation fraction per activation with a forward hook — (out==0).float().mean() for ReLU (or |out|>0.99 for tanh/sigmoid), logged every K steps per layer. Healthy: a stable modest dead fraction (ReLU is sparse by design). Bad: a fraction climbing over training or a layer pinned near ~100% dead. Levers, in order: (1) lower LR (the primary cause); (2) ReLU → LeakyReLU / GELU / SiLU so the negative region keeps a gradient; (3) fix init / large negative biases. (CS231n "Neural Networks 1" — dying ReLU) (the output being constant is owned by verifying-dl-experiments; this is the internal mechanism.)

O26 — No weight/grad/activation histograms → scalar norms hide bimodal/saturating/collapsing distributions

Symptom: scalar dashboards (loss, one grad-norm) look fine yet the model under-performs or destabilizes — a mean/norm hides the shape: activations drifting to a saturated tail, weights collapsing to a spike at 0 (a layer dying, O25), or a gradient distribution growing fat outlier tails all read as an unremarkable scalar. Root cause: norms and means are lossy summaries — a healthy spread and a bimodal/all-saturated/all-zero distribution can share the same L2 norm. The diagnostic signal is the change in shape over training, which a scalar can't show. Fix: periodically (every few hundred steps — histograms aren't free) log SummaryWriter.add_histogram(tag, values, global_step) for each layer's weights, its gradients (after backward, before zero_grad), and key activations (forward hook). Read the time-evolution: weights collapsing to a spike = a layer dying; gradient histograms collapsing to ~0 = vanishing (lever: residual/norm/init, P17); fat tails = clip + lower LR (P13/P12); activations wandering into a saturating tail = init/normalization fix (P17). Pair with the scalars above. (SummaryWriter.add_histogram, Karpathy recipe — visualize weights/activations)

Pointers — adjacent mechanics catalogued elsewhere

NaN / loss-spike / LR-too-HIGH / grad explosion / z-loss / qk-norm / init & norm placement / determinism → references/training/precision-stability.md (P8–P19). This file is the LR-too-LOW / won't-move side; that one is the blows-up side.
OOM from the optimizer step / activation checkpointing / LoRA-QLoRA memory → references/training/oom-memory.md (M5, M12–M13).
N-GPU effective batch × LR, DeepSpeed accum double-count, find_unused_parameters → references/training/distributed-launch.md (D11, D18, D8).
Dataloader correctness (worker RNG, collate, labels, shuffle) that mimics "won't learn" → references/training/data-pipeline.md.
Is the converged number REAL (collapse, leakage, train-vs-val, metric validity, seed discipline) → verifying-dl-experiments (REQUIRED — every "is the result real" fork above).

39 KiB Raw Blame History Unescape Escape

Convergence & optimization debugging — it runs, doesn't crash, but won't learn (or learns badly)

Table of contents

It isn't learning at all — the first-hour triage

O1 — Run the overfit-one-batch smoke BEFORE tuning anything (the canonical correctness test)

O2 — Loss flat from step 0, weights byte-identical after step() → params aren't in the optimizer

O3 — backward() is a no-op / raises "does not require grad" → loss detached from the graph

O4 — Wrong zero_grad / backward / step order, or a missing step()

O5 — Forgot model.train() / left model.eval() on → Dropout & BatchNorm in the wrong mode

Optimizer / learning-rate / weight-decay / schedule

O6 — Weight decay "does nothing" / Norm gains destabilize → coupled Adam(weight_decay=) + decaying bias & Norm

O7 — Loss crawls with no NaN → LR is too LOW; find the band with an LR range test

O8 — lr_scheduler.step() before optimizer.step() skips the first LR; per-step vs per-epoch cadence

O9 — Gradient accumulation gives effective N×LR → divide the loss by accum_steps (and normalize per token)

O10 — AdamW(eps=1e-8) underflows in bf16/fp16 → unbounded updates where v is tiny

O11 — fused=True AdamW breaks under AMP/FSDP; foreach inflates peak memory