playbook/antigravity-awesome-skills/skills/remote-gpu-trainer/references/training/by-domain.md

231 lines
31 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Per-domain training gotchas — make each domain's run start, not lie, and not silently mistrain
The cross-cutting layers (precision, OOM, throughput, checkpoint, distributed) hold everywhere; this file
is the **domain-shaped** residue — the data-format, masking, normalization, schedule, and freezing traps
that only bite LLM / vision / diffusion / RL / VLM training. Each entry is **Symptom → Root cause → Fix**
with the exact knob. This layer owns *making the domain pipeline RUN and debugging its mechanics*;
**verifying-dl-experiments** owns *is the converged number real* (collapse-vs-real-effect, train/val
leakage, metric validity, constant/degenerate output). Cross-link it (**REQUIRED**) at every "loss is fine
but the output/metric is wrong" fork — the headline domain failures (diffusion samples bad at low loss,
mAP=0, reward collapse, VLM ignores the image) are exactly that shape.
To jump: `grep -in '<keyword>' references/training/by-domain.md` (e.g. `padding`, `packing`, `rope`,
`z-loss`, `dpo`, `mAP=0`, `mIoU`, `ignore_index`, `ema`, `vae`, `cfg`, `kl`, `whiten`, `projector`,
`freeze`, `seed`).
## Table of contents
- **LLM** — L1 pad-side · L2 loss-mask-100 · L3 pad-token-unset · L4 packing-cross-contamination · L5 RoPE-context-extension · L6 grad-explosion+z-loss · L7 eval-perplexity-mask · L8 SFT/DPO/RLHF-data-format · L9 DPO-collapse+KL · L10 gated-token-before-Trainer
- **Vision (cls/det/seg)** — V1 normalization-mismatch · V2 aug-on-eval · V3 mAP=0 · V4 anchor/NMS/conf-thresh · V5 mIoU=0 ignore_index/off-by-one · V6 class-imbalance · V7 BN-tiny-batch
- **Diffusion** — DF1 loss-low-samples-bad (cross-link) · DF2 EMA-weights · DF3 VAE-scaling · DF4 noise-schedule/timestep · DF5 CFG-conditioning-dropout · DF6 sampler-vs-model · DF7 SNR-weighting
- **RL** — R1 reward-collapse · R2 KL-blowup · R3 whitening · R4 replay/obs-normalization · R5 non-stationarity · R6 seed-variance (cross-link)
- **VLM** — X1 stage-freeze · X2 projector-only-stage1 · X3 per-group-LR · X4 image-token-truncation · X5 alignment-collapse (cross-link)
- **Pointers** — precision-stability.md, oom-memory.md, gotchas_universal.md, verifying-dl-experiments (skill)
---
## LLM / transformer
### L1 — Padding side: right for causal-LM SFT, **left** for generation/DPO
**Symptom**: a fine-tuned causal LM produces garbage, or DPO/batched-generation logprobs disagree with single-example decoding.
**Root cause**: causal-LM **training** wants **right-padding** (pad lands after content, attention mask zeroes it). Batched **generation/DPO** want **left-padding** — with right-padding the "last real token" position differs per row, so a shared decode step reads pad. TRL's `DPOTrainer` requires `processing_class` padding side `"left"`.
**Fix**: `tokenizer.padding_side="right"` for SFT collation; `"left"` for generation/eval/DPO — set it per phase, not globally. ([HF causal-LM](https://huggingface.co/docs/transformers/tasks/language_modeling), [TRL DPO](https://huggingface.co/docs/trl/dpo_trainer))
### L2 — Loss over prompt + pad tokens dilutes the signal → mask with 100
**Symptom**: SFT "trains" but parrots the prompt / barely follows instructions; loss plausible but flat.
**Root cause**: HF LM loss is `CrossEntropyLoss(ignore_index=-100)` — only `-100` positions are skipped. Leaving the prompt-prefix labels and pad labels as real ids averages the loss over "predict the prompt / predict pad."
**Fix**: set labels to `-100` at **both** the prompt prefix (train only on the completion) and all padding positions. TRL `SFTTrainer` `completion_only_loss` / `DataCollatorForCompletionOnlyLM` does the prefix masking — verify it fired (decode one masked label row). Whether the gradient hits the right tokens is a smoke-target → cross-link **verifying-dl-experiments** (**REQUIRED**). ([gpt2 thread](https://huggingface.co/gpt2/discussions/34))
### L3 — `pad_token` unset → pad error or silent pad-with-token-0
**Symptom**: `ValueError: Asking to pad but the tokenizer does not have a padding token`, or it pads with id 0 (a real token, often `<unk>`/`!`).
**Root cause**: many base LMs (GPT-2, Llama, Mistral) ship no `pad_token`.
**Fix**: `tokenizer.pad_token = tokenizer.eos_token` and `model.config.pad_token_id = tokenizer.pad_token_id`. With right-padding + attention mask, reusing EOS as PAD is safe. If a *new* token is added, `model.resize_token_embeddings(len(tokenizer))` or its id indexes out of range.
### L4 — Sequence packing leaks attention across documents → contaminated training
**Symptom**: throughput jumps after enabling packing, but quality drops vs unpacked; the model "completes" one doc with content from a packed neighbor.
**Root cause**: naive packing concatenates examples into one `max_len` sequence; a vanilla causal mask lets a token in doc 2 attend back into doc 1 — cross-sequence contamination.
**Fix**: **document masking** — emit `position_ids` that reset per sub-sequence + an attention impl that honors boundaries. TRL/HF `DataCollatorWithFlattening` packs into one stream, returns `position_ids`, and sets each example's first label to `-100`; FlashAttention-2 varlen restricts attention within-document. Requires `attn_implementation="flash_attention_2"` — packing without it silently contaminates. ([HF blog](https://huggingface.co/blog/packing-with-FA2), [transformers #31629](https://github.com/huggingface/transformers/pull/31629), [IBM](https://research.ibm.com/blog/hugging-face-training-flash-attention))
### L5 — Fine-tuning past pretrain context without RoPE scaling → garbage past N tokens
**Symptom**: a 4k-context model is incoherent past ~4k at inference even when trained on longer sequences; or long-context finetune won't converge.
**Root cause**: RoPE frequencies are calibrated to the pretrain context; longer positions extrapolate into unseen rotation angles. Linear interp degrades past ~4×; YaRN holds to 1632×.
**Fix**: set `rope_scaling={"type":"linear"|"dynamic"|"yarn","factor":<target/orig>}` in config and finetune **with scaling active** (`"yarn"` for big jumps, `"linear"` only ≤4×). Train-time vs inference-time `rope_scaling` mismatch is a silent regression. ([RoPE deep dive](https://amaarora.github.io/posts/2025-09-21-rope-context-extension.html), [HF guide](https://medium.com/@leannetan/extending-context-length-with-hugging-faces-transformers-6b04db05b39a))
### L6 — Loss spikes / logit drift in long LM training → z-loss + the precision-layer knobs
**Symptom**: pretraining/long-SFT loss is stable then spikes; in bf16/fp16 it can NaN; logits grow unboundedly over training.
**Root cause**: the softmax normalizer `log Z` drifts from 0 as logits grow → low-precision overflow + gradient instability.
**Fix**: add **z-loss** `1e-4 · log²(Z)` (the PaLM/Gopher coefficient) to pull `log Z` toward 0. The general divergence ladder (warmup, grad-clip, skip-the-batch, qk-norm, bf16-over-fp16) is **references/training/precision-stability.md** P12P18 (z-loss is P15) — not restated here; this entry is the LM-specific *why z-loss exists*. ([PaLM](https://arxiv.org/abs/2204.02311), [small-scale proxies](https://arxiv.org/abs/2309.14322))
### L7 — Eval perplexity wrong from a wrong mask/stride, not the model
**Symptom**: reported PPL implausible, or differs from a published number on the same checkpoint+data.
**Root cause**: PPL = `exp(mean NLL over scored tokens)`. Including pad/prompt tokens, or a sliding window that double-counts overlap context as scored tokens, corrupts the denominator.
**Fix**: score only non-`-100` positions; for long docs use the HF strided window where overlap tokens are `-100` (context, not scored). Whether the number is comparable across runs is metric-validity → cross-link **verifying-dl-experiments** (**REQUIRED**). ([HF perplexity](https://huggingface.co/docs/transformers/perplexity))
### L8 — SFT / DPO / RLHF expect different dataset schemas; the wrong one trains on nothing
**Symptom**: TRL trainer runs but learns nothing, or errors on a missing column; preference data in an SFT trainer (or vice versa) silently mistrains.
**Root cause**: **SFT** = prompt+completion (train on completion); **DPO/preference** = `{prompt, chosen, rejected}` or conversational messages; **RLHF/PPO** = prompts only + a separate reward model. Conversational data needs the chat template applied.
**Fix**: match trainer to schema; for conversational data confirm the chat template fired (decode one example — look for role tags). Prefer the **explicit-prompt** form `{prompt, chosen, rejected}` over implicit. Recommended order SFT → DPO (DPO from a non-SFT base often underperforms). ([TRL DPO](https://huggingface.co/docs/trl/dpo_trainer))
### L9 — DPO reward margin won't grow / chosen logps crash → beta + ref-model + collapse
**Symptom**: `rewards/margins` ~0 or `rewards/accuracies` ~0.5; or `logps/chosen` and `logps/rejected` both plunge (suppresses everything).
**Root cause**: `beta` controls deviation from the frozen reference — too **small** → policy drifts (implicit KL blows up, degenerate text); too **large** → signal too weak to move the margin. DPO widens the gap mostly by **suppressing the rejected** likelihood, so both logps falling *with a growing margin* is normal; both falling *with a flat margin* is collapse. A lost/absent `ref_model` (some PEFT paths) removes the anchor.
**Fix**: start `beta=0.1`, raise to 0.30.5 if text degrades, lower if the margin won't move. Use `learning_rate≈1e-6` (TRL DPO default; `≈1e-5` for LoRA) — too high is the classic collapse. Health signal: `rewards/margins` ↑, `rewards/accuracies` → ~0.7+. With `ref_model=None` TRL uses the initial policy as the frozen reference — concrete check: a frozen reference must yield **identical** logps for a fixed batch across steps; re-score one batch early and late, and if they drift the anchor is being trained (the trap when `ref_model=None` lacks a real frozen copy). Bug-vs-real-effect on the collapse → cross-link **verifying-dl-experiments** (**REQUIRED**). ([TRL DPO](https://huggingface.co/docs/trl/dpo_trainer))
### L10 — Gated/private model 401s mid-Trainer → authenticate BEFORE construction
**Symptom**: `401`/`GatedRepoError` when the Trainer loads a Llama/Gemma/Mistral base despite granted access; or `push_to_hub` can't write.
**Root cause**: the token must be visible to the process **before** the gated `from_pretrained` / Trainer-internal load; setting it after is too late.
**Fix**: push the token first (env/stdin, **never inline the literal**): set `HF_TOKEN`, or `huggingface_hub.login(token=os.environ["HF_TOKEN"])` at the top before any `from_pretrained`. `push_to_hub` needs a **write**-scope token + `hub_model_id` in `TrainingArguments`. Verify `huggingface-cli whoami` before launch — on a metered box a 401 wastes a full reload. Secrets transport → `references/ssh_transport.md` (U34); offline-without-key → gotchas_universal.md U35. ([HF gated](https://huggingface.co/docs/hub/en/models-gated))
---
## Vision (classification / detection / segmentation)
### V1 — Normalization mismatch train↔eval → near-zero accuracy on a "trained" model
**Symptom**: training loss falls but val/test is near-chance; or a fine-tuned backbone is far worse than its pretrained eval.
**Root cause**: pretrain `(mean,std)` (ImageNet `mean=[.485,.456,.406] std=[.229,.224,.225]`) differs between train/eval paths, or one normalizes to `[0,1]` and the other `[0,255]`; or RGB vs BGR (OpenCV loads BGR). A reported CenterNet case got post-norm mean `-115`, std `8` from wrong channel stats.
**Fix**: use the **exact** pretrain normalization, identically in train and eval; match channel order. Print one input tensor's per-channel mean/std — should be ~`N(0,1)`. Remaining gap = real-effect vs this bug → cross-link **verifying-dl-experiments** (**REQUIRED**; input-normalization is a named check). ([why-normalize](https://inside-machinelearning.com/en/why-and-how-to-normalize-data-object-detection-on-image-in-pytorch-part-1/), [tf/models #10778](https://github.com/tensorflow/models/issues/10778))
### V2 — Train-time augmentation applied at eval → unstable/depressed metrics
**Symptom**: eval numbers flicker run-to-run or sit below the training-curve val.
**Root cause**: the random transform pipeline (`RandomResizedCrop`, flip, jitter) is reused for the eval loader, so each pass sees different inputs; or `model.eval()` never called so Dropout/BN stay in train mode.
**Fix**: separate `train_transform` (random) from `eval_transform` (deterministic resize+center-crop+normalize); call `model.eval()` + `torch.no_grad()`. A flickering eval metric is usually this, not the model.
### V3 — Detection mAP=0 despite a falling loss → box-format / label-id / scale mismatch
**Symptom**: detection loss decreases normally but mAP is exactly 0 (or ~0) at every eval.
**Root cause**: a format mismatch the loss tolerates but eval doesn't — (1) box format `cxcywh`/`xywh` vs evaluator's `xyxy`, or normalized `[0,1]` vs absolute pixels; (2) class id off-by-one (0-indexed model vs 1-indexed COCO, 0=background); (3) boxes in resized space matched against original-res GT; (4) eval score threshold so high everything is filtered.
**Fix**: assert the eval pipeline's box format + class indexing, convert explicitly (`torchvision.ops.box_convert`), and visualize 23 predicted boxes before trusting the metric. mAP=0 with healthy loss is almost never the model — it's eval glue; the all-zero-metric pattern → cross-link **verifying-dl-experiments** (**REQUIRED**). ([tf/models #10778](https://github.com/tensorflow/models/issues/10778), [bbox formats](https://www.learnml.io/posts/a-guide-to-bounding-box-formats/))
### V4 — Detections vanish after NMS / anchor mismatch → no boxes survive
**Symptom**: raw head outputs look reasonable but final detections are empty or absurdly few.
**Root cause**: NMS IoU too aggressive, score threshold too high, or anchor sizes/ratios don't cover the dataset's object scales (regression targets unreachable).
**Fix**: log pre-NMS vs post-NMS counts; loosen `score_thresh` (~0.05 for eval recall), NMS IoU ~0.50.6; for anchor heads run k-means auto-anchor (YOLO `autoanchor`) on GT boxes. Pairs with V3.
### V5 — Segmentation mIoU=0 or NaN → `ignore_index` / label off-by-one
**Symptom**: seg loss trains but mIoU is 0 (or a class NaN); or loss is NaN from step 1.
**Root cause**: label/class-index inconsistency — a void value (commonly `255`) not excluded → treated as a class id ≥ `num_classes` (out-of-range / pollutes IoU); or off-by-one (0=background but labels start at 1, or a `reduce_labels` 0→255 shift applied inconsistently between loss and metric).
**Fix**: set the **same** `ignore_index` in **both** loss and metric — `CrossEntropyLoss(ignore_index=255)` and mIoU mask `(label != 255)`; confirm `label.max() < num_classes` after any shift; apply reduction identically. mIoU=0 with falling loss = all-zero-metric pattern → cross-link **verifying-dl-experiments** (**REQUIRED**). ([torchmetrics #2747](https://github.com/Lightning-AI/torchmetrics/issues/2747), [HF ignore_index](https://discuss.huggingface.co/t/understanding-ignore-index-and-reduce-labels/64587))
### V6 — Severe class imbalance → model predicts only the majority class
**Symptom**: high pixel/sample accuracy but rare classes never predicted; minority recall ~0.
**Root cause**: unweighted cross-entropy is dominated by the majority class; "always predict majority" is the easy degenerate solution.
**Fix**: weight the loss (`CrossEntropyLoss(weight=...)` inverse-frequency) or focal loss (detection); class-balanced sampler. Report **per-class / macro** metrics, never just overall — a high aggregate hiding a collapsed minority is degenerate output → cross-link **verifying-dl-experiments** (**REQUIRED**).
### V7 — Tiny per-GPU batch → BatchNorm stats garbage → unstable/poor training
**Symptom**: detection/seg at batch 12 per GPU is unstable or underperforms a larger-batch run.
**Root cause**: BN estimates mean/var over the batch; at batch 12 those are noisy and running stats drift.
**Fix**: **SyncBatchNorm** across GPUs (`torch.nn.SyncBatchNorm.convert_sync_batchnorm`) under DDP, or **GroupNorm**, or freeze pretrained BN (`FrozenBatchNorm2d`, as detection backbones do). DDP mechanics → `references/training/distributed-launch.md`.
---
## Diffusion / generative
### DF1 — Loss is low but samples are bad → the canonical "loss ≠ quality"
**Symptom**: the noise-prediction MSE converges nicely but samples are blurry, mode-collapsed, or wrong.
**Root cause**: diffusion loss (predict noise at a random timestep) is **weakly correlated with sample quality** — good average noise-prediction still compounds errors over the sampling trajectory. Real culprits are downstream: missing EMA (DF2), wrong VAE scaling (DF3), train/sample schedule mismatch (DF4/DF6), no/over CFG (DF5).
**Fix**: the textbook **is-the-number-real** fork → cross-link **verifying-dl-experiments** (**REQUIRED**; it owns loss-low-output-bad). Mechanically walk DF2→DF6; the single most common miss is evaluating **raw** weights instead of **EMA** (DF2). ([stability techniques](https://apxml.com/courses/advanced-diffusion-architectures/chapter-4-advanced-diffusion-training/training-stability-techniques))
### DF2 — Sampling from raw (non-EMA) weights → worse than the "same" model
**Symptom**: samples from the just-saved checkpoint look worse than expected; quality jumps with an EMA checkpoint.
**Root cause**: diffusion quality depends heavily on EMA — a running average (`decay≈0.999`, ~1000-update window) that denoises the noisy SGD trajectory. Raw weights are the noisy point estimate.
**Fix**: maintain EMA during training and **sample/evaluate from EMA weights** (`diffusers` `EMAModel`); save both (EMA for inference, raw for resume); verify the eval path actually loaded EMA. Resolves a large share of "DF1" reports. ([EMA](https://medium.com/@thibaut.chauffier/training-diffusion-models-from-scratch-21d7a1f18e9e))
### DF3 — VAE latent scaling wrong → latents off-unit-variance → diffusion can't learn
**Symptom**: latent-diffusion (SD-style) training is unstable / produces noise or blank output; or a swapped-in custom VAE degrades everything.
**Root cause**: the model assumes ~unit-variance latents; the VAE output is multiplied by a calibrated `scaling_factor` (SD v1 = `0.18215`). A wrong/missing factor (or a custom VAE with different stats) leaves latents off-scale.
**Fix**: scale by the VAE's `config.scaling_factor` on encode, divide on decode. For a **custom** VAE, **measure** empirical latent std on a sample and set factor `1/std` — don't inherit `0.18215`. Print latent mean/std before training (~0/~1). ([sd-vae](https://huggingface.co/stabilityai/sd-vae-ft-mse))
### DF4 — Noise schedule / timestep / prediction_type mismatch train↔inference
**Symptom**: structured artifacts, wrong contrast/brightness, or failure at low step counts.
**Root cause**: betas/alphas schedule (linear/cosine/scaled-linear), `num_train_timesteps`, or `prediction_type` differs between training and the inference scheduler; e.g. trained on `epsilon`, sampled as `v`/`x0`.
**Fix**: keep the **same** `beta_schedule`, `num_train_timesteps`, `prediction_type` in both — in `diffusers` build the inference scheduler from the training `scheduler.config`. Mismatched `prediction_type` is a silent quality killer.
### DF5 — Conditioning never dropped during training → CFG is a no-op / model ignores prompt
**Symptom**: changing `guidance_scale` at inference barely changes output, or the model ignores conditioning.
**Root cause**: CFG needs a learned **unconditional** path, trained by randomly replacing the condition with a null embedding for a fraction `p_drop` of examples. No dropout ⇒ no usable unconditional estimate ⇒ CFG no-op.
**Fix**: during training drop the condition `p_drop≈0.1` (replace with null embedding). At inference use `guidance_scale` ~515 for T2I (higher = more prompt adherence, lower diversity). Model-ignores-input is a verifying-dl-experiments concern (**REQUIRED**); the training-side root cause is the missing dropout. ([CFG theory](https://apxml.com/courses/advanced-diffusion-architectures/chapter-4-advanced-diffusion-training/classifier-free-guidance-theory))
### DF6 — Sampler ≠ model: a sampler config that doesn't match the trained objective
**Symptom**: switching samplers wildly changes quality; one sampler gives noise.
**Root cause**: samplers assume a `prediction_type` + schedule (DF4); ancestral/SDE vs deterministic ODE interact differently with the trained noise level, and step count below the sampler's stable range degrades.
**Fix**: validate the checkpoint with the **reference** sampler/step-count from its recipe first, then explore; confirm `prediction_type` matches; for few-step use DPM-Solver++ not plain DDPM. "New sampler → bad samples" is a config mismatch, not a model regression.
### DF7 — Uniform-timestep loss weighting → blurry samples → Min-SNR weighting
**Symptom**: samples are systematically blurry / low-detail despite long training.
**Root cause**: uniform loss weight over-weights high-noise (easy) steps relative to low-noise steps that carry fine detail; the gradient is dominated by the easy regime.
**Fix**: apply **Min-SNR-γ** weighting (γ≈5) so low-noise steps get their share; `diffusers` scripts expose `--snr_gamma`. Compounds with DF2/DF3 — fix these details together, not in isolation. ([Min-SNR](https://arxiv.org/abs/2303.09556))
---
## RL
### R1 — Reward collapses / output degenerates mid-training
**Symptom**: average reward suddenly drops, responses get short/repetitive or refuse, length collapses.
**Root cause**: reward hacking / over-optimization — the policy exploits the reward model's blind spots and drifts far, often after the ratio gets clipped much more and approximate KL spikes.
**Fix**: strengthen the KL penalty (raise the KL coefficient), lower LR (`≈1e-6` for LLM PPO), reduce PPO update epochs/batch, add a length penalty if length is gamed. Watch reward **and** KL together — a reward jump with a KL spike is hacking, not progress. Bug-vs-real-effect on the collapse → cross-link **verifying-dl-experiments** (**REQUIRED**; it owns collapse/degenerate-output). ([PPO instability](https://apxml.com/courses/rlhf-reinforcement-learning-human-feedback/chapter-4-rl-ppo-fine-tuning/troubleshooting-ppo-instability), [N-details RLHF](https://huggingface.co/blog/the_n_implementation_details_of_rlhf_with_ppo))
### R2 — KL to the reference blows up → policy runs away
**Symptom**: KL grows without bound; generations go incoherent; "diverges" though loss isn't NaN.
**Root cause**: the KL penalty is too weak (or adaptive-KL target too loose); aggressive updates push the policy far, and a huge KL term can dominate the objective so the model optimizes the penalty instead.
**Fix**: use an **adaptive KL controller** with a target (e.g. 6), or a fixed coefficient large enough to hold KL bounded; clip the ratio (`cliprange≈0.2`); cap update epochs. Confirm the frozen reference isn't being updated. Same axis DPO's `beta` controls (L9). ([KL penalty role](https://apxml.com/courses/rlhf-reinforcement-learning-human-feedback/chapter-4-rl-ppo-fine-tuning/kl-divergence-penalty-role))
### R3 — Un-normalized rewards/advantages → unstable gradients → whiten
**Symptom**: high-variance, brittle training; small reward-scale changes destabilize everything.
**Root cause**: raw reward/advantage magnitudes vary wildly across batches; PPO gradients are scale-sensitive.
**Fix**: **whiten** advantages per minibatch (subtract mean, divide by std) — the standard PPO trick, more stabilizing than plain reward normalization (double-normalizing reward **and** advantage is often redundant). For classic-control RL, normalize **observations** with a running mean/std (`VecNormalize`) — un-normalized obs is a top cause of failure to learn. ([impl matters](https://openreview.net/pdf?id=rxEmiOEIFL), [whitening redundancy](https://liujch1998.github.io/2023/04/16/ppo-norm.html))
### R4 — Replay buffer / normalization state not checkpointed → resume behaves like cold start
**Symptom**: an off-policy run (DQN/SAC) resumed from checkpoint acts like a cold start; or a normalized env's stats reset on resume and performance tanks.
**Root cause**: the replay buffer and the running obs/reward normalization stats are part of training **state** but are often omitted — restoring only weights loses them.
**Fix**: checkpoint+restore the replay buffer (or accept warmup) **and** the `VecNormalize`/running-stats alongside weights. General checkpoint-everything-stateful (optimizer/scheduler/RNG/step) → `references/training/checkpoint-resume.md`; on spot boxes losing buffer/normstats every preemption silently degrades learning → `references/spot-resilience.md`.
### R5 — Non-stationarity treated as a bug → chasing a moving target
**Symptom**: value/critic loss won't converge to zero; metrics oscillate even when "working."
**Root cause**: RL targets are **non-stationary** — the policy changes the data distribution and bootstrapped targets move. A value loss that never hits zero is expected.
**Fix**: judge by **return/reward trend**, not critic-loss-to-zero; stabilize with a slow target network (DQN/SAC) and GAE. Don't "fix" non-convergent critic loss by shrinking LR to zero — that just stops learning.
### R6 — A single seed's result is not the result → RL variance is huge
**Symptom**: identical hyperparameters + different seeds give non-overlapping curves; an ablation "win" disappears on re-run.
**Root cause**: extreme seed variance from algorithm, policy sampling, and environment stochasticity (comparing 5-run point estimates yields >50% Type-I error).
**Fix**: report aggregate over **≥5 seeds** (more for noisy envs), use **IQM** (interquartile mean) over mean/median, show CIs. A single-seed delta is not a result — squarely **verifying-dl-experiments** territory (bug-vs-real-effect, seed discipline; **REQUIRED**). ([Henderson](https://arxiv.org/pdf/1708.04133), [how-many-seeds](https://arxiv.org/pdf/1806.08295))
---
## Multimodal / VLM
### X1 — Wrong freeze schedule across stages → alignment never forms or the LLM is wrecked
**Symptom**: a LLaVA-style VLM doesn't ground on images (ignores visual tokens), or text quality collapses after multimodal finetune.
**Root cause**: VLM training is **staged**, each stage freezing different towers. Stage 1 (alignment): freeze vision encoder **and** LLM, train **only the projector**. Stage 2 (instruction tuning): unfreeze LLM **and** projector, keep vision encoder frozen. Training the LLM in stage 1 (before the projector aligns) corrupts it; never unfreezing it means it can't use the visual tokens.
**Fix**: set `requires_grad` per tower per stage per the recipe; print trainable-param counts at each stage start to confirm the freeze took. "Ignores its input image" is model-ignores-input → cross-link **verifying-dl-experiments** (**REQUIRED**). ([LLaVA recipe](https://rohitbandaru.github.io/blog/Vision-Language-Models/))
### X2 — Projector trained from scratch with the LLM hot → unstable stage-1
**Symptom**: stage-1 alignment loss is unstable or the projector output is garbage.
**Root cause**: the projector (linear or 2-layer MLP, often concatenating groups of vision tokens into the LLM embedding space) is **randomly initialized**; flowing gradients into the LLM through an un-aligned projector destabilizes both.
**Fix**: stage 1 trains the projector **alone** against a frozen LLM so it learns the LLM's embedding space first; only then (stage 2) unfreeze the LLM. Confirm projector output dim == LLM hidden size (else a silent shape-broadcast bug). ([projector adapter](https://rohitbandaru.github.io/blog/Vision-Language-Models/))
### X3 — One global LR for all towers → vision encoder drifts or LLM underfits
**Symptom**: a shared LR either over-updates the pretrained vision encoder (forgets visual features) or leaves projector/LLM undertrained.
**Root cause**: towers are at different maturities — a pretrained vision encoder needs a tiny LR (or freeze), a fresh projector a larger one, the LLM in between. LLaVA: ~`1e-3` projector in alignment, `2e-5` LLM in stage 2.
**Fix**: **parameter groups** with per-group LRs (`AdamW([{"params":proj.parameters(),"lr":1e-3},{"params":llm.parameters(),"lr":2e-5}])`), or freeze the vision encoder. Log each group's LR to confirm. ([LLaVA LRs](https://rohitbandaru.github.io/blog/Vision-Language-Models/))
### X4 — Sequence truncation drops image tokens → shape error or silent loss of vision
**Symptom**: VLM training errors on image-token-count mismatch, or intermittently ignores images on long examples.
**Root cause**: image placeholders expand into many tokens (hundreds per image); a `max_length` truncation cuts them, breaking the image-token ↔ feature alignment.
**Fix**: set `max_length=None` (or large enough never to truncate image tokens) for VLM trainers, or verify truncation never removes placeholders across the whole dataset. Count image tokens vs expected per example as a smoke check. ([TRL VLM note](https://huggingface.co/docs/trl/dpo_trainer))
### X5 — Modality alignment collapse: the LLM answers from text priors, not the image
**Symptom**: the VLM gives plausible answers that ignore the actual image content (language-prior shortcut).
**Root cause**: weak visual signal (bad projector, frozen-everything, too little alignment data) lets the LLM fall back on its language prior — a degenerate "ignore the input" solution that still lowers loss on text-predictable answers.
**Fix**: the mechanical fixes are X1X3 (correct freeze/projector/LR so the visual path contributes). Whether it's genuinely grounding vs shortcutting is **verifying-dl-experiments** (model-ignores-input/degenerate-output; **REQUIRED**) — probe with image-perturbation / counterfactual-image tests, which that skill owns.
---
## Pointers — domain-adjacent mechanics catalogued elsewhere
- **Precision / NaN / loss-spike / z-loss / grad-clip** (L6's general ladder) → `references/training/precision-stability.md`.
- **OOM, activation checkpointing, LoRA/QLoRA, FSDP/ZeRO, seq-len memory** → `references/training/oom-memory.md`.
- **Dataloader starvation, GPU-util%, NVMe staging, tar-sharding** → `references/gotchas_universal.md` (U8, U21, U24, U25).
- **DDP/FSDP launch, SyncBatchNorm under DDP, NCCL** → `references/training/distributed-launch.md`, `references/multinode.md`.
- **Checkpoint-everything-stateful + atomic resume** (R4's general form) → `references/training/checkpoint-resume.md`, `references/spot-resilience.md`.
- **"Runs but won't learn": loop wiring, optimizer/LR/weight-decay, loss-function & label form, freezing/BN drift, dataloader correctness** → `references/training/convergence-debugging.md`, `references/training/data-pipeline.md`.
- **Is the metric/model correct** (collapse, leakage, all-zero metrics, model-ignores-input, seed discipline) → **verifying-dl-experiments** (**REQUIRED** — owns every "bug vs real effect" fork above).