339 lines
22 KiB
Markdown
339 lines
22 KiB
Markdown
# OOM & fitting a model that doesn't — VRAM + host-RAM out-of-memory during training
|
||
|
||
How to read a CUDA OOM trace, understand *what* fills the card (params vs optimizer vs gradients vs
|
||
activations vs fragmentation), and apply the fixes **in cost order** — from a free batch-size cut to
|
||
ZeRO-3/QLoRA sharding. This layer owns *making training RUN and fit*; **verifying-dl-experiments** owns
|
||
*is the resulting number correct*. Cross-link it (**REQUIRED**) wherever a "fix" risks changing the
|
||
science (shrinking the one variable under test, swapping precision, changing seq-len).
|
||
|
||
To jump: `grep -in '<keyword>' references/training/oom-memory.md` (e.g. `expandable`, `checkpoint`,
|
||
`zero`, `validation`, `snapshot`, `lora`, `empty_cache`, `longest`, `fragment`).
|
||
|
||
## Table of contents
|
||
|
||
- **Read it first** — M1 anatomy (where VRAM goes) · M2 reading the OOM trace · M3 VRAM-OOM vs host-RAM-OOM
|
||
- **Fixes in order** — M4 the ladder (do these top-down) · M5 batch/grad-accum · M6 bf16 mixed precision · M7 activation/gradient checkpointing · M8 expandable_segments · M9 FSDP / ZeRO sharding · M10 CPU/NVMe offload · M11 seq-len/resolution · M12 8-bit & paged optimizers · M13 LoRA/QLoRA
|
||
- **OOM at a specific step** — M14 first backward · M15 validation/eval · M16 the longest batch · M17 step-2 (optimizer alloc)
|
||
- **Debugging** — M18 memory_summary · M19 the snapshot + visualizer · M20 empty_cache & "leak" myths
|
||
- **Pointers** — host-RAM cgroup-OOM → gotchas_universal.md U9 · VRAM-vs-cgroup → U10 · zombie-VRAM → U11
|
||
|
||
---
|
||
|
||
## Read it first
|
||
|
||
### M1 — Anatomy: where the VRAM actually goes (a model that runs inference OOMs in training)
|
||
|
||
Training memory is **not** just weights. For **Adam mixed-precision**, per parameter:
|
||
|
||
| Bucket | Bytes/param | Note |
|
||
|---|---|---|
|
||
| Weights (fp16/bf16 + fp32 master) | **6** | 2 B working copy + 4 B fp32 master for stable updates |
|
||
| Optimizer states (Adam m, v, fp32) | **8** | momentum 4 B + variance 4 B |
|
||
| Gradients (fp32) | **4** | one per param, backward pass |
|
||
| **Subtotal (fixed, batch-independent)** | **~18 B/param** | a 4 B-param model ≈ 72 GB *before any activation* |
|
||
| Forward **activations** (cached for backward) | **varies** | scales with `batch × seq_len × depth × hidden`; the part that explodes |
|
||
| Temporary buffers (softmax, matmul scratch) | spikes | a single peak op can OOM even when steady-state fits |
|
||
|
||
Source: HF model-memory-anatomy (https://huggingface.co/docs/transformers/en/model_memory_anatomy) gives
|
||
the 6+8+4 split and the "4B params, batch 16 ≈ 85 GB" worked figure. **Why a model that *infers* in 16 GB
|
||
OOMs in training:** inference is just the 2 B/param working copy + small activations; training adds the
|
||
+12 B/param of grads+optimizer **and** keeps the entire forward activation graph alive for backward.
|
||
The fixed 18 B/param is attacked by M9/M12/M13; the activation term by M5/M6/M7/M11.
|
||
|
||
### M2 — Reading the CUDA OOM trace (the numbers tell you which fix)
|
||
|
||
**Symptom**: `torch.OutOfMemoryError: CUDA out of memory. Tried to allocate X MiB (GPU 0; Y GiB total
|
||
capacity; Z GiB already allocated; A GiB free; B GiB reserved in total by PyTorch ...)`.
|
||
|
||
**Root cause — decode the four numbers**:
|
||
- **Tried to allocate X** — the size of the *single* failing request. Large X = a big tensor (long-seq
|
||
attention score matrix, the longest batch M16); tiny X failing with GBs "free" = **fragmentation**.
|
||
- **reserved B vs allocated Z** — `reserved` = total the caching allocator grabbed from the driver;
|
||
`allocated` = live tensors. **`reserved` ≫ `allocated` with a small failing X ⇒ fragmentation** (free
|
||
blocks exist but none is contiguous enough). This is the explicit PyTorch diagnostic: "if reserved but
|
||
unallocated is large, set `expandable_segments:True`" (M8).
|
||
- **free A** — driver-visible free on the card; if A is large but the alloc still fails, suspect another
|
||
process (M3) or a zombie holding VRAM (gotchas_universal.md **U11**).
|
||
|
||
Sources: PyTorch forums thread on the trace fields
|
||
(https://discuss.pytorch.org/t/torch-outofmemoryerror-cuda-out-of-memory/217669); the reserved-vs-allocated
|
||
→ fragmentation rule is from the allocator docs (M8 URL).
|
||
|
||
### M3 — VRAM OOM is not host-RAM OOM (distinct failure, distinct fix)
|
||
|
||
A `torch.OutOfMemoryError: CUDA out of memory` (a Python traceback) is **VRAM** exhaustion. A bare `Killed`
|
||
/ **exit 137** with **no traceback** is the Linux kernel killing the process for **host-RAM**
|
||
(cgroup `memory.max`) exhaustion — almost always `num_workers × a big in-RAM object`. These have opposite
|
||
fixes and live in the universal catalog:
|
||
- host-RAM cgroup-OOM (`Killed`, exit 137, dataloader workers) → **gotchas_universal.md U9**.
|
||
- VRAM-OOM distinct from cgroup, fragmentation, concurrent-job sizing → **gotchas_universal.md U10**.
|
||
- "empty GPU" still OOMs (a zombie holds VRAM nvidia-smi can't attribute) → **gotchas_universal.md U11**.
|
||
|
||
Confirm which one before "fixing": `dmesg | grep -iE 'killed process|out of memory'` non-empty ⇒ host-RAM
|
||
kernel kill (U9), **not** a CUDA OOM. Do not shrink the model to "fix" a host-RAM kill.
|
||
|
||
---
|
||
|
||
## Fixes, in order (cheapest / least-science-disturbing first)
|
||
|
||
### M4 — The ladder: apply top-down, stop when it fits
|
||
|
||
Each rung costs more (speed, complexity, or risk to the result). Climb only as far as needed:
|
||
|
||
1. **Reduce micro-batch + grad-accumulation** (M5) — free, exact same effective batch, zero accuracy change.
|
||
2. **bf16 mixed precision** (M6) — halves activations, usually a speedup; bf16 needs no loss scaling.
|
||
3. **Activation / gradient checkpointing** (M7) — trades ~20–30% compute for a large activation cut.
|
||
4. **`PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True`** (M8) — free; fixes *fragmentation* OOM only.
|
||
5. **SDPA / FlashAttention** — stop materializing the full O(seq²) attention matrix (M11).
|
||
6. **Shorter seq-len / lower resolution** (M11) — cheap but **changes the science** → verify (REQUIRED).
|
||
7. **8-bit / paged optimizer** (M12) — cuts the 8 B/param optimizer state to ~2 B with ~no accuracy loss.
|
||
8. **FSDP / DeepSpeed ZeRO-1→2→3** (M9) — shard optimizer→grads→params across GPUs (needs ≥2 GPUs).
|
||
9. **CPU / NVMe offload** (M10) — last resort on a single GPU; trades a large speed hit for fit.
|
||
10. **LoRA / QLoRA** (M13) — for *finetuning*: freeze base, train adapters; QLoRA quantizes base to 4-bit.
|
||
|
||
Rungs 1–4 and 7 do **not** alter the model/optimization math; rung 6 does (declare it, re-verify per
|
||
verifying-dl-experiments). Rungs 8–10 change *where* state lives, not the math (LoRA changes capacity).
|
||
|
||
### M5 — Reduce micro-batch + gradient accumulation (the free first move)
|
||
|
||
**Symptom**: OOM scales with batch size; effective batch must stay fixed for the result to hold.
|
||
|
||
**Fix**: drop `per_device_train_batch_size` to what fits, raise `gradient_accumulation_steps` to keep the
|
||
*effective* batch identical (`effective = micro_batch × accum × world_size`). Gradients accumulate over
|
||
sub-batches before one optimizer step — same math, lower peak activation memory. Keep micro-batch as large
|
||
as fits (batch 4 × accum 16 beats batch 1 × accum 64 — better GPU utilization).
|
||
Source: https://huggingface.co/docs/transformers/main/en/perf_train_gpu_one (gradient accumulation).
|
||
Caveat: with token-level loss + a custom loop, naive accumulation can mis-weight the loss across uneven
|
||
sub-batch token counts — a correctness issue owned by **verifying-dl-experiments** (REQUIRED).
|
||
|
||
### M6 — bf16 mixed precision (prefer bf16 over fp16 on Ampere+)
|
||
|
||
**Symptom**: fp32 training; activations dominate; the GPU is Ampere (A100/30xx) or newer.
|
||
|
||
**Fix**: `bf16=True` (HF `TrainingArguments`) or `torch.autocast("cuda", dtype=torch.bfloat16)`. The main
|
||
win is **activations stored in 16-bit**. **bf16 over fp16**: bf16 has fp32's exponent range, so it needs no
|
||
loss-scaling and won't overflow/underflow — fewer NaN failures. Note fp16 can *increase* memory at small
|
||
batch (it keeps both fp16 and fp32 weight copies); bf16 is the safer default where supported.
|
||
Source: https://huggingface.co/docs/transformers/main/en/perf_train_gpu_one (mixed precision; bf16 needs
|
||
Ampere+). NaN/divergence after switching precision = a numerics question → **verifying-dl-experiments**.
|
||
|
||
### M7 — Activation / gradient checkpointing (trade compute for activation memory)
|
||
|
||
**Symptom**: the 18 B/param fixed cost fits but **activations** OOM (deep model, long seq, big batch).
|
||
|
||
**Fix**: `gradient_checkpointing=True` (HF), `model.gradient_checkpointing_enable()`, or
|
||
`torch.utils.checkpoint.checkpoint(...)` manually. Only a subset of activations is stored; the rest are
|
||
**recomputed** during backward — cuts activation memory substantially at **~20–30% slower** training.
|
||
Source: https://huggingface.co/docs/transformers/main/en/perf_train_gpu_one ("~20% slower"). Gotcha: with
|
||
HF generate/caching set `model.config.use_cache=False` when checkpointing, or it warns and ignores; with
|
||
DDP, reentrant checkpointing can break — use `use_reentrant=False`.
|
||
|
||
### M8 — `expandable_segments:True` (the free fragmentation fix)
|
||
|
||
**Symptom**: OOM where the **failing alloc is small** yet `reserved` ≫ `allocated` and GBs look "free"
|
||
(M2); common with **variable shapes** (changing batch/seq-len, dynamic padding).
|
||
|
||
**Fix**: launch with `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True` (env var, set *before* the process
|
||
starts; the modern alias is `PYTORCH_ALLOC_CONF`). It backs segments with CUDA VMM so they grow/shrink
|
||
instead of each `cudaMalloc` being an unmergeable block — which is the root of fragmentation.
|
||
Source: PyTorch CUDA notes (https://docs.pytorch.org/docs/stable/notes/cuda.html) and the allocator devlog
|
||
(https://docs.pytorch.org/devlogs/eager/2026-06-01-cuda-caching-allocator/). Alternative knob if fragmenting
|
||
on *large* blocks: `max_split_size_mb:<N>` (stops the allocator splitting blocks above N MiB). This is the
|
||
same knob referenced in **gotchas_universal.md U10** — set it as a default on the box, it is nearly free.
|
||
Version note: `expandable_segments` is still flagged experimental; it has known interop edges with some VMM
|
||
allocators (e.g. NCCL `ncclMemAlloc`, pytorch/pytorch#165419) — if a custom-allocator stack misbehaves,
|
||
drop it.
|
||
|
||
### M9 — FSDP / DeepSpeed ZeRO sharding (≥2 GPUs: shard the 18 B/param)
|
||
|
||
**Symptom**: the fixed 18 B/param state alone won't fit one card; multiple GPUs available.
|
||
|
||
**Fix — shard training state across the data-parallel group**, escalating by stage:
|
||
- **ZeRO-1 / `optim_state_dict` shard** — partition **optimizer states** (the 8 B/param). Smallest comms
|
||
change; start here if DP OOMs on optimizer state.
|
||
- **ZeRO-2 / FSDP `SHARD_GRAD_OP`** — also partition **gradients** (the 4 B). Good memory/comms balance.
|
||
- **ZeRO-3 / FSDP `FULL_SHARD`** — also partition **parameters** (the 6 B). Maximum savings; needs
|
||
high-bandwidth interconnect (NVLink/NVSwitch) because params are all-gathered per layer.
|
||
Source: DeepSpeed ZeRO tutorial (https://www.deepspeed.ai/tutorials/zero/) and HF DeepSpeed integration
|
||
(https://huggingface.co/docs/transformers/en/deepspeed); FSDP's `ShardingStrategy` maps 1:1 to these stages.
|
||
Multi-GPU launch + NCCL fabric gotchas (wrong NIC, timeout, MTU) → **references/multinode.md**.
|
||
|
||
### M10 — CPU / NVMe offload (single-GPU last resort)
|
||
|
||
**Symptom**: even ZeRO-3 (or single GPU) can't hold params/optimizer; willing to trade large speed for fit.
|
||
|
||
**Fix**: offload optimizer states (and with ZeRO-3, parameters) to **CPU RAM or NVMe**. DeepSpeed
|
||
`offload_optimizer: {device: cpu|nvme}` (valid ZeRO-1/2/3), `offload_param: {device: nvme}` (ZeRO-3 only);
|
||
**ZeRO-Infinity** offloads to both for huge models. QLoRA's **paged optimizers** (M12) are a lighter form —
|
||
optimizer state pages to CPU only on memory spikes. Source: DeepSpeed ZeRO docs
|
||
(https://deepspeed.readthedocs.io/en/stable/zero3.html); ZeRO-Infinity
|
||
(https://www.deepspeed.ai/2021/03/07/zero3-offload.html). Cost: PCIe/NVMe bandwidth becomes the bottleneck —
|
||
expect a multi-× slowdown; on a metered box, weigh it against renting a bigger card (principle #1, #9).
|
||
|
||
### M11 — Reduce sequence length / resolution / attention footprint
|
||
|
||
**Symptom**: activations (esp. the **O(seq²)** attention score matrix) dominate; OOM grows super-linearly
|
||
with seq-len/resolution.
|
||
|
||
**Fix (cheapest variant first)**:
|
||
- **Use SDPA / FlashAttention** to avoid materializing the full seq² attention matrix —
|
||
`attn_implementation="sdpa"` (default in PyTorch 2.1.1+) or `"flash_attention_2"`. No accuracy change.
|
||
- Only then **shorten seq-len / lower image resolution / patchify** — this **changes the task/science**;
|
||
declare it and re-verify (the resolution-change-broke-training failure mode is owned by
|
||
**verifying-dl-experiments**, REQUIRED).
|
||
Source: https://huggingface.co/docs/transformers/main/en/perf_train_gpu_one (SDPA, attention backends) and
|
||
model-memory-anatomy (attention score matrix grows with seq²).
|
||
|
||
### M12 — 8-bit & paged optimizers (cut the 8 B/param optimizer state to ~2 B)
|
||
|
||
**Symptom**: optimizer states are the largest single bucket (M1); want the cut with ~no accuracy loss.
|
||
|
||
**Fix**: swap AdamW for a **quantized** optimizer — HF `optim="adamw_bnb_8bit"` / `"paged_adamw_8bit"`
|
||
(bitsandbytes 8-bit Adam, states held in 8-bit, dequantized per step → ~2 B/param vs 8 B), or
|
||
`optim="adafactor"` (stores row/column moments instead of per-element → much less memory, **slower
|
||
convergence**). **Paged** variants additionally page optimizer state to CPU on spikes to survive transient
|
||
peaks. Source: https://huggingface.co/docs/transformers/main/en/perf_train_gpu_one (optimizers) and
|
||
model-memory-anatomy ("quantized Adam → 2 bytes/param"). Adafactor's convergence change is a science
|
||
question → **verifying-dl-experiments** (REQUIRED) before trusting its ablation deltas.
|
||
|
||
### M13 — LoRA / QLoRA (finetuning only: don't train the full model)
|
||
|
||
**Symptom**: *finetuning* a large pretrained model; full-finetune state won't fit.
|
||
|
||
**Fix**: **LoRA** freezes the base weights and trains small low-rank adapters → grads+optimizer exist only
|
||
for the adapter (a tiny fraction of params), so the 18 B/param cost nearly vanishes for the base.
|
||
**QLoRA** goes further: quantize the **frozen base to 4-bit NF4** (+ double quantization + paged
|
||
optimizers), train fp16/bf16 adapters on top — reported to finetune a **65B model on a single 48 GB GPU**
|
||
with no accuracy degradation vs 16-bit. Source: QLoRA paper
|
||
(https://arxiv.org/abs/2305.14314) and repo (https://github.com/artidoro/qlora). Note: LoRA *changes model
|
||
capacity* — it is a different optimization target, not a free OOM trick. Whether the LoRA result matches
|
||
full-finetune is a science claim → **verifying-dl-experiments** (REQUIRED).
|
||
|
||
---
|
||
|
||
## OOM at a SPECIFIC step (the step number is the diagnosis)
|
||
|
||
### M14 — OOM on the **first backward** (not the forward)
|
||
|
||
**Symptom**: forward pass completes, OOM hits at `.backward()`.
|
||
|
||
**Root cause**: forward only allocates activations; **backward** additionally allocates the full
|
||
**gradient** buffers (4 B/param) and needs every cached activation live simultaneously — peak memory is at
|
||
backward, not forward. A model that forwards fine still OOMs here.
|
||
|
||
**Fix**: M7 (checkpointing — recompute instead of store activations) is the targeted fix; then M5/M6. If the
|
||
peak is a single huge layer, gradient-checkpoint *that block* specifically.
|
||
|
||
### M15 — OOM only during **validation / eval**, training was fine
|
||
|
||
**Symptom**: training epochs run; the first eval pass OOMs — even with `torch.no_grad()` / `model.eval()`,
|
||
sometimes even at eval batch size 1.
|
||
|
||
**Root cause — two distinct ones**:
|
||
1. **Eval batch > train batch**, or no-grad eval lets a *larger* batch be attempted that exceeds the train
|
||
peak. The activation graph isn't kept, but a big single forward + its temporary buffers can still OOM.
|
||
2. **HF Trainer accumulates predictions on the GPU**: by default eval logits/labels are concatenated **on
|
||
the GPU** for the whole eval set before moving to CPU — a large eval set OOMs regardless of batch size
|
||
(huggingface/transformers#7232).
|
||
|
||
**Fix**: set `per_device_eval_batch_size` explicitly (don't inherit a too-large value); set
|
||
**`eval_accumulation_steps=N`** so predictions move to CPU every N steps instead of piling on the GPU
|
||
(https://huggingface.co/docs/transformers/main_classes/trainer). In a custom loop: wrap eval in
|
||
`torch.no_grad()` / `torch.inference_mode()`, and `.cpu()` / `.detach()` outputs before appending to any
|
||
list. Eval-artifact *sizing* (per-sample dumps blowing up) is owned by **verifying-dl-experiments**.
|
||
|
||
### M16 — OOM mid-epoch on the **longest batch** (variable-length / bucketed data)
|
||
|
||
**Symptom**: thousands of steps succeed, then a random step OOMs; restarting from there OOMs again at the
|
||
same data; fixed batches never OOM.
|
||
|
||
**Root cause**: with variable-length inputs (NLP token batches, point clouds, variable-resolution images),
|
||
peak activation memory is set by the **longest sequence in the batch**, not the average. Memory is sized for
|
||
the worst case, which only appears on certain batches.
|
||
|
||
**Fix**: size everything for the **max** length, not the mean: cap `max_length` / use **length bucketing or
|
||
sorted batching** so long samples share small batches; set `group_by_length=True` (HF) and a hard
|
||
`max_length`. A `expandable_segments:True` (M8) also helps because the variable shapes otherwise fragment.
|
||
Don't conclude "data corruption" from a step-N OOM — it's the longest batch.
|
||
|
||
### M17 — OOM on **step 2** (after the first optimizer step), step 1 fine
|
||
|
||
**Symptom**: step 1 trains; OOM on step 2 or at the first `optimizer.step()`.
|
||
|
||
**Root cause**: Adam **lazily allocates** its m/v state (the 8 B/param) on the *first* `optimizer.step()`,
|
||
not at construction. Peak therefore jumps after step 1. Reserved memory also climbs as the allocator caches
|
||
backward buffers. Source: the memory-snapshot timeline shows optimizer state appearing after iter 1
|
||
(https://pytorch.org/blog/understanding-gpu-memory-1/).
|
||
|
||
**Fix**: budget for the **post-step** peak, not step-1 — measure peak with `max_memory_allocated()` *after*
|
||
two full steps, not one. Then apply M12 (8-bit optimizer halves this jump) or M5.
|
||
|
||
---
|
||
|
||
## Debugging tools (measure, don't guess)
|
||
|
||
### M18 — `torch.cuda.memory_summary()` + the stat functions (first look)
|
||
|
||
**Symptom**: need to know *what* is resident before choosing a fix.
|
||
|
||
**Fix**: print `torch.cuda.memory_summary()` at the OOM point (or in an `except torch.cuda.OutOfMemoryError`)
|
||
for a table of allocated/reserved/active by size class. Programmatic: `torch.cuda.memory_allocated()` (live
|
||
tensors) vs `torch.cuda.memory_reserved()` (allocator total) — a big gap = fragmentation/caching (→ M8);
|
||
`torch.cuda.max_memory_allocated()` for the true peak (reset with `reset_peak_memory_stats()` between phases
|
||
to isolate forward vs backward vs optimizer). Source:
|
||
https://docs.pytorch.org/docs/stable/notes/cuda.html (memory management functions).
|
||
|
||
### M19 — The CUDA memory snapshot + visualizer (find the exact culprit allocation)
|
||
|
||
**Symptom**: summary stats aren't enough — need *which line of code* allocated the memory that OOMs.
|
||
|
||
**Fix — record a snapshot around the OOM and view the timeline**:
|
||
```python
|
||
torch.cuda.memory._record_memory_history(max_entries=100000) # start before the step(s)
|
||
try:
|
||
train_a_few_steps()
|
||
finally:
|
||
torch.cuda.memory._dump_snapshot("oom_snapshot.pickle") # write history
|
||
torch.cuda.memory._record_memory_history(enabled=None) # stop
|
||
```
|
||
Drag `oom_snapshot.pickle` onto **https://pytorch.org/memory_viz** (snapshots are not uploaded server-side),
|
||
or `python torch/cuda/_memory_viz.py trace_plot oom_snapshot.pickle -o snapshot.html`. The timeline colors
|
||
**parameters / gradients / optimizer state / activations / temporaries** separately, so the tallest band at
|
||
the OOM moment names the bucket to attack (→ map back to M5–M13). Available PyTorch **2.1+**. Source:
|
||
https://pytorch.org/blog/understanding-gpu-memory-1/. On a remote box: dump the pickle, `scp` it down
|
||
(references/ssh_transport.md), view locally — don't try to run the visualizer over ssh.
|
||
|
||
### M20 — `empty_cache()` and the "memory leak" myths
|
||
|
||
**Symptom**: belief that `torch.cuda.empty_cache()` "frees memory and fixes OOM," or that steadily-rising
|
||
reserved memory is a leak.
|
||
|
||
**Root cause / myth-busting**:
|
||
- `torch.cuda.empty_cache()` returns cached-but-unused blocks **to the driver**; it does **not** free live
|
||
tensors and does **not** make more room for *your own* process (the allocator would reuse that cache
|
||
anyway). It only helps a *second* process on the same GPU, or it reduces fragmentation at a ~10% speed
|
||
cost (HF `torch_empty_cache_steps=N` runs it every N steps —
|
||
https://huggingface.co/docs/transformers/main/en/perf_train_gpu_one). It is **not** a fix for a model that
|
||
genuinely doesn't fit.
|
||
- **Rising `reserved` ≠ leak.** The caching allocator holds freed blocks for reuse; reserved climbing then
|
||
plateauing is normal. A *true* leak is **rising `memory_allocated()`** across steps with constant batch —
|
||
usually accumulating tensors that still require grad (appending `loss` instead of `loss.item()`, keeping
|
||
references in a Python list). Fix the reference, not with `empty_cache()`.
|
||
- Calling `empty_cache()` every step to "stay safe" just slows training and can *increase* fragmentation.
|
||
|
||
Real OOM-mechanics leaks (accumulate-loss-tensor, no `detach`) belong here; whether a *metric* drift is a
|
||
real effect vs a bug belongs to **verifying-dl-experiments** (REQUIRED).
|
||
|
||
---
|
||
|
||
## Pointers — memory gotchas catalogued elsewhere (do NOT restate)
|
||
|
||
- **Host-RAM cgroup-OOM** (bare `Killed` / exit 137, `num_workers × big tensor`) → **gotchas_universal.md U9**.
|
||
- **VRAM-OOM vs cgroup-OOM**, concurrent-job sizing, the `expandable_segments` one-liner → **gotchas_universal.md U10**.
|
||
- **Zombie holds VRAM nvidia-smi can't see** (OOM on an "empty" GPU) → **gotchas_universal.md U11**.
|
||
- **Disk-full crashes `torch.save`** (not memory, but the other "out of space") → **gotchas_universal.md U6**.
|
||
- **Multi-GPU NCCL / fabric** for FSDP/ZeRO launches → **references/multinode.md**.
|
||
- **Is the post-fit number correct** (precision swap, seq-len change, LoRA-vs-full, accumulation loss
|
||
weighting, determinism) → **verifying-dl-experiments** (REQUIRED — this layer makes it *fit and run*; that
|
||
one decides if the *result is true*).
|