{"id": "convergence-frozen-resnet", "prompt": "Fine-tuning a ResNet50 on a rented GPU. Training runs with no errors and normal speed, but loss barely drops and val accuracy is stuck near chance. I froze the backbone with requires_grad=False and use Adam with weight_decay. How do I debug why it isn't learning?", "expect_files": ["references/training/convergence-debugging.md"], "expect_ids": ["O1", "O2", "O17", "O18", "O6"], "expect_grep": [], "must_cover": "overfit-one-batch smoke; frozen-param-still-in-optimizer; frozen-BN running-stats drift; Adam vs AdamW decoupled decay", "agentic": "PASS (1-hop, 2026-06): SKILL.md 'When training breaks' -> convergence-debugging.md O1/O2/O17/O18/O6"} {"id": "data-worker-rng-dup", "prompt": "My image augmentations seem to repeat: different DataLoader workers produce identical random crops, and every epoch looks the same. Linux, num_workers=8, numpy-based augmentation. Real bug? Fix?", "expect_files": ["references/training/data-pipeline.md"], "expect_ids": ["DP1"], "expect_grep": ["worker_init_fn", "torch.initial_seed"], "must_cover": "numpy global RNG inherited via fork, not reseeded per worker; fix via worker_init_fn or route RNG through torch", "agentic": "PASS (1-hop, 2026-06): SKILL.md 'When training breaks' -> data-pipeline.md DP1"} {"id": "oom-on-step-2", "prompt": "CUDA out of memory on step 2, right after the first optimizer step. Step 1 ran fine. Why does it OOM only on the second step?", "expect_files": ["references/training/oom-memory.md"], "expect_ids": ["M17"], "expect_grep": [], "must_cover": "Adam lazily allocates optimizer state (m,v) on the first step()", "agentic": "PASS (workflow w2r1t7mm9): routed to oom-memory.md ladder + step-2 entry"} {"id": "nccl-one-rank-hang", "prompt": "Multi-GPU training hangs partway through an epoch; one rank seems stuck and the others wait forever (NCCL timeout). How do I find which rank and why?", "expect_files": ["references/training/distributed-launch.md"], "expect_ids": ["D19", "D20"], "expect_grep": [], "must_cover": "one rank diverged/OOM'd; survivors hang on the absent collective; desync-debug toolkit", "agentic": "PASS (workflow w2r1t7mm9): routed to distributed-launch.md hang toolkit"} {"id": "diffusion-loss-low-samples-bad", "prompt": "My diffusion model's training loss is low and still decreasing, but the generated samples look bad/blurry. The loss says it's fine. What's wrong?", "expect_files": ["references/training/by-domain.md"], "expect_ids": ["DF1", "DF2"], "expect_grep": ["EMA"], "must_cover": "loss != sample quality; sampling from raw (non-EMA) weights; cross-link verifying-dl-experiments", "agentic": "PASS (workflow w2r1t7mm9): routed to by-domain.md diffusion section"} {"id": "nan-loss-spike-bf16", "prompt": "LLM pretraining in bf16: loss is stable then suddenly spikes to NaN. How do I find where the NaN comes from and stop the spikes?", "expect_files": ["references/training/precision-stability.md"], "expect_ids": ["P8", "P12", "P15"], "expect_grep": ["z-loss"], "must_cover": "NaN arithmetic origins + anomaly detection; LR-too-high/warmup spike; z-loss to bound logits", "agentic": "PASS (workflow w2r1t7mm9): routed to precision-stability.md"} {"id": "resume-epoch-reset", "prompt": "I resume training from a checkpoint but the epoch/step counter restarts from 0 and the LR schedule replays warmup. What did I forget to save/restore?", "expect_files": ["references/training/checkpoint-resume.md"], "expect_ids": ["C1", "C12", "C14"], "expect_grep": [], "must_cover": "save FULL state (epoch/step/scheduler/RNG/scaler), not just weights", "agentic": "PASS (1-hop): SKILL.md -> checkpoint-resume.md"} {"id": "throughput-gpu-starved", "prompt": "GPU utilization is low and training is slow on my rented box. I think the dataloader is starving the GPU. How do I confirm and fix it?", "expect_files": ["references/training/throughput-profiling.md"], "expect_ids": ["T1", "T4"], "expect_grep": ["num_workers"], "must_cover": "GPU-bound vs data-bound vs comms-bound triage; num_workers/prefetch knobs", "agentic": "PASS: SKILL.md -> throughput-profiling.md"} {"id": "runpod-spot-resume-teardown", "prompt": "On RunPod my spot training keeps getting preempted. How do I make it resume instead of restarting, and how do I stop the meter most cheaply afterwards without losing checkpoints?", "expect_files": ["profiles/runpod.md"], "expect_ids": [], "expect_grep": ["terminate", "Network Volume"], "must_cover": "Network Volume is the only durable store; ~5s grace; terminate (not stop) stops billing; verify ckpt before terminate", "agentic": "PASS (workflow w2r1t7mm9): SKILL.md -> profiles/runpod.md SS4/SS5 -> spot-resilience.md -> checkpoint-resume.md C3"} {"id": "vastai-teardown-billing", "prompt": "On vast.ai, what action actually stops billing, and how do I tear down without losing my checkpoints?", "expect_files": ["profiles/vastai.md"], "expect_ids": [], "expect_grep": ["destroy"], "must_cover": "destroy is the only meter-stop; stop still bills disk; copy + load-verify off-box before destroy", "agentic": "PASS (workflow w2r1t7mm9): SKILL.md -> profiles/vastai.md SS5 -> lifecycle_checklist Phase 5"} {"id": "autodl-inode-disk-full", "prompt": "On AutoDL my torch.save fails with a disk/iostream error, but df -h shows plenty of space left. What's going on?", "expect_files": ["references/gotchas_universal.md"], "expect_ids": [], "expect_grep": ["inode", "df -i"], "must_cover": "storage dies on inodes before bytes; monitor df -i not just df -h; millions of small files", "agentic": "PASS (workflow w2r1t7mm9): routed to the inode/disk gotcha (principle #5 / U7)"} {"id": "china-hf-download-stall", "prompt": "Training in mainland China: a huggingface model download stalls and hangs with no error. How do I fix the download?", "expect_files": ["references/china-network.md"], "expect_ids": [], "expect_grep": ["hf-mirror", "HF_ENDPOINT"], "must_cover": "HF_ENDPOINT=hf-mirror.com; keep hf_transfer OFF on flaky CN links; resumable-download ladder", "agentic": "PASS (workflow w2r1t7mm9): SKILL.md -> references/china-network.md"} {"id": "lambda-stop-vs-terminate", "prompt": "On Lambda Cloud, is there a stop action to pause billing while keeping my instance, or only terminate? How should I tear down?", "expect_files": ["profiles/lambda.md"], "expect_ids": [], "expect_grep": ["terminate"], "must_cover": "no stop state on Lambda on-demand; terminate is irreversible + wipes the instance; persistent FS is the only durable home", "agentic": "PASS (workflow w2r1t7mm9): SKILL.md -> profiles/lambda.md"} {"id": "autodl-first-contact-15day", "prompt": "First time on AutoDL. I'll 关机 (stop) my instance between sessions to save money — is my data safe if it stays stopped for a few weeks? Anything else I should know up front?", "expect_files": ["profiles/autodl.md"], "expect_ids": [], "expect_grep": ["Surface to the user", "免密", "AD-DANGER"], "must_cover": "关机 auto-releases after 15 days -> data disk deleted (not safe to park indefinitely); sync best to /root/autodl-fs for a longer pause; surface conveniences (one-click SSH免密, GPU notify, panels) + danger clocks (principle #10)", "agentic": "PASS (2026-06): principle #10 first-contact surfacing -> profiles/autodl.md Surface block + AD-DANGER 15-day clock"}