playbook/antigravity-awesome-skills/skills/remote-gpu-trainer/examples/autodl_sweep
ci[bot] c4c6a41c21 📦 deps(thirdparty): update snapshots 2026-06-21 16:02:37 +00:00
..
README.md 📦 deps(thirdparty): update snapshots 2026-06-21 16:02:37 +00:00
queue_1.txt 📦 deps(thirdparty): update snapshots 2026-06-21 16:02:37 +00:00

README.md

Worked example — a 3-cell ablation sweep on AutoDL

A complete, end-to-end run of the 6-phase lifecycle (SKILL.md) for the deepest profile (profiles/autodl.md). Substitute your own project name, alias, and configs. Two instances run their own queue file in parallel; this walkthrough ships queue_1.txt and shows one instance. Read profiles/autodl.md first — it owns every path and verb used below.

The AutoDL SCRIPT OVERRIDES (profiles/autodl.md §8) that parameterize the templates:

export PROJECT_REPO_DIR=/root/myproj
export DATA_DIR=/root/autodl-tmp          # fast per-instance scratch (checkpoints)
export DURABLE_DIR=/root/autodl-fs        # region-locked shared FS (survives release)
export PROXY_HOOK='source /etc/network_turbo'
export CRED_FILE=/root/.wandb_key

Phase 0 — Environment audit

ssh autodl-1 'df -h /root/autodl-tmp /root/autodl-fs / && df -i /root/autodl-fs && \
              cat /sys/fs/cgroup/memory.max | numfmt --to=iec && nvidia-smi'
bash scripts/gpu_health.sh 0     # run ON the box: Xid / throttle pre-flight (U22/U23)

Budget the disk: ckpt_size × cells_in_queue + scratch. Verify: nvidia-smi shows the expected GPU; df -i /root/autodl-fs is well under 100% (the inode cap, U7).

Phase 1 — SSH + credentials

# alias already in ~/.ssh/config (references/ssh_transport.md). Push the wandb key via stdin,
# to the per-instance disk — NEVER the shared FS (U34, and AutoDL's classifier blocks it, AD-gotcha):
printf '%s\n' "$WANDB_KEY_FROM_ENV" | ssh autodl-1 'umask 077; cat > /root/.wandb_key && chmod 600 /root/.wandb_key'

Verify: ssh autodl-1 'python -c "import torch;print(torch.cuda.is_available())"' prints True.

Phase 2 — Wrapper + CPU-smoke gate

# Parameterize the templates, drop the .template suffix, smoke locally on CPU BEFORE renting time:
cp scripts/run_one.sh.template run_one.sh && cp scripts/run_queue.sh.template run_queue.sh
python -m src.train -c configs/ablation/baseline.yaml --task reconstruction \
       --limit-batches 2 --epochs 1   # logger off; catches import/shape/scale bugs for free

Verify: the smoke exits 0 on 2 batches. (Smoke contentREQUIRED: verifying-dl-experiments.)

Phase 3 — Detached launch

# Push the parameterized wrappers + queue to the shared FS (ONE copy, all instances read it):
scp run_one.sh run_queue.sh examples/autodl_sweep/queue_1.txt autodl-1:/root/autodl-fs/
ssh autodl-1 "RUN_ONE=/root/autodl-fs/run_one.sh tmux new -d -s q1 \
  'bash /root/autodl-fs/run_queue.sh /root/autodl-fs/queue_1.txt 2>&1 | tee /root/autodl-tmp/runs/logs/q1_master.log'"

Verify within 60 s: ssh autodl-1 'tmux ls && tail -5 /root/autodl-tmp/runs/logs/q1_master.log' shows the session alive and a STARTING baseline line. Never overwrite the FS wrapper mid-run (U2 / principle #6).

Phase 4 — Durable monitoring

ssh autodl-1 'grep -hE "STARTING|FINISHED|QUEUE DONE|ERROR|Traceback" /root/autodl-tmp/runs/logs/q1_master.log | tail -8'

For a multi-hour sweep deploy the four-layer architecture (references/monitoring_patterns.md): a remote self-completion marker + a session patrol loop. Flag a FINISHED at <50% typical duration (probable early-stop) and re-launch the identical config (principle #7), never a patched one. Don't blind-retry.

Phase 5 — Aggregate + verify + teardown

ssh autodl-1 'DATA_DIR=/root/autodl-tmp DURABLE_DIR=/root/autodl-fs bash /root/autodl-fs/aggregate_to_fs.sh'  # gated sync (U33)
LOCAL_TARGET=/path/to/local/final_ckpts REMOTE_ALIAS=autodl-1 \
  REMOTE_PATH=/root/autodl-fs/final_ckpts bash scripts/download_loop.sh        # resumable per-dir pull
python scripts/verify_local.py /path/to/local/final_ckpts/                     # LOAD each best.pth

Verify: verify_local.py reports 100% OK. Iron Law: only AFTER every cell is pulled AND load-verified AND the user approves does teardown run — on AutoDL 关机 stops the meter and keeps the disk (the reversible exception); release frees it irreversibly. Reconcile against the roster, not the log (references/parallel_ablation.md §6). REQUIRED: superpowers:verification-before-completion.