playbook/antigravity-awesome-skills/skills/remote-gpu-trainer/profiles/autodl.md

23 KiB
Raw Blame History

Profile: AutoDL

The deepest, battle-tested profile — a Chinese cgroup-isolated SSH-rental with a 3-tier storage model and the one rental where the meter-stop action is non-destructive. Fills all 8 schema sections (profiles/_schema.md) at full depth. Read this before Phase 0; it owns every path, proxy, billing verb, and TB pin the SKILL.md phases delegate to. Universal gotchas are NOT restated here — see references/gotchas_universal.md.

Surface to the user up front (principle #10): conveniences most users miss — the console has a one-click "设置SSH免密登录" (registers your key so the agent connects non-interactively), GPU-availability notifications ("订阅GPU通知"), and built-in AutoPanel / JupyterLab / TensorBoard tiles. ⚠️ Danger clocks — 关机 (stop) auto-releases the box after 15 days → the data disk is deleted (AD-DANGER, §5); only /root/autodl-fs survives a 释放; low balance / arrears force-stop. And the TB tile is pinned to /root/tf-logs — write your logger there (or symlink) or the panel shows empty (AD7 / U39).

To jump: grep -in '<keyword>' profiles/autodl.md (e.g. grep -in inode profiles/autodl.md).

Table of contents

  1. LAUNCH — entry points + env contract (base miniconda IS the env)
  2. STORAGE MODEL — 3 tiers + survival matrix + inode cap
  3. NETWORK — academic proxy + China mirrors + pinned TB
  4. SPOT / INTERRUPTION + RESUME — effectively on-demand
  5. TEARDOWN / BILLING — 关机 stops the meter AND keeps the disk (the AutoDL exception)
  6. DAEMON TOOL — tmux / nohup
  7. TOP GOTCHAS — AD1..AD9, platform-pinned
  8. SCRIPT OVERRIDES — values to parameterize scripts/

---
platform: autodl
kind: ssh-rental
meter_stop_verb: 关机           # shutdown/power-off STOPS billing AND keeps /root + disks
meter_stop_irreversible: false  # the AutoDL EXCEPTION — 关机 is reversible; only 释放/release deletes
detach_primitive: tmux          # nohup fallback when tmux is not installed (often absent on fresh image)
spot_available: false           # on-demand only; no spot/bid/preemption model
spot_grace: n/a
shared_fs: true                 # /root/autodl-fs — region-locked, cross-instance within one region
inode_cap: ~200K                # hard cap on the shared FS, independent of byte capacity
free_egress: true               # no per-GB egress fee, but cross-GFW pulls need the academic proxy (see china_mirror_needed)
china_mirror_needed: true       # behind the GFW — hf-mirror / ModelScope + /etc/network_turbo
host_driver_cuda_max: image-dependent   # the prebuilt image pins torch+CUDA; do not downgrade (AD9)
local_nvme: true                # /root/autodl-tmp data disk is fast local NVMe, per-instance
---

1. LAUNCH

First time? (rent → reach the box). On the AutoDL console: pick a GPU + region with stock → 创建实例 (choose the PyTorch image — the base env ships prebuilt) → register your key once via 设置SSH免密登录 (so the agent connects non-interactively) → copy the instance's SSH connection string + password from the console → test ssh -p <PORT> root@connect.<region>.seetacloud.com 'nvidia-smi'. That string is your entry to every phase below. (Console-only steps; AutoDL's UI shifts — re-check its docs if a label moved.)

Entry points. Web console (创建实例) for create/release/power; per-instance SSH connection string from the console (ssh -p <PORT> root@connect.<region>.seetacloud.com). No first-class platform CLI/REST for job control — SSH is the orchestration channel. Set a stable alias per instance in ~/.ssh/config (Host autodl-<proj>-<N>, HostName connect.<region>.seetacloud.com, Port <PORT>) so every later command is short; the port is assigned at create-time and changes on re-create (update the alias). SSH/keepalive config → references/ssh_transport.md.

Env contract — the prebuilt base miniconda IS the env (AD6). The image ships the full DL stack into base (/root/miniconda3/bin/python); there is no /root/miniconda3/envs/<name>/. Base is the deliberate single-tenant project env. Never conda create / conda clone base on the rental — cloning wastes ~16 GB of base packages + the disk just freed, for zero benefit. Train with the explicit interpreter /root/miniconda3/bin/python; in remote polls use that path or pure shell, never bare python3 (it may be absent → exit 127). When installing project deps, filter framework pins so a requirements.txt does not downgrade the image's torch build (AD9).

The "no DL in conda base" discipline applies to the persistent local machine only — on an ephemeral rental, base IS the expected place to run. A local env-guard hook must exempt remote-ssh + instance base.


2. STORAGE MODEL (survival matrix — principle #4)

Three tiers, each with a different speed / size / inode profile and a different survival behavior:

Tier Path Speed Size Inode cap Scope
System disk / medium ~30 GB none per-instance
Data disk /root/autodl-tmp fast NVMe per-plan (e.g. ~50 GB) none per-instance
Shared FS /root/autodl-fs NFS (slow, ~30 s/sync) ~200 GB ~200K (hard) region-locked, all instances in one region

Survival matrix — the part most platforms get wrong, and where AutoDL is the exception:

Tier Survives 关机 (stop)? Survives 释放 (release/destroy)? Notes
/ system yes no AutoDL persists /root across power-off — UNLIKE RunPod/vast/K8s/Colab
/root/autodl-tmp data yes no fast tier; checkpoints written here mid-run
/root/autodl-fs shared yes yes the ONLY tier that survives release; region-locked

Where checkpoints MUST go for the §5 teardown verb: write live checkpoints to the fast data disk (/root/autodl-tmp/checkpoints/<name>, never the 30 GB system disk), then checked-sync best.pth to /root/autodl-fs — the only tier that survives a 释放. If only ever using 关机, the data disk also survives, but syncing the durable copy to FS is the safe default (a later release loses the data disk).

Region/DC-lock (AD3). FS quota is region-scoped; each region has its own physical mount. Files written from a <region-a> instance are invisible to a <region-b> instance even at the identical /root/autodl-fs/ path. Create the FS quota in the same region as the instances; to bridge regions, pick one region as primary and scp between them (slow). Confirm sharing with a write-from-one / read-from- another probe before relying on it.

Inode discipline (AD4). The ~200K cap is independent of bytes: df -h can read 34% while cp fails "No space left" because df -i is at 100%. The inode bomb is per-sample eval visualization (files_per_sample × N_samples × N_conditions → tens of thousands of tiny files); checkpoints (few large files) are inode-cheap. Monitor df -i, not just df -h (Phase 0 + every space check). Eval-artifact sizing policy is owned by REQUIRED: verifying-dl-experiments.

Data-disk hog (AD5). When /root/autodl-tmp hits 100% but runs/ looks small, the real hog is the HF cache symlinked onto the data disk (~/.cache/huggingface → tens of GB of model blobs). Audit du -sh ~/.cache/huggingface/hub/models--* | sort -rh before deleting checkpoints; redirect HF_HOME to the data disk explicitly (see §8). Disk is expandable — prefer expand over silently shrinking the experiment (principle #9). Get explicit user confirmation naming rm -rf targets (the harness classifier blocks agent-inferred irreversible deletes).


3. NETWORK

Egress proxy — source /etc/network_turbo is MANDATORY (AD1). Instances start with no proxy; direct egress to api.wandb.ai / huggingface.co / github.com / pypi.org is unreliable (0.5 s … 300 s … blocked). Every shell that calls wandb / HF / pip / git must source /etc/network_turbo first (source /etc/network_turbo 2>/dev/null || true at the top of every wrapper). It exports http_proxy / https_proxy pointing at the in-DC academic proxy (http://<proxy-ip>:<port>), a no_proxy allow-list for domestic endpoints, and the CA bundle. Perf delta: wandb push ~0.8 s with turbo vs >120 s timeout without — no exceptions, even a small wandb.summary write can wedge for minutes.

China mirrors (AD2). HF behind the GFW → HF_ENDPOINT=https://hf-mirror.com or pull from ModelScope. Two compounding traps: (a) HF's Xet CAS backend is NOT mirror-proxied (the mirror covers the API but big .safetensors shards still hit the flaky international endpoint) → export HF_HUB_DISABLE_XET=1 (or pip uninstall -y hf_xet) to force the classic LFS path the mirror does proxy; (b) no_proxy in network_turbo lists modelscope.com but not modelscope.cn — routing a DOMESTIC source through the international-acceleration proxy SLOWS it. Wrap every download in a timeout <s> … && break retry loop (resumes partial files; a stall ≠ permanent failure). Full mirror table + no_proxy ladder → references/china-network.md.

Port exposure. AutoDL maps a single custom port (6006) for user services; the platform also exposes JupyterLab. SSH port is the per-instance <PORT> and changes on re-create.

Platform TensorBoard is pinned to /root/tf-logs (AD7). The image autostarts tensorboard --logdir /root/tf-logs --port 6007 on boot and the AutoPanel TB tile proxies straight to that pid — the --logdir is hard-pinned and cannot be reconfigured from inside the container. Events written anywhere else are invisible in the web tile no matter how correct the SummaryWriter setup. Fix: write to SummaryWriter(log_dir="/root/tf-logs/<run>"), or ln -sfn <your-tb> /root/tf-logs/<run> (the pinned TB has --reload=5, so the run appears within ~5 s — no restart). Verify with curl -s http://127.0.0.1:6007/data/runs (expect a JSON array with the run), NOT ss (can show nothing inside the container while curl returns 200). Local logs die with the instance — for durable curves use a hosted tracker (REQUIRED: huggingface-skills:huggingface-trackio).

SSH flavor. Direct-TCP SSH on the per-instance host:port — scp/rsync work normally (no proxied-SSH restriction). Use a per-dir resumable loop for large transfers (single-connection scp -r resets mid- transfer); rsync -avz --partial is preferred. Transport patterns → references/ssh_transport.md.


4. SPOT / INTERRUPTION + RESUME (principle #7/#8)

No spot/bid/preemption model — AutoDL is on-demand. There is no mid-run eviction, no SIGTERM grace window to handle (spot_grace: n/a). The real loss vectors are: (a) forgot to release/关机 → idle billing (principle #1); (b) an instance reboot that ends a non-detached process (a vanished process is not always OOM — enumerate reboot / OOM / SSH-HUP / manual-kill before concluding, see references/gotchas_universal.md); (c) availability — the GPU plan being sold out at create-time (build retry-until-available, not survive-an-eviction).

Resume hook. The universal spine still applies (principle #8): checkpoint atomically to the data disk + sync best.pth to FS, and resume-from-latest unconditionally on relaunch. The detach primitive (§6) makes the identical launch command survive an SSH drop; checkpoint+resume makes it survive a reboot. Cadence formula → references/spot-resilience.md (the formula generalizes even without spot — it bounds re-compute lost to a reboot).


5. TEARDOWN / BILLING (principle #9 + the Iron Law)

关机 (shutdown / power-off) STOPS the meter AND keeps /root + both disks — this is the AutoDL EXCEPTION among rentals. Everywhere else (RunPod wipes the container disk on stop, vast bills the disk forever, K8s wipes the pod FS, Colab loses /content) a "stop" is lossy or still-billing. On AutoDL, 关机 is the safe park: meter off, all three tiers intact, restart later. There is also a no-GPU / 无卡模式 mode for cheap restart to copy files or fix the env without paying for the GPU.

Action Stops meter? Keeps / + data disk? Keeps FS? Reversible?
关机 (shutdown) yes yes yes yes — restart anytime (the AutoDL exception)
无卡模式 (no-GPU) mostly (cheap) yes yes yes
释放 (release/destroy) yes NO yes NO — deletes / + data disk irreversibly

Cost trap. 关机 still bills the data-disk storage at a small rate while the GPU meter is off — far cheaper than running, but not free. Only 释放 fully ends storage billing, at the cost of the data disk. ⚠️ Auto-release clock (AD-DANGER): a 关机 (stopped) instance is auto-released after 15 days (the console shows "关机 15 天后释放") → that release deletes / and the data disk, so 关机 is safe parking only within the window; for a longer pause, sync best to /root/autodl-fs (survives 释放) or expect to re-download. Low balance / arrears also force-stop the instance. Surface this to the user up front (principle #10) — most users assume 关机 parks the box indefinitely. Teardown Iron Law (SKILL.md Phase 5): no 释放 / file-delete until best.pth is pulled to local AND verified by load (scripts/verify_local.py) AND the user explicitly approves — "it looked done in the log" is not evidence (principle #3). Because 关机 is non-destructive here, the cheap safe move when unsure is to 关机 and ask, never 释放 on a guess. REQUIRED: superpowers:verification-before-completion is the general form of this gate.


6. DAEMON TOOL

tmux is the detach primitive when present, but tmux is often NOT installed on a fresh AutoDL image and apt-get install tmux fails when egress is down. Zero-dependency fallback: nohup bash run_queue.sh queue.txt </dev/null >master.log 2>&1 & — survives an SSH drop (SIGHUP), needs no package. Verify either with pgrep -af <script>. The detach survives an SSH drop; it does not survive a 关机/reboot — that is what checkpoint+resume (§4) is for.

Native queue: none. AutoDL has no built-in scheduler → use the bundled scripts/run_queue.sh.template (resumable queue iterator, start_index for resume) driving scripts/run_one.sh.template per cell. Never overwrite a script a running bash is mid-execution (bash reads by byte-offset → re-executes blocks; version the filename) — universal physics, see references/gotchas_universal.md.

Monitoring. A session-bound watcher dies with the session; for multi-hour runs deploy the four-layer durable architecture (references/monitoring_patterns.md). Detect "done" by a log marker (grep -q 'QUEUE DONE' master.log), never by pgrep (the waiter's own cmdline matches the pattern and loops forever). A cloud scheduler cannot reach the rented box (no SSH key in a cloud sandbox — secret leak); the honest recurring check is the remote self-monitor + a session loop with the local key.


7. TOP GOTCHAS (AutoDL-pinned; universal ones → references/gotchas_universal.md)

AD1 — external network call hangs / wandb shows 0 runs. Symptom: wandb.init times out at 90/120/180 s, dashboard reads 0 runs while wandb/run-* exist locally; HF downloads stall; pip/git glacial. Root cause: instances start with no proxy; direct egress to wandb/HF/PyPI/GitHub is unreliable or blocked, and wandb-core's retry logic under a flaky link can roll back already-uploaded runs. Fix: source /etc/network_turbo at the top of every shell/wrapper before any external call; recover an empty cloud project with for d in wandb/run-*; do timeout 120 wandb sync "$d"; done.

AD2 — HF download stalls even with hf-mirror + turbo. Symptom: from_pretrained / snapshot_download hangs or ConnectTimeout on big .safetensors shards. Root cause: (a) HF's Xet CAS backend is not mirror-proxied; (b) no_proxy lists modelscope.com not modelscope.cn (domestic source forced through international proxy = slower); (c) a curl test run without turbo measures the wrong path. Fix: export HF_HUB_DISABLE_XET=1 (or pip uninstall -y hf_xet) with HF_ENDPOINT=https://hf-mirror.com, or pull from ModelScope to a plain dir + load via local-path override; wrap in a timeout … && break resume loop. Detail → references/china-network.md.

AD3 — cross-region instances cannot share FS. Symptom: two instances in different regions see identical /root/autodl-fs/ paths but files written from one are invisible to the other. Root cause: FS quota is region-scoped; each region has its own physical mount. Fix: create the FS quota in the same region as the instances; bridge regions via scp from a chosen primary; verify with a write-one / read-other probe.

AD4 — FS write fails "No space left" while df -h looks fine. Symptom: cp/mkdir to /root/autodl-fs fails though df -h shows ~34%; df -i shows … 0 100%. Root cause: the shared FS enforces a hard ~200K inode cap independent of bytes; per-sample eval visualization (many tiny files) exhausts it. Fix: monitor df -i; cap per-sample eval vis on large test sets (sizing → verifying-dl- experiments); once a results dir is verified locally, prune its per-sample image subdir from FS; recover by find /root/autodl-fs -type d -name '<vis-dir>' -exec rm -rf {} + to free inodes fast.

AD5 — data disk full; HF cache is the hidden hog; agent rm auto-denied. Symptom: /root/autodl-tmp at 100% though runs/ looks small; an agent rm -rf of "obvious junk" is auto-denied. Root cause: ~/.cache/huggingface is symlinked onto the data disk, so the HF model cache (tens of GB) is the real hog; the harness blocks irreversible rm -rf whose targets the agent inferred. Fix: audit du -sh ~/.cache/huggingface/hub/models--* | sort -rh; set HF_HOME to a chosen data-disk dir + keep the metric/eval JSONs (tiny evidence); present exact deletion targets + sizes for explicit user confirmation; offer "clean vs expand the disk".

AD6 — base IS the env; a "never use base" rule blocks every remote command. Symptom: a local "don't run DL in conda base" guard fires on ssh autodl 'python train.py', but conda env list shows nothing and /root/miniconda3/envs/ is empty; poll scripts calling python3 exit 127. Root cause: the image installs the whole DL stack into base — base IS the single-tenant project env (no /envs/), and the image often ships only python (no python3). Fix: train with /root/miniconda3/bin/python; exempt remote-ssh + instance base from the local guard (never conda create --clone base); in remote scripts use the explicit interpreter or pure shell, never bare python3.

AD7 — platform TensorBoard pinned to /root/tf-logs; events elsewhere invisible. Symptom: the events file is non-empty and curl http://127.0.0.1:6007/ returns 200, but the AutoPanel TB tile shows zero runs; /data/runs returns []. Root cause: the image autostarts tensorboard --logdir /root/tf-logs and the tile proxies that pid; --logdir is hard-pinned and not reconfigurable in-container. Fix: write SummaryWriter(log_dir="/root/tf-logs/<run>"), or ln -sfn <your-tb> /root/tf-logs/<run> (the pinned TB's --reload=5 picks it up in ~5 s); verify with curl … /data/runs, not ss. (Also: restart the TB server to evict STALE cached tags after deleting/renaming runs.) The cross-platform "live panel silently empty" class (path/port/process mismatch on any platform) is the general form → references/gotchas_universal.md U39.

AD8 — wandb val-phase CPU memory spike to 30+ GB at epoch 1 end. Symptom: at the end of epoch 1 (validation), cgroup memory jumps from ~8 GB to 30+ GB, sometimes wedging the instance. Root cause: project trainers log per-sample distributions at step==1 (e.g. LPIPS/VGG over ~2000 samples on CPU = ~30 GB activations). Fix: cap the val-time sample accumulator — -o training.val_metric_sample_cap=256 (project-specific knob; check the trainer for the equivalent). Distinct from a DataLoader-worker cgroup OOM (universal gotcha).

AD9 — project torch pin would DOWNGRADE the image's working build. Symptom: the image ships e.g. a new-arch-capable torch (sm_120); the project pins torch<2.9; a naive pip install -r requirements.txt replaces it with a wheel lacking the arch's kernels → no kernel image is available at first forward. Root cause: the image torch/CUDA build is matched to the rented GPU arch; the project pin is stale for it. Fix: filter framework pins out of the remote install — grep -ivE '^(torch|torchvision|torchaudio)' requirements.txt > /root/req_remote.txt && pip install -r /root/req_remote.txt — keep the image build; smoke torch.cuda.get_device_capability() + a heavy import before launch; disclose the off-band torch version with results.


8. SCRIPT OVERRIDES

The exact values to parameterize the scripts/ templates (scripts/run_one.sh.template, scripts/run_queue.sh.template) for AutoDL:

DATA_DIR=/root/autodl-tmp             # fast NVMe data disk — live checkpoints, logs, HF cache
DURABLE_DIR=/root/autodl-fs           # region-locked shared FS — the only tier surviving 释放
PROXY_HOOK='source /etc/network_turbo 2>/dev/null || true'   # MANDATORY before any external call (AD1)
CRED_FILE=/root/.wandb_key            # per-instance ONLY — the FS security classifier blocks wandb keys
SCRATCH='latest.pth'                  # prune on success; keep best.pth (the keepable artifact)
HF_HOME=/root/autodl-tmp/huggingface_cache   # redirect off the symlinked ~/.cache hog (AD5)
HF_ENDPOINT=https://hf-mirror.com     # + HF_HUB_DISABLE_XET=1 (AD2)
DETACH=tmux                           # nohup fallback when tmux is absent (§6)
PY=/root/miniconda3/bin/python        # base IS the env — explicit interpreter, never bare python3 (AD6)
TB_LOGDIR=/root/tf-logs               # platform TB is pinned here (AD7)

Credential push (AD-specific). The FS security classifier blocks files matching wandb-key patterns — put the key at the per-instance /root/.wandb_key, never on /root/autodl-fs. Stream exactly one credential block via stdin so the secret never appears in a command; the wrapper reads it into WANDB_API_KEY before launch. Secrets-via-stdin pattern → references/ssh_transport.md.

Checked-sync (the gated success line). run_one.sh writes live checkpoints to $DATA_DIR/checkpoints/<name>, prunes latest.pth on success, then syncs best.pth to $DURABLE_DIR/final_ckpts/<name> gating the success echo on the actual copy result — an unconditional "synced" lies when the FS inode cap (AD4) silently fails the mkdir/cp (universal silent-sync gotcha). Until a download is verified locally, the data disk copy is source-of-truth.