23 KiB
Profile: AutoDL
The deepest, battle-tested profile — a Chinese cgroup-isolated SSH-rental with a 3-tier storage model
and the one rental where the meter-stop action is non-destructive. Fills all 8 schema sections
(profiles/_schema.md) at full depth. Read this before Phase 0; it owns every path, proxy, billing
verb, and TB pin the SKILL.md phases delegate to. Universal gotchas are NOT restated here — see
references/gotchas_universal.md.
Surface to the user up front (principle #10): conveniences most users miss — the console has a one-click "设置SSH免密登录" (registers your key so the agent connects non-interactively), GPU-availability notifications ("订阅GPU通知"), and built-in AutoPanel / JupyterLab / TensorBoard tiles. ⚠️ Danger clocks — 关机 (stop) auto-releases the box after 15 days → the data disk is deleted (AD-DANGER, §5); only
/root/autodl-fssurvives a 释放; low balance / arrears force-stop. And the TB tile is pinned to/root/tf-logs— write your logger there (or symlink) or the panel shows empty (AD7 / U39).
To jump: grep -in '<keyword>' profiles/autodl.md (e.g. grep -in inode profiles/autodl.md).
Table of contents
- LAUNCH — entry points + env contract (base miniconda IS the env)
- STORAGE MODEL — 3 tiers + survival matrix + inode cap
- NETWORK — academic proxy + China mirrors + pinned TB
- SPOT / INTERRUPTION + RESUME — effectively on-demand
- TEARDOWN / BILLING — 关机 stops the meter AND keeps the disk (the AutoDL exception)
- DAEMON TOOL — tmux / nohup
- TOP GOTCHAS — AD1..AD9, platform-pinned
- SCRIPT OVERRIDES — values to parameterize
scripts/
---
platform: autodl
kind: ssh-rental
meter_stop_verb: 关机 # shutdown/power-off STOPS billing AND keeps /root + disks
meter_stop_irreversible: false # the AutoDL EXCEPTION — 关机 is reversible; only 释放/release deletes
detach_primitive: tmux # nohup fallback when tmux is not installed (often absent on fresh image)
spot_available: false # on-demand only; no spot/bid/preemption model
spot_grace: n/a
shared_fs: true # /root/autodl-fs — region-locked, cross-instance within one region
inode_cap: ~200K # hard cap on the shared FS, independent of byte capacity
free_egress: true # no per-GB egress fee, but cross-GFW pulls need the academic proxy (see china_mirror_needed)
china_mirror_needed: true # behind the GFW — hf-mirror / ModelScope + /etc/network_turbo
host_driver_cuda_max: image-dependent # the prebuilt image pins torch+CUDA; do not downgrade (AD9)
local_nvme: true # /root/autodl-tmp data disk is fast local NVMe, per-instance
---
1. LAUNCH
First time? (rent → reach the box). On the AutoDL console: pick a GPU + region with stock → 创建实例
(choose the PyTorch image — the base env ships prebuilt) → register your key once via 设置SSH免密登录
(so the agent connects non-interactively) → copy the instance's SSH connection string + password from the
console → test ssh -p <PORT> root@connect.<region>.seetacloud.com 'nvidia-smi'. That string is your entry to
every phase below. (Console-only steps; AutoDL's UI shifts — re-check its docs if a label moved.)
Entry points. Web console (创建实例) for create/release/power; per-instance SSH connection string from
the console (ssh -p <PORT> root@connect.<region>.seetacloud.com). No first-class platform CLI/REST for
job control — SSH is the orchestration channel. Set a stable alias per instance in ~/.ssh/config
(Host autodl-<proj>-<N>, HostName connect.<region>.seetacloud.com, Port <PORT>) so every later
command is short; the port is assigned at create-time and changes on re-create (update the alias).
SSH/keepalive config → references/ssh_transport.md.
Env contract — the prebuilt base miniconda IS the env (AD6). The image ships the full DL stack into
base (/root/miniconda3/bin/python); there is no /root/miniconda3/envs/<name>/. Base is the
deliberate single-tenant project env. Never conda create / conda clone base on the rental —
cloning wastes ~16 GB of base packages + the disk just freed, for zero benefit. Train with the explicit
interpreter /root/miniconda3/bin/python; in remote polls use that path or pure shell, never bare
python3 (it may be absent → exit 127). When installing project deps, filter framework pins so a
requirements.txt does not downgrade the image's torch build (AD9).
The "no DL in conda base" discipline applies to the persistent local machine only — on an ephemeral rental, base IS the expected place to run. A local env-guard hook must exempt remote-ssh + instance base.
2. STORAGE MODEL (survival matrix — principle #4)
Three tiers, each with a different speed / size / inode profile and a different survival behavior:
| Tier | Path | Speed | Size | Inode cap | Scope |
|---|---|---|---|---|---|
| System disk | / |
medium | ~30 GB | none | per-instance |
| Data disk | /root/autodl-tmp |
fast NVMe | per-plan (e.g. ~50 GB) | none | per-instance |
| Shared FS | /root/autodl-fs |
NFS (slow, ~30 s/sync) | ~200 GB | ~200K (hard) | region-locked, all instances in one region |
Survival matrix — the part most platforms get wrong, and where AutoDL is the exception:
| Tier | Survives 关机 (stop)? | Survives 释放 (release/destroy)? | Notes |
|---|---|---|---|
/ system |
yes | no | AutoDL persists /root across power-off — UNLIKE RunPod/vast/K8s/Colab |
/root/autodl-tmp data |
yes | no | fast tier; checkpoints written here mid-run |
/root/autodl-fs shared |
yes | yes | the ONLY tier that survives release; region-locked |
Where checkpoints MUST go for the §5 teardown verb: write live checkpoints to the fast data disk
(/root/autodl-tmp/checkpoints/<name>, never the 30 GB system disk), then checked-sync best.pth
to /root/autodl-fs — the only tier that survives a 释放. If only ever using 关机, the data disk also
survives, but syncing the durable copy to FS is the safe default (a later release loses the data disk).
Region/DC-lock (AD3). FS quota is region-scoped; each region has its own physical mount. Files written
from a <region-a> instance are invisible to a <region-b> instance even at the identical
/root/autodl-fs/ path. Create the FS quota in the same region as the instances; to bridge regions,
pick one region as primary and scp between them (slow). Confirm sharing with a write-from-one / read-from-
another probe before relying on it.
Inode discipline (AD4). The ~200K cap is independent of bytes: df -h can read 34% while cp
fails "No space left" because df -i is at 100%. The inode bomb is per-sample eval visualization
(files_per_sample × N_samples × N_conditions → tens of thousands of tiny files); checkpoints (few large
files) are inode-cheap. Monitor df -i, not just df -h (Phase 0 + every space check). Eval-artifact
sizing policy is owned by REQUIRED: verifying-dl-experiments.
Data-disk hog (AD5). When /root/autodl-tmp hits 100% but runs/ looks small, the real hog is the
HF cache symlinked onto the data disk (~/.cache/huggingface → tens of GB of model blobs). Audit
du -sh ~/.cache/huggingface/hub/models--* | sort -rh before deleting checkpoints; redirect HF_HOME to
the data disk explicitly (see §8). Disk is expandable — prefer expand over silently shrinking the
experiment (principle #9). Get explicit user confirmation naming rm -rf targets (the harness classifier
blocks agent-inferred irreversible deletes).
3. NETWORK
Egress proxy — source /etc/network_turbo is MANDATORY (AD1). Instances start with no proxy; direct
egress to api.wandb.ai / huggingface.co / github.com / pypi.org is unreliable (0.5 s … 300 s …
blocked). Every shell that calls wandb / HF / pip / git must source /etc/network_turbo first
(source /etc/network_turbo 2>/dev/null || true at the top of every wrapper). It exports
http_proxy / https_proxy pointing at the in-DC academic proxy (http://<proxy-ip>:<port>), a
no_proxy allow-list for domestic endpoints, and the CA bundle. Perf delta: wandb push ~0.8 s with turbo
vs >120 s timeout without — no exceptions, even a small wandb.summary write can wedge for minutes.
China mirrors (AD2). HF behind the GFW → HF_ENDPOINT=https://hf-mirror.com or pull from
ModelScope. Two compounding traps: (a) HF's Xet CAS backend is NOT mirror-proxied (the mirror
covers the API but big .safetensors shards still hit the flaky international endpoint) →
export HF_HUB_DISABLE_XET=1 (or pip uninstall -y hf_xet) to force the classic LFS path the mirror does
proxy; (b) no_proxy in network_turbo lists modelscope.com but not modelscope.cn — routing a
DOMESTIC source through the international-acceleration proxy SLOWS it. Wrap every download in a
timeout <s> … && break retry loop (resumes partial files; a stall ≠ permanent failure). Full mirror
table + no_proxy ladder → references/china-network.md.
Port exposure. AutoDL maps a single custom port (6006) for user services; the platform also exposes
JupyterLab. SSH port is the per-instance <PORT> and changes on re-create.
Platform TensorBoard is pinned to /root/tf-logs (AD7). The image autostarts
tensorboard --logdir /root/tf-logs --port 6007 on boot and the AutoPanel TB tile proxies straight to that
pid — the --logdir is hard-pinned and cannot be reconfigured from inside the container. Events written
anywhere else are invisible in the web tile no matter how correct the SummaryWriter setup. Fix: write to
SummaryWriter(log_dir="/root/tf-logs/<run>"), or ln -sfn <your-tb> /root/tf-logs/<run> (the pinned TB
has --reload=5, so the run appears within ~5 s — no restart). Verify with
curl -s http://127.0.0.1:6007/data/runs (expect a JSON array with the run), NOT ss (can show nothing
inside the container while curl returns 200). Local logs die with the instance — for durable curves use a
hosted tracker (REQUIRED: huggingface-skills:huggingface-trackio).
SSH flavor. Direct-TCP SSH on the per-instance host:port — scp/rsync work normally (no proxied-SSH
restriction). Use a per-dir resumable loop for large transfers (single-connection scp -r resets mid-
transfer); rsync -avz --partial is preferred. Transport patterns → references/ssh_transport.md.
4. SPOT / INTERRUPTION + RESUME (principle #7/#8)
No spot/bid/preemption model — AutoDL is on-demand. There is no mid-run eviction, no SIGTERM grace
window to handle (spot_grace: n/a). The real loss vectors are: (a) forgot to release/关机 → idle
billing (principle #1); (b) an instance reboot that ends a non-detached process (a vanished process is
not always OOM — enumerate reboot / OOM / SSH-HUP / manual-kill before concluding, see
references/gotchas_universal.md); (c) availability — the GPU plan being sold out at create-time (build
retry-until-available, not survive-an-eviction).
Resume hook. The universal spine still applies (principle #8): checkpoint atomically to the data disk +
sync best.pth to FS, and resume-from-latest unconditionally on relaunch. The detach primitive (§6) makes
the identical launch command survive an SSH drop; checkpoint+resume makes it survive a reboot. Cadence
formula → references/spot-resilience.md (the formula generalizes even without spot — it bounds
re-compute lost to a reboot).
5. TEARDOWN / BILLING (principle #9 + the Iron Law)
关机 (shutdown / power-off) STOPS the meter AND keeps /root + both disks — this is the AutoDL
EXCEPTION among rentals. Everywhere else (RunPod wipes the container disk on stop, vast bills the disk
forever, K8s wipes the pod FS, Colab loses /content) a "stop" is lossy or still-billing. On AutoDL,
关机 is the safe park: meter off, all three tiers intact, restart later. There is also a no-GPU /
无卡模式 mode for cheap restart to copy files or fix the env without paying for the GPU.
| Action | Stops meter? | Keeps / + data disk? |
Keeps FS? | Reversible? |
|---|---|---|---|---|
| 关机 (shutdown) | yes | yes | yes | yes — restart anytime (the AutoDL exception) |
| 无卡模式 (no-GPU) | mostly (cheap) | yes | yes | yes |
| 释放 (release/destroy) | yes | NO | yes | NO — deletes / + data disk irreversibly |
Cost trap. 关机 still bills the data-disk storage at a small rate while the GPU meter is off — far
cheaper than running, but not free. Only 释放 fully ends storage billing, at the cost of the data disk.
⚠️ Auto-release clock (AD-DANGER): a 关机 (stopped) instance is auto-released after 15 days (the
console shows "关机 15 天后释放") → that release deletes / and the data disk, so 关机 is safe parking
only within the window; for a longer pause, sync best to /root/autodl-fs (survives 释放) or expect to
re-download. Low balance / arrears also force-stop the instance. Surface this to the user up front
(principle #10) — most users assume 关机 parks the box indefinitely.
Teardown Iron Law (SKILL.md Phase 5): no 释放 / file-delete until best.pth is pulled to local AND
verified by load (scripts/verify_local.py) AND the user explicitly approves — "it looked done in the
log" is not evidence (principle #3). Because 关机 is non-destructive here, the cheap safe move when unsure
is to 关机 and ask, never 释放 on a guess. REQUIRED: superpowers:verification-before-completion is
the general form of this gate.
6. DAEMON TOOL
tmux is the detach primitive when present, but tmux is often NOT installed on a fresh AutoDL image
and apt-get install tmux fails when egress is down. Zero-dependency fallback:
nohup bash run_queue.sh queue.txt </dev/null >master.log 2>&1 & — survives an SSH drop (SIGHUP), needs
no package. Verify either with pgrep -af <script>. The detach survives an SSH drop; it does not
survive a 关机/reboot — that is what checkpoint+resume (§4) is for.
Native queue: none. AutoDL has no built-in scheduler → use the bundled scripts/run_queue.sh.template
(resumable queue iterator, start_index for resume) driving scripts/run_one.sh.template per cell.
Never overwrite a script a running bash is mid-execution (bash reads by byte-offset → re-executes
blocks; version the filename) — universal physics, see references/gotchas_universal.md.
Monitoring. A session-bound watcher dies with the session; for multi-hour runs deploy the four-layer
durable architecture (references/monitoring_patterns.md). Detect "done" by a log marker
(grep -q 'QUEUE DONE' master.log), never by pgrep (the waiter's own cmdline matches the pattern and
loops forever). A cloud scheduler cannot reach the rented box (no SSH key in a cloud sandbox — secret
leak); the honest recurring check is the remote self-monitor + a session loop with the local key.
7. TOP GOTCHAS (AutoDL-pinned; universal ones → references/gotchas_universal.md)
AD1 — external network call hangs / wandb shows 0 runs. Symptom: wandb.init times out at
90/120/180 s, dashboard reads 0 runs while wandb/run-* exist locally; HF downloads stall; pip/git glacial.
Root cause: instances start with no proxy; direct egress to wandb/HF/PyPI/GitHub is unreliable or
blocked, and wandb-core's retry logic under a flaky link can roll back already-uploaded runs. Fix:
source /etc/network_turbo at the top of every shell/wrapper before any external call; recover an
empty cloud project with for d in wandb/run-*; do timeout 120 wandb sync "$d"; done.
AD2 — HF download stalls even with hf-mirror + turbo. Symptom: from_pretrained /
snapshot_download hangs or ConnectTimeout on big .safetensors shards. Root cause: (a) HF's Xet CAS
backend is not mirror-proxied; (b) no_proxy lists modelscope.com not modelscope.cn (domestic source
forced through international proxy = slower); (c) a curl test run without turbo measures the wrong path.
Fix: export HF_HUB_DISABLE_XET=1 (or pip uninstall -y hf_xet) with HF_ENDPOINT=https://hf-mirror.com,
or pull from ModelScope to a plain dir + load via local-path override; wrap in a timeout … && break
resume loop. Detail → references/china-network.md.
AD3 — cross-region instances cannot share FS. Symptom: two instances in different regions see
identical /root/autodl-fs/ paths but files written from one are invisible to the other. Root cause: FS
quota is region-scoped; each region has its own physical mount. Fix: create the FS quota in the same
region as the instances; bridge regions via scp from a chosen primary; verify with a write-one / read-other
probe.
AD4 — FS write fails "No space left" while df -h looks fine. Symptom: cp/mkdir to
/root/autodl-fs fails though df -h shows ~34%; df -i shows … 0 100%. Root cause: the shared FS
enforces a hard ~200K inode cap independent of bytes; per-sample eval visualization (many tiny files)
exhausts it. Fix: monitor df -i; cap per-sample eval vis on large test sets (sizing → verifying-dl-
experiments); once a results dir is verified locally, prune its per-sample image subdir from FS; recover by
find /root/autodl-fs -type d -name '<vis-dir>' -exec rm -rf {} + to free inodes fast.
AD5 — data disk full; HF cache is the hidden hog; agent rm auto-denied. Symptom:
/root/autodl-tmp at 100% though runs/ looks small; an agent rm -rf of "obvious junk" is auto-denied.
Root cause: ~/.cache/huggingface is symlinked onto the data disk, so the HF model cache (tens of
GB) is the real hog; the harness blocks irreversible rm -rf whose targets the agent inferred. Fix:
audit du -sh ~/.cache/huggingface/hub/models--* | sort -rh; set HF_HOME to a chosen data-disk dir + keep
the metric/eval JSONs (tiny evidence); present exact deletion targets + sizes for explicit user
confirmation; offer "clean vs expand the disk".
AD6 — base IS the env; a "never use base" rule blocks every remote command. Symptom: a local "don't
run DL in conda base" guard fires on ssh autodl 'python train.py', but conda env list shows nothing and
/root/miniconda3/envs/ is empty; poll scripts calling python3 exit 127. Root cause: the image installs
the whole DL stack into base — base IS the single-tenant project env (no /envs/), and the image often
ships only python (no python3). Fix: train with /root/miniconda3/bin/python; exempt remote-ssh +
instance base from the local guard (never conda create --clone base); in remote scripts use the explicit
interpreter or pure shell, never bare python3.
AD7 — platform TensorBoard pinned to /root/tf-logs; events elsewhere invisible. Symptom: the
events file is non-empty and curl http://127.0.0.1:6007/ returns 200, but the AutoPanel TB tile shows
zero runs; /data/runs returns []. Root cause: the image autostarts tensorboard --logdir /root/tf-logs and the tile proxies that pid; --logdir is hard-pinned and not reconfigurable in-container.
Fix: write SummaryWriter(log_dir="/root/tf-logs/<run>"), or ln -sfn <your-tb> /root/tf-logs/<run>
(the pinned TB's --reload=5 picks it up in ~5 s); verify with curl … /data/runs, not ss. (Also:
restart the TB server to evict STALE cached tags after deleting/renaming runs.) The cross-platform "live panel silently empty" class (path/port/process mismatch on any platform) is the general form → references/gotchas_universal.md U39.
AD8 — wandb val-phase CPU memory spike to 30+ GB at epoch 1 end. Symptom: at the end of epoch 1
(validation), cgroup memory jumps from ~8 GB to 30+ GB, sometimes wedging the instance. Root cause:
project trainers log per-sample distributions at step==1 (e.g. LPIPS/VGG over ~2000 samples on CPU =
~30 GB activations). Fix: cap the val-time sample accumulator — -o training.val_metric_sample_cap=256
(project-specific knob; check the trainer for the equivalent). Distinct from a DataLoader-worker cgroup OOM
(universal gotcha).
AD9 — project torch pin would DOWNGRADE the image's working build. Symptom: the image ships e.g. a
new-arch-capable torch (sm_120); the project pins torch<2.9; a naive pip install -r requirements.txt
replaces it with a wheel lacking the arch's kernels → no kernel image is available at first forward.
Root cause: the image torch/CUDA build is matched to the rented GPU arch; the project pin is stale for it.
Fix: filter framework pins out of the remote install —
grep -ivE '^(torch|torchvision|torchaudio)' requirements.txt > /root/req_remote.txt && pip install -r /root/req_remote.txt — keep the image build; smoke torch.cuda.get_device_capability() + a heavy import
before launch; disclose the off-band torch version with results.
8. SCRIPT OVERRIDES
The exact values to parameterize the scripts/ templates (scripts/run_one.sh.template,
scripts/run_queue.sh.template) for AutoDL:
DATA_DIR=/root/autodl-tmp # fast NVMe data disk — live checkpoints, logs, HF cache
DURABLE_DIR=/root/autodl-fs # region-locked shared FS — the only tier surviving 释放
PROXY_HOOK='source /etc/network_turbo 2>/dev/null || true' # MANDATORY before any external call (AD1)
CRED_FILE=/root/.wandb_key # per-instance ONLY — the FS security classifier blocks wandb keys
SCRATCH='latest.pth' # prune on success; keep best.pth (the keepable artifact)
HF_HOME=/root/autodl-tmp/huggingface_cache # redirect off the symlinked ~/.cache hog (AD5)
HF_ENDPOINT=https://hf-mirror.com # + HF_HUB_DISABLE_XET=1 (AD2)
DETACH=tmux # nohup fallback when tmux is absent (§6)
PY=/root/miniconda3/bin/python # base IS the env — explicit interpreter, never bare python3 (AD6)
TB_LOGDIR=/root/tf-logs # platform TB is pinned here (AD7)
Credential push (AD-specific). The FS security classifier blocks files matching wandb-key patterns —
put the key at the per-instance /root/.wandb_key, never on /root/autodl-fs. Stream exactly one
credential block via stdin so the secret never appears in a command; the wrapper reads it
into WANDB_API_KEY before launch. Secrets-via-stdin pattern → references/ssh_transport.md.
Checked-sync (the gated success line). run_one.sh writes live checkpoints to
$DATA_DIR/checkpoints/<name>, prunes latest.pth on success, then syncs best.pth to
$DURABLE_DIR/final_ckpts/<name> gating the success echo on the actual copy result — an unconditional
"synced" lies when the FS inode cap (AD4) silently fails the mkdir/cp (universal silent-sync gotcha).
Until a download is verified locally, the data disk copy is source-of-truth.