# Profile: AutoDL The deepest, battle-tested profile — a Chinese cgroup-isolated SSH-rental with a 3-tier storage model and the *one* rental where the meter-stop action is non-destructive. Fills all 8 schema sections (`profiles/_schema.md`) at full depth. Read this **before Phase 0**; it owns every path, proxy, billing verb, and TB pin the SKILL.md phases delegate to. Universal gotchas are NOT restated here — see `references/gotchas_universal.md`. > **Surface to the user up front (principle #10):** conveniences most users miss — the console has a > **one-click "设置SSH免密登录"** (registers your key so the agent connects non-interactively), **GPU-availability > notifications** ("订阅GPU通知"), and built-in **AutoPanel / JupyterLab / TensorBoard** tiles. ⚠️ Danger clocks > — **关机 (stop) auto-releases the box after 15 days → the data disk is deleted** (AD-DANGER, §5); only > `/root/autodl-fs` survives a 释放; low balance / arrears force-stop. And the TB tile is **pinned to > `/root/tf-logs`** — write your logger there (or symlink) or the panel shows empty (AD7 / U39). To jump: `grep -in '' profiles/autodl.md` (e.g. `grep -in inode profiles/autodl.md`). ## Table of contents 1. LAUNCH — entry points + env contract (base miniconda IS the env) 2. STORAGE MODEL — 3 tiers + survival matrix + inode cap 3. NETWORK — academic proxy + China mirrors + pinned TB 4. SPOT / INTERRUPTION + RESUME — effectively on-demand 5. TEARDOWN / BILLING — 关机 stops the meter AND keeps the disk (the AutoDL exception) 6. DAEMON TOOL — tmux / nohup 7. TOP GOTCHAS — AD1..AD9, platform-pinned 8. SCRIPT OVERRIDES — values to parameterize `scripts/` --- ```yaml --- platform: autodl kind: ssh-rental meter_stop_verb: 关机 # shutdown/power-off STOPS billing AND keeps /root + disks meter_stop_irreversible: false # the AutoDL EXCEPTION — 关机 is reversible; only 释放/release deletes detach_primitive: tmux # nohup fallback when tmux is not installed (often absent on fresh image) spot_available: false # on-demand only; no spot/bid/preemption model spot_grace: n/a shared_fs: true # /root/autodl-fs — region-locked, cross-instance within one region inode_cap: ~200K # hard cap on the shared FS, independent of byte capacity free_egress: true # no per-GB egress fee, but cross-GFW pulls need the academic proxy (see china_mirror_needed) china_mirror_needed: true # behind the GFW — hf-mirror / ModelScope + /etc/network_turbo host_driver_cuda_max: image-dependent # the prebuilt image pins torch+CUDA; do not downgrade (AD9) local_nvme: true # /root/autodl-tmp data disk is fast local NVMe, per-instance --- ``` --- ## 1. LAUNCH **First time? (rent → reach the box).** On the AutoDL console: pick a GPU + region with stock → **创建实例** (choose the PyTorch image — the base env ships prebuilt) → register your key once via **设置SSH免密登录** (so the agent connects non-interactively) → copy the instance's **SSH connection string** + password from the console → test `ssh -p root@connect..seetacloud.com 'nvidia-smi'`. That string is your entry to every phase below. (Console-only steps; AutoDL's UI shifts — re-check its docs if a label moved.) **Entry points.** Web console (创建实例) for create/release/power; per-instance SSH connection string from the console (`ssh -p root@connect..seetacloud.com`). No first-class platform CLI/REST for job control — SSH is the orchestration channel. Set a stable alias per instance in `~/.ssh/config` (`Host autodl--`, `HostName connect..seetacloud.com`, `Port `) so every later command is short; the port is assigned at create-time and **changes on re-create** (update the alias). SSH/keepalive config → `references/ssh_transport.md`. **Env contract — the prebuilt base miniconda IS the env (AD6).** The image ships the full DL stack into **base** (`/root/miniconda3/bin/python`); there is no `/root/miniconda3/envs//`. Base is the deliberate single-tenant project env. **Never `conda create` / `conda clone base`** on the rental — cloning wastes ~16 GB of base packages + the disk just freed, for zero benefit. Train with the explicit interpreter `/root/miniconda3/bin/python`; in remote polls use that path or pure shell, never bare `python3` (it may be absent → exit 127). When installing project deps, **filter framework pins** so a `requirements.txt` does not downgrade the image's torch build (AD9). > The "no DL in conda base" discipline applies to the *persistent local* machine only — on an ephemeral > rental, base IS the expected place to run. A local env-guard hook must exempt remote-ssh + instance base. --- ## 2. STORAGE MODEL *(survival matrix — principle #4)* Three tiers, each with a different speed / size / inode profile and a **different survival behavior**: | Tier | Path | Speed | Size | Inode cap | Scope | |---|---|---|---|---|---| | System disk | `/` | medium | ~30 GB | none | per-instance | | Data disk | `/root/autodl-tmp` | **fast NVMe** | per-plan (e.g. ~50 GB) | none | per-instance | | Shared FS | `/root/autodl-fs` | NFS (slow, ~30 s/sync) | ~200 GB | **~200K (hard)** | **region-locked**, all instances in one region | **Survival matrix** — the part most platforms get wrong, and where AutoDL is the **exception**: | Tier | Survives 关机 (stop)? | Survives 释放 (release/destroy)? | Notes | |---|---|---|---| | `/` system | **yes** | no | AutoDL persists `/root` across power-off — UNLIKE RunPod/vast/K8s/Colab | | `/root/autodl-tmp` data | **yes** | no | fast tier; checkpoints written here mid-run | | `/root/autodl-fs` shared | **yes** | **yes** | the ONLY tier that survives release; region-locked | **Where checkpoints MUST go for the §5 teardown verb:** write live checkpoints to the fast data disk (`/root/autodl-tmp/checkpoints/`, never the 30 GB system disk), then **checked-sync `best.pth` to `/root/autodl-fs`** — the only tier that survives a 释放. If only ever using 关机, the data disk also survives, but syncing the durable copy to FS is the safe default (a later release loses the data disk). **Region/DC-lock (AD3).** FS quota is region-scoped; each region has its own physical mount. Files written from a `` instance are invisible to a `` instance even at the identical `/root/autodl-fs/` path. Create the FS quota in the **same region** as the instances; to bridge regions, pick one region as primary and scp between them (slow). Confirm sharing with a write-from-one / read-from- another probe before relying on it. **Inode discipline (AD4).** The ~200K cap is **independent of bytes**: `df -h` can read 34% while `cp` fails "No space left" because `df -i` is at 100%. The inode bomb is **per-sample eval visualization** (`files_per_sample × N_samples × N_conditions` → tens of thousands of tiny files); checkpoints (few large files) are inode-cheap. Monitor `df -i`, not just `df -h` (Phase 0 + every space check). Eval-artifact sizing policy is owned by **REQUIRED:** verifying-dl-experiments. **Data-disk hog (AD5).** When `/root/autodl-tmp` hits 100% but `runs/` looks small, the real hog is the **HF cache symlinked onto the data disk** (`~/.cache/huggingface` → tens of GB of model blobs). Audit `du -sh ~/.cache/huggingface/hub/models--* | sort -rh` before deleting checkpoints; redirect `HF_HOME` to the data disk explicitly (see §8). Disk is expandable — prefer expand over silently shrinking the experiment (principle #9). Get explicit user confirmation naming `rm -rf` targets (the harness classifier blocks agent-inferred irreversible deletes). --- ## 3. NETWORK **Egress proxy — `source /etc/network_turbo` is MANDATORY (AD1).** Instances start with no proxy; direct egress to `api.wandb.ai` / `huggingface.co` / `github.com` / `pypi.org` is unreliable (0.5 s … 300 s … blocked). Every shell that calls wandb / HF / pip / git must `source /etc/network_turbo` first (`source /etc/network_turbo 2>/dev/null || true` at the top of every wrapper). It exports `http_proxy` / `https_proxy` pointing at the in-DC academic proxy (`http://:`), a `no_proxy` allow-list for domestic endpoints, and the CA bundle. Perf delta: wandb push ~0.8 s with turbo vs >120 s timeout without — no exceptions, even a small `wandb.summary` write can wedge for minutes. **China mirrors (AD2).** HF behind the GFW → `HF_ENDPOINT=https://hf-mirror.com` or pull from **ModelScope**. Two compounding traps: (a) HF's **Xet CAS backend** is NOT mirror-proxied (the mirror covers the API but big `.safetensors` shards still hit the flaky international endpoint) → `export HF_HUB_DISABLE_XET=1` (or `pip uninstall -y hf_xet`) to force the classic LFS path the mirror does proxy; (b) `no_proxy` in network_turbo lists `modelscope.com` but **not** `modelscope.cn` — routing a DOMESTIC source through the international-acceleration proxy SLOWS it. Wrap every download in a `timeout … && break` retry loop (resumes partial files; a stall ≠ permanent failure). Full mirror table + `no_proxy` ladder → `references/china-network.md`. **Port exposure.** AutoDL maps a single custom port (6006) for user services; the platform also exposes JupyterLab. SSH port is the per-instance `` and changes on re-create. **Platform TensorBoard is pinned to `/root/tf-logs` (AD7).** The image autostarts `tensorboard --logdir /root/tf-logs --port 6007` on boot and the AutoPanel TB tile proxies straight to that pid — the `--logdir` is hard-pinned and cannot be reconfigured from inside the container. Events written anywhere else are invisible in the web tile no matter how correct the `SummaryWriter` setup. Fix: write to `SummaryWriter(log_dir="/root/tf-logs/")`, or `ln -sfn /root/tf-logs/` (the pinned TB has `--reload=5`, so the run appears within ~5 s — no restart). Verify with `curl -s http://127.0.0.1:6007/data/runs` (expect a JSON array with the run), NOT `ss` (can show nothing inside the container while curl returns 200). Local logs die with the instance — for durable curves use a hosted tracker (**REQUIRED:** huggingface-skills:huggingface-trackio). **SSH flavor.** Direct-TCP SSH on the per-instance host:port — `scp`/`rsync` work normally (no proxied-SSH restriction). Use a per-dir resumable loop for large transfers (single-connection `scp -r` resets mid- transfer); `rsync -avz --partial` is preferred. Transport patterns → `references/ssh_transport.md`. --- ## 4. SPOT / INTERRUPTION + RESUME *(principle #7/#8)* **No spot/bid/preemption model — AutoDL is on-demand.** There is no mid-run eviction, no SIGTERM grace window to handle (`spot_grace: n/a`). The real loss vectors are: (a) **forgot to release/关机** → idle billing (principle #1); (b) an instance **reboot** that ends a non-detached process (a vanished process is not always OOM — enumerate reboot / OOM / SSH-HUP / manual-kill before concluding, see `references/gotchas_universal.md`); (c) availability — the GPU plan being sold out at create-time (build retry-until-available, not survive-an-eviction). **Resume hook.** The universal spine still applies (principle #8): checkpoint atomically to the data disk + sync `best.pth` to FS, and resume-from-latest unconditionally on relaunch. The detach primitive (§6) makes the *identical launch command* survive an SSH drop; checkpoint+resume makes it survive a reboot. Cadence formula → `references/spot-resilience.md` (the formula generalizes even without spot — it bounds re-compute lost to a reboot). --- ## 5. TEARDOWN / BILLING *(principle #9 + the Iron Law)* **关机 (shutdown / power-off) STOPS the meter AND keeps `/root` + both disks — this is the AutoDL EXCEPTION among rentals.** Everywhere else (RunPod wipes the container disk on stop, vast bills the disk forever, K8s wipes the pod FS, Colab loses `/content`) a "stop" is lossy or still-billing. On AutoDL, 关机 is the **safe park**: meter off, all three tiers intact, restart later. There is also a **no-GPU / 无卡模式 mode** for cheap restart to copy files or fix the env without paying for the GPU. | Action | Stops meter? | Keeps `/` + data disk? | Keeps FS? | Reversible? | |---|---|---|---|---| | 关机 (shutdown) | **yes** | **yes** | yes | **yes** — restart anytime (the AutoDL exception) | | 无卡模式 (no-GPU) | mostly (cheap) | yes | yes | yes | | 释放 (release/destroy) | yes | **NO** | yes | **NO — deletes `/` + data disk irreversibly** | **Cost trap.** 关机 still bills the data-disk *storage* at a small rate while the GPU meter is off — far cheaper than running, but not free. Only 释放 fully ends storage billing, at the cost of the data disk. **⚠️ Auto-release clock (AD-DANGER):** a 关机 (stopped) instance is **auto-released after 15 days** (the console shows "关机 15 天后释放") → that release deletes `/` **and the data disk**, so 关机 is safe parking only *within* the window; for a longer pause, sync `best` to `/root/autodl-fs` (survives 释放) or expect to re-download. Low balance / arrears also force-stop the instance. **Surface this to the user up front (principle #10)** — most users assume 关机 parks the box indefinitely. **Teardown Iron Law (SKILL.md Phase 5):** no 释放 / file-delete until `best.pth` is **pulled to local AND verified by load** (`scripts/verify_local.py`) AND the user explicitly approves — "it looked done in the log" is not evidence (principle #3). Because 关机 is non-destructive here, the cheap safe move when unsure is to **关机 and ask**, never 释放 on a guess. **REQUIRED:** superpowers:verification-before-completion is the general form of this gate. --- ## 6. DAEMON TOOL **tmux** is the detach primitive when present, but **tmux is often NOT installed on a fresh AutoDL image** and `apt-get install tmux` fails when egress is down. Zero-dependency fallback: `nohup bash run_queue.sh queue.txt master.log 2>&1 &` — survives an SSH drop (SIGHUP), needs no package. Verify either with `pgrep -af