328 lines
23 KiB
Markdown
328 lines
23 KiB
Markdown
# Profile: AutoDL
|
||
|
||
The deepest, battle-tested profile — a Chinese cgroup-isolated SSH-rental with a 3-tier storage model
|
||
and the *one* rental where the meter-stop action is non-destructive. Fills all 8 schema sections
|
||
(`profiles/_schema.md`) at full depth. Read this **before Phase 0**; it owns every path, proxy, billing
|
||
verb, and TB pin the SKILL.md phases delegate to. Universal gotchas are NOT restated here — see
|
||
`references/gotchas_universal.md`.
|
||
|
||
> **Surface to the user up front (principle #10):** conveniences most users miss — the console has a
|
||
> **one-click "设置SSH免密登录"** (registers your key so the agent connects non-interactively), **GPU-availability
|
||
> notifications** ("订阅GPU通知"), and built-in **AutoPanel / JupyterLab / TensorBoard** tiles. ⚠️ Danger clocks
|
||
> — **关机 (stop) auto-releases the box after 15 days → the data disk is deleted** (AD-DANGER, §5); only
|
||
> `/root/autodl-fs` survives a 释放; low balance / arrears force-stop. And the TB tile is **pinned to
|
||
> `/root/tf-logs`** — write your logger there (or symlink) or the panel shows empty (AD7 / U39).
|
||
|
||
To jump: `grep -in '<keyword>' profiles/autodl.md` (e.g. `grep -in inode profiles/autodl.md`).
|
||
|
||
## Table of contents
|
||
|
||
1. LAUNCH — entry points + env contract (base miniconda IS the env)
|
||
2. STORAGE MODEL — 3 tiers + survival matrix + inode cap
|
||
3. NETWORK — academic proxy + China mirrors + pinned TB
|
||
4. SPOT / INTERRUPTION + RESUME — effectively on-demand
|
||
5. TEARDOWN / BILLING — 关机 stops the meter AND keeps the disk (the AutoDL exception)
|
||
6. DAEMON TOOL — tmux / nohup
|
||
7. TOP GOTCHAS — AD1..AD9, platform-pinned
|
||
8. SCRIPT OVERRIDES — values to parameterize `scripts/`
|
||
|
||
---
|
||
|
||
```yaml
|
||
---
|
||
platform: autodl
|
||
kind: ssh-rental
|
||
meter_stop_verb: 关机 # shutdown/power-off STOPS billing AND keeps /root + disks
|
||
meter_stop_irreversible: false # the AutoDL EXCEPTION — 关机 is reversible; only 释放/release deletes
|
||
detach_primitive: tmux # nohup fallback when tmux is not installed (often absent on fresh image)
|
||
spot_available: false # on-demand only; no spot/bid/preemption model
|
||
spot_grace: n/a
|
||
shared_fs: true # /root/autodl-fs — region-locked, cross-instance within one region
|
||
inode_cap: ~200K # hard cap on the shared FS, independent of byte capacity
|
||
free_egress: true # no per-GB egress fee, but cross-GFW pulls need the academic proxy (see china_mirror_needed)
|
||
china_mirror_needed: true # behind the GFW — hf-mirror / ModelScope + /etc/network_turbo
|
||
host_driver_cuda_max: image-dependent # the prebuilt image pins torch+CUDA; do not downgrade (AD9)
|
||
local_nvme: true # /root/autodl-tmp data disk is fast local NVMe, per-instance
|
||
---
|
||
```
|
||
|
||
---
|
||
|
||
## 1. LAUNCH
|
||
|
||
**First time? (rent → reach the box).** On the AutoDL console: pick a GPU + region with stock → **创建实例**
|
||
(choose the PyTorch image — the base env ships prebuilt) → register your key once via **设置SSH免密登录**
|
||
(so the agent connects non-interactively) → copy the instance's **SSH connection string** + password from the
|
||
console → test `ssh -p <PORT> root@connect.<region>.seetacloud.com 'nvidia-smi'`. That string is your entry to
|
||
every phase below. (Console-only steps; AutoDL's UI shifts — re-check its docs if a label moved.)
|
||
|
||
**Entry points.** Web console (创建实例) for create/release/power; per-instance SSH connection string from
|
||
the console (`ssh -p <PORT> root@connect.<region>.seetacloud.com`). No first-class platform CLI/REST for
|
||
job control — SSH is the orchestration channel. Set a stable alias per instance in `~/.ssh/config`
|
||
(`Host autodl-<proj>-<N>`, `HostName connect.<region>.seetacloud.com`, `Port <PORT>`) so every later
|
||
command is short; the port is assigned at create-time and **changes on re-create** (update the alias).
|
||
SSH/keepalive config → `references/ssh_transport.md`.
|
||
|
||
**Env contract — the prebuilt base miniconda IS the env (AD6).** The image ships the full DL stack into
|
||
**base** (`/root/miniconda3/bin/python`); there is no `/root/miniconda3/envs/<name>/`. Base is the
|
||
deliberate single-tenant project env. **Never `conda create` / `conda clone base`** on the rental —
|
||
cloning wastes ~16 GB of base packages + the disk just freed, for zero benefit. Train with the explicit
|
||
interpreter `/root/miniconda3/bin/python`; in remote polls use that path or pure shell, never bare
|
||
`python3` (it may be absent → exit 127). When installing project deps, **filter framework pins** so a
|
||
`requirements.txt` does not downgrade the image's torch build (AD9).
|
||
|
||
> The "no DL in conda base" discipline applies to the *persistent local* machine only — on an ephemeral
|
||
> rental, base IS the expected place to run. A local env-guard hook must exempt remote-ssh + instance base.
|
||
|
||
---
|
||
|
||
## 2. STORAGE MODEL *(survival matrix — principle #4)*
|
||
|
||
Three tiers, each with a different speed / size / inode profile and a **different survival behavior**:
|
||
|
||
| Tier | Path | Speed | Size | Inode cap | Scope |
|
||
|---|---|---|---|---|---|
|
||
| System disk | `/` | medium | ~30 GB | none | per-instance |
|
||
| Data disk | `/root/autodl-tmp` | **fast NVMe** | per-plan (e.g. ~50 GB) | none | per-instance |
|
||
| Shared FS | `/root/autodl-fs` | NFS (slow, ~30 s/sync) | ~200 GB | **~200K (hard)** | **region-locked**, all instances in one region |
|
||
|
||
**Survival matrix** — the part most platforms get wrong, and where AutoDL is the **exception**:
|
||
|
||
| Tier | Survives 关机 (stop)? | Survives 释放 (release/destroy)? | Notes |
|
||
|---|---|---|---|
|
||
| `/` system | **yes** | no | AutoDL persists `/root` across power-off — UNLIKE RunPod/vast/K8s/Colab |
|
||
| `/root/autodl-tmp` data | **yes** | no | fast tier; checkpoints written here mid-run |
|
||
| `/root/autodl-fs` shared | **yes** | **yes** | the ONLY tier that survives release; region-locked |
|
||
|
||
**Where checkpoints MUST go for the §5 teardown verb:** write live checkpoints to the fast data disk
|
||
(`/root/autodl-tmp/checkpoints/<name>`, never the 30 GB system disk), then **checked-sync `best.pth`
|
||
to `/root/autodl-fs`** — the only tier that survives a 释放. If only ever using 关机, the data disk also
|
||
survives, but syncing the durable copy to FS is the safe default (a later release loses the data disk).
|
||
|
||
**Region/DC-lock (AD3).** FS quota is region-scoped; each region has its own physical mount. Files written
|
||
from a `<region-a>` instance are invisible to a `<region-b>` instance even at the identical
|
||
`/root/autodl-fs/` path. Create the FS quota in the **same region** as the instances; to bridge regions,
|
||
pick one region as primary and scp between them (slow). Confirm sharing with a write-from-one / read-from-
|
||
another probe before relying on it.
|
||
|
||
**Inode discipline (AD4).** The ~200K cap is **independent of bytes**: `df -h` can read 34% while `cp`
|
||
fails "No space left" because `df -i` is at 100%. The inode bomb is **per-sample eval visualization**
|
||
(`files_per_sample × N_samples × N_conditions` → tens of thousands of tiny files); checkpoints (few large
|
||
files) are inode-cheap. Monitor `df -i`, not just `df -h` (Phase 0 + every space check). Eval-artifact
|
||
sizing policy is owned by **REQUIRED:** verifying-dl-experiments.
|
||
|
||
**Data-disk hog (AD5).** When `/root/autodl-tmp` hits 100% but `runs/` looks small, the real hog is the
|
||
**HF cache symlinked onto the data disk** (`~/.cache/huggingface` → tens of GB of model blobs). Audit
|
||
`du -sh ~/.cache/huggingface/hub/models--* | sort -rh` before deleting checkpoints; redirect `HF_HOME` to
|
||
the data disk explicitly (see §8). Disk is expandable — prefer expand over silently shrinking the
|
||
experiment (principle #9). Get explicit user confirmation naming `rm -rf` targets (the harness classifier
|
||
blocks agent-inferred irreversible deletes).
|
||
|
||
---
|
||
|
||
## 3. NETWORK
|
||
|
||
**Egress proxy — `source /etc/network_turbo` is MANDATORY (AD1).** Instances start with no proxy; direct
|
||
egress to `api.wandb.ai` / `huggingface.co` / `github.com` / `pypi.org` is unreliable (0.5 s … 300 s …
|
||
blocked). Every shell that calls wandb / HF / pip / git must `source /etc/network_turbo` first
|
||
(`source /etc/network_turbo 2>/dev/null || true` at the top of every wrapper). It exports
|
||
`http_proxy` / `https_proxy` pointing at the in-DC academic proxy (`http://<proxy-ip>:<port>`), a
|
||
`no_proxy` allow-list for domestic endpoints, and the CA bundle. Perf delta: wandb push ~0.8 s with turbo
|
||
vs >120 s timeout without — no exceptions, even a small `wandb.summary` write can wedge for minutes.
|
||
|
||
**China mirrors (AD2).** HF behind the GFW → `HF_ENDPOINT=https://hf-mirror.com` or pull from
|
||
**ModelScope**. Two compounding traps: (a) HF's **Xet CAS backend** is NOT mirror-proxied (the mirror
|
||
covers the API but big `.safetensors` shards still hit the flaky international endpoint) →
|
||
`export HF_HUB_DISABLE_XET=1` (or `pip uninstall -y hf_xet`) to force the classic LFS path the mirror does
|
||
proxy; (b) `no_proxy` in network_turbo lists `modelscope.com` but **not** `modelscope.cn` — routing a
|
||
DOMESTIC source through the international-acceleration proxy SLOWS it. Wrap every download in a
|
||
`timeout <s> … && break` retry loop (resumes partial files; a stall ≠ permanent failure). Full mirror
|
||
table + `no_proxy` ladder → `references/china-network.md`.
|
||
|
||
**Port exposure.** AutoDL maps a single custom port (6006) for user services; the platform also exposes
|
||
JupyterLab. SSH port is the per-instance `<PORT>` and changes on re-create.
|
||
|
||
**Platform TensorBoard is pinned to `/root/tf-logs` (AD7).** The image autostarts
|
||
`tensorboard --logdir /root/tf-logs --port 6007` on boot and the AutoPanel TB tile proxies straight to that
|
||
pid — the `--logdir` is hard-pinned and cannot be reconfigured from inside the container. Events written
|
||
anywhere else are invisible in the web tile no matter how correct the `SummaryWriter` setup. Fix: write to
|
||
`SummaryWriter(log_dir="/root/tf-logs/<run>")`, or `ln -sfn <your-tb> /root/tf-logs/<run>` (the pinned TB
|
||
has `--reload=5`, so the run appears within ~5 s — no restart). Verify with
|
||
`curl -s http://127.0.0.1:6007/data/runs` (expect a JSON array with the run), NOT `ss` (can show nothing
|
||
inside the container while curl returns 200). Local logs die with the instance — for durable curves use a
|
||
hosted tracker (**REQUIRED:** huggingface-skills:huggingface-trackio).
|
||
|
||
**SSH flavor.** Direct-TCP SSH on the per-instance host:port — `scp`/`rsync` work normally (no proxied-SSH
|
||
restriction). Use a per-dir resumable loop for large transfers (single-connection `scp -r` resets mid-
|
||
transfer); `rsync -avz --partial` is preferred. Transport patterns → `references/ssh_transport.md`.
|
||
|
||
---
|
||
|
||
## 4. SPOT / INTERRUPTION + RESUME *(principle #7/#8)*
|
||
|
||
**No spot/bid/preemption model — AutoDL is on-demand.** There is no mid-run eviction, no SIGTERM grace
|
||
window to handle (`spot_grace: n/a`). The real loss vectors are: (a) **forgot to release/关机** → idle
|
||
billing (principle #1); (b) an instance **reboot** that ends a non-detached process (a vanished process is
|
||
not always OOM — enumerate reboot / OOM / SSH-HUP / manual-kill before concluding, see
|
||
`references/gotchas_universal.md`); (c) availability — the GPU plan being sold out at create-time (build
|
||
retry-until-available, not survive-an-eviction).
|
||
|
||
**Resume hook.** The universal spine still applies (principle #8): checkpoint atomically to the data disk +
|
||
sync `best.pth` to FS, and resume-from-latest unconditionally on relaunch. The detach primitive (§6) makes
|
||
the *identical launch command* survive an SSH drop; checkpoint+resume makes it survive a reboot. Cadence
|
||
formula → `references/spot-resilience.md` (the formula generalizes even without spot — it bounds
|
||
re-compute lost to a reboot).
|
||
|
||
---
|
||
|
||
## 5. TEARDOWN / BILLING *(principle #9 + the Iron Law)*
|
||
|
||
**关机 (shutdown / power-off) STOPS the meter AND keeps `/root` + both disks — this is the AutoDL
|
||
EXCEPTION among rentals.** Everywhere else (RunPod wipes the container disk on stop, vast bills the disk
|
||
forever, K8s wipes the pod FS, Colab loses `/content`) a "stop" is lossy or still-billing. On AutoDL,
|
||
关机 is the **safe park**: meter off, all three tiers intact, restart later. There is also a **no-GPU /
|
||
无卡模式 mode** for cheap restart to copy files or fix the env without paying for the GPU.
|
||
|
||
| Action | Stops meter? | Keeps `/` + data disk? | Keeps FS? | Reversible? |
|
||
|---|---|---|---|---|
|
||
| 关机 (shutdown) | **yes** | **yes** | yes | **yes** — restart anytime (the AutoDL exception) |
|
||
| 无卡模式 (no-GPU) | mostly (cheap) | yes | yes | yes |
|
||
| 释放 (release/destroy) | yes | **NO** | yes | **NO — deletes `/` + data disk irreversibly** |
|
||
|
||
**Cost trap.** 关机 still bills the data-disk *storage* at a small rate while the GPU meter is off — far
|
||
cheaper than running, but not free. Only 释放 fully ends storage billing, at the cost of the data disk.
|
||
**⚠️ Auto-release clock (AD-DANGER):** a 关机 (stopped) instance is **auto-released after 15 days** (the
|
||
console shows "关机 15 天后释放") → that release deletes `/` **and the data disk**, so 关机 is safe parking
|
||
only *within* the window; for a longer pause, sync `best` to `/root/autodl-fs` (survives 释放) or expect to
|
||
re-download. Low balance / arrears also force-stop the instance. **Surface this to the user up front
|
||
(principle #10)** — most users assume 关机 parks the box indefinitely.
|
||
**Teardown Iron Law (SKILL.md Phase 5):** no 释放 / file-delete until `best.pth` is **pulled to local AND
|
||
verified by load** (`scripts/verify_local.py`) AND the user explicitly approves — "it looked done in the
|
||
log" is not evidence (principle #3). Because 关机 is non-destructive here, the cheap safe move when unsure
|
||
is to **关机 and ask**, never 释放 on a guess. **REQUIRED:** superpowers:verification-before-completion is
|
||
the general form of this gate.
|
||
|
||
---
|
||
|
||
## 6. DAEMON TOOL
|
||
|
||
**tmux** is the detach primitive when present, but **tmux is often NOT installed on a fresh AutoDL image**
|
||
and `apt-get install tmux` fails when egress is down. Zero-dependency fallback:
|
||
`nohup bash run_queue.sh queue.txt </dev/null >master.log 2>&1 &` — survives an SSH drop (SIGHUP), needs
|
||
no package. Verify either with `pgrep -af <script>`. The detach survives an SSH drop; it does **not**
|
||
survive a 关机/reboot — that is what checkpoint+resume (§4) is for.
|
||
|
||
**Native queue: none.** AutoDL has no built-in scheduler → use the bundled `scripts/run_queue.sh.template`
|
||
(resumable queue iterator, `start_index` for resume) driving `scripts/run_one.sh.template` per cell.
|
||
**Never overwrite a script a running bash is mid-execution** (bash reads by byte-offset → re-executes
|
||
blocks; version the filename) — universal physics, see `references/gotchas_universal.md`.
|
||
|
||
**Monitoring.** A session-bound watcher dies with the session; for multi-hour runs deploy the four-layer
|
||
durable architecture (`references/monitoring_patterns.md`). Detect "done" by a **log marker**
|
||
(`grep -q 'QUEUE DONE' master.log`), never by `pgrep` (the waiter's own cmdline matches the pattern and
|
||
loops forever). A cloud scheduler cannot reach the rented box (no SSH key in a cloud sandbox — secret
|
||
leak); the honest recurring check is the remote self-monitor + a session loop with the local key.
|
||
|
||
---
|
||
|
||
## 7. TOP GOTCHAS (AutoDL-pinned; universal ones → `references/gotchas_universal.md`)
|
||
|
||
**AD1 — external network call hangs / wandb shows 0 runs.** *Symptom:* `wandb.init` times out at
|
||
90/120/180 s, dashboard reads 0 runs while `wandb/run-*` exist locally; HF downloads stall; pip/git glacial.
|
||
*Root cause:* instances start with **no proxy**; direct egress to wandb/HF/PyPI/GitHub is unreliable or
|
||
blocked, and wandb-core's retry logic under a flaky link can roll back already-uploaded runs. *Fix:*
|
||
`source /etc/network_turbo` at the top of **every** shell/wrapper before any external call; recover an
|
||
empty cloud project with `for d in wandb/run-*; do timeout 120 wandb sync "$d"; done`.
|
||
|
||
**AD2 — HF download stalls even with hf-mirror + turbo.** *Symptom:* `from_pretrained` /
|
||
`snapshot_download` hangs or `ConnectTimeout` on big `.safetensors` shards. *Root cause:* (a) HF's Xet CAS
|
||
backend is not mirror-proxied; (b) `no_proxy` lists `modelscope.com` not `modelscope.cn` (domestic source
|
||
forced through international proxy = slower); (c) a curl test run without turbo measures the wrong path.
|
||
*Fix:* `export HF_HUB_DISABLE_XET=1` (or `pip uninstall -y hf_xet`) with `HF_ENDPOINT=https://hf-mirror.com`,
|
||
or pull from ModelScope to a plain dir + load via local-path override; wrap in a `timeout … && break`
|
||
resume loop. Detail → `references/china-network.md`.
|
||
|
||
**AD3 — cross-region instances cannot share FS.** *Symptom:* two instances in different regions see
|
||
identical `/root/autodl-fs/` paths but files written from one are invisible to the other. *Root cause:* FS
|
||
quota is region-scoped; each region has its own physical mount. *Fix:* create the FS quota in the same
|
||
region as the instances; bridge regions via scp from a chosen primary; verify with a write-one / read-other
|
||
probe.
|
||
|
||
**AD4 — FS write fails "No space left" while `df -h` looks fine.** *Symptom:* `cp`/`mkdir` to
|
||
`/root/autodl-fs` fails though `df -h` shows ~34%; `df -i` shows `… 0 100%`. *Root cause:* the shared FS
|
||
enforces a **hard ~200K inode cap independent of bytes**; per-sample eval visualization (many tiny files)
|
||
exhausts it. *Fix:* monitor `df -i`; cap per-sample eval vis on large test sets (sizing → verifying-dl-
|
||
experiments); once a results dir is verified locally, prune its per-sample image subdir from FS; recover by
|
||
`find /root/autodl-fs -type d -name '<vis-dir>' -exec rm -rf {} +` to free inodes fast.
|
||
|
||
**AD5 — data disk full; HF cache is the hidden hog; agent `rm` auto-denied.** *Symptom:*
|
||
`/root/autodl-tmp` at 100% though `runs/` looks small; an agent `rm -rf` of "obvious junk" is auto-denied.
|
||
*Root cause:* `~/.cache/huggingface` is symlinked onto the data disk, so the **HF model cache** (tens of
|
||
GB) is the real hog; the harness blocks irreversible `rm -rf` whose targets the agent inferred. *Fix:*
|
||
audit `du -sh ~/.cache/huggingface/hub/models--* | sort -rh`; set `HF_HOME` to a chosen data-disk dir + keep
|
||
the metric/eval JSONs (tiny evidence); present exact deletion targets + sizes for explicit user
|
||
confirmation; offer "clean vs expand the disk".
|
||
|
||
**AD6 — base IS the env; a "never use base" rule blocks every remote command.** *Symptom:* a local "don't
|
||
run DL in conda base" guard fires on `ssh autodl 'python train.py'`, but `conda env list` shows nothing and
|
||
`/root/miniconda3/envs/` is empty; poll scripts calling `python3` exit 127. *Root cause:* the image installs
|
||
the whole DL stack into **base** — base IS the single-tenant project env (no `/envs/`), and the image often
|
||
ships only `python` (no `python3`). *Fix:* train with `/root/miniconda3/bin/python`; exempt remote-ssh +
|
||
instance base from the local guard (never `conda create --clone base`); in remote scripts use the explicit
|
||
interpreter or pure shell, never bare `python3`.
|
||
|
||
**AD7 — platform TensorBoard pinned to `/root/tf-logs`; events elsewhere invisible.** *Symptom:* the
|
||
events file is non-empty and `curl http://127.0.0.1:6007/` returns 200, but the AutoPanel TB tile shows
|
||
zero runs; `/data/runs` returns `[]`. *Root cause:* the image autostarts `tensorboard --logdir
|
||
/root/tf-logs` and the tile proxies that pid; `--logdir` is hard-pinned and not reconfigurable in-container.
|
||
*Fix:* write `SummaryWriter(log_dir="/root/tf-logs/<run>")`, or `ln -sfn <your-tb> /root/tf-logs/<run>`
|
||
(the pinned TB's `--reload=5` picks it up in ~5 s); verify with `curl … /data/runs`, not `ss`. (Also:
|
||
restart the TB server to evict STALE cached tags after deleting/renaming runs.) The cross-platform "live panel silently empty" class (path/port/process mismatch on any platform) is the general form → `references/gotchas_universal.md` U39.
|
||
|
||
**AD8 — wandb val-phase CPU memory spike to 30+ GB at epoch 1 end.** *Symptom:* at the end of epoch 1
|
||
(validation), cgroup memory jumps from ~8 GB to 30+ GB, sometimes wedging the instance. *Root cause:*
|
||
project trainers log per-sample distributions at `step==1` (e.g. LPIPS/VGG over ~2000 samples on CPU =
|
||
~30 GB activations). *Fix:* cap the val-time sample accumulator — `-o training.val_metric_sample_cap=256`
|
||
(project-specific knob; check the trainer for the equivalent). Distinct from a DataLoader-worker cgroup OOM
|
||
(universal gotcha).
|
||
|
||
**AD9 — project torch pin would DOWNGRADE the image's working build.** *Symptom:* the image ships e.g. a
|
||
new-arch-capable torch (sm_120); the project pins `torch<2.9`; a naive `pip install -r requirements.txt`
|
||
replaces it with a wheel lacking the arch's kernels → `no kernel image is available` at first forward.
|
||
*Root cause:* the image torch/CUDA build is matched to the rented GPU arch; the project pin is stale for it.
|
||
*Fix:* filter framework pins out of the remote install —
|
||
`grep -ivE '^(torch|torchvision|torchaudio)' requirements.txt > /root/req_remote.txt && pip install -r
|
||
/root/req_remote.txt` — keep the image build; smoke `torch.cuda.get_device_capability()` + a heavy import
|
||
before launch; disclose the off-band torch version with results.
|
||
|
||
---
|
||
|
||
## 8. SCRIPT OVERRIDES
|
||
|
||
The exact values to parameterize the `scripts/` templates (`scripts/run_one.sh.template`,
|
||
`scripts/run_queue.sh.template`) for AutoDL:
|
||
|
||
```sh
|
||
DATA_DIR=/root/autodl-tmp # fast NVMe data disk — live checkpoints, logs, HF cache
|
||
DURABLE_DIR=/root/autodl-fs # region-locked shared FS — the only tier surviving 释放
|
||
PROXY_HOOK='source /etc/network_turbo 2>/dev/null || true' # MANDATORY before any external call (AD1)
|
||
CRED_FILE=/root/.wandb_key # per-instance ONLY — the FS security classifier blocks wandb keys
|
||
SCRATCH='latest.pth' # prune on success; keep best.pth (the keepable artifact)
|
||
HF_HOME=/root/autodl-tmp/huggingface_cache # redirect off the symlinked ~/.cache hog (AD5)
|
||
HF_ENDPOINT=https://hf-mirror.com # + HF_HUB_DISABLE_XET=1 (AD2)
|
||
DETACH=tmux # nohup fallback when tmux is absent (§6)
|
||
PY=/root/miniconda3/bin/python # base IS the env — explicit interpreter, never bare python3 (AD6)
|
||
TB_LOGDIR=/root/tf-logs # platform TB is pinned here (AD7)
|
||
```
|
||
|
||
**Credential push (AD-specific).** The FS security classifier blocks files matching wandb-key patterns —
|
||
put the key at the **per-instance** `/root/.wandb_key`, never on `/root/autodl-fs`. Stream exactly one
|
||
credential block via stdin so the secret never appears in a command; the wrapper reads it
|
||
into `WANDB_API_KEY` before launch. Secrets-via-stdin pattern → `references/ssh_transport.md`.
|
||
|
||
**Checked-sync (the gated success line).** `run_one.sh` writes live checkpoints to
|
||
`$DATA_DIR/checkpoints/<name>`, prunes `latest.pth` on success, then syncs `best.pth` to
|
||
`$DURABLE_DIR/final_ckpts/<name>` **gating the success echo on the actual copy result** — an unconditional
|
||
"synced" lies when the FS inode cap (AD4) silently fails the `mkdir`/`cp` (universal silent-sync gotcha).
|
||
Until a download is verified locally, the **data disk** copy is source-of-truth.
|