playbook/antigravity-awesome-skills/skills/remote-gpu-trainer/profiles/vastai.md

356 lines
26 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
platform: vastai
kind: ssh-rental
meter_stop_verb: destroy # the action that STOPS billing; stop keeps billing disk forever (compute off, storage on)
meter_stop_irreversible: true # destroy permanently deletes container disk
detach_primitive: tmux # auto-attached on login; dies on container restart → onstart.sh is the durable hook
spot_available: true # interruptible (bid) auction — central to the platform
spot_grace: ~0s # preemption is an abrupt pause, no documented notice / no SIGTERM
shared_fs: false # NO platform-wide FS; Volumes are machine-locked (per-GPU bound on restart)
inode_cap: host-dependent # undocumented; whatever the host's Docker storage driver gives
free_egress: host-dependent # CORRECTED: host-set bandwidth price; billed per byte in AND out, often $0 but not guaranteed
china_mirror_needed: false # no China DCs and no platform proxy; fix HF at workload level
host_driver_cuda_max: image-dependent # CUDA ships in the chosen Docker image; must be ≤ host driver
local_nvme: host-dependent
---
# vast.ai — platform profile
One-line purpose: rent a marketplace GPU as a **Docker image on a third-party host**, run a spot-resumable
job, and **copy results off before `destroy`** — the only verb that stops the full meter.
> **Surface to the user up front (principle #10):** ⚠️ Danger clocks — a **`stop`ped instance bills its disk FOREVER** (only `destroy` stops the full meter, and `destroy` deletes everything); **bandwidth/egress bills continuously**, host-priced. Risk — rent only **verified, high-reliability** hosts with a direct port (an unverified host can vanish mid-run); cloud-sync works even while stopped (§5), the cleanest durable target.
**Table of contents** (`grep -in '^## ' profiles/vastai.md` to jump):
- §1 LAUNCH — offer-driven, Docker-image-is-the-env
- §2 STORAGE MODEL — per-machine-local disk; survival matrix; cloud-sync escape hatch
- §3 NETWORK — proxy vs direct SSH; random ports; host-set bandwidth; no China proxy
- §4 SPOT / INTERRUPTION + RESUME — bid auction, ~0 s pause, GPU-bound resume, status-poll loop
- §5 TEARDOWN / BILLING — `destroy` is the meter-stop; `stop` bills disk forever; bandwidth bills always
- §6 DAEMON TOOL — tmux dies on restart; `onstart.sh` is the durable relaunch
- §7 TOP GOTCHAS — VAST1VAST13, platform-pinned + Platform-specific debugging
- §8 SCRIPT OVERRIDES — values to parameterize `scripts/`
Universal gotchas are NOT restated here — see `references/gotchas_universal.md`. Spot cadence math and
atomic-resume live in `references/spot-resilience.md`.
**The one fact that reshapes everything:** vast.ai is a **decentralized marketplace of third-party hosts**,
not a uniform first-party cloud. Consequences that diverge from AutoDL: **no platform-wide shared FS**, **no
China-mirror proxy**, **no single prebuilt conda env** (the Docker image IS the env), **storage is locked to
one physical host and even one GPU ID**, **bandwidth is host-priced (not free by fiat)**, and
**interruptible (bid) preemption is a real, central, abrupt model**.
---
## 1. LAUNCH
**Entry points** (all equivalent): web console (`cloud.vast.ai`), the `vastai` CLI / Python SDK, the REST
API (`https://console.vast.ai/api/v1/...`, Bearer token), and SSH into the running container. The CLI is the
orchestration surface: `pip install vastai`, then `vastai set api-key $VAST_API_KEY` (env-var name only —
never inline the key).
**Env contract — the Docker image IS the env.** A bare VM is not offered by default; the create call MUST
specify `--image` (e.g. `pytorch/pytorch:2.4.0-cuda12.4-cudnn9-runtime`). **CUDA version is whatever the
image ships** — a mismatch with the host driver is a real failure mode (VAST5). The image's default Python
env is the low-friction place to run — do not `conda create` on a rental (the remote-base exception holds).
Note: **Docker-in-Docker is not supported** "due to security constraints" (verified
docs.vast.ai/.../faq/instances 2026-06) — a containerized inner runtime is not an option here.
**Launch is offer-driven and two-step** (search a marketplace offer → create onto it):
```bash
#!/usr/bin/env bash
set -u
# 1) find a verified, rentable offer with at least one direct port, cheapest $/dlperf first
vastai search offers 'gpu_name=RTX_4090 num_gpus=1 verified=true rentable=true direct_port_count>=1' -o 'dlperf_usd-'
# 2) create onto the chosen OFFER_ID; --direct enables direct-TCP SSH (see §3)
vastai create instance OFFER_ID --image pytorch/pytorch:2.4.0-cuda12.4-cudnn9-runtime \
--disk 50 --ssh --direct --onstart-cmd 'nvidia-smi && bash /workspace/onstart.sh'
```
`--onstart-cmd` (**max 16 KB**; for a longer script, gzip+base64-encode it) is written to `/root/onstart.sh`
and **re-runs on every container start** — this is the platform-native boot hook and the durable relaunch
path (§6) (verified docs.vast.ai/cli/commands 2026-06). Filter offers hard: an unverified, low-reliability
host can simply vanish (`Offline`) mid-run (VAST7). Boot is not instant: the host must **pull the Docker
image and boot — typically 15 min depending on image size** (verified docs.vast.ai CLI Hello World 2026-06);
a fat image stuck in `Loading` is the slow-download symptom (VAST13).
**verify:** `vastai show instance OFFER_ID` lists the new instance `running`, and an in-container
`nvidia-smi` (via `--onstart-cmd` or first SSH) shows the expected GPU with a CUDA that matches the image.
---
## 2. STORAGE MODEL *(survival matrix — principle #4)*
Three tiers; the persistence + region story is the single biggest divergence from AutoDL — **there is no
region-wide shared FS** to sync to. (verified docs.vast.ai/.../storage/types 2026-06)
| Tier | Path | Speed | Survives STOP? | Survives DESTROY? | Cap |
|---|---|---|---|---|---|
| Container / instance disk (`--disk N`) | `/` + `/workspace` | local | **yes** (bills) | **NO — gone** | fixed at create, **non-resizable**, min **10 GB** (default) |
| Volume (local) | mounted path | local | yes | **yes, until volume deleted** (bills per-GB while it exists) | fixed; **machine-locked**, non-resizable |
| Cloud sync (S3 / GDrive / Backblaze / Dropbox) | off-box bucket | network | yes | **yes — fully off-box** | provider's; **works even while instance is stopped** |
| Network Volume (cross-machine) | — | — | — | — | **not in current storage docs — treat as unavailable** |
**Machine-lock — and per-GPU-lock — is the trap.** A Volume "is tied to the physical machine where created"
and "cannot migrate between different physical machines." Worse, a stopped instance is **bound to a specific
GPU ID**, not just the machine: "When an instance is created, it is bound to a specific GPU ID. If the
instance is stopped, it remains bound to the same GPU ID and waits for that GPU to become available again"
(verified vast-ai.crisp.help scheduling article 2026-06). So a machine can show **available for rent** (other
GPUs free) while the stopped instance is stuck in `Scheduling` waiting for *its* GPU (VAST3).
**Where checkpoints MUST go for the §5 verb:** there is **no durable mount that survives `destroy`** on the
container disk — so the durable target is **off-box**. Two real off-box paths: (a) `vastai copy` the result
to local / another instance / a Volume **before** `destroy`; (b) **Cloud sync** (`vastai cloud copy`) to
S3/GDrive/Backblaze/Dropbox — notably **works even while the instance is stopped** (verified
docs.vast.ai/.../data-movement 2026-06), which makes it the cleanest durable target for a spot job. Always
assume the instance is lost once its lifetime expires. Inode caps and FS type are **undocumented and
host-dependent** (whatever the host's Docker storage driver gives) — `df -i` per host, do not assume an
AutoDL-style platform constant.
**verify:** before any teardown, `vastai copy <id>:/path/to/ckpt local:/path/to/local` exits 0 (or
`vastai cloud copy` completes) AND the local artifact loads (`scripts/verify_local.py`).
---
## 3. NETWORK
**Shared public IP + random external port.** Each instance shares a host's (usually shared) public IP;
"each open internal port (such as 22 or 8080 etc) is mapped to a *random* external port" read from the
**"IP Port Info" pop-up** (button on the instance) or `vastai show instance` — format
`PUBLIC_IP:33526 -> 8081/tcp` (verified docs.vast.ai/.../connect/networking 2026-06). Ports change per
instance — discover them at runtime, never hard-code. **Hard cap 64 open ports per instance.**
**Two SSH flavors — and the scp size trap:**
- **Proxy SSH** (default, via Vast's proxy): "works on all machines, slower for data transfer." It carries
`scp` but is throttled — vast's own guidance is **scp over proxy only for transfers under ~1 GB**; above
that "using the direct ssh connection is recommended" (verified docs.vast.ai/.../data-movement 2026-06).
- **Direct SSH** (direct-TCP to the host): "requires machines with open ports, faster and more reliable, the
preferred method." This is the one that carries large `scp`/`rsync`/`vastai copy` without stalling. It
**requires the offer to expose open ports** → filter `direct_port_count>=1` and create with `--direct`.
**Rule:** if bulk transfer must work, require **direct-TCP** at create time. `vastai copy` "uses rsync and is
generally fast and efficient, subject to single-link upload/download constraints" — for a multi-GB result,
direct + a resumable loop (`references/gotchas_universal.md` U12). For a big *inbound* dataset, prefer
`wget`/`curl` from a cloud bucket over proxied SSH (much higher throughput). Custom services use Docker `-p`
(e.g. `-p 8081:8081`); Jupyter defaults to internal 8080 gated by `JUPYTER_TOKEN` (override the port via
`JUPYTER_PORT`).
**Bandwidth is metered and host-priced — NOT free by fiat (corrected).** "You are charged bandwidth prices
for every byte sent or received to or from the instance, regardless of what state it is in," and "pricing is
set by the host and is specific to each offer" (verified docs.vast.ai/.../reference/billing +
.../instances/pricing 2026-06). In practice many hosts price egress at ~$0 (vast is generally a low/zero
egress option), but a given offer **can** charge per-GB in *both* directions — read the per-offer bandwidth
rate (hover the price on the instance card / search page) before a transfer-heavy job. This is why the
frontmatter is `free_egress: host-dependent`, not `true`.
**China relevance: none at the platform level.** No China datacenters, no `/etc/network_turbo` equivalent, no
built-in HF mirror. The HF-unreachable problem still exists at the *workload* level from some hosts, but the
fix is the job's **own** `HF_ENDPOINT=https://hf-mirror.com` / `hf_transfer`, not a platform script — see
`references/gotchas_universal.md` (HF download) for the resumable-download ladder.
**verify:** `ssh <alias> 'echo ok'` over the **direct** endpoint, then a 1-file `vastai copy` round-trip
exits 0.
---
## 4. SPOT / INTERRUPTION + RESUME *(principle #7/#8)*
vast.ai's **interruptible** rentals are a **live continuous-bid auction** — the cheap-GPU core of the
platform ("can reduce costs by fifty percent or even more"), far more first-class than anything on AutoDL.
(verified vast.ai/article/Rental-Types 2026-06)
- **Bidding:** clients set a bid price; "the current highest bid is the instance that runs, the others are
paused." **On-demand always beats interruptible** regardless of bid amount ("on-demand instances will
always take precedence").
- **The bid is fixed at create.** "The bidding method cannot be changed after an instance is rented"
(verified Rental-Types 2026-06) — so the resume lever is **not** "raise this instance's bid." To recover an
out-priced run, either wait for the higher bid to finish, or **re-launch the identical job on a fresh
offer** (cheaper/on-demand) — which is why off-box checkpoints (§2) matter.
- **Preemption = pause, not destroy.** A preempted instance is paused (disk survives) until its bid regains
top priority or the higher bid finishes. Because storage is machine-/GPU-locked, it can only resume **on
the original host's original GPU** — the resumability cliff (VAST3).
- **Detection signal + grace window:** **little/no advance notice — treat the grace as ~0 s, an abrupt
pause.** No documented termination signal; a SIGTERM-flush handler is **NOT** a safety net. Detect via the
API: `show_instance` returns `actual_status` (current container state), `intended_status` (desired state),
`cur_state` (contract/hardware allocation), and `status_msg` (human string, e.g. "success, running ...")
(verified docs.vast.ai/api-reference/instances/show-instances 2026-06). A preempted instance stops being
`running`; the UI shows **Inactive** (stopped, data preserved) / **Scheduling** (waiting for the GPU to
free) / **Offline** (host gone).
- **Resume hook:** wait for the higher bid to finish or restart the instance; it returns
`Scheduling → running` **only if the same GPU is still free** (else it sticks — VAST3), then
**`/root/onstart.sh` re-runs** and relaunches training (§6). The job itself must be checkpoint-resumable
(`--resume`, load-latest unconditionally) so the identical command resumes idempotently.
**Orchestrator pattern:** poll `actual_status` / `status_msg` on a timer; on preemption, restart (or
re-launch on a new offer) and let `onstart.sh` + checkpoint-resume recover. Cadence formula (Young/Daly) and
atomic temp→fsync→rename resume → `references/spot-resilience.md`.
**verify:** kill-and-resume drill — `vastai stop instance <id>` then `start`; the job resumes from the last
checkpoint step, not epoch 0.
---
## 5. TEARDOWN / BILLING *(principle #9 + the Iron Law)*
This is the most error-prone section — be precise. (verified docs.vast.ai/.../reference/billing +
.../manage-instances 2026-06)
- **`destroy` is the ONLY thing that stops the full meter** (compute **and** disk). It is **irreversible**
all container-disk data is permanently deleted. (`vastai destroy instance <id>`)
- **`stop` is a trap:** it detaches the GPU and halts compute billing, but **disk keeps charging
indefinitely** while stopped — "stopping an instance does not avoid storage costs," "you will continue to
be billed for disk storage, even if your balance is negative." The #1 surprise bill on vast.ai.
"Stopped" ≠ "meter off."
- **Bandwidth bills in EVERY state.** Charged "for every byte sent or received... regardless of what state it
is in" — so even a transfer to/from a *stopped* instance (cloud sync) accrues host-set bandwidth cost (§3).
- **A Volume keeps billing after the instance is destroyed** until the volume itself is deleted ("charged per
GB while volume exists," independently from instances).
- **On-demand instances auto-stop when their host-set lifetime expires** — "when the rental end date is
reached, the rental contract expires and the instance is stopped." Data remains until destroyed. An
unattended job can silently end, so checkpoint as if the box disappears at any moment.
- **Zero / negative balance → deletion.** At $0.00 "your instances, storage volumes, and data will be
scheduled for deletion unless you add credits"; without a saved card "your instances and stored data will
be destroyed." There is a "short grace period where your balance may go negative before deletion occurs" —
do not rely on it.
- **Poll-loop cost trap:** a status-poll loop with no timeout/error check will loop forever while the
instance keeps accruing disk + bandwidth charges. Bound every poll loop with `timeout` + an exit check.
**Teardown Iron Law (vast.ai instance):** NO `destroy` until checkpoints are **copied off-box AND verified by
load** — either `vastai copy`-ed to local (`scripts/verify_local.py` reports 100% OK) or `vastai cloud copy`
confirmed — the copy exit status is checked (VAST2), and the user has **explicitly approved** the
cost-affecting action. "It looked done in the log" is not evidence (principle #3). Because `destroy` deletes
the disk and there is **no shared FS to fall back on**, the confirmation gate matters more here, not less.
---
## 6. DAEMON TOOL
- **Auto-tmux on SSH login** (same as AutoDL): login attaches a tmux session "to keep the session active
even if you disconnect." Disable with `touch ~/.no_auto_tmux` then reconnect (verified docs.vast.ai
jupyter-ssh FAQ 2026-06).
- **tmux survives an SSH disconnect but NOT a container restart/reboot/spot-resume** — a reboot or
spot-resume wipes the tmux session. The **durable relaunch hook is `/root/onstart.sh`** (the
`--onstart-cmd`), which re-runs on every container start. Put the training relaunch there, **not** in
tmux, so a spot-resume actually restarts the job.
- **SSH keys apply only to instances created AFTER the key is added** — existing instances do not get a new
key automatically. Set the account key **before** creating, or inject it via `onstart`. A pasted key missing
its `ssh-rsa`/`ssh-ed25519` prefix or `user@host` suffix authenticates as a password prompt — copy the whole
line (verified docs.vast.ai jupyter-ssh FAQ 2026-06).
- **Native queue:** vast.ai has **Serverless / autoscaler** for queue-style workloads, but single-instance
training has no managed scheduler — the orchestrator + `onstart.sh` + checkpoint-resume **is** the queue.
---
## 7. TOP GOTCHAS (platform-pinned; Symptom → Root cause → Fix)
Universal gotchas (CRLF, cgroup OOM, silent-sync, HF stalls, zombie VRAM, GPU-0%-util, scp-resets,
egress-surcharge) live in `references/gotchas_universal.md` — not repeated here.
- **VAST1 — surprise bill on a "stopped" instance.** Symptom: a stopped, idle instance keeps charging for
days, even past a negative balance. → Root cause: `stop` halts compute only; **disk bills forever while
stopped**, and bandwidth bills in every state. → Fix: to stop the meter, **`destroy`** (after copy-out per
§5); never leave an instance merely stopped to "save money."
- **VAST2 — results gone after teardown.** Symptom: `destroy` run, checkpoints irrecoverable. → Root cause:
`destroy` permanently nukes container disk and there's **no platform-wide FS to fall back on**. → Fix:
`vastai copy` out (or `vastai cloud copy` to a bucket) and **check its exit status** BEFORE `destroy`; gate
the success line on the copy result, never on a log claim.
- **VAST3 — paused/stopped instance stuck in `Scheduling` though the machine shows "available."** Symptom:
preempted or stopped run never resumes; the portal still lists the same machine as rentable. → Root cause:
the instance is **bound to a specific GPU ID** (not the machine); if that GPU was re-rented, it waits
indefinitely while *other* GPUs on the host stay free. "If stuck >30 s, GPU likely rented by another user."
→ Fix: stop the scheduling attempt, **create a NEW instance on the same host and re-attach the same Volume**
(works because other GPUs are free), or re-launch on a different offer from an off-box checkpoint; don't
wait for the same GPU to come back (verified vast-ai.crisp.help + manage-instances 2026-06).
- **VAST4 — job dies mid-step with no warning.** Symptom: interruptible run vanishes abruptly. → Root cause:
bid preemption with **~0 s notice and no SIGTERM**; a flush handler never fires. → Fix: periodic checkpoint
to disk on a Young/Daly timer + load-latest-on-resume; poll `actual_status`/`status_msg` and restart (§4,
`references/spot-resilience.md`). The bid can't be raised on a live instance — re-launch elsewhere if the
GPU is gone.
- **VAST5 — CUDA driver mismatch on a fresh box.** Symptom: `torch.cuda.is_available()` is False / driver
mismatch error. → Root cause: **CUDA ships in the Docker image, not the host**; the image's CUDA may be
newer than the host driver supports (image CUDA must be ≤ host driver). → Fix: pick an image whose CUDA ≤
host driver; verify `nvidia-smi`/`nvcc` inside the container in `onstart` before training (general triangle:
`gotchas_universal.md` U28).
- **VAST6 — a service is unreachable on its "own" port.** Symptom: TB/Jupyter/API not reachable at the
internal port. → Root cause: internal ports map to **random external ports** and there's a **64-port cap**
per instance. → Fix: open ports with `-p` at create, **discover the external mapping at runtime**
(`vastai show instance` / IP Port Info pop-up), never hard-code a port.
- **VAST7 — host vanishes mid-run.** Symptom: instance flips to `Offline`, work lost. → Root cause: it's a
**marketplace** — an unverified/low-reliability host can disconnect. → Fix: filter offers on
`verified=true`, high `reliability`, and `direct_port_count>=1`; treat any single host as disposable and
checkpoint off-box accordingly.
- **VAST8 — bulk `scp` over the default SSH stalls / crawls.** Symptom: a multi-GB result copy over the
default endpoint hangs or runs at a trickle. → Root cause: the **default is proxy SSH**, throttled and
recommended only for <1 GB; large transfers need direct-TCP. Fix: create with `--direct` (offer must have
`direct_port_count>=1`) and use that endpoint for `scp`/`vastai copy`; for big *inbound* data prefer
`wget`/`curl` from a bucket (verified data-movement docs 2026-06).
- **VAST9 bandwidth shows up on the bill.** Symptom: a transfer-heavy job costs more than the GPU-hours
alone. Root cause: bandwidth is **host-priced and metered per byte in both directions, in every state**
some offers are not $0-egress. Fix: read the per-offer bandwidth rate before committing; pull a dataset
**once** to durable local/Volume, not per-epoch from a remote bucket (general form: `gotchas_universal.md`
U14/U15).
- **VAST10 disk full, and you can't grow it.** Symptom: `No space left on device` mid-run; `--disk` can't
be raised. Root cause: container disk is **fixed at create (min 10 GB) and non-resizable**; Docker
layers + HF cache + checkpoints overrun it. Fix: over-provision `--disk` at create; redirect `HF_HOME`
onto the data disk; prune `latest`/periodic checkpoints, keep only `best` (inode/byte audit:
`gotchas_universal.md` U6/U7).
- **VAST11 secret baked into the image or onstart-cmd is recoverable.** Symptom: a key embedded at build
time or in `--onstart-cmd` is stored by the platform. Root cause: image layers and the 16 KB onstart
string are persisted server-side. Fix: inject `WANDB_API_KEY`/`HF_TOKEN` via **env vars at create**, never
baked into image layers or `--onstart-cmd`; stream creds via stdin at runtime (`gotchas_universal.md` U34).
- **VAST12 assuming a cross-machine Network Volume exists.** Symptom: a plan relies on a Volume following
the job to a different host. Root cause: Volumes are **machine-locked**; cross-machine Network Volumes are
**not in the current storage docs**. Fix: design for off-box durability (`vastai cloud copy` to a bucket),
not a portable volume; only same-machine re-attach is reliable.
- **VAST13 instance stuck in `Loading`, never reaches `running`.** Symptom: a new instance sits in
`Loading`/`Connecting` for many minutes. Root cause: the host is **pulling a large Docker image** (boot is
15 min, longer for fat images) or the host link is slow. Fix: wait out the documented window, then read
`vastai show logs <id>` (below) for the pull progress; if still stuck, `destroy` and re-create on a faster
offer with a slimmer image.
### Platform-specific debugging (commands + what to check)
- **Read the boot/container/system logs from off-box:**
`vastai show logs <id> --tail 200 [--filter <grep>] [--daemon-logs]` uploads container logs (and, with
`--daemon-logs`, host/system logs) to a generated URL. This is the first stop for a box that won't connect,
a stuck `Loading`, or a silent `onstart` failure (verified docs.vast.ai/api-reference/instances/show-logs
2026-06). The GUI equivalent is the **"Logs" button** on the instance card.
- **Inspect the live state machine without SSH:** `vastai show instance <id>` (or the API) compare
`actual_status` (where the container *is*), `intended_status` (where it *should* be), `cur_state` (contract/
hardware allocation) and `status_msg`. `intended=running` but `actual≠running` + `Scheduling` VAST3
(GPU-bound wait); `Offline` VAST7 (host gone).
- **Confirm the GPU is really attached:** in `onstart` / first SSH run `nvidia-smi` and
`python -c "import torch; print(torch.cuda.is_available(), torch.version.cuda)"` `False`/CPU-only VAST5
(image CUDA > host driver) or no-GPU container (`gotchas_universal.md` U31).
- **Detect a stuck download inside the box:** `du -sh ~/.cache/huggingface/hub` over time (no growth = stalled
HF pull), `df -h /` (filling = active download) and `df -i /` (inodes), then the resumable-download ladder
in `gotchas_universal.md` (HF). A fat-image stall *before* SSH is visible only via `vastai show logs`.
- **Find the real external ports / SSH target:** `vastai show instance <id>` lists the port map and
`vastai ssh-url <id>` prints the connection string — never assume port 22 is reachable (VAST6).
---
## 8. SCRIPT OVERRIDES
Values to parameterize the `scripts/` templates for vast.ai:
```bash
# DATA_DIR — data + (only) checkpoint mount; NOTHING survives destroy, so durable = off-box copy-out/cloud-sync
DATA_DIR=/workspace # container disk; survives stop, bills forever, GONE on destroy
DURABLE_DIR=off-box # no destroy-surviving mount: vastai copy / vastai cloud copy before destroy (§5)
# PROXY_HOOK — none at platform level (no /etc/network_turbo). HF mirror is the JOB's own env if needed:
PROXY_HOOK='' # set HF_ENDPOINT=https://hf-mirror.com in the job env only if a host can't reach HF
# CRED_FILE — empty: vast's key is the VAST_API_KEY env var, not a file. WANDB_API_KEY/HF_TOKEN also arrive via env.
CRED_FILE="" # no cred FILE on disk → run_one's [ -n "$CRED_FILE" ] guard skips the cat; VAST_API_KEY + WANDB_API_KEY/HF_TOKEN injected via env at create, NOT into the image or onstart-cmd
# SCRATCH — what to prune (disk is fixed-size, non-resizable → prune aggressively)
SCRATCH='latest.pth periodic-*.pth *.tmp ~/.cache/huggingface/hub/blobs' # keep only best + tiny eval JSONs
# HF_HOME — redirect cache off the small root onto the data disk
HF_HOME=/workspace/.cache/huggingface
# DETACH — durable relaunch is onstart.sh, NOT tmux (tmux dies on container restart/spot-resume)
DETACH='/root/onstart.sh' # re-runs on every container start; tmux only for an attached SSH session
```
**Secrets note:** inject `WANDB_API_KEY` / `HF_TOKEN` via **env vars at create**, never baked into the Docker
image layers or the 16 KB `--onstart-cmd` (both are stored by the platform — VAST11).