26 KiB
| platform | kind | meter_stop_verb | meter_stop_irreversible | detach_primitive | spot_available | spot_grace | shared_fs | inode_cap | free_egress | china_mirror_needed | host_driver_cuda_max | local_nvme |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| vastai | ssh-rental | destroy | true | tmux | true | ~0s | false | host-dependent | host-dependent | false | image-dependent | host-dependent |
vast.ai — platform profile
One-line purpose: rent a marketplace GPU as a Docker image on a third-party host, run a spot-resumable
job, and copy results off before destroy — the only verb that stops the full meter.
Surface to the user up front (principle #10): ⚠️ Danger clocks — a
stopped instance bills its disk FOREVER (onlydestroystops the full meter, anddestroydeletes everything); bandwidth/egress bills continuously, host-priced. Risk — rent only verified, high-reliability hosts with a direct port (an unverified host can vanish mid-run); cloud-sync works even while stopped (§5), the cleanest durable target.
Table of contents (grep -in '^## ' profiles/vastai.md to jump):
- §1 LAUNCH — offer-driven, Docker-image-is-the-env
- §2 STORAGE MODEL — per-machine-local disk; survival matrix; cloud-sync escape hatch
- §3 NETWORK — proxy vs direct SSH; random ports; host-set bandwidth; no China proxy
- §4 SPOT / INTERRUPTION + RESUME — bid auction, ~0 s pause, GPU-bound resume, status-poll loop
- §5 TEARDOWN / BILLING —
destroyis the meter-stop;stopbills disk forever; bandwidth bills always - §6 DAEMON TOOL — tmux dies on restart;
onstart.shis the durable relaunch - §7 TOP GOTCHAS — VAST1–VAST13, platform-pinned + Platform-specific debugging
- §8 SCRIPT OVERRIDES — values to parameterize
scripts/
Universal gotchas are NOT restated here — see references/gotchas_universal.md. Spot cadence math and
atomic-resume live in references/spot-resilience.md.
The one fact that reshapes everything: vast.ai is a decentralized marketplace of third-party hosts, not a uniform first-party cloud. Consequences that diverge from AutoDL: no platform-wide shared FS, no China-mirror proxy, no single prebuilt conda env (the Docker image IS the env), storage is locked to one physical host and even one GPU ID, bandwidth is host-priced (not free by fiat), and interruptible (bid) preemption is a real, central, abrupt model.
1. LAUNCH
Entry points (all equivalent): web console (cloud.vast.ai), the vastai CLI / Python SDK, the REST
API (https://console.vast.ai/api/v1/..., Bearer token), and SSH into the running container. The CLI is the
orchestration surface: pip install vastai, then vastai set api-key $VAST_API_KEY (env-var name only —
never inline the key).
Env contract — the Docker image IS the env. A bare VM is not offered by default; the create call MUST
specify --image (e.g. pytorch/pytorch:2.4.0-cuda12.4-cudnn9-runtime). CUDA version is whatever the
image ships — a mismatch with the host driver is a real failure mode (VAST5). The image's default Python
env is the low-friction place to run — do not conda create on a rental (the remote-base exception holds).
Note: Docker-in-Docker is not supported "due to security constraints" (verified
docs.vast.ai/.../faq/instances 2026-06) — a containerized inner runtime is not an option here.
Launch is offer-driven and two-step (search a marketplace offer → create onto it):
#!/usr/bin/env bash
set -u
# 1) find a verified, rentable offer with at least one direct port, cheapest $/dlperf first
vastai search offers 'gpu_name=RTX_4090 num_gpus=1 verified=true rentable=true direct_port_count>=1' -o 'dlperf_usd-'
# 2) create onto the chosen OFFER_ID; --direct enables direct-TCP SSH (see §3)
vastai create instance OFFER_ID --image pytorch/pytorch:2.4.0-cuda12.4-cudnn9-runtime \
--disk 50 --ssh --direct --onstart-cmd 'nvidia-smi && bash /workspace/onstart.sh'
--onstart-cmd (max 16 KB; for a longer script, gzip+base64-encode it) is written to /root/onstart.sh
and re-runs on every container start — this is the platform-native boot hook and the durable relaunch
path (§6) (verified docs.vast.ai/cli/commands 2026-06). Filter offers hard: an unverified, low-reliability
host can simply vanish (Offline) mid-run (VAST7). Boot is not instant: the host must pull the Docker
image and boot — typically 1–5 min depending on image size (verified docs.vast.ai CLI Hello World 2026-06);
a fat image stuck in Loading is the slow-download symptom (VAST13).
→ verify: vastai show instance OFFER_ID lists the new instance running, and an in-container
nvidia-smi (via --onstart-cmd or first SSH) shows the expected GPU with a CUDA that matches the image.
2. STORAGE MODEL (survival matrix — principle #4)
Three tiers; the persistence + region story is the single biggest divergence from AutoDL — there is no region-wide shared FS to sync to. (verified docs.vast.ai/.../storage/types 2026-06)
| Tier | Path | Speed | Survives STOP? | Survives DESTROY? | Cap |
|---|---|---|---|---|---|
Container / instance disk (--disk N) |
/ + /workspace |
local | yes (bills) | NO — gone | fixed at create, non-resizable, min 10 GB (default) |
| Volume (local) | mounted path | local | yes | yes, until volume deleted (bills per-GB while it exists) | fixed; machine-locked, non-resizable |
| Cloud sync (S3 / GDrive / Backblaze / Dropbox) | off-box bucket | network | yes | yes — fully off-box | provider's; works even while instance is stopped |
| Network Volume (cross-machine) | — | — | — | — | not in current storage docs — treat as unavailable |
Machine-lock — and per-GPU-lock — is the trap. A Volume "is tied to the physical machine where created"
and "cannot migrate between different physical machines." Worse, a stopped instance is bound to a specific
GPU ID, not just the machine: "When an instance is created, it is bound to a specific GPU ID. If the
instance is stopped, it remains bound to the same GPU ID and waits for that GPU to become available again"
(verified vast-ai.crisp.help scheduling article 2026-06). So a machine can show available for rent (other
GPUs free) while the stopped instance is stuck in Scheduling waiting for its GPU (VAST3).
Where checkpoints MUST go for the §5 verb: there is no durable mount that survives destroy on the
container disk — so the durable target is off-box. Two real off-box paths: (a) vastai copy the result
to local / another instance / a Volume before destroy; (b) Cloud sync (vastai cloud copy) to
S3/GDrive/Backblaze/Dropbox — notably works even while the instance is stopped (verified
docs.vast.ai/.../data-movement 2026-06), which makes it the cleanest durable target for a spot job. Always
assume the instance is lost once its lifetime expires. Inode caps and FS type are undocumented and
host-dependent (whatever the host's Docker storage driver gives) — df -i per host, do not assume an
AutoDL-style platform constant.
→ verify: before any teardown, vastai copy <id>:/path/to/ckpt local:/path/to/local exits 0 (or
vastai cloud copy completes) AND the local artifact loads (scripts/verify_local.py).
3. NETWORK
Shared public IP + random external port. Each instance shares a host's (usually shared) public IP;
"each open internal port (such as 22 or 8080 etc) is mapped to a random external port" read from the
"IP Port Info" pop-up (button on the instance) or vastai show instance — format
PUBLIC_IP:33526 -> 8081/tcp (verified docs.vast.ai/.../connect/networking 2026-06). Ports change per
instance — discover them at runtime, never hard-code. Hard cap 64 open ports per instance.
Two SSH flavors — and the scp size trap:
- Proxy SSH (default, via Vast's proxy): "works on all machines, slower for data transfer." It carries
scpbut is throttled — vast's own guidance is scp over proxy only for transfers under ~1 GB; above that "using the direct ssh connection is recommended" (verified docs.vast.ai/.../data-movement 2026-06). - Direct SSH (direct-TCP to the host): "requires machines with open ports, faster and more reliable, the
preferred method." This is the one that carries large
scp/rsync/vastai copywithout stalling. It requires the offer to expose open ports → filterdirect_port_count>=1and create with--direct.
Rule: if bulk transfer must work, require direct-TCP at create time. vastai copy "uses rsync and is
generally fast and efficient, subject to single-link upload/download constraints" — for a multi-GB result,
direct + a resumable loop (references/gotchas_universal.md U12). For a big inbound dataset, prefer
wget/curl from a cloud bucket over proxied SSH (much higher throughput). Custom services use Docker -p
(e.g. -p 8081:8081); Jupyter defaults to internal 8080 gated by JUPYTER_TOKEN (override the port via
JUPYTER_PORT).
Bandwidth is metered and host-priced — NOT free by fiat (corrected). "You are charged bandwidth prices
for every byte sent or received to or from the instance, regardless of what state it is in," and "pricing is
set by the host and is specific to each offer" (verified docs.vast.ai/.../reference/billing +
.../instances/pricing 2026-06). In practice many hosts price egress at ~$0 (vast is generally a low/zero
egress option), but a given offer can charge per-GB in both directions — read the per-offer bandwidth
rate (hover the price on the instance card / search page) before a transfer-heavy job. This is why the
frontmatter is free_egress: host-dependent, not true.
China relevance: none at the platform level. No China datacenters, no /etc/network_turbo equivalent, no
built-in HF mirror. The HF-unreachable problem still exists at the workload level from some hosts, but the
fix is the job's own HF_ENDPOINT=https://hf-mirror.com / hf_transfer, not a platform script — see
references/gotchas_universal.md (HF download) for the resumable-download ladder.
→ verify: ssh <alias> 'echo ok' over the direct endpoint, then a 1-file vastai copy round-trip
exits 0.
4. SPOT / INTERRUPTION + RESUME (principle #7/#8)
vast.ai's interruptible rentals are a live continuous-bid auction — the cheap-GPU core of the platform ("can reduce costs by fifty percent or even more"), far more first-class than anything on AutoDL. (verified vast.ai/article/Rental-Types 2026-06)
- Bidding: clients set a bid price; "the current highest bid is the instance that runs, the others are paused." On-demand always beats interruptible regardless of bid amount ("on-demand instances will always take precedence").
- The bid is fixed at create. "The bidding method cannot be changed after an instance is rented" (verified Rental-Types 2026-06) — so the resume lever is not "raise this instance's bid." To recover an out-priced run, either wait for the higher bid to finish, or re-launch the identical job on a fresh offer (cheaper/on-demand) — which is why off-box checkpoints (§2) matter.
- Preemption = pause, not destroy. A preempted instance is paused (disk survives) until its bid regains top priority or the higher bid finishes. Because storage is machine-/GPU-locked, it can only resume on the original host's original GPU — the resumability cliff (VAST3).
- Detection signal + grace window: little/no advance notice — treat the grace as ~0 s, an abrupt
pause. No documented termination signal; a SIGTERM-flush handler is NOT a safety net. Detect via the
API:
show_instancereturnsactual_status(current container state),intended_status(desired state),cur_state(contract/hardware allocation), andstatus_msg(human string, e.g. "success, running ...") (verified docs.vast.ai/api-reference/instances/show-instances 2026-06). A preempted instance stops beingrunning; the UI shows Inactive (stopped, data preserved) / Scheduling (waiting for the GPU to free) / Offline (host gone). - Resume hook: wait for the higher bid to finish or restart the instance; it returns
Scheduling → runningonly if the same GPU is still free (else it sticks — VAST3), then/root/onstart.shre-runs and relaunches training (§6). The job itself must be checkpoint-resumable (--resume, load-latest unconditionally) so the identical command resumes idempotently.
Orchestrator pattern: poll actual_status / status_msg on a timer; on preemption, restart (or
re-launch on a new offer) and let onstart.sh + checkpoint-resume recover. Cadence formula (Young/Daly) and
atomic temp→fsync→rename resume → references/spot-resilience.md.
→ verify: kill-and-resume drill — vastai stop instance <id> then start; the job resumes from the last
checkpoint step, not epoch 0.
5. TEARDOWN / BILLING (principle #9 + the Iron Law)
This is the most error-prone section — be precise. (verified docs.vast.ai/.../reference/billing + .../manage-instances 2026-06)
destroyis the ONLY thing that stops the full meter (compute and disk). It is irreversible — all container-disk data is permanently deleted. (vastai destroy instance <id>)stopis a trap: it detaches the GPU and halts compute billing, but disk keeps charging indefinitely while stopped — "stopping an instance does not avoid storage costs," "you will continue to be billed for disk storage, even if your balance is negative." The #1 surprise bill on vast.ai. "Stopped" ≠ "meter off."- Bandwidth bills in EVERY state. Charged "for every byte sent or received... regardless of what state it is in" — so even a transfer to/from a stopped instance (cloud sync) accrues host-set bandwidth cost (§3).
- A Volume keeps billing after the instance is destroyed until the volume itself is deleted ("charged per GB while volume exists," independently from instances).
- On-demand instances auto-stop when their host-set lifetime expires — "when the rental end date is reached, the rental contract expires and the instance is stopped." Data remains until destroyed. An unattended job can silently end, so checkpoint as if the box disappears at any moment.
- Zero / negative balance → deletion. At $0.00 "your instances, storage volumes, and data will be scheduled for deletion unless you add credits"; without a saved card "your instances and stored data will be destroyed." There is a "short grace period where your balance may go negative before deletion occurs" — do not rely on it.
- Poll-loop cost trap: a status-poll loop with no timeout/error check will loop forever while the
instance keeps accruing disk + bandwidth charges. Bound every poll loop with
timeout+ an exit check.
Teardown Iron Law (vast.ai instance): NO destroy until checkpoints are copied off-box AND verified by
load — either vastai copy-ed to local (scripts/verify_local.py reports 100% OK) or vastai cloud copy
confirmed — the copy exit status is checked (VAST2), and the user has explicitly approved the
cost-affecting action. "It looked done in the log" is not evidence (principle #3). Because destroy deletes
the disk and there is no shared FS to fall back on, the confirmation gate matters more here, not less.
6. DAEMON TOOL
- Auto-tmux on SSH login (same as AutoDL): login attaches a tmux session "to keep the session active
even if you disconnect." Disable with
touch ~/.no_auto_tmuxthen reconnect (verified docs.vast.ai jupyter-ssh FAQ 2026-06). - tmux survives an SSH disconnect but NOT a container restart/reboot/spot-resume — a reboot or
spot-resume wipes the tmux session. The durable relaunch hook is
/root/onstart.sh(the--onstart-cmd), which re-runs on every container start. Put the training relaunch there, not in tmux, so a spot-resume actually restarts the job. - SSH keys apply only to instances created AFTER the key is added — existing instances do not get a new
key automatically. Set the account key before creating, or inject it via
onstart. A pasted key missing itsssh-rsa/ssh-ed25519prefix oruser@hostsuffix authenticates as a password prompt — copy the whole line (verified docs.vast.ai jupyter-ssh FAQ 2026-06). - Native queue: vast.ai has Serverless / autoscaler for queue-style workloads, but single-instance
training has no managed scheduler — the orchestrator +
onstart.sh+ checkpoint-resume is the queue.
7. TOP GOTCHAS (platform-pinned; Symptom → Root cause → Fix)
Universal gotchas (CRLF, cgroup OOM, silent-sync, HF stalls, zombie VRAM, GPU-0%-util, scp-resets,
egress-surcharge) live in references/gotchas_universal.md — not repeated here.
- VAST1 — surprise bill on a "stopped" instance. Symptom: a stopped, idle instance keeps charging for
days, even past a negative balance. → Root cause:
stophalts compute only; disk bills forever while stopped, and bandwidth bills in every state. → Fix: to stop the meter,destroy(after copy-out per §5); never leave an instance merely stopped to "save money." - VAST2 — results gone after teardown. Symptom:
destroyrun, checkpoints irrecoverable. → Root cause:destroypermanently nukes container disk and there's no platform-wide FS to fall back on. → Fix:vastai copyout (orvastai cloud copyto a bucket) and check its exit status BEFOREdestroy; gate the success line on the copy result, never on a log claim. - VAST3 — paused/stopped instance stuck in
Schedulingthough the machine shows "available." Symptom: preempted or stopped run never resumes; the portal still lists the same machine as rentable. → Root cause: the instance is bound to a specific GPU ID (not the machine); if that GPU was re-rented, it waits indefinitely while other GPUs on the host stay free. "If stuck >30 s, GPU likely rented by another user." → Fix: stop the scheduling attempt, create a NEW instance on the same host and re-attach the same Volume (works because other GPUs are free), or re-launch on a different offer from an off-box checkpoint; don't wait for the same GPU to come back (verified vast-ai.crisp.help + manage-instances 2026-06). - VAST4 — job dies mid-step with no warning. Symptom: interruptible run vanishes abruptly. → Root cause:
bid preemption with ~0 s notice and no SIGTERM; a flush handler never fires. → Fix: periodic checkpoint
to disk on a Young/Daly timer + load-latest-on-resume; poll
actual_status/status_msgand restart (§4,references/spot-resilience.md). The bid can't be raised on a live instance — re-launch elsewhere if the GPU is gone. - VAST5 — CUDA driver mismatch on a fresh box. Symptom:
torch.cuda.is_available()is False / driver mismatch error. → Root cause: CUDA ships in the Docker image, not the host; the image's CUDA may be newer than the host driver supports (image CUDA must be ≤ host driver). → Fix: pick an image whose CUDA ≤ host driver; verifynvidia-smi/nvccinside the container inonstartbefore training (general triangle:gotchas_universal.mdU28). - VAST6 — a service is unreachable on its "own" port. Symptom: TB/Jupyter/API not reachable at the
internal port. → Root cause: internal ports map to random external ports and there's a 64-port cap
per instance. → Fix: open ports with
-pat create, discover the external mapping at runtime (vastai show instance/ IP Port Info pop-up), never hard-code a port. - VAST7 — host vanishes mid-run. Symptom: instance flips to
Offline, work lost. → Root cause: it's a marketplace — an unverified/low-reliability host can disconnect. → Fix: filter offers onverified=true, highreliability, anddirect_port_count>=1; treat any single host as disposable and checkpoint off-box accordingly. - VAST8 — bulk
scpover the default SSH stalls / crawls. Symptom: a multi-GB result copy over the default endpoint hangs or runs at a trickle. → Root cause: the default is proxy SSH, throttled and recommended only for <1 GB; large transfers need direct-TCP. → Fix: create with--direct(offer must havedirect_port_count>=1) and use that endpoint forscp/vastai copy; for big inbound data preferwget/curlfrom a bucket (verified data-movement docs 2026-06). - VAST9 — bandwidth shows up on the bill. Symptom: a transfer-heavy job costs more than the GPU-hours
alone. → Root cause: bandwidth is host-priced and metered per byte in both directions, in every state —
some offers are not $0-egress. → Fix: read the per-offer bandwidth rate before committing; pull a dataset
once to durable local/Volume, not per-epoch from a remote bucket (general form:
gotchas_universal.mdU14/U15). - VAST10 — disk full, and you can't grow it. Symptom:
No space left on devicemid-run;--diskcan't be raised. → Root cause: container disk is fixed at create (min 10 GB) and non-resizable; Docker layers + HF cache + checkpoints overrun it. → Fix: over-provision--diskat create; redirectHF_HOMEonto the data disk; prunelatest/periodic checkpoints, keep onlybest(inode/byte audit:gotchas_universal.mdU6/U7). - VAST11 — secret baked into the image or onstart-cmd is recoverable. Symptom: a key embedded at build
time or in
--onstart-cmdis stored by the platform. → Root cause: image layers and the 16 KB onstart string are persisted server-side. → Fix: injectWANDB_API_KEY/HF_TOKENvia env vars at create, never baked into image layers or--onstart-cmd; stream creds via stdin at runtime (gotchas_universal.mdU34). - VAST12 — assuming a cross-machine Network Volume exists. Symptom: a plan relies on a Volume following
the job to a different host. → Root cause: Volumes are machine-locked; cross-machine Network Volumes are
not in the current storage docs. → Fix: design for off-box durability (
vastai cloud copyto a bucket), not a portable volume; only same-machine re-attach is reliable. - VAST13 — instance stuck in
Loading, never reachesrunning. Symptom: a new instance sits inLoading/Connectingfor many minutes. → Root cause: the host is pulling a large Docker image (boot is 1–5 min, longer for fat images) or the host link is slow. → Fix: wait out the documented window, then readvastai show logs <id>(below) for the pull progress; if still stuck,destroyand re-create on a faster offer with a slimmer image.
Platform-specific debugging (commands + what to check)
- Read the boot/container/system logs from off-box:
vastai show logs <id> --tail 200 [--filter <grep>] [--daemon-logs]— uploads container logs (and, with--daemon-logs, host/system logs) to a generated URL. This is the first stop for a box that won't connect, a stuckLoading, or a silentonstartfailure (verified docs.vast.ai/api-reference/instances/show-logs 2026-06). The GUI equivalent is the "Logs" button on the instance card. - Inspect the live state machine without SSH:
vastai show instance <id>(or the API) — compareactual_status(where the container is),intended_status(where it should be),cur_state(contract/ hardware allocation) andstatus_msg.intended=runningbutactual≠running+Scheduling⇒ VAST3 (GPU-bound wait);Offline⇒ VAST7 (host gone). - Confirm the GPU is really attached: in
onstart/ first SSH runnvidia-smiandpython -c "import torch; print(torch.cuda.is_available(), torch.version.cuda)"—False/CPU-only ⇒ VAST5 (image CUDA > host driver) or no-GPU container (gotchas_universal.mdU31). - Detect a stuck download inside the box:
du -sh ~/.cache/huggingface/hubover time (no growth = stalled HF pull),df -h /(filling = active download) anddf -i /(inodes), then the resumable-download ladder ingotchas_universal.md(HF). A fat-image stall before SSH is visible only viavastai show logs. - Find the real external ports / SSH target:
vastai show instance <id>lists the port map andvastai ssh-url <id>prints the connection string — never assume port 22 is reachable (VAST6).
8. SCRIPT OVERRIDES
Values to parameterize the scripts/ templates for vast.ai:
# DATA_DIR — data + (only) checkpoint mount; NOTHING survives destroy, so durable = off-box copy-out/cloud-sync
DATA_DIR=/workspace # container disk; survives stop, bills forever, GONE on destroy
DURABLE_DIR=off-box # no destroy-surviving mount: vastai copy / vastai cloud copy before destroy (§5)
# PROXY_HOOK — none at platform level (no /etc/network_turbo). HF mirror is the JOB's own env if needed:
PROXY_HOOK='' # set HF_ENDPOINT=https://hf-mirror.com in the job env only if a host can't reach HF
# CRED_FILE — empty: vast's key is the VAST_API_KEY env var, not a file. WANDB_API_KEY/HF_TOKEN also arrive via env.
CRED_FILE="" # no cred FILE on disk → run_one's [ -n "$CRED_FILE" ] guard skips the cat; VAST_API_KEY + WANDB_API_KEY/HF_TOKEN injected via env at create, NOT into the image or onstart-cmd
# SCRATCH — what to prune (disk is fixed-size, non-resizable → prune aggressively)
SCRATCH='latest.pth periodic-*.pth *.tmp ~/.cache/huggingface/hub/blobs' # keep only best + tiny eval JSONs
# HF_HOME — redirect cache off the small root onto the data disk
HF_HOME=/workspace/.cache/huggingface
# DETACH — durable relaunch is onstart.sh, NOT tmux (tmux dies on container restart/spot-resume)
DETACH='/root/onstart.sh' # re-runs on every container start; tmux only for an attached SSH session
Secrets note: inject WANDB_API_KEY / HF_TOKEN via env vars at create, never baked into the Docker
image layers or the 16 KB --onstart-cmd (both are stored by the platform — VAST11).