26 KiB
| platform | kind | meter_stop_verb | meter_stop_irreversible | detach_primitive | spot_available | spot_grace | shared_fs | inode_cap | free_egress | china_mirror_needed | host_driver_cuda_max | local_nvme |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| paperspace | cloud-api | shut-down | false | tmux | false | n/a | true | none | true | false | host-dependent | host-dependent |
Paperspace (DigitalOcean) — platform profile
One-line purpose: substrate for running detached GPU jobs on Paperspace Gradient (managed Jupyter
notebooks/deployments) and Paperspace Core (raw Linux VMs, "Machines") — what stops the meter, what
survives a stop vs a destroy, and the auto-shutdown clock that ends every long run. Universal gotchas are
NOT repeated here — see references/gotchas_universal.md.
Surface to the user up front (principle #10): ⚠️ Danger clocks — an auto-shutdown timer ends every Notebook/Core run (set it consciously; Gradient free notebooks hard-cap at 6 h); snapshots / block storage keep billing after a machine is destroyed (orphan bleed). Heads-up — the Gradient CLI/API was deprecated 15 Jul 2024 (pin
gradient<3.0; the three-CLI mess, §1).
To jump: grep -in '<keyword>' profiles/paperspace.md.
Table of contents
- LAUNCH — Gradient vs Core, the env contract, the three-CLI mess
- STORAGE MODEL — survival matrix, the stop-keeps-disk rule, pip-doesn't-persist
- NETWORK — public IP (static vs dynamic), ports, SSH flavor
- SPOT / INTERRUPTION + RESUME — the auto-shutdown clock, not spot
- TEARDOWN / BILLING — what actually stops the meter (the trap)
- DAEMON TOOL — tmux on Core; why Notebooks resist a daemon
- TOP GOTCHAS —
PS1–PS13, platform-pinned + platform-specific debugging - SCRIPT OVERRIDES — values for the
scripts/templates
1. LAUNCH
Two product families, with opposite operating models:
- Gradient — the managed layer. Notebooks are a web Jupyter IDE on a shared persistent store;
Deployments serve a container behind a REST endpoint (bring a Docker image
<user>/img:tag); Workflows run GPU-backed DAG automation. Entry: web console, the CLI/SDK, or REST. - Core / Machines — raw Linux/Windows VMs with a persistent block disk, full root/SSH. OS templates
include ML-in-a-Box (preinstalled CUDA + PyTorch/TensorFlow/RAPIDS/Jupyter; terminal/SSH-only,
home
/home/paperspace, shell/bin/bash). Ubuntu 22.04 is required for H100 and recommended for A100; Ubuntu 20.04 is recommended for any other machine type (verified github.com/Paperspace/ml-in-a-box README + DO machines docs 2026-06). This is the family that maps cleanly onto the AutoDL tmux-resilient-training pattern.
Env contract. The chosen image/template IS the Python env — do NOT conda create on a rental
(principle: the prebuilt base is the env). On Core, run inside ML-in-a-Box directly; on Gradient
Deployments, the env is the Docker image specified at create time. Because a destroy wipes the box, the
durable analog of the env is a Docker image plus a requirements.txt/lock file kept off-box, so a recreate
reproduces it. On Notebooks, a plain pip install does NOT survive a restart (writes to
/usr/local/lib, ephemeral) — see §2 / PS3.
The three-CLI mess (gates ALL automation). The tooling fragmented across the DigitalOcean acquisition; the draft's "migrate to the current API/CLI" understates the trap (verified github.com/Paperspace 2026-06):
- The legacy Gradient REST API endpoints were deprecated 15 Jul 2024 — stale calls 404 or no-op.
gradient-cliv2 is deprecated; pinpip install "gradient<3.0"only to keep old scripts alive.gradient-python(github.com/digitalocean/gradient-python) is NOT the orchestration CLI — it is the new DigitalOcean Gradient AI / GenAI inference SDK. Name collision — do not install it expecting notebook/machine control.- The recommended tool for new work is the streamlined
pspaceCLI (github.com/Paperspace/cli, releases ongoing into 2026; e.g.pspace public-ip release <ip>). Pin and verify the CLI binary + version in any automation; do not assumegradient⇒pspacecommand parity.
→ verify: ssh <core-alias> 'python -c "import torch;print(torch.cuda.is_available())"' on Core, or a
print(torch.cuda.is_available()) cell in a Notebook.
2. STORAGE MODEL (survival matrix — principle #4)
The defining fact: a stop/shut-down keeps the disk — Paperspace is one of the few profiles here that behaves like AutoDL's 关机 in this respect. Only destroy/delete removes storage.
Gradient Notebooks — /storage and /notebooks are separate branches from /, NOT nested
(verified DO notebooks/details/storage-architecture 2026-06):
/storage— shared persistent, team-wide, scoped to a storage region/cluster. Survives stop. (Team-shared ⇒ never write secrets here — see §7 /references/gotchas_universal.md.)/notebooks— per-notebook persistent, managed via the console File Manager. Survives stop.- everything else — ephemeral workspace (incl.
/usr/local/libwherepiplands), wiped on stop.
Core machines — block storage 50 GB–2 TB, persists across a stop; expansion is one-way
("increasing block storage expands the filesystem and is not reversible"). Region-locked: storage and
custom templates must be used in the same datacenter. Snapshots are a separate billed resource
($0.29/GB/mo, default policy is "Never" / 0 stored — they bill only if manually enabled, and a
snapshot survives a machine destroy, so an orphaned snapshot keeps charging — see PS9).
| Tier | Path | Survives STOP? | Survives DESTROY/DELETE? | Cap / note |
|---|---|---|---|---|
| Notebook shared persistent | /storage |
yes | yes (separate resource) | team-shared per region/cluster; billed until deleted |
| Notebook per-notebook | /notebooks |
yes | no (dies with the notebook) | per-notebook persistent; console File Manager |
| Notebook workspace | everything else (incl. /usr/local/lib) |
no | no | ephemeral; wiped on stop; pip lands here |
| Core block storage | machine root + block vol | yes | no | 50 GB–2 TB; expansion irreversible; region-locked |
| Core snapshot | (separate resource) | yes | yes (orphan-bills!) | $0.29/GB/mo; default policy Never/0; survives machine destroy |
Mount checkpoints MUST go to (for the §5 teardown verb): on Notebooks, /storage (cross-stop,
cross-delete-of-the-notebook) — /notebooks dies if the notebook itself is deleted. On Core, the block
disk survives a stop, but a destroy wipes it, so the Iron-Law pull-to-local before destroy still applies.
No documented inode cap on either tier; still monitor df -i (universal, U7 / principle #5).
3. NETWORK
- Egress. Direct and unproxied to HF/GitHub/PyPI; no
network_turbo-style accelerator and no documented egress fee. China-mirror relevance is N/A as a platform feature — relevant only when operating from inside China and supplying a private mirror (thenreferences/china-network.md). - Public IP. Core machines are reached by public IP, of two kinds (verified DO
machines/how-to/manage-public-ips 2026-06):
- Static — "the same IP address every time it powers on … remains in your account until you delete
it." Use it to pin stable SSH/endpoint addressing. Billed until deleted — including while the
machine is powered off (see §5 /
PS6). API/CLI can create/release a static IP but cannot add a dynamic IP to an existing machine — dynamic must be requested at machine-creation time. - Dynamic — "assigned automatically when a machine powers on and deleted when it powers off"; a new IP on every start, so a hard-coded SSH alias breaks after a restart. Charged only while the machine runs (auto-released on power-off → no idle IP cost). A machine with no public IP is internet-isolated (and avoids the IP charge). Private networks give team-isolated pools.
- Static — "the same IP address every time it powers on … remains in your account until you delete
it." Use it to pin stable SSH/endpoint addressing. Billed until deleted — including while the
machine is powered off (see §5 /
- Ports / services. Firewall is self-managed — open ports to expose services. Tunnel Jupyter (8888) /
TensorBoard (6006) over SSH on Core:
ssh -L 8888:localhost:8888 -L 6006:localhost:6006 paperspace@<machine-ip>(placeholder host — substitute the machine's real IP/static address). In a Gradient Notebook, launch TensorBoard in-Jupyter and write logs under/storage(or they vanish on stop). - SSH flavor. Core = a standard Linux VM → full
ssh/scp/rsync(ML-in-a-Box default userpaperspace). Gradient Notebooks expose a Jupyter sandbox, not a clean persistent SSH daemon — there is no stable SSH-daemon story for a multi-day unattended run on a Notebook.
4. SPOT / INTERRUPTION + RESUME (principle #7/#8)
No AWS-style spot/preemptible tier with a 2-minute interruption warning. The two interruption modes are different in kind and BOTH are deterministic, not random eviction:
-
Capacity-at-launch. The desired GPU type may be unavailable when launching — a launch-time availability problem, not a runtime eviction. On free notebooks this surfaces as "out of capacity" / the notebook sits "pending" in queue for the next free machine (verified DO notebooks/how-to docs 2026-06). Build retry-launch-until-available logic, not a 2-minute-grace flush handler; for assured access, a paid instance type bypasses the free queue.
-
Auto-shutdown clock — the hard ceiling on any long run. The timer is the real killer:
- Gradient free notebooks hard-stop at a 6-hour maximum auto-shutdown (cannot be raised).
- Paid notebooks default to 12-hour auto-shutdown; range 1 hour – 1 week.
- Core machines allow a configurable 1 hour – 1 week auto-shutdown.
- Trap (Core/Linux): Core Linux auto-shutdown is wall-clock, not idle-based — "Linux machines shut down regardless of whether any users are connected" (only Windows waits for idle). An active SSH/tmux session does not keep a Linux Core machine alive past the timer (verified DO machines/how-to/manage-auto-shutdown 2026-06).
- Trap (API): auto-shutdown cannot be enabled/disabled via API or CLI on an existing machine — "you can only manage the auto-shutdown feature via the Paperspace console" (same source). Set it deliberately at create time / in the console.
The window is deterministic, so plan around it: a tmux session inside a Notebook still dies at the timeout (§6). Resume hook: checkpoint full state to
/storage(Notebooks) or the block disk (Core) before the auto-shutdown window, then restart and load-latest-on-startup unconditionally. Because the clock is known in advance, cadence can be planned rather than guessed — but the load-latest-on-startup spine (principle #8) is what makes the restart idempotent. Young/Daly cadence formula →references/spot-resilience.md.
5. TEARDOWN / BILLING (principle #9 + the Iron Law — the most error-prone section)
Per-hour billing (verified DO products/paperspace/pricing 2026-06). A shut-down/power-off STOPS the compute (GPU) meter while disk persists — this is the AutoDL-like part. But it does NOT stop every meter.
- What a stop still bills (the trap): "When a Paperspace machine is powered off, attached storage,
public IP addresses, and other add-ons continue to be billed on an hourly basis until you destroy
those resources." Gradient
/storageover the plan allowance and Core block storage both keep charging while the machine is off. - The monthly-cap softener (new fact): non-GPU resources (storage, public IP, snapshots) have a maximum monthly charge — "once a non-GPU resource reaches its monthly maximum, it no longer incurs charges for the rest of the billing cycle." Static public IP caps at $3.00/mo ($0.0045/hr). So a forgotten static IP is a bounded ~$3/mo bleed, but a forgotten 2 TB block volume is ~$120/mo until destroyed (verified DO pricing 2026-06).
- What actually stops the full meter: destroy the machine AND release the static IP AND delete the storage (AND delete any snapshot) — separate actions. "To stop all charges for a machine and its add-ons, destroy the machine and any resources you no longer need." A stopped-but-not- destroyed machine with a Static IP, a 2 TB block volume, and a leftover snapshot is still spending money.
- Irreversible: destroy/delete of a machine removes its block storage (no recovery); block-storage expansion is also one-way. A shut-down is reversible (resume later).
Net contrast vs the other profiles: Paperspace gives a real idle-cheap stop (unlike Lambda, which has no stop), but unlike AutoDL's 关机 the storage + IP + snapshots keep billing until each is explicitly destroyed/released. "Stopped" ≠ "free."
Iron Law (teardown gate): NO destroy/delete of the machine, release of the IP, or deletion of
/storage/block-storage/snapshot until checkpoints are pulled to local AND verified by load, and the user has explicitly approved the specific cost-affecting action. A destroy is irreversible — "it looked done in the log" is not evidence (principle #3). General form →superpowers:verification-before-completion.
6. DAEMON TOOL
- Core machines — full VMs ⇒
tmux/screen/nohupall available; SSH is as stable as any cloud VM. This is the closest analog to the AutoDL tmux-resilient pattern. tmux survives an SSH drop; it does NOT survive a machine stop/restart (the process is gone), and — critically on Core/Linux — a live tmux session does not defer the wall-clock auto-shutdown (§4), so durability still rests on checkpoint-to-disk + load-latest (principle #8), not on the detach primitive. - Gradient Notebooks — a managed Jupyter sandbox: no clean persistent SSH-daemon story, and the auto-shutdown timer is a hard ceiling — a tmux session started inside a Notebook still dies at the timeout. Notebooks are not built for unattended multi-day daemons.
- Platform-native long-job mechanisms — Workflows (DAG automation) and Deployments (always-on serving). For training-as-a-daemon, prefer Core + tmux; treat Notebooks as interactive/short-run only.
If tmux is absent on a minimal image, fall back to nohup <cmd> </dev/null >log 2>&1 &.
7. TOP GOTCHAS (platform-pinned; universal ones → references/gotchas_universal.md)
-
PS1 — "Stopped the machine, still getting billed." Symptom: GPU meter halted but the bill keeps climbing while the box is off. Root cause: shut-down stops only the compute meter; attached storage + public IP + add-ons + snapshots bill hourly until destroyed/released (verified DO pricing 2026-06). Fix: to truly stop the meter, destroy the machine, release the Static IP, delete the storage and any snapshot — separate teardown actions. Audit for orphaned storage/IPs/snapshots after every stop.
-
PS2 — A long run dies at a round-number wall-clock with no error. Symptom: training vanishes at exactly 6 h / 12 h (or the configured Core window); no traceback. Root cause: the auto-shutdown clock, not a crash — free notebooks 6 h (hard cap), paid notebooks 12 h default, Core 1 h–1 wk. On Core/Linux the clock is wall-clock, not idle — an active SSH/tmux session does NOT extend it (verified DO manage-auto-shutdown 2026-06). Fix: checkpoint to
/storage(Notebooks) or the block disk (Core) before the window; for Core, raise the auto-shutdown to the longest needed in the console (API/CLI cannot change it post-create); restart + load-latest to resume. -
PS3 —
pip install(or any non-/storagewrite) vanishes after a Notebook restart. Symptom: packages installed in-session are gone next session; "saved" files disappear after stop/restart. Root cause:pipwrites to/usr/local/lib, which is ephemeral workspace — only/storageand/notebookspersist (verified fast.ai forum + DO storage-architecture 2026-06). "Machines are snapshots, not servers," so in-session installs do not persist. Fix: install into a persisted dir —pip install --user(lands in the home dir under a persisted tree) orpip install --target /storage/pyenv && export PYTHONPATH=/storage/pyenv; write all checkpoints/logs/outputs under/storage; verify they landed (ls/checksum) before stop. -
PS4 — Automation 404s / silently no-ops / installs the wrong SDK. Symptom: a
gradient-era create/stop call fails or does nothing; orpip install gradient(v3+) imports an inference SDK with no notebook/machine commands. Root cause: legacy Gradient REST endpoints deprecated 15 Jul 2024;gradient-cliv2 deprecated;gradient-pythonv3 is the DigitalOcean Gradient AI inference SDK — a name collision, not the orchestration CLI (verified github.com/Paperspace/gradient-cli + digitalocean/gradient-python 2026-06). Fix: for new work use thepspaceCLI (github.com/Paperspace/cli); to keep old scripts alive pinpip install "gradient<3.0". Pin and verify the CLI binary + version in any automation. -
PS5 — Custom template / storage / volume "not found" in a different datacenter. Symptom: a saved template or block volume is unavailable when launching elsewhere; block-storage resize can't be undone. Root cause: storage and templates are region/DC-locked, and block-storage expansion is irreversible (one-way filesystem grow). Fix: pick the datacenter deliberately and keep storage+compute+template co-located; size block storage with headroom up-front (cannot shrink).
-
PS6 — SSH alias breaks after every restart. Symptom: the saved
sshhost no longer connects after a machine restart. Root cause: a Dynamic public IP is released on power-off and reassigned on start (new IP each time). Fix: attach a Static IP for stable SSH/endpoint addressing (it bills until deleted, capped $3/mo —PS1), or re-resolve the address on each start before scripting. Note: API/CLI can manage a static IP but cannot add a dynamic one to an existing machine (request dynamic at create time). -
PS7 — Free-tier notebook code is PUBLIC by default. Symptom: proprietary/confidential code is world-readable in a Gradient free notebook. Root cause: free Gradient notebooks are public by default; private notebooks require a paid plan (verified Paperspace blog / pricing 2026-06). Fix: never put confidential code or any secret in a free notebook; upgrade to a paid plan for private notebooks. Treat the free tier as a public scratchpad. (Secrets hygiene →
references/gotchas_universal.md.) -
PS8 — Free notebook won't start / sits "pending." Symptom: a free-GPU notebook stays pending or errors "out of capacity"; only one notebook will run. Root cause: free tier = 1 concurrent running notebook, ≤5 projects, 5 GB
/storage, and free machines are pooled — a pending notebook is queued for the next free machine (verified Paperspace free-instances docs + blog 2026-06). Fix: expect queueing on free; stop the other free notebook (only one runs); for assured access use a paid instance type, which skips the free queue. -
PS9 — A destroyed machine keeps billing via a leftover snapshot. Symptom: machine destroyed, yet a small monthly charge persists. Root cause: snapshots are a separate resource that survives a machine destroy and bills at
$0.29/GB/mountil deleted; auto-snapshot defaults to "Never"/0 but a manually-enabled policy (daily by default, up to 10 stored) silently accrues (verified DO pricing + blog/automated-snapshots 2026-06). Fix: when tearing down, delete the snapshot too (console or CLI); audit the snapshots list after every machine destroy. Capped per-resource by the monthly maximum but still a bleed. -
PS10 — Notebook upload/import fails on the 5 GB free cap. Symptom: uploading a multi-GB dataset to
/storagefails for an unpaid account. Root cause: free/storageallowance is 5 GB; overage is $0.29/GB/mo (paid plans include more: e.g. 200 GB / 1 TB tiers) (verified Paperspace pricing + fast.ai forum 2026-06). Fix: stream/stage the dataset rather than uploading the whole thing, prune aggressively, or upgrade the plan; redirect HF/torch caches off/storageif they would push over the allowance. -
PS11 — ML-in-a-Box CUDA/driver too old for current PyTorch on a new-arch GPU. Symptom:
The NVIDIA driver on your system is too old (found version 110xx). Please update your GPU driver, orno kernel image is available for executionon a fresh card. Root cause: the template's host driver/CUDA stack lags newer PyTorch wheels; on a rental the host driver is host-global and a tenant usually cannot upgrade it (verified github.com/Paperspace/ml-in-a-box issue #13 2026-06). This is the platform-pinned face of the universal CUDA-triangle (U28). Fix: install a torch build matching the box's CUDA (do not force-upgrade the host driver on a rental); pick a template whose Ubuntu/driver matches the GPU (22.04 for H100/A100). Full triangle → U28 inreferences/gotchas_universal.md. -
PS12 — Gradient Deployment / custom image won't pull or drifts. Symptom: a Deployment fails to pull
<user>/img:tag, or "the same image" behaves differently over time. Root cause: a moving tag (:latest) resolves to a different layer set; private-registry creds missing. Fix: pin the image by digest (@sha256:) and supply registry creds as a Gradient secret, not inline. General form → U30 inreferences/gotchas_universal.md. -
PS13 — Platform-specific debugging. Commands + what to check (Core uses standard Linux tooling; the Notebook-only items are the platform delta):
- Confirm GPU + driver/torch match:
nvidia-smi(driver/CUDA version) thenpython -c "import torch;print(torch.__version__, torch.version.cuda, torch.cuda.is_available())"— a mismatch here isPS11/U28, not a code bug. - Find what is eating the 5 GB / over-allowance
/storage(the platform's own recommended cmd):du -sch .[!.]* * | sort -h(or!du -sch …in a cell); installncdufor an interactive view (verified DO notebooks/how-to/manage-storage 2026-06). Checkdf -hANDdf -i(inodes, U7). - Is a Notebook write durable?
df -h /storage /notebooksand confirm the target is one of those two mounts — anything else (incl./usr/local/lib) is ephemeral (PS3). - Why did the run vanish? Walk the universal ladder (U3):
dmesg | grep -iE 'killed process|out of memory'(OOM?),uptime(recent reboot = auto-shutdown fired,PS2),nvidia-smi(GPU idle = died, not hung). A round-numberuptime-near-window with a cleandmesg⇒ auto-shutdown, not a crash. - Detect a stuck/slow download: watch the target file size grow
(
watch -n5 'ls -l /storage/<file>'); a flat size with a live process = stalled wire (U12 resumable loop). Egress is direct/unproxied here, so a stall is route/peer, not a missing proxy hook. - Audit orphaned billables before declaring teardown done: in the console (or
pspace) list machines, public IPs, storage/volumes, and snapshots —PS1/PS9hide in the last two.
- Confirm GPU + driver/torch match:
8. SCRIPT OVERRIDES
Values to parameterize the scripts/ templates for Paperspace. Forward-slash paths; placeholders for any
host/IP (never a real address). Core and Gradient differ — both shown.
# --- Gradient Notebook ---
DATA_DIR=/storage # team-shared persistent; survives stop AND notebook delete
DURABLE_DIR=/storage # checkpoints land here (NOT /notebooks — dies with the notebook)
SCRATCH=/tmp # ephemeral workspace; wiped on stop — never the only copy
HF_HOME=/storage/.cache/huggingface # redirect cache off ephemeral workspace (watch the 5 GB free cap, PS10)
PROXY_HOOK= # none — direct egress (no network_turbo)
CRED_FILE="" # Paperspace keys are Gradient secrets / env vars, not files — WANDB_API_KEY/HF_TOKEN arrive via the secret/env (run_one's [ -n "$CRED_FILE" ] guard skips the file read); never write keys to /storage (team-shared)
DETACH= # no clean tmux; Jupyter kernel + hard 6h/12h auto-shutdown ceiling
# NOTE: pip into /storage to persist — pip install --target /storage/pyenv && export PYTHONPATH=/storage/pyenv (PS3)
# --- Core machine (preferred for daemonized training) ---
DATA_DIR=/path/to/blockstore # placeholder — the attached block disk mount
DURABLE_DIR=/path/to/blockstore/ckpts
SCRATCH=/tmp
HF_HOME=/path/to/blockstore/.cache/huggingface
PROXY_HOOK= # none
CRED_FILE="" # Paperspace keys are Gradient secrets / env vars, not files — WANDB_API_KEY/HF_TOKEN arrive via the secret/env (run_one's [ -n "$CRED_FILE" ] guard skips the file read); inject at launch, never inline
DETACH=tmux # survives SSH drop, NOT a machine stop, and NOT the wall-clock auto-shutdown — rely on checkpoint+resume
SSH_HOST=<machine-ip> # placeholder — ML-in-a-Box user is `paperspace`; pin a Static IP for a stable alias (PS6); dynamic IP changes every start
Reminder: secrets referenced by env-var NAME or Gradient secret only — never inline a key, and never write
one onto the team-shared /storage (universal secrets gotcha → references/gotchas_universal.md).