playbook/antigravity-awesome-skills/skills/remote-gpu-trainer/profiles/paperspace.md

26 KiB
Raw Blame History

platform kind meter_stop_verb meter_stop_irreversible detach_primitive spot_available spot_grace shared_fs inode_cap free_egress china_mirror_needed host_driver_cuda_max local_nvme
paperspace cloud-api shut-down false tmux false n/a true none true false host-dependent host-dependent

Paperspace (DigitalOcean) — platform profile

One-line purpose: substrate for running detached GPU jobs on Paperspace Gradient (managed Jupyter notebooks/deployments) and Paperspace Core (raw Linux VMs, "Machines") — what stops the meter, what survives a stop vs a destroy, and the auto-shutdown clock that ends every long run. Universal gotchas are NOT repeated here — see references/gotchas_universal.md.

Surface to the user up front (principle #10): ⚠️ Danger clocks — an auto-shutdown timer ends every Notebook/Core run (set it consciously; Gradient free notebooks hard-cap at 6 h); snapshots / block storage keep billing after a machine is destroyed (orphan bleed). Heads-up — the Gradient CLI/API was deprecated 15 Jul 2024 (pin gradient<3.0; the three-CLI mess, §1).

To jump: grep -in '<keyword>' profiles/paperspace.md.

Table of contents

  1. LAUNCH — Gradient vs Core, the env contract, the three-CLI mess
  2. STORAGE MODEL — survival matrix, the stop-keeps-disk rule, pip-doesn't-persist
  3. NETWORK — public IP (static vs dynamic), ports, SSH flavor
  4. SPOT / INTERRUPTION + RESUME — the auto-shutdown clock, not spot
  5. TEARDOWN / BILLING — what actually stops the meter (the trap)
  6. DAEMON TOOL — tmux on Core; why Notebooks resist a daemon
  7. TOP GOTCHAS — PS1PS13, platform-pinned + platform-specific debugging
  8. SCRIPT OVERRIDES — values for the scripts/ templates

1. LAUNCH

Two product families, with opposite operating models:

  • Gradient — the managed layer. Notebooks are a web Jupyter IDE on a shared persistent store; Deployments serve a container behind a REST endpoint (bring a Docker image <user>/img:tag); Workflows run GPU-backed DAG automation. Entry: web console, the CLI/SDK, or REST.
  • Core / Machines — raw Linux/Windows VMs with a persistent block disk, full root/SSH. OS templates include ML-in-a-Box (preinstalled CUDA + PyTorch/TensorFlow/RAPIDS/Jupyter; terminal/SSH-only, home /home/paperspace, shell /bin/bash). Ubuntu 22.04 is required for H100 and recommended for A100; Ubuntu 20.04 is recommended for any other machine type (verified github.com/Paperspace/ml-in-a-box README + DO machines docs 2026-06). This is the family that maps cleanly onto the AutoDL tmux-resilient-training pattern.

Env contract. The chosen image/template IS the Python env — do NOT conda create on a rental (principle: the prebuilt base is the env). On Core, run inside ML-in-a-Box directly; on Gradient Deployments, the env is the Docker image specified at create time. Because a destroy wipes the box, the durable analog of the env is a Docker image plus a requirements.txt/lock file kept off-box, so a recreate reproduces it. On Notebooks, a plain pip install does NOT survive a restart (writes to /usr/local/lib, ephemeral) — see §2 / PS3.

The three-CLI mess (gates ALL automation). The tooling fragmented across the DigitalOcean acquisition; the draft's "migrate to the current API/CLI" understates the trap (verified github.com/Paperspace 2026-06):

  • The legacy Gradient REST API endpoints were deprecated 15 Jul 2024 — stale calls 404 or no-op.
  • gradient-cli v2 is deprecated; pin pip install "gradient<3.0" only to keep old scripts alive.
  • gradient-python (github.com/digitalocean/gradient-python) is NOT the orchestration CLI — it is the new DigitalOcean Gradient AI / GenAI inference SDK. Name collision — do not install it expecting notebook/machine control.
  • The recommended tool for new work is the streamlined pspace CLI (github.com/Paperspace/cli, releases ongoing into 2026; e.g. pspace public-ip release <ip>). Pin and verify the CLI binary + version in any automation; do not assume gradientpspace command parity.

verify: ssh <core-alias> 'python -c "import torch;print(torch.cuda.is_available())"' on Core, or a print(torch.cuda.is_available()) cell in a Notebook.


2. STORAGE MODEL (survival matrix — principle #4)

The defining fact: a stop/shut-down keeps the disk — Paperspace is one of the few profiles here that behaves like AutoDL's 关机 in this respect. Only destroy/delete removes storage.

Gradient Notebooks/storage and /notebooks are separate branches from /, NOT nested (verified DO notebooks/details/storage-architecture 2026-06):

  • /storageshared persistent, team-wide, scoped to a storage region/cluster. Survives stop. (Team-shared ⇒ never write secrets here — see §7 / references/gotchas_universal.md.)
  • /notebooksper-notebook persistent, managed via the console File Manager. Survives stop.
  • everything else — ephemeral workspace (incl. /usr/local/lib where pip lands), wiped on stop.

Core machines — block storage 50 GB2 TB, persists across a stop; expansion is one-way ("increasing block storage expands the filesystem and is not reversible"). Region-locked: storage and custom templates must be used in the same datacenter. Snapshots are a separate billed resource ($0.29/GB/mo, default policy is "Never" / 0 stored — they bill only if manually enabled, and a snapshot survives a machine destroy, so an orphaned snapshot keeps charging — see PS9).

Tier Path Survives STOP? Survives DESTROY/DELETE? Cap / note
Notebook shared persistent /storage yes yes (separate resource) team-shared per region/cluster; billed until deleted
Notebook per-notebook /notebooks yes no (dies with the notebook) per-notebook persistent; console File Manager
Notebook workspace everything else (incl. /usr/local/lib) no no ephemeral; wiped on stop; pip lands here
Core block storage machine root + block vol yes no 50 GB2 TB; expansion irreversible; region-locked
Core snapshot (separate resource) yes yes (orphan-bills!) $0.29/GB/mo; default policy Never/0; survives machine destroy

Mount checkpoints MUST go to (for the §5 teardown verb): on Notebooks, /storage (cross-stop, cross-delete-of-the-notebook) — /notebooks dies if the notebook itself is deleted. On Core, the block disk survives a stop, but a destroy wipes it, so the Iron-Law pull-to-local before destroy still applies. No documented inode cap on either tier; still monitor df -i (universal, U7 / principle #5).


3. NETWORK

  • Egress. Direct and unproxied to HF/GitHub/PyPI; no network_turbo-style accelerator and no documented egress fee. China-mirror relevance is N/A as a platform feature — relevant only when operating from inside China and supplying a private mirror (then references/china-network.md).
  • Public IP. Core machines are reached by public IP, of two kinds (verified DO machines/how-to/manage-public-ips 2026-06):
    • Static — "the same IP address every time it powers on … remains in your account until you delete it." Use it to pin stable SSH/endpoint addressing. Billed until deletedincluding while the machine is powered off (see §5 / PS6). API/CLI can create/release a static IP but cannot add a dynamic IP to an existing machine — dynamic must be requested at machine-creation time.
    • Dynamic — "assigned automatically when a machine powers on and deleted when it powers off"; a new IP on every start, so a hard-coded SSH alias breaks after a restart. Charged only while the machine runs (auto-released on power-off → no idle IP cost). A machine with no public IP is internet-isolated (and avoids the IP charge). Private networks give team-isolated pools.
  • Ports / services. Firewall is self-managed — open ports to expose services. Tunnel Jupyter (8888) / TensorBoard (6006) over SSH on Core: ssh -L 8888:localhost:8888 -L 6006:localhost:6006 paperspace@<machine-ip> (placeholder host — substitute the machine's real IP/static address). In a Gradient Notebook, launch TensorBoard in-Jupyter and write logs under /storage (or they vanish on stop).
  • SSH flavor. Core = a standard Linux VM → full ssh/scp/rsync (ML-in-a-Box default user paperspace). Gradient Notebooks expose a Jupyter sandbox, not a clean persistent SSH daemon — there is no stable SSH-daemon story for a multi-day unattended run on a Notebook.

4. SPOT / INTERRUPTION + RESUME (principle #7/#8)

No AWS-style spot/preemptible tier with a 2-minute interruption warning. The two interruption modes are different in kind and BOTH are deterministic, not random eviction:

  1. Capacity-at-launch. The desired GPU type may be unavailable when launching — a launch-time availability problem, not a runtime eviction. On free notebooks this surfaces as "out of capacity" / the notebook sits "pending" in queue for the next free machine (verified DO notebooks/how-to docs 2026-06). Build retry-launch-until-available logic, not a 2-minute-grace flush handler; for assured access, a paid instance type bypasses the free queue.

  2. Auto-shutdown clock — the hard ceiling on any long run. The timer is the real killer:

    • Gradient free notebooks hard-stop at a 6-hour maximum auto-shutdown (cannot be raised).
    • Paid notebooks default to 12-hour auto-shutdown; range 1 hour 1 week.
    • Core machines allow a configurable 1 hour 1 week auto-shutdown.
    • Trap (Core/Linux): Core Linux auto-shutdown is wall-clock, not idle-based — "Linux machines shut down regardless of whether any users are connected" (only Windows waits for idle). An active SSH/tmux session does not keep a Linux Core machine alive past the timer (verified DO machines/how-to/manage-auto-shutdown 2026-06).
    • Trap (API): auto-shutdown cannot be enabled/disabled via API or CLI on an existing machine — "you can only manage the auto-shutdown feature via the Paperspace console" (same source). Set it deliberately at create time / in the console.

    The window is deterministic, so plan around it: a tmux session inside a Notebook still dies at the timeout (§6). Resume hook: checkpoint full state to /storage (Notebooks) or the block disk (Core) before the auto-shutdown window, then restart and load-latest-on-startup unconditionally. Because the clock is known in advance, cadence can be planned rather than guessed — but the load-latest-on-startup spine (principle #8) is what makes the restart idempotent. Young/Daly cadence formula → references/spot-resilience.md.


5. TEARDOWN / BILLING (principle #9 + the Iron Law — the most error-prone section)

Per-hour billing (verified DO products/paperspace/pricing 2026-06). A shut-down/power-off STOPS the compute (GPU) meter while disk persists — this is the AutoDL-like part. But it does NOT stop every meter.

  • What a stop still bills (the trap): "When a Paperspace machine is powered off, attached storage, public IP addresses, and other add-ons continue to be billed on an hourly basis until you destroy those resources." Gradient /storage over the plan allowance and Core block storage both keep charging while the machine is off.
  • The monthly-cap softener (new fact): non-GPU resources (storage, public IP, snapshots) have a maximum monthly charge — "once a non-GPU resource reaches its monthly maximum, it no longer incurs charges for the rest of the billing cycle." Static public IP caps at $3.00/mo ($0.0045/hr). So a forgotten static IP is a bounded ~$3/mo bleed, but a forgotten 2 TB block volume is ~$120/mo until destroyed (verified DO pricing 2026-06).
  • What actually stops the full meter: destroy the machine AND release the static IP AND delete the storage (AND delete any snapshot) — separate actions. "To stop all charges for a machine and its add-ons, destroy the machine and any resources you no longer need." A stopped-but-not- destroyed machine with a Static IP, a 2 TB block volume, and a leftover snapshot is still spending money.
  • Irreversible: destroy/delete of a machine removes its block storage (no recovery); block-storage expansion is also one-way. A shut-down is reversible (resume later).

Net contrast vs the other profiles: Paperspace gives a real idle-cheap stop (unlike Lambda, which has no stop), but unlike AutoDL's 关机 the storage + IP + snapshots keep billing until each is explicitly destroyed/released. "Stopped" ≠ "free."

Iron Law (teardown gate): NO destroy/delete of the machine, release of the IP, or deletion of /storage/block-storage/snapshot until checkpoints are pulled to local AND verified by load, and the user has explicitly approved the specific cost-affecting action. A destroy is irreversible — "it looked done in the log" is not evidence (principle #3). General form → superpowers:verification-before-completion.


6. DAEMON TOOL

  • Core machines — full VMs ⇒ tmux/screen/nohup all available; SSH is as stable as any cloud VM. This is the closest analog to the AutoDL tmux-resilient pattern. tmux survives an SSH drop; it does NOT survive a machine stop/restart (the process is gone), and — critically on Core/Linux — a live tmux session does not defer the wall-clock auto-shutdown (§4), so durability still rests on checkpoint-to-disk + load-latest (principle #8), not on the detach primitive.
  • Gradient Notebooks — a managed Jupyter sandbox: no clean persistent SSH-daemon story, and the auto-shutdown timer is a hard ceiling — a tmux session started inside a Notebook still dies at the timeout. Notebooks are not built for unattended multi-day daemons.
  • Platform-native long-job mechanismsWorkflows (DAG automation) and Deployments (always-on serving). For training-as-a-daemon, prefer Core + tmux; treat Notebooks as interactive/short-run only.

If tmux is absent on a minimal image, fall back to nohup <cmd> </dev/null >log 2>&1 &.


7. TOP GOTCHAS (platform-pinned; universal ones → references/gotchas_universal.md)

  • PS1 — "Stopped the machine, still getting billed." Symptom: GPU meter halted but the bill keeps climbing while the box is off. Root cause: shut-down stops only the compute meter; attached storage + public IP + add-ons + snapshots bill hourly until destroyed/released (verified DO pricing 2026-06). Fix: to truly stop the meter, destroy the machine, release the Static IP, delete the storage and any snapshot — separate teardown actions. Audit for orphaned storage/IPs/snapshots after every stop.

  • PS2 — A long run dies at a round-number wall-clock with no error. Symptom: training vanishes at exactly 6 h / 12 h (or the configured Core window); no traceback. Root cause: the auto-shutdown clock, not a crash — free notebooks 6 h (hard cap), paid notebooks 12 h default, Core 1 h1 wk. On Core/Linux the clock is wall-clock, not idle — an active SSH/tmux session does NOT extend it (verified DO manage-auto-shutdown 2026-06). Fix: checkpoint to /storage (Notebooks) or the block disk (Core) before the window; for Core, raise the auto-shutdown to the longest needed in the console (API/CLI cannot change it post-create); restart + load-latest to resume.

  • PS3 — pip install (or any non-/storage write) vanishes after a Notebook restart. Symptom: packages installed in-session are gone next session; "saved" files disappear after stop/restart. Root cause: pip writes to /usr/local/lib, which is ephemeral workspace — only /storage and /notebooks persist (verified fast.ai forum + DO storage-architecture 2026-06). "Machines are snapshots, not servers," so in-session installs do not persist. Fix: install into a persisted dir — pip install --user (lands in the home dir under a persisted tree) or pip install --target /storage/pyenv && export PYTHONPATH=/storage/pyenv; write all checkpoints/logs/outputs under /storage; verify they landed (ls/checksum) before stop.

  • PS4 — Automation 404s / silently no-ops / installs the wrong SDK. Symptom: a gradient-era create/stop call fails or does nothing; or pip install gradient (v3+) imports an inference SDK with no notebook/machine commands. Root cause: legacy Gradient REST endpoints deprecated 15 Jul 2024; gradient-cli v2 deprecated; gradient-python v3 is the DigitalOcean Gradient AI inference SDK — a name collision, not the orchestration CLI (verified github.com/Paperspace/gradient-cli + digitalocean/gradient-python 2026-06). Fix: for new work use the pspace CLI (github.com/Paperspace/cli); to keep old scripts alive pin pip install "gradient<3.0". Pin and verify the CLI binary + version in any automation.

  • PS5 — Custom template / storage / volume "not found" in a different datacenter. Symptom: a saved template or block volume is unavailable when launching elsewhere; block-storage resize can't be undone. Root cause: storage and templates are region/DC-locked, and block-storage expansion is irreversible (one-way filesystem grow). Fix: pick the datacenter deliberately and keep storage+compute+template co-located; size block storage with headroom up-front (cannot shrink).

  • PS6 — SSH alias breaks after every restart. Symptom: the saved ssh host no longer connects after a machine restart. Root cause: a Dynamic public IP is released on power-off and reassigned on start (new IP each time). Fix: attach a Static IP for stable SSH/endpoint addressing (it bills until deleted, capped $3/mo — PS1), or re-resolve the address on each start before scripting. Note: API/CLI can manage a static IP but cannot add a dynamic one to an existing machine (request dynamic at create time).

  • PS7 — Free-tier notebook code is PUBLIC by default. Symptom: proprietary/confidential code is world-readable in a Gradient free notebook. Root cause: free Gradient notebooks are public by default; private notebooks require a paid plan (verified Paperspace blog / pricing 2026-06). Fix: never put confidential code or any secret in a free notebook; upgrade to a paid plan for private notebooks. Treat the free tier as a public scratchpad. (Secrets hygiene → references/gotchas_universal.md.)

  • PS8 — Free notebook won't start / sits "pending." Symptom: a free-GPU notebook stays pending or errors "out of capacity"; only one notebook will run. Root cause: free tier = 1 concurrent running notebook, ≤5 projects, 5 GB /storage, and free machines are pooled — a pending notebook is queued for the next free machine (verified Paperspace free-instances docs + blog 2026-06). Fix: expect queueing on free; stop the other free notebook (only one runs); for assured access use a paid instance type, which skips the free queue.

  • PS9 — A destroyed machine keeps billing via a leftover snapshot. Symptom: machine destroyed, yet a small monthly charge persists. Root cause: snapshots are a separate resource that survives a machine destroy and bills at $0.29/GB/mo until deleted; auto-snapshot defaults to "Never"/0 but a manually-enabled policy (daily by default, up to 10 stored) silently accrues (verified DO pricing + blog/automated-snapshots 2026-06). Fix: when tearing down, delete the snapshot too (console or CLI); audit the snapshots list after every machine destroy. Capped per-resource by the monthly maximum but still a bleed.

  • PS10 — Notebook upload/import fails on the 5 GB free cap. Symptom: uploading a multi-GB dataset to /storage fails for an unpaid account. Root cause: free /storage allowance is 5 GB; overage is $0.29/GB/mo (paid plans include more: e.g. 200 GB / 1 TB tiers) (verified Paperspace pricing + fast.ai forum 2026-06). Fix: stream/stage the dataset rather than uploading the whole thing, prune aggressively, or upgrade the plan; redirect HF/torch caches off /storage if they would push over the allowance.

  • PS11 — ML-in-a-Box CUDA/driver too old for current PyTorch on a new-arch GPU. Symptom: The NVIDIA driver on your system is too old (found version 110xx). Please update your GPU driver, or no kernel image is available for execution on a fresh card. Root cause: the template's host driver/CUDA stack lags newer PyTorch wheels; on a rental the host driver is host-global and a tenant usually cannot upgrade it (verified github.com/Paperspace/ml-in-a-box issue #13 2026-06). This is the platform-pinned face of the universal CUDA-triangle (U28). Fix: install a torch build matching the box's CUDA (do not force-upgrade the host driver on a rental); pick a template whose Ubuntu/driver matches the GPU (22.04 for H100/A100). Full triangle → U28 in references/gotchas_universal.md.

  • PS12 — Gradient Deployment / custom image won't pull or drifts. Symptom: a Deployment fails to pull <user>/img:tag, or "the same image" behaves differently over time. Root cause: a moving tag (:latest) resolves to a different layer set; private-registry creds missing. Fix: pin the image by digest (@sha256:) and supply registry creds as a Gradient secret, not inline. General form → U30 in references/gotchas_universal.md.

  • PS13 — Platform-specific debugging. Commands + what to check (Core uses standard Linux tooling; the Notebook-only items are the platform delta):

    • Confirm GPU + driver/torch match: nvidia-smi (driver/CUDA version) then python -c "import torch;print(torch.__version__, torch.version.cuda, torch.cuda.is_available())" — a mismatch here is PS11/U28, not a code bug.
    • Find what is eating the 5 GB / over-allowance /storage (the platform's own recommended cmd): du -sch .[!.]* * | sort -h (or !du -sch … in a cell); install ncdu for an interactive view (verified DO notebooks/how-to/manage-storage 2026-06). Check df -h AND df -i (inodes, U7).
    • Is a Notebook write durable? df -h /storage /notebooks and confirm the target is one of those two mounts — anything else (incl. /usr/local/lib) is ephemeral (PS3).
    • Why did the run vanish? Walk the universal ladder (U3): dmesg | grep -iE 'killed process|out of memory' (OOM?), uptime (recent reboot = auto-shutdown fired, PS2), nvidia-smi (GPU idle = died, not hung). A round-number uptime-near-window with a clean dmesg ⇒ auto-shutdown, not a crash.
    • Detect a stuck/slow download: watch the target file size grow (watch -n5 'ls -l /storage/<file>'); a flat size with a live process = stalled wire (U12 resumable loop). Egress is direct/unproxied here, so a stall is route/peer, not a missing proxy hook.
    • Audit orphaned billables before declaring teardown done: in the console (or pspace) list machines, public IPs, storage/volumes, and snapshotsPS1/PS9 hide in the last two.

8. SCRIPT OVERRIDES

Values to parameterize the scripts/ templates for Paperspace. Forward-slash paths; placeholders for any host/IP (never a real address). Core and Gradient differ — both shown.

# --- Gradient Notebook ---
DATA_DIR=/storage                # team-shared persistent; survives stop AND notebook delete
DURABLE_DIR=/storage             # checkpoints land here (NOT /notebooks — dies with the notebook)
SCRATCH=/tmp                     # ephemeral workspace; wiped on stop — never the only copy
HF_HOME=/storage/.cache/huggingface     # redirect cache off ephemeral workspace (watch the 5 GB free cap, PS10)
PROXY_HOOK=                      # none — direct egress (no network_turbo)
CRED_FILE=""                     # Paperspace keys are Gradient secrets / env vars, not files — WANDB_API_KEY/HF_TOKEN arrive via the secret/env (run_one's [ -n "$CRED_FILE" ] guard skips the file read); never write keys to /storage (team-shared)
DETACH=                          # no clean tmux; Jupyter kernel + hard 6h/12h auto-shutdown ceiling
# NOTE: pip into /storage to persist — pip install --target /storage/pyenv && export PYTHONPATH=/storage/pyenv (PS3)

# --- Core machine (preferred for daemonized training) ---
DATA_DIR=/path/to/blockstore     # placeholder — the attached block disk mount
DURABLE_DIR=/path/to/blockstore/ckpts
SCRATCH=/tmp
HF_HOME=/path/to/blockstore/.cache/huggingface
PROXY_HOOK=                      # none
CRED_FILE=""                     # Paperspace keys are Gradient secrets / env vars, not files — WANDB_API_KEY/HF_TOKEN arrive via the secret/env (run_one's [ -n "$CRED_FILE" ] guard skips the file read); inject at launch, never inline
DETACH=tmux                      # survives SSH drop, NOT a machine stop, and NOT the wall-clock auto-shutdown — rely on checkpoint+resume
SSH_HOST=<machine-ip>            # placeholder — ML-in-a-Box user is `paperspace`; pin a Static IP for a stable alias (PS6); dynamic IP changes every start

Reminder: secrets referenced by env-var NAME or Gradient secret only — never inline a key, and never write one onto the team-shared /storage (universal secrets gotcha → references/gotchas_universal.md).