playbook/antigravity-awesome-skills/skills/remote-gpu-trainer/profiles/paperspace.md

366 lines
26 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
platform: paperspace # Paperspace (now under DigitalOcean): Gradient Notebooks + Core/Machines
kind: cloud-api # web console + pspace/gradient CLI/SDK + REST; Core machines also reachable by SSH
meter_stop_verb: shut-down # shut-down/power-off stops COMPUTE; only destroy/delete stops storage + IP
meter_stop_irreversible: false # a stop is reversible; destroy/delete IS irreversible (loses block storage)
detach_primitive: tmux # on Core VMs; Notebooks have no clean SSH-daemon story (Jupyter kernel + hard auto-shutdown ceiling)
spot_available: false # no AWS-style spot/preemptible with a 2-min warning
spot_grace: n/a # interruption is capacity-at-launch + a deterministic auto-shutdown clock, not eviction
shared_fs: true # Gradient /storage is team-shared per storage region/cluster
inode_cap: none # no documented inode cap on either /storage or Core block storage
free_egress: true # no documented ingress/egress fee
china_mirror_needed: false # US/global cloud, direct egress; no platform-provided proxy
host_driver_cuda_max: "host-dependent" # ML-in-a-Box / template ships the CUDA+driver stack (often lagging)
local_nvme: host-dependent # ephemeral workspace on Notebooks; block storage on Core
---
# Paperspace (DigitalOcean) — platform profile
One-line purpose: substrate for running detached GPU jobs on Paperspace Gradient (managed Jupyter
notebooks/deployments) and Paperspace Core (raw Linux VMs, "Machines") — what stops the meter, what
survives a stop vs a destroy, and the auto-shutdown clock that ends every long run. Universal gotchas are
NOT repeated here — see `references/gotchas_universal.md`.
> **Surface to the user up front (principle #10):** ⚠️ Danger clocks — an **auto-shutdown timer ends every Notebook/Core run** (set it consciously; Gradient free notebooks hard-cap at 6 h); **snapshots / block storage keep billing after a machine is destroyed** (orphan bleed). Heads-up — the **Gradient CLI/API was deprecated 15 Jul 2024** (pin `gradient<3.0`; the three-CLI mess, §1).
To jump: `grep -in '<keyword>' profiles/paperspace.md`.
## Table of contents
1. LAUNCH — Gradient vs Core, the env contract, the three-CLI mess
2. STORAGE MODEL — survival matrix, the stop-keeps-disk rule, pip-doesn't-persist
3. NETWORK — public IP (static vs dynamic), ports, SSH flavor
4. SPOT / INTERRUPTION + RESUME — the auto-shutdown clock, not spot
5. TEARDOWN / BILLING — what actually stops the meter (the trap)
6. DAEMON TOOL — tmux on Core; why Notebooks resist a daemon
7. TOP GOTCHAS — `PS1``PS13`, platform-pinned + platform-specific debugging
8. SCRIPT OVERRIDES — values for the `scripts/` templates
---
## 1. LAUNCH
Two product families, with opposite operating models:
- **Gradient** — the managed layer. **Notebooks** are a web Jupyter IDE on a shared persistent store;
**Deployments** serve a container behind a REST endpoint (bring a Docker image `<user>/img:tag`);
**Workflows** run GPU-backed DAG automation. Entry: web console, the CLI/SDK, or REST.
- **Core / Machines** — raw Linux/Windows VMs with a persistent block disk, full root/SSH. OS templates
include **ML-in-a-Box** (preinstalled CUDA + PyTorch/TensorFlow/RAPIDS/Jupyter; **terminal/SSH-only**,
home `/home/paperspace`, shell `/bin/bash`). **Ubuntu 22.04 is required for H100 and recommended for
A100; Ubuntu 20.04 is recommended for any other machine type** (verified github.com/Paperspace/ml-in-a-box
README + DO machines docs 2026-06). This is the family that maps cleanly onto the AutoDL
tmux-resilient-training pattern.
**Env contract.** The chosen image/template IS the Python env — do NOT `conda create` on a rental
(principle: the prebuilt base is the env). On Core, run inside **ML-in-a-Box** directly; on Gradient
Deployments, the env is the Docker image specified at create time. Because a *destroy* wipes the box, the
durable analog of the env is a Docker image plus a `requirements.txt`/lock file kept off-box, so a recreate
reproduces it. **On Notebooks, a plain `pip install` does NOT survive a restart** (writes to
`/usr/local/lib`, ephemeral) — see §2 / `PS3`.
**The three-CLI mess (gates ALL automation).** The tooling fragmented across the DigitalOcean acquisition;
the draft's "migrate to the current API/CLI" understates the trap (verified github.com/Paperspace 2026-06):
- The **legacy Gradient REST API endpoints were deprecated 15 Jul 2024** — stale calls 404 or no-op.
- **`gradient-cli` v2 is deprecated**; pin `pip install "gradient<3.0"` only to keep *old* scripts alive.
- **`gradient-python` (github.com/digitalocean/gradient-python) is NOT the orchestration CLI** — it is the
new DigitalOcean *Gradient AI / GenAI inference* SDK. **Name collision** — do not install it expecting
notebook/machine control.
- The **recommended tool for new work is the streamlined `pspace` CLI** (github.com/Paperspace/cli,
releases ongoing into 2026; e.g. `pspace public-ip release <ip>`). Pin and verify the CLI binary +
version in any automation; do not assume `gradient``pspace` command parity.
**verify:** `ssh <core-alias> 'python -c "import torch;print(torch.cuda.is_available())"'` on Core, or a
`print(torch.cuda.is_available())` cell in a Notebook.
---
## 2. STORAGE MODEL *(survival matrix — principle #4)*
The defining fact: a **stop/shut-down keeps the disk** — Paperspace is one of the few profiles here that
behaves like AutoDL's 关机 in this respect. Only **destroy/delete** removes storage.
**Gradient Notebooks**`/storage` and `/notebooks` are **separate branches from `/`, NOT nested**
(verified DO notebooks/details/storage-architecture 2026-06):
- `/storage`**shared persistent**, team-wide, scoped to a **storage region/cluster**. Survives stop.
(Team-shared ⇒ never write secrets here — see §7 / `references/gotchas_universal.md`.)
- `/notebooks`**per-notebook persistent**, managed via the console File Manager. Survives stop.
- everything else — **ephemeral workspace** (incl. `/usr/local/lib` where `pip` lands), wiped on stop.
**Core machines** — block storage **50 GB2 TB**, persists across a stop; **expansion is one-way**
("increasing block storage expands the filesystem and is not reversible"). Region-locked: storage and
custom templates must be used in the **same datacenter**. **Snapshots** are a separate billed resource
(`$0.29/GB/mo`, default policy is **"Never" / 0 stored** — they bill only if manually enabled, and a
snapshot **survives a machine destroy**, so an orphaned snapshot keeps charging — see `PS9`).
| Tier | Path | Survives STOP? | Survives DESTROY/DELETE? | Cap / note |
|---|---|---|---|---|
| Notebook shared persistent | `/storage` | yes | yes (separate resource) | team-shared per region/cluster; billed until deleted |
| Notebook per-notebook | `/notebooks` | yes | no (dies with the notebook) | per-notebook persistent; console File Manager |
| Notebook workspace | everything else (incl. `/usr/local/lib`) | **no** | no | ephemeral; wiped on stop; `pip` lands here |
| Core block storage | machine root + block vol | yes | **no** | 50 GB2 TB; expansion irreversible; region-locked |
| Core snapshot | (separate resource) | yes | **yes** (orphan-bills!) | `$0.29/GB/mo`; default policy Never/0; survives machine destroy |
**Mount checkpoints MUST go to (for the §5 teardown verb):** on Notebooks, `/storage` (cross-stop,
cross-delete-of-the-notebook) — `/notebooks` dies if the notebook itself is deleted. On Core, the block
disk survives a stop, but a *destroy* wipes it, so the Iron-Law pull-to-local before destroy still applies.
No documented inode cap on either tier; still monitor `df -i` (universal, U7 / principle #5).
---
## 3. NETWORK
- **Egress.** Direct and unproxied to HF/GitHub/PyPI; no `network_turbo`-style accelerator and no
documented egress fee. China-mirror relevance is **N/A as a platform feature** — relevant only when
operating from inside China and supplying a private mirror (then `references/china-network.md`).
- **Public IP.** Core machines are reached by **public IP**, of two kinds (verified DO
machines/how-to/manage-public-ips 2026-06):
- **Static** — "the same IP address every time it powers on … remains in your account until you delete
it." Use it to pin stable SSH/endpoint addressing. **Billed until deleted** — *including while the
machine is powered off* (see §5 / `PS6`). API/CLI can create/release a **static** IP but **cannot add a
dynamic IP to an existing machine** — dynamic must be requested at machine-creation time.
- **Dynamic** — "assigned automatically when a machine powers on and deleted when it powers off"; a **new
IP on every start**, so a hard-coded SSH alias breaks after a restart. **Charged only while the machine
runs** (auto-released on power-off → no idle IP cost).
A machine with **no public IP** is internet-isolated (and avoids the IP charge). **Private networks**
give team-isolated pools.
- **Ports / services.** Firewall is self-managed — open ports to expose services. Tunnel Jupyter (8888) /
TensorBoard (6006) over SSH on Core:
`ssh -L 8888:localhost:8888 -L 6006:localhost:6006 paperspace@<machine-ip>`
(placeholder host — substitute the machine's real IP/static address). In a Gradient Notebook, launch
TensorBoard in-Jupyter and write logs under `/storage` (or they vanish on stop).
- **SSH flavor.** Core = a standard Linux VM → full `ssh`/`scp`/`rsync` (ML-in-a-Box default user
`paperspace`). Gradient Notebooks expose a **Jupyter sandbox**, not a clean persistent SSH daemon —
there is no stable SSH-daemon story for a multi-day unattended run on a Notebook.
---
## 4. SPOT / INTERRUPTION + RESUME *(principle #7/#8)*
**No AWS-style spot/preemptible tier** with a 2-minute interruption warning. The two interruption modes are
different in kind and BOTH are deterministic, not random eviction:
1. **Capacity-at-launch.** The desired GPU type may be unavailable when launching — a *launch-time*
availability problem, not a runtime eviction. On free notebooks this surfaces as **"out of capacity" /
the notebook sits "pending" in queue for the next free machine** (verified DO notebooks/how-to docs
2026-06). Build **retry-launch-until-available** logic, not a 2-minute-grace flush handler; for assured
access, a paid instance type bypasses the free queue.
2. **Auto-shutdown clock — the hard ceiling on any long run.** The timer is the real killer:
- **Gradient free** notebooks hard-stop at a **6-hour** maximum auto-shutdown (cannot be raised).
- **Paid notebooks** default to **12-hour** auto-shutdown; range **1 hour 1 week**.
- **Core** machines allow a configurable **1 hour 1 week** auto-shutdown.
- **Trap (Core/Linux):** Core Linux auto-shutdown is **wall-clock, not idle-based** — "Linux machines
shut down regardless of whether any users are connected" (only Windows waits for idle). An active
SSH/tmux session does **not** keep a Linux Core machine alive past the timer (verified DO
machines/how-to/manage-auto-shutdown 2026-06).
- **Trap (API):** auto-shutdown **cannot be enabled/disabled via API or CLI on an existing machine**
"you can only manage the auto-shutdown feature via the Paperspace console" (same source). Set it
deliberately at create time / in the console.
The window is deterministic, so plan around it: a tmux session inside a Notebook **still dies at the
timeout** (§6). **Resume hook:** checkpoint full state to `/storage` (Notebooks) or the block disk
(Core) *before* the auto-shutdown window, then restart and load-latest-on-startup unconditionally.
Because the clock is known in advance, cadence can be planned rather than guessed — but the
load-latest-on-startup spine (principle #8) is what makes the restart idempotent. Young/Daly cadence
formula → `references/spot-resilience.md`.
---
## 5. TEARDOWN / BILLING *(principle #9 + the Iron Law — the most error-prone section)*
Per-hour billing (verified DO products/paperspace/pricing 2026-06). **A shut-down/power-off STOPS the
compute (GPU) meter** while disk persists — this is the AutoDL-like part. **But it does NOT stop every
meter.**
- **What a stop still bills (the trap):** "When a Paperspace machine is powered off, attached **storage**,
**public IP addresses**, and other **add-ons** continue to be billed on an hourly basis until you destroy
those resources." Gradient `/storage` over the plan allowance and Core block storage both keep charging
while the machine is off.
- **The monthly-cap softener (new fact):** non-GPU resources (storage, public IP, snapshots) have a
**maximum monthly charge** — "once a non-GPU resource reaches its monthly maximum, it no longer incurs
charges for the rest of the billing cycle." Static public IP caps at **$3.00/mo** ($0.0045/hr). So a
forgotten static IP is a bounded ~$3/mo bleed, but a forgotten 2 TB block volume is **~$120/mo** until
destroyed (verified DO pricing 2026-06).
- **What actually stops the full meter:** **destroy the machine** AND **release the static IP** AND
**delete the storage** (AND delete any **snapshot**) — separate actions. "To stop all charges for a
machine and its add-ons, destroy the machine and any resources you no longer need." A stopped-but-not-
destroyed machine with a Static IP, a 2 TB block volume, and a leftover snapshot is still spending money.
- **Irreversible:** **destroy/delete** of a machine removes its block storage (no recovery); block-storage
**expansion** is also one-way. A **shut-down is reversible** (resume later).
**Net contrast vs the other profiles:** Paperspace gives a real idle-cheap *stop* (unlike Lambda, which has
no stop), but unlike AutoDL's 关机 the **storage + IP + snapshots keep billing** until each is explicitly
destroyed/released. "Stopped" ≠ "free."
> **Iron Law (teardown gate):** NO destroy/delete of the machine, release of the IP, or deletion of
> `/storage`/block-storage/snapshot until checkpoints are **pulled to local AND verified by load**, and the
> user has **explicitly approved** the specific cost-affecting action. A destroy is irreversible — "it
> looked done in the log" is not evidence (principle #3). General form →
> `superpowers:verification-before-completion`.
---
## 6. DAEMON TOOL
- **Core machines** — full VMs ⇒ `tmux`/`screen`/`nohup` all available; SSH is as stable as any cloud VM.
This is the closest analog to the AutoDL tmux-resilient pattern. tmux survives an SSH drop; it does NOT
survive a machine **stop/restart** (the process is gone), and — critically on Core/Linux — a live tmux
session does **not** defer the wall-clock auto-shutdown (§4), so durability still rests on
checkpoint-to-disk + load-latest (principle #8), not on the detach primitive.
- **Gradient Notebooks** — a managed Jupyter sandbox: **no clean persistent SSH-daemon story**, and the
**auto-shutdown timer is a hard ceiling** — a tmux session started inside a Notebook **still dies at the
timeout**. Notebooks are not built for unattended multi-day daemons.
- **Platform-native long-job mechanisms** — **Workflows** (DAG automation) and **Deployments** (always-on
serving). For training-as-a-daemon, prefer **Core + tmux**; treat Notebooks as interactive/short-run only.
If `tmux` is absent on a minimal image, fall back to `nohup <cmd> </dev/null >log 2>&1 &`.
---
## 7. TOP GOTCHAS (platform-pinned; universal ones → `references/gotchas_universal.md`)
- **PS1 — "Stopped the machine, still getting billed."**
Symptom: GPU meter halted but the bill keeps climbing while the box is off.
Root cause: shut-down stops only the **compute** meter; attached **storage** + **public IP** + add-ons +
snapshots bill hourly until destroyed/released (verified DO pricing 2026-06).
Fix: to truly stop the meter, **destroy the machine, release the Static IP, delete the storage and any
snapshot** — separate teardown actions. Audit for orphaned storage/IPs/snapshots after every stop.
- **PS2 — A long run dies at a round-number wall-clock with no error.**
Symptom: training vanishes at exactly 6 h / 12 h (or the configured Core window); no traceback.
Root cause: the **auto-shutdown clock**, not a crash — free notebooks 6 h (hard cap), paid notebooks 12 h
default, Core 1 h1 wk. On Core/Linux the clock is **wall-clock, not idle** — an active SSH/tmux session
does NOT extend it (verified DO manage-auto-shutdown 2026-06).
Fix: checkpoint to `/storage` (Notebooks) or the block disk (Core) **before** the window; for Core, raise
the auto-shutdown to the longest needed **in the console** (API/CLI cannot change it post-create);
restart + load-latest to resume.
- **PS3 — `pip install` (or any non-`/storage` write) vanishes after a Notebook restart.**
Symptom: packages installed in-session are gone next session; "saved" files disappear after stop/restart.
Root cause: `pip` writes to `/usr/local/lib`, which is **ephemeral workspace** — only `/storage` and
`/notebooks` persist (verified fast.ai forum + DO storage-architecture 2026-06). "Machines are snapshots,
not servers," so in-session installs do not persist.
Fix: install into a persisted dir — `pip install --user` (lands in the home dir under a persisted tree)
or `pip install --target /storage/pyenv && export PYTHONPATH=/storage/pyenv`; write all
checkpoints/logs/outputs under `/storage`; verify they landed (`ls`/checksum) before stop.
- **PS4 — Automation 404s / silently no-ops / installs the wrong SDK.**
Symptom: a `gradient`-era create/stop call fails or does nothing; or `pip install gradient` (v3+) imports
an inference SDK with no notebook/machine commands.
Root cause: **legacy Gradient REST endpoints deprecated 15 Jul 2024**; **`gradient-cli` v2 deprecated**;
**`gradient-python` v3 is the DigitalOcean Gradient AI inference SDK — a name collision**, not the
orchestration CLI (verified github.com/Paperspace/gradient-cli + digitalocean/gradient-python 2026-06).
Fix: for new work use the **`pspace` CLI** (github.com/Paperspace/cli); to keep old scripts alive pin
`pip install "gradient<3.0"`. Pin and verify the CLI binary + version in any automation.
- **PS5 — Custom template / storage / volume "not found" in a different datacenter.**
Symptom: a saved template or block volume is unavailable when launching elsewhere; block-storage resize
can't be undone.
Root cause: storage and templates are **region/DC-locked**, and **block-storage expansion is
irreversible** (one-way filesystem grow).
Fix: pick the datacenter deliberately and keep storage+compute+template co-located; size block storage
with headroom up-front (cannot shrink).
- **PS6 — SSH alias breaks after every restart.**
Symptom: the saved `ssh` host no longer connects after a machine restart.
Root cause: a **Dynamic public IP** is released on power-off and reassigned on start (new IP each time).
Fix: attach a **Static IP** for stable SSH/endpoint addressing (it bills until deleted, capped $3/mo —
`PS1`), or re-resolve the address on each start before scripting. Note: API/CLI can manage a *static* IP
but cannot add a *dynamic* one to an existing machine (request dynamic at create time).
- **PS7 — Free-tier notebook code is PUBLIC by default.**
Symptom: proprietary/confidential code is world-readable in a Gradient free notebook.
Root cause: free Gradient notebooks are **public by default; private notebooks require a paid plan**
(verified Paperspace blog / pricing 2026-06).
Fix: never put confidential code or any secret in a free notebook; upgrade to a paid plan for private
notebooks. Treat the free tier as a public scratchpad. (Secrets hygiene → `references/gotchas_universal.md`.)
- **PS8 — Free notebook won't start / sits "pending."**
Symptom: a free-GPU notebook stays pending or errors "out of capacity"; only one notebook will run.
Root cause: free tier = **1 concurrent running notebook, ≤5 projects, 5 GB `/storage`**, and free machines
are pooled — a pending notebook is queued for the next free machine (verified Paperspace free-instances
docs + blog 2026-06).
Fix: expect queueing on free; stop the other free notebook (only one runs); for assured access use a paid
instance type, which skips the free queue.
- **PS9 — A destroyed machine keeps billing via a leftover snapshot.**
Symptom: machine destroyed, yet a small monthly charge persists.
Root cause: **snapshots are a separate resource that survives a machine destroy** and bills at
`$0.29/GB/mo` until deleted; auto-snapshot defaults to "Never"/0 but a manually-enabled policy (daily by
default, up to 10 stored) silently accrues (verified DO pricing + blog/automated-snapshots 2026-06).
Fix: when tearing down, delete the snapshot too (console or CLI); audit the snapshots list after every
machine destroy. Capped per-resource by the monthly maximum but still a bleed.
- **PS10 — Notebook upload/import fails on the 5 GB free cap.**
Symptom: uploading a multi-GB dataset to `/storage` fails for an unpaid account.
Root cause: free `/storage` allowance is **5 GB**; overage is **$0.29/GB/mo** (paid plans include more:
e.g. 200 GB / 1 TB tiers) (verified Paperspace pricing + fast.ai forum 2026-06).
Fix: stream/stage the dataset rather than uploading the whole thing, prune aggressively, or upgrade the
plan; redirect HF/torch caches off `/storage` if they would push over the allowance.
- **PS11 — ML-in-a-Box CUDA/driver too old for current PyTorch on a new-arch GPU.**
Symptom: `The NVIDIA driver on your system is too old (found version 110xx). Please update your GPU
driver`, or `no kernel image is available for execution` on a fresh card.
Root cause: the template's **host driver/CUDA stack lags newer PyTorch wheels**; on a rental the host
driver is host-global and a tenant usually cannot upgrade it (verified github.com/Paperspace/ml-in-a-box
issue #13 2026-06). This is the platform-pinned face of the universal CUDA-triangle (U28).
Fix: install a torch build matching the box's CUDA (do not force-upgrade the host driver on a rental);
pick a template whose Ubuntu/driver matches the GPU (22.04 for H100/A100). Full triangle → U28 in
`references/gotchas_universal.md`.
- **PS12 — Gradient Deployment / custom image won't pull or drifts.**
Symptom: a Deployment fails to pull `<user>/img:tag`, or "the same image" behaves differently over time.
Root cause: a moving tag (`:latest`) resolves to a different layer set; private-registry creds missing.
Fix: pin the image by digest (`@sha256:`) and supply registry creds as a Gradient **secret**, not inline.
General form → U30 in `references/gotchas_universal.md`.
- **PS13 — Platform-specific debugging.** Commands + what to check (Core uses standard Linux tooling; the
Notebook-only items are the platform delta):
- **Confirm GPU + driver/torch match:** `nvidia-smi` (driver/CUDA version) then
`python -c "import torch;print(torch.__version__, torch.version.cuda, torch.cuda.is_available())"`
a mismatch here is `PS11`/U28, not a code bug.
- **Find what is eating the 5 GB / over-allowance `/storage` (the platform's own recommended cmd):**
`du -sch .[!.]* * | sort -h` (or `!du -sch …` in a cell); install `ncdu` for an interactive view
(verified DO notebooks/how-to/manage-storage 2026-06). Check `df -h` AND `df -i` (inodes, U7).
- **Is a Notebook write durable?** `df -h /storage /notebooks` and confirm the target is one of those two
mounts — anything else (incl. `/usr/local/lib`) is ephemeral (`PS3`).
- **Why did the run vanish?** Walk the universal ladder (U3): `dmesg | grep -iE 'killed process|out of
memory'` (OOM?), `uptime` (recent reboot = auto-shutdown fired, `PS2`), `nvidia-smi` (GPU idle = died,
not hung). A round-number `uptime`-near-window with a clean `dmesg` ⇒ auto-shutdown, not a crash.
- **Detect a stuck/slow download:** watch the target file size grow
(`watch -n5 'ls -l /storage/<file>'`); a flat size with a live process = stalled wire (U12 resumable
loop). Egress is direct/unproxied here, so a stall is route/peer, not a missing proxy hook.
- **Audit orphaned billables before declaring teardown done:** in the console (or `pspace`) list
machines, **public IPs**, **storage/volumes**, and **snapshots**`PS1`/`PS9` hide in the last two.
---
## 8. SCRIPT OVERRIDES
Values to parameterize the `scripts/` templates for Paperspace. Forward-slash paths; placeholders for any
host/IP (never a real address). Core and Gradient differ — both shown.
```sh
# --- Gradient Notebook ---
DATA_DIR=/storage # team-shared persistent; survives stop AND notebook delete
DURABLE_DIR=/storage # checkpoints land here (NOT /notebooks — dies with the notebook)
SCRATCH=/tmp # ephemeral workspace; wiped on stop — never the only copy
HF_HOME=/storage/.cache/huggingface # redirect cache off ephemeral workspace (watch the 5 GB free cap, PS10)
PROXY_HOOK= # none — direct egress (no network_turbo)
CRED_FILE="" # Paperspace keys are Gradient secrets / env vars, not files — WANDB_API_KEY/HF_TOKEN arrive via the secret/env (run_one's [ -n "$CRED_FILE" ] guard skips the file read); never write keys to /storage (team-shared)
DETACH= # no clean tmux; Jupyter kernel + hard 6h/12h auto-shutdown ceiling
# NOTE: pip into /storage to persist — pip install --target /storage/pyenv && export PYTHONPATH=/storage/pyenv (PS3)
# --- Core machine (preferred for daemonized training) ---
DATA_DIR=/path/to/blockstore # placeholder — the attached block disk mount
DURABLE_DIR=/path/to/blockstore/ckpts
SCRATCH=/tmp
HF_HOME=/path/to/blockstore/.cache/huggingface
PROXY_HOOK= # none
CRED_FILE="" # Paperspace keys are Gradient secrets / env vars, not files — WANDB_API_KEY/HF_TOKEN arrive via the secret/env (run_one's [ -n "$CRED_FILE" ] guard skips the file read); inject at launch, never inline
DETACH=tmux # survives SSH drop, NOT a machine stop, and NOT the wall-clock auto-shutdown — rely on checkpoint+resume
SSH_HOST=<machine-ip> # placeholder — ML-in-a-Box user is `paperspace`; pin a Static IP for a stable alias (PS6); dynamic IP changes every start
```
Reminder: secrets referenced by env-var NAME or Gradient secret only — never inline a key, and never write
one onto the team-shared `/storage` (universal secrets gotcha → `references/gotchas_universal.md`).