playbook/antigravity-awesome-skills/skills/remote-gpu-trainer/profiles/lambda.md

---
platform: lambda
kind: cloud-api               # REST API / web console / SSH to a normal Ubuntu VM
meter_stop_verb: terminate    # the ONLY action that stops billing; sudo shutdown does NOT
meter_stop_irreversible: true # terminate wipes local NVMe — there is no stop/suspend state
detach_primitive: tmux        # plain Ubuntu; tmux/screen/nohup, install if absent
spot_available: false         # no spot/preemptible tier; interruption is capacity-at-launch
spot_grace: n/a               # no mid-run eviction → no grace window
shared_fs: true               # region-locked NFS filesystem, attach-at-launch only
inode_cap: none               # no documented inode cap; GiB quota only
free_egress: true             # no ingress/egress fees on instances or filesystems
china_mirror_needed: false    # US/global cloud, direct egress; no platform proxy
host_driver_cuda_max: lambda-stack-dependent  # Lambda Stack bundles driver+CUDA+PyTorch; version moves per release — read nvidia-smi on the box, do NOT assume a number
local_nvme: true              # ephemeral root/local NVMe, gone on terminate
---

# Lambda Cloud — Profile

Lambda Cloud is a **cattle-not-pets** GPU cloud: on-demand + reserved instances, a prebuilt **Lambda
Stack** image, and **no stop/suspend state** — an instance can only be **launched, restarted, or
terminated**, and terminate destroys the local NVMe. Nothing on the box survives a teardown except what was
pushed off or written to an attached **region-locked NFS filesystem**. This inverts the AutoDL "关机保留数据"
instinct: here, durable design (checkpoint-to-NFS + idempotent resume) is **mandatory, not optional**.

> **Surface to the user up front (principle #10):** ⚠️ Danger clocks — there is **no stop/suspend**: an instance can only be launched / restarted / **terminated, and terminate wipes the local NVMe** — only the attached **NFS filesystem** survives, and **it keeps billing until you delete it manually** (LAM6). Conveniences — one-click **JupyterLab** per instance, free egress both directions. A terminate→relaunch yields a **new IP**.

> Docs/console domain moved from `lambdalabs.com` to `lambda.ai` (docs at `docs.lambda.ai`, console at
> `cloud.lambda.ai`); the **REST API base is still `cloud.lambdalabs.com/api/v1`** and `cloud.lambda.ai`
> also resolves (verified docs.lambda.ai + cloud-api 2026-06). Treat both hosts as live.

To jump: `grep -in <keyword> profiles/lambda.md`.

**Table of contents** — 1. LAUNCH · 2. STORAGE MODEL (survival matrix) · 3. NETWORK ·
4. SPOT / INTERRUPTION + RESUME · 5. TEARDOWN / BILLING · 6. DAEMON TOOL · 7. TOP GOTCHAS (LAM1–LAM13) +
Platform-specific debugging · 8. SCRIPT OVERRIDES.

Universal gotchas (CRLF, inode/`df -i`, silent sync, cgroup OOM, spot grace) are NOT repeated here —
see `references/gotchas_universal.md`. Universal invariants → `references/principles.md`.

---

## 1. LAUNCH

Entry points:
- **Web console** at `cloud.lambda.ai` → Instances → Launch (pick GPU type + region, attach a filesystem
  here if one is needed — see §2; attach any per-instance firewall ruleset here too — see §3/LAM4).
- **REST API** — `https://cloud.lambdalabs.com/api/v1`, auth `curl -u $LAMBDA_API_KEY:` (basic-auth,
  password empty). Canonical automation surface (verified docs.lambda.ai/api/cloud 2026-06):
  - `GET  /instance-types` — lists every GPU type **and** `regions_with_capacity_available[]` per type.
    This field IS the capacity signal — poll it to know where a type can launch right now (drives LAM5
    retry-until-available).
  - `POST /instance-operations/launch` · `.../terminate` · `.../restart` — create / stop-meter / reboot.
- **SSH** — standard connection to a normal Ubuntu VM; **default user is `ubuntu`** (not `root`); use
  `sudo` for root. One-click **JupyterLab** is offered per instance.
- **SkyPilot** — de-facto orchestration layer: `pip install "skypilot[lambda]"`, key file at
  `~/.lambda_cloud/lambda_keys` containing a line `api_key = <KEY>` (verified docs.skypilot.co 2026-06).
  Use it for retry-until-capacity + autostop (§4, §6).

**Env contract — the image/base IS the env.** Instances ship **Lambda Stack** (NVIDIA driver + CUDA +
cuDNN + PyTorch/TensorFlow, all upgraded together as one apt metapackage). Run in it directly on the
throwaway box — do **not** `conda create` on a rental (`references/principles.md` §2), and do not `pip
install torch` over the top (LAM7/LAM8). Lambda Stack's exact CUDA/driver/PyTorch **moves per release**;
read it off the box (`nvidia-smi`, `python -c "import torch;print(torch.__version__,torch.version.cuda)"`)
rather than assuming a number. The **durable** form of the env is a Docker image (Lambda recommends running
Docker inside the instance) or a setup script replayed on each launch — because terminate destroys the box.
Reserved / 1-Click Clusters provide flat-rate multi-node (own billing model — LAM12).

> **verify:** `ssh ubuntu@<IP> 'python -c "import torch;print(torch.cuda.is_available())"'` → `True`.

---

## 2. STORAGE MODEL  *(survival matrix — principle #4)*

Two tiers, and the trap is that the default working location is the **volatile** one.

- **Local / root NVMe** — fast, per-instance, **ephemeral**. Docs: *"Data not stored in the mount location
  is erased once you terminate your instance and cannot be recovered"* (verified docs.lambda.ai
  creating-managing-instances 2026-06). This is where work lands by default.
- **NFS filesystem** — a regional network filesystem mounted at `/lambda/nfs/<name>` (docs example mount:
  `/lambda/nfs/persistent-storage`). **The only durable home.** Three hard constraints (verified
  docs.lambda.ai/public-cloud/filesystems 2026-06):
  - **Region-locked** — *"The filesystem must reside in the same region as the instance or cluster"* and
    *"Filesystems cannot currently be transferred between regions."* Pick the region deliberately at create.
  - **Attach-at-launch only** — *"You must attach the filesystem … at the time that the instance … is
    launched"* and *"You can't attach a filesystem after you've created an instance."*
  - Billed **$0.20/GiB/month in 1-hour increments**, **free ingress/egress**; **up to 24 filesystems per
    account**; most regions allow up to 8 EB/filesystem but **us-south-1 (Texas) caps at 10 TB**.
- **No documented inode cap** — GiB quota only; no `df -i` ceiling surfaced (still audit `df -i` per the
  universal storage gotcha).

| Tier | Path | Survives RESTART? | Survives TERMINATE? | Cap |
|---|---|---|---|---|
| Local / root NVMe | `/`, `/home/ubuntu` | yes (data persists; **but cold reboot wipes RAM** — LAM9) | **NO** (erased, unrecoverable) | instance root volume |
| NFS filesystem | `/lambda/nfs/<name>` | yes | **yes** (separate lifecycle; keeps billing — LAM6) | GiB quota; ~10 TB in us-south-1, 8 EB elsewhere |

**Checkpoints MUST go to** `/lambda/nfs/<name>` (the durable tier) for the §5 `terminate` verb. A
checkpoint left on local NVMe dies with the box. If no filesystem was attached at launch, the only durable
path is to `pull` the result off-box (free egress) before terminating.

---

## 3. NETWORK

- **Direct, unproxied egress.** US/global cloud — egress to HF / GitHub / PyPI is direct; **no
  `network_turbo`-style accelerator exists**, and none is needed. China-mirror relevance is **N/A as a
  platform feature** (relevant only when operating from inside China; then `references/china-network.md`
  applies to the user's own setup, nothing platform-provided).
- **Free egress both directions** — *"Transparent pricing with no egress fees"* (verified lambda.ai
  pricing 2026-06). Re-pulling a large model or pushing results off-box costs nothing, making
  "pull-before-terminate" the cheap, safe default when no NFS is attached.
- **Firewall** — default allows *"only incoming ICMP traffic or TCP traffic on port 22 (SSH)"*. Open more
  via **global rules** (apply workspace-wide) or **per-instance rulesets** (region-scoped). Per-instance
  rulesets: *"You must attach rulesets during the instance launch process. You can't attach them after the
  instance has been launched"* and *"You can't remove rulesets from an instance after the instance has been
  launched"* (verified docs.lambda.ai/public-cloud/firewalls 2026-06) → plan port exposure before launch
  (gotcha LAM4). Global rules can still be edited on the workspace afterward.
- **Exposing TB / Jupyter** — instances get a public IP; tunnel over SSH rather than opening ports:
  `ssh -L 8888:localhost:8888 -L 6006:localhost:6006 ubuntu@<IP>`. No platform-pinned TensorBoard dir —
  run TB on `:6006` against the logdir under the NFS mount.
- **SSH flavor** — direct TCP to a normal VM (`ubuntu@<IP>`); full `scp`/`rsync` work, no proxy-jump quirk.
  **No static IP feature** — *"On-Demand Cloud doesn't support static IP addresses"* (verified DeepTalk
  staff 2026-06). The IP is fixed for an instance's life, but **terminate→relaunch yields a NEW IP**
  (LAM10) — re-read it from the console/API every launch; never hard-code it in automation.

---

## 4. SPOT / INTERRUPTION + RESUME  *(principle #7/#8)*

**No spot / preemptible tier — and no mid-run eviction.** This is the key divergence from vast.ai/RunPod:
there is **no SIGTERM→SIGKILL grace window to survive**, because a running instance is never evicted
mid-epoch. The interruption model is different in kind:

- **Capacity-at-launch is the real failure.** The desired GPU type may be **unavailable when launch is
  attempted** — Lambda has **no spot tier to fall back to**, and real-world on-demand fill rates are
  spiky (one published 6-month log: ~64% same-day A100 success — i.e. ~1 in 3 attempts blocked; a 26 h
  "temporarily unavailable" stall scaling 2→4 H100; verified medium.com/@velinxs 2026-06). H100/B200
  capacity is the tightest. The resilience pattern is **retry-until-available**, not survive-eviction:
  poll `GET /instance-types` for `regions_with_capacity_available` and `POST .../launch` the moment a
  region appears (or let SkyPilot's provisioner retry across regions/types).
- **Self-inflicted termination only.** Once running, the only destructive events are an operator
  `terminate`, or an **improper `sudo shutdown`** that pushes the box to **Alert** while still billing
  (LAM3 / §5), or a **cold reboot** that wipes RAM (LAM9).
- **Resume hook** — checkpoint full state to the NFS filesystem on a periodic timer, load-latest
  unconditionally on startup, so a fresh post-capacity launch resumes instead of restarting. Because the
  box is cattle, the resume path is exercised on *every* relaunch, not just after a rare preemption.

Cadence formula (Young/Daly) + atomic-write resume → `references/spot-resilience.md`. Here the formula's
μ is effectively "time between voluntary relaunches," not a preemption rate.

---

## 5. TEARDOWN / BILLING  *(principle #9 + the Iron Law)*

**TERMINATE is the meter-stop verb — and it is irreversible.** *"Billing begins the moment you launch an
instance and the instance passes health checks, and ends the moment you terminate the instance"*, billed
in **one-minute increments**, *"regardless if they're actively being used"* (verified
docs.lambda.ai/public-cloud/billing 2026-06).

> **The shutdown trap (most error-prone fact on this platform):** *"Do not use commands such as `sudo
> shutdown -h now` or `sudo systemctl poweroff` … These commands will not work as expected and will cause
> your instances to go into Alert status, and billing will continue"* (verified docs.lambda.ai 2026-06).
> Also `halt` / `shutdown -P 0` only stop the OS, not the meter (DeepTalk staff). Stop the meter **only**
> via `terminate` from the console or `POST /instance-operations/terminate` — which works even from inside
> the instance itself.

What each action preserves:
- **terminate** — stops the instance meter; **erases the local NVMe** (unrecoverable). The NFS filesystem
  has a **separate lifecycle** and survives — but it **keeps billing $0.20/GiB/month until explicitly
  deleted** (*"Billing continues as long as a filesystem exists, even if it's not mounted to an instance"*),
  so a terminated-but-forgotten filesystem is a silent ongoing charge (LAM6).
- **There is no stop/suspend state** — *"It currently isn't possible to pause (suspend) your instance …
  Your only options are to launch, restart, or terminate"* (verified docs.lambda.ai 2026-06). Idle-cheap
  pause is impossible; the only way to stop paying for compute is to destroy the box and rebuild later.
- **restart / cold reboot** — does **not** stop the meter and does **not** wipe disk, but a **cold reboot
  erases RAM and bypasses safe shutdown** — reserve it for a frozen box only (LAM9).

**Iron Law (SKILL.md Phase 5):** NO `terminate` until checkpoints are **pulled to local OR confirmed on
NFS by load-test** AND the user approves the cost-affecting action. Because terminate is destructive and
irreversible, an unverified `cp`/`rsync` to NFS means **permanent loss** — verify the sync (checksum /
`ls -l` / a load) before terminating, not after. Egress is free, so a belt-and-suspenders `pull` to local
is cheap. Cross-link: `superpowers:verification-before-completion` (REQUIRED) for the general gate.

---

## 6. DAEMON TOOL

- **Detach primitive: `tmux`** (or `screen` / `nohup`) on a standard Ubuntu VM — same playbook as the
  AutoDL tmux pattern. Install if absent (`sudo apt install -y tmux`); fall back to
  `nohup … </dev/null >log 2>&1 &`.
- **Survives an SSH drop, NOT a terminate.** tmux keeps the job alive across a dropped connection, but
  with no stop state the detach primitive can't survive a teardown — only the **checkpoint-to-NFS +
  idempotent resume** spine does (principle #8). tmux is the SSH-resilience layer; the checkpoint is the
  instance-resilience layer. (tmux also won't survive a cold reboot — LAM9.)
- **Native orchestration: SkyPilot** (managed jobs, autostop, retry-until-capacity) + **1-Click
  Clusters** for multi-node; no platform job-queue otherwise. SkyPilot moves the box on capacity loss but
  **restarts the process from scratch — the checkpoint-load restores progress** (don't assume the
  framework resumes training state).

---

## 7. TOP GOTCHAS  (Lambda-pinned — universal ones live in `references/gotchas_universal.md`)

- **LAM1 — Terminate erases the local NVMe; there is no stop/suspend.**
  Symptom: relaunched instance is blank, yesterday's run gone. → Root cause: local storage is ephemeral
  (*"Data not stored in the mount location is erased … and cannot be recovered"*) and no stop state
  preserves it; the AutoDL "关机 keeps my data" assumption is false. → Fix: design every workflow around
  destroy/recreate — checkpoint to `/lambda/nfs/<name>` or `pull` off-box before any terminate; never keep
  the only copy on local NVMe. (docs.lambda.ai 2026-06)

- **LAM2 — Filesystem is attach-at-launch only and region-locked.**
  Symptom: a running instance has no durable storage and one can't be added; or a us-east filesystem won't
  mount on a us-west instance. → Root cause: filesystems attach only at create time and can't move between
  regions. → Fix: decide the region and attach the filesystem **at launch**; co-locate instance +
  filesystem in the same region. (filesystems doc 2026-06)

- **LAM3 — `sudo shutdown` / `poweroff` keeps the meter running (Alert state).**
  Symptom: instance "powered off" but the bill keeps climbing. → Root cause: an in-OS shutdown sends the
  instance to **Alert** without stopping billing; `halt`/`shutdown -P 0` only stop the OS, not the meter.
  → Fix: stop the meter only via **terminate** (console or `POST /instance-operations/terminate`); never
  rely on an in-box poweroff. (billing doc + DeepTalk staff 2026-06)

- **LAM4 — Per-instance firewall rulesets are immutable post-launch.**
  Symptom: a needed inbound port can't be opened (or a wrong one removed) on a live instance. → Root cause:
  per-instance rulesets *"must [be attached] during the instance launch process"* and *"can't [be removed]
  after the instance has been launched."* → Fix: plan port exposure before launch, use an editable
  **global** rule, or tunnel over SSH (`-L`, §3) instead of opening a port. (firewalls doc 2026-06)

- **LAM5 — Capacity, not eviction, is the bottleneck (no spot fallback).**
  Symptom: launch fails / dashboard shows the desired GPU type unavailable; long stalls scaling up. → Root
  cause: on-demand supply for a specific GPU/region is exhausted (worst for H100/B200), and there is no
  spot tier to fall back to. → Fix: poll `GET /instance-types` for `regions_with_capacity_available` and
  launch the instant a region appears (or use SkyPilot's cross-region/type provisioner); resume from the
  NFS checkpoint once granted (§4). (cloud-api doc + medium.com/@velinxs 2026-06)

- **LAM6 — The NFS filesystem keeps billing after the instance is gone.**
  Symptom: all instances terminated, but storage charges continue. → Root cause: *"Billing continues as
  long as a filesystem exists, even if it's not mounted to an instance"* — $0.20/GiB/month until deleted.
  → Fix: after the final `pull` + verify, **delete the filesystem** (console Storage → Delete; requires
  terminating attached instances first) — a distinct teardown step. (billing + filesystems docs 2026-06)

- **LAM7 — `pip install torch` over Lambda Stack silently shadows or mismatches it.**
  Symptom: a `pip install` in `base` reports *"Defaulting to user installation because normal site-packages
  is not writeable"* and lands in `~/.local`, or a `torch==X` pin drags in a CUDA/torchvision combo that
  conflicts with the system build → import/CUDA errors. → Root cause: Lambda Stack PyTorch lives in
  system `/usr/lib/python3/dist-packages` (not pip-writable as `ubuntu`); pip's user install or a hard
  version pin diverges from it. → Fix: use the Stack's PyTorch as-is (don't reinstall), loosen pins
  (`torch>=2.x` not `==`), or fully isolate in a fresh venv/conda env and install torch there cleanly —
  don't half-mix pip-over-system. (DeepTalk threads 2026-06)

- **LAM8 — conda/venv that "borrows" Stack PyTorch via system-site-packages then breaks on pip.**
  Symptom: created a conda env to use the Stack's torch, then a later `pip install` pulls a second,
  conflicting torch or can't write site-packages. → Root cause: mixing `--system-site-packages` (to see
  the system torch) with pip installs into the same env creates two torch copies. → Fix: pick ONE model —
  either run in the bare Stack base (preferred on a rental), or build a fully self-contained env with
  `conda install pytorch torchvision` (no system-site-packages borrowing). (DeepTalk
  bypassing-lambda-stack thread 2026-06)

- **LAM9 — Cold reboot wipes RAM and tmux; warm restart still bills.**
  Symptom: after a "reboot" the detached training job is gone and the box came back clean-ish. → Root
  cause: a **cold reboot** *"erases all data currently in the instance's memory and bypasses the operating
  system's safe-shutdown mechanisms"* — kills tmux sessions and any in-RAM state; neither reboot stops the
  meter. → Fix: only cold-reboot a frozen box; rely on checkpoint-to-NFS, not on process survival across a
  reboot; expect to re-`ssh` and re-`tmux attach` (session may be gone). (console doc 2026-06)

- **LAM10 — No static IP; the public IP changes on terminate→relaunch.**
  Symptom: automation/SSH config hard-coded to yesterday's IP fails after a relaunch. → Root cause:
  *"On-Demand Cloud doesn't support static IP addresses"* — a fresh launch gets a fresh IP. → Fix: read
  the IP from the console / `GET /instances` on every launch; template SSH config dynamically; never
  hard-code it. (DeepTalk staff 2026-06)

- **LAM11 — `apt full-upgrade` on Lambda Stack images can break cuDNN/DOCA.**
  Symptom: after a recommended `apt-get update && upgrade` (or `full-upgrade` on 24.04 images), PyTorch/TF
  fails to find cuDNN, or full-upgrade itself fails on a DOCA package. → Root cause: a system cuDNN bump
  or DOCA repo state diverges from the Stack-bundled libs. → Fix: avoid blanket `full-upgrade` on a
  rental; if cuDNN is missing, symlink the Stack copies —
  `for so in /usr/lib/python3/dist-packages/tensorflow/libcudnn*; do sudo ln -s "$so" /usr/lib/x86_64-linux-gnu/; done`
  (note: Stack cuDNN is usable *only* by the Stack-installed PyTorch/TF). (troubleshooting doc 2026-06)

- **LAM12 — 1-Click Clusters / reserved bill differently than on-demand (commitment traps).**
  Symptom: expected per-minute pricing, got a 2-week minimum / weekly invoice / a reservation that expired.
  → Root cause: **1-Click Clusters** carry a **minimum 2-week commitment with weekly billing** (not
  per-minute); **reserved** capacity requires Lambda approval and the **invoice must be paid within ~10
  days or the reservation is forfeited**, on non-cancelable terms. → Fix: use plain on-demand single
  instances for per-minute experiments; only enter a cluster/reservation with confirmed sustained need and
  budget approval. (1-click-clusters docs + nOps/CheckThat 2026-06)

- **LAM13 — GH200 (ARM/aarch64) breaks `pip install torch` — needs the ARM build.**
  Symptom: on a 1× GH200 box, `pip install torch` installs a **CPU-only** wheel (no CUDA), or a pinned
  `torch==2.2.0` fails to resolve. → Root cause: GH200 is aarch64; the default PyPI torch wheel for
  aarch64 is CPU-only. → Fix: use Lambda Stack's pre-compiled ARM PyTorch (e.g. 2.4.1) as-is, or install
  from the CUDA index `pip install torch --index-url https://download.pytorch.org/whl/cu128` (aarch64 GPU
  wheels live there), or compile from source for newer versions; relax exact pins. (DeepTalk GH200 thread
  + pytorch.org 2026-06)

### Platform-specific debugging
- **Confirm billing actually stopped:** after a teardown, check the instance is **gone** (not in *Alert*)
  via the console or `curl -u $LAMBDA_API_KEY: https://cloud.lambdalabs.com/api/v1/instances` — an Alert-
  state box (from an in-OS shutdown) is still charging (LAM3).
- **Capacity probe before launch:** `curl -u $LAMBDA_API_KEY: .../instance-types | jq '.data | to_entries[]
  | {type:.key, regions:.value.regions_with_capacity_available}'` — empty `regions` ⇒ that GPU type can't
  launch anywhere right now (LAM5); this is the loop condition for retry-until-available.
- **GPU sanity on the box:** `nvidia-smi` (driver/CUDA + util) and `python -c "import torch;
  print(torch.__version__, torch.version.cuda, torch.cuda.is_available())"` — mismatch between
  `torch.version.cuda` and `nvidia-smi` CUDA usually means a pip-shadowed torch (LAM7/8/13), not a Stack
  problem.
- **Read the real Stack version, never assume:** `apt list --installed 2>/dev/null | grep -i lambda-stack`
  and `dpkg -l | grep -i cudnn` — confirm before debugging a "version mismatch."
- **Disk pressure on the ephemeral root:** `df -h /` and `df -h /lambda/nfs/<name>`; remember `/home/ubuntu`
  is volatile — large datasets/checkpoints filling the root volume are also *lost* on terminate, so move
  them to NFS, not just to clear space.
- **Detect a stalled download:** background the pull (`nohup … &`) and watch growth —
  `watch -n5 'du -sh <target>; ls -l <target>'` (flat size for minutes ⇒ stalled; re-pull, egress is free).
- **Stuck/unreachable after reboot:** if SSH dies post-reboot, the box may be in *Alert* or networking
  failed to come up — check the console state and prefer a fresh **terminate→relaunch** (resume from NFS)
  over fighting a cold-reboot that already wiped RAM (LAM9).

---

## 8. SCRIPT OVERRIDES

Values to parameterize the `scripts/` templates for Lambda:

```
DATA_DIR=       /home/ubuntu (ephemeral NVMe — lost on terminate)
DURABLE_DIR=    /lambda/nfs/<name>
PROXY_HOOK=     (none — direct egress; no network_turbo)
CRED_FILE=      ""  (Lambda key is the $LAMBDA_API_KEY env var, not a file on disk — run_one's [ -n "$CRED_FILE" ] guard skips the file read and the env var passes through; SkyPilot key file at ~/.lambda_cloud/lambda_keys, format `api_key = <KEY>`)
SCRATCH=        prune periodic ckpts on local NVMe; keep only `best` on /lambda/nfs/<name>
HF_HOME=        /lambda/nfs/<name>/.cache/huggingface   (durable; survives terminate, free egress on re-pull)
DETACH=         tmux  (apt install if absent; nohup fallback)
SSH_USER=       ubuntu   (NOT root)
```

Notes for the wrapper:
- Default checkpoint dir → the NFS mount, not `/home/ubuntu` — the latter is erased on terminate.
- If no NFS filesystem is attached, set the wrapper to `pull` checkpoints to local on the periodic timer
  (free egress) instead of relying on durable on-box storage.
- Re-read the instance IP from the console/API on every launch (LAM10) — never persist it in SSH config.
- Do not `pip install torch` / blanket `apt full-upgrade` on the rental — use the Stack as-is (LAM7/8/11);
  on GH200 use the ARM build (LAM13).
- The teardown step is **terminate via API**, gated by the Iron Law; verify billing stopped (no *Alert*
  state) and add an explicit reminder to **delete the NFS filesystem** (LAM6) when the project is done.