343 lines
25 KiB
Markdown
343 lines
25 KiB
Markdown
---
|
||
platform: lambda
|
||
kind: cloud-api # REST API / web console / SSH to a normal Ubuntu VM
|
||
meter_stop_verb: terminate # the ONLY action that stops billing; sudo shutdown does NOT
|
||
meter_stop_irreversible: true # terminate wipes local NVMe — there is no stop/suspend state
|
||
detach_primitive: tmux # plain Ubuntu; tmux/screen/nohup, install if absent
|
||
spot_available: false # no spot/preemptible tier; interruption is capacity-at-launch
|
||
spot_grace: n/a # no mid-run eviction → no grace window
|
||
shared_fs: true # region-locked NFS filesystem, attach-at-launch only
|
||
inode_cap: none # no documented inode cap; GiB quota only
|
||
free_egress: true # no ingress/egress fees on instances or filesystems
|
||
china_mirror_needed: false # US/global cloud, direct egress; no platform proxy
|
||
host_driver_cuda_max: lambda-stack-dependent # Lambda Stack bundles driver+CUDA+PyTorch; version moves per release — read nvidia-smi on the box, do NOT assume a number
|
||
local_nvme: true # ephemeral root/local NVMe, gone on terminate
|
||
---
|
||
|
||
# Lambda Cloud — Profile
|
||
|
||
Lambda Cloud is a **cattle-not-pets** GPU cloud: on-demand + reserved instances, a prebuilt **Lambda
|
||
Stack** image, and **no stop/suspend state** — an instance can only be **launched, restarted, or
|
||
terminated**, and terminate destroys the local NVMe. Nothing on the box survives a teardown except what was
|
||
pushed off or written to an attached **region-locked NFS filesystem**. This inverts the AutoDL "关机保留数据"
|
||
instinct: here, durable design (checkpoint-to-NFS + idempotent resume) is **mandatory, not optional**.
|
||
|
||
> **Surface to the user up front (principle #10):** ⚠️ Danger clocks — there is **no stop/suspend**: an instance can only be launched / restarted / **terminated, and terminate wipes the local NVMe** — only the attached **NFS filesystem** survives, and **it keeps billing until you delete it manually** (LAM6). Conveniences — one-click **JupyterLab** per instance, free egress both directions. A terminate→relaunch yields a **new IP**.
|
||
|
||
> Docs/console domain moved from `lambdalabs.com` to `lambda.ai` (docs at `docs.lambda.ai`, console at
|
||
> `cloud.lambda.ai`); the **REST API base is still `cloud.lambdalabs.com/api/v1`** and `cloud.lambda.ai`
|
||
> also resolves (verified docs.lambda.ai + cloud-api 2026-06). Treat both hosts as live.
|
||
|
||
To jump: `grep -in <keyword> profiles/lambda.md`.
|
||
|
||
**Table of contents** — 1. LAUNCH · 2. STORAGE MODEL (survival matrix) · 3. NETWORK ·
|
||
4. SPOT / INTERRUPTION + RESUME · 5. TEARDOWN / BILLING · 6. DAEMON TOOL · 7. TOP GOTCHAS (LAM1–LAM13) +
|
||
Platform-specific debugging · 8. SCRIPT OVERRIDES.
|
||
|
||
Universal gotchas (CRLF, inode/`df -i`, silent sync, cgroup OOM, spot grace) are NOT repeated here —
|
||
see `references/gotchas_universal.md`. Universal invariants → `references/principles.md`.
|
||
|
||
---
|
||
|
||
## 1. LAUNCH
|
||
|
||
Entry points:
|
||
- **Web console** at `cloud.lambda.ai` → Instances → Launch (pick GPU type + region, attach a filesystem
|
||
here if one is needed — see §2; attach any per-instance firewall ruleset here too — see §3/LAM4).
|
||
- **REST API** — `https://cloud.lambdalabs.com/api/v1`, auth `curl -u $LAMBDA_API_KEY:` (basic-auth,
|
||
password empty). Canonical automation surface (verified docs.lambda.ai/api/cloud 2026-06):
|
||
- `GET /instance-types` — lists every GPU type **and** `regions_with_capacity_available[]` per type.
|
||
This field IS the capacity signal — poll it to know where a type can launch right now (drives LAM5
|
||
retry-until-available).
|
||
- `POST /instance-operations/launch` · `.../terminate` · `.../restart` — create / stop-meter / reboot.
|
||
- **SSH** — standard connection to a normal Ubuntu VM; **default user is `ubuntu`** (not `root`); use
|
||
`sudo` for root. One-click **JupyterLab** is offered per instance.
|
||
- **SkyPilot** — de-facto orchestration layer: `pip install "skypilot[lambda]"`, key file at
|
||
`~/.lambda_cloud/lambda_keys` containing a line `api_key = <KEY>` (verified docs.skypilot.co 2026-06).
|
||
Use it for retry-until-capacity + autostop (§4, §6).
|
||
|
||
**Env contract — the image/base IS the env.** Instances ship **Lambda Stack** (NVIDIA driver + CUDA +
|
||
cuDNN + PyTorch/TensorFlow, all upgraded together as one apt metapackage). Run in it directly on the
|
||
throwaway box — do **not** `conda create` on a rental (`references/principles.md` §2), and do not `pip
|
||
install torch` over the top (LAM7/LAM8). Lambda Stack's exact CUDA/driver/PyTorch **moves per release**;
|
||
read it off the box (`nvidia-smi`, `python -c "import torch;print(torch.__version__,torch.version.cuda)"`)
|
||
rather than assuming a number. The **durable** form of the env is a Docker image (Lambda recommends running
|
||
Docker inside the instance) or a setup script replayed on each launch — because terminate destroys the box.
|
||
Reserved / 1-Click Clusters provide flat-rate multi-node (own billing model — LAM12).
|
||
|
||
> **verify:** `ssh ubuntu@<IP> 'python -c "import torch;print(torch.cuda.is_available())"'` → `True`.
|
||
|
||
---
|
||
|
||
## 2. STORAGE MODEL *(survival matrix — principle #4)*
|
||
|
||
Two tiers, and the trap is that the default working location is the **volatile** one.
|
||
|
||
- **Local / root NVMe** — fast, per-instance, **ephemeral**. Docs: *"Data not stored in the mount location
|
||
is erased once you terminate your instance and cannot be recovered"* (verified docs.lambda.ai
|
||
creating-managing-instances 2026-06). This is where work lands by default.
|
||
- **NFS filesystem** — a regional network filesystem mounted at `/lambda/nfs/<name>` (docs example mount:
|
||
`/lambda/nfs/persistent-storage`). **The only durable home.** Three hard constraints (verified
|
||
docs.lambda.ai/public-cloud/filesystems 2026-06):
|
||
- **Region-locked** — *"The filesystem must reside in the same region as the instance or cluster"* and
|
||
*"Filesystems cannot currently be transferred between regions."* Pick the region deliberately at create.
|
||
- **Attach-at-launch only** — *"You must attach the filesystem … at the time that the instance … is
|
||
launched"* and *"You can't attach a filesystem after you've created an instance."*
|
||
- Billed **$0.20/GiB/month in 1-hour increments**, **free ingress/egress**; **up to 24 filesystems per
|
||
account**; most regions allow up to 8 EB/filesystem but **us-south-1 (Texas) caps at 10 TB**.
|
||
- **No documented inode cap** — GiB quota only; no `df -i` ceiling surfaced (still audit `df -i` per the
|
||
universal storage gotcha).
|
||
|
||
| Tier | Path | Survives RESTART? | Survives TERMINATE? | Cap |
|
||
|---|---|---|---|---|
|
||
| Local / root NVMe | `/`, `/home/ubuntu` | yes (data persists; **but cold reboot wipes RAM** — LAM9) | **NO** (erased, unrecoverable) | instance root volume |
|
||
| NFS filesystem | `/lambda/nfs/<name>` | yes | **yes** (separate lifecycle; keeps billing — LAM6) | GiB quota; ~10 TB in us-south-1, 8 EB elsewhere |
|
||
|
||
**Checkpoints MUST go to** `/lambda/nfs/<name>` (the durable tier) for the §5 `terminate` verb. A
|
||
checkpoint left on local NVMe dies with the box. If no filesystem was attached at launch, the only durable
|
||
path is to `pull` the result off-box (free egress) before terminating.
|
||
|
||
---
|
||
|
||
## 3. NETWORK
|
||
|
||
- **Direct, unproxied egress.** US/global cloud — egress to HF / GitHub / PyPI is direct; **no
|
||
`network_turbo`-style accelerator exists**, and none is needed. China-mirror relevance is **N/A as a
|
||
platform feature** (relevant only when operating from inside China; then `references/china-network.md`
|
||
applies to the user's own setup, nothing platform-provided).
|
||
- **Free egress both directions** — *"Transparent pricing with no egress fees"* (verified lambda.ai
|
||
pricing 2026-06). Re-pulling a large model or pushing results off-box costs nothing, making
|
||
"pull-before-terminate" the cheap, safe default when no NFS is attached.
|
||
- **Firewall** — default allows *"only incoming ICMP traffic or TCP traffic on port 22 (SSH)"*. Open more
|
||
via **global rules** (apply workspace-wide) or **per-instance rulesets** (region-scoped). Per-instance
|
||
rulesets: *"You must attach rulesets during the instance launch process. You can't attach them after the
|
||
instance has been launched"* and *"You can't remove rulesets from an instance after the instance has been
|
||
launched"* (verified docs.lambda.ai/public-cloud/firewalls 2026-06) → plan port exposure before launch
|
||
(gotcha LAM4). Global rules can still be edited on the workspace afterward.
|
||
- **Exposing TB / Jupyter** — instances get a public IP; tunnel over SSH rather than opening ports:
|
||
`ssh -L 8888:localhost:8888 -L 6006:localhost:6006 ubuntu@<IP>`. No platform-pinned TensorBoard dir —
|
||
run TB on `:6006` against the logdir under the NFS mount.
|
||
- **SSH flavor** — direct TCP to a normal VM (`ubuntu@<IP>`); full `scp`/`rsync` work, no proxy-jump quirk.
|
||
**No static IP feature** — *"On-Demand Cloud doesn't support static IP addresses"* (verified DeepTalk
|
||
staff 2026-06). The IP is fixed for an instance's life, but **terminate→relaunch yields a NEW IP**
|
||
(LAM10) — re-read it from the console/API every launch; never hard-code it in automation.
|
||
|
||
---
|
||
|
||
## 4. SPOT / INTERRUPTION + RESUME *(principle #7/#8)*
|
||
|
||
**No spot / preemptible tier — and no mid-run eviction.** This is the key divergence from vast.ai/RunPod:
|
||
there is **no SIGTERM→SIGKILL grace window to survive**, because a running instance is never evicted
|
||
mid-epoch. The interruption model is different in kind:
|
||
|
||
- **Capacity-at-launch is the real failure.** The desired GPU type may be **unavailable when launch is
|
||
attempted** — Lambda has **no spot tier to fall back to**, and real-world on-demand fill rates are
|
||
spiky (one published 6-month log: ~64% same-day A100 success — i.e. ~1 in 3 attempts blocked; a 26 h
|
||
"temporarily unavailable" stall scaling 2→4 H100; verified medium.com/@velinxs 2026-06). H100/B200
|
||
capacity is the tightest. The resilience pattern is **retry-until-available**, not survive-eviction:
|
||
poll `GET /instance-types` for `regions_with_capacity_available` and `POST .../launch` the moment a
|
||
region appears (or let SkyPilot's provisioner retry across regions/types).
|
||
- **Self-inflicted termination only.** Once running, the only destructive events are an operator
|
||
`terminate`, or an **improper `sudo shutdown`** that pushes the box to **Alert** while still billing
|
||
(LAM3 / §5), or a **cold reboot** that wipes RAM (LAM9).
|
||
- **Resume hook** — checkpoint full state to the NFS filesystem on a periodic timer, load-latest
|
||
unconditionally on startup, so a fresh post-capacity launch resumes instead of restarting. Because the
|
||
box is cattle, the resume path is exercised on *every* relaunch, not just after a rare preemption.
|
||
|
||
Cadence formula (Young/Daly) + atomic-write resume → `references/spot-resilience.md`. Here the formula's
|
||
μ is effectively "time between voluntary relaunches," not a preemption rate.
|
||
|
||
---
|
||
|
||
## 5. TEARDOWN / BILLING *(principle #9 + the Iron Law)*
|
||
|
||
**TERMINATE is the meter-stop verb — and it is irreversible.** *"Billing begins the moment you launch an
|
||
instance and the instance passes health checks, and ends the moment you terminate the instance"*, billed
|
||
in **one-minute increments**, *"regardless if they're actively being used"* (verified
|
||
docs.lambda.ai/public-cloud/billing 2026-06).
|
||
|
||
> **The shutdown trap (most error-prone fact on this platform):** *"Do not use commands such as `sudo
|
||
> shutdown -h now` or `sudo systemctl poweroff` … These commands will not work as expected and will cause
|
||
> your instances to go into Alert status, and billing will continue"* (verified docs.lambda.ai 2026-06).
|
||
> Also `halt` / `shutdown -P 0` only stop the OS, not the meter (DeepTalk staff). Stop the meter **only**
|
||
> via `terminate` from the console or `POST /instance-operations/terminate` — which works even from inside
|
||
> the instance itself.
|
||
|
||
What each action preserves:
|
||
- **terminate** — stops the instance meter; **erases the local NVMe** (unrecoverable). The NFS filesystem
|
||
has a **separate lifecycle** and survives — but it **keeps billing $0.20/GiB/month until explicitly
|
||
deleted** (*"Billing continues as long as a filesystem exists, even if it's not mounted to an instance"*),
|
||
so a terminated-but-forgotten filesystem is a silent ongoing charge (LAM6).
|
||
- **There is no stop/suspend state** — *"It currently isn't possible to pause (suspend) your instance …
|
||
Your only options are to launch, restart, or terminate"* (verified docs.lambda.ai 2026-06). Idle-cheap
|
||
pause is impossible; the only way to stop paying for compute is to destroy the box and rebuild later.
|
||
- **restart / cold reboot** — does **not** stop the meter and does **not** wipe disk, but a **cold reboot
|
||
erases RAM and bypasses safe shutdown** — reserve it for a frozen box only (LAM9).
|
||
|
||
**Iron Law (SKILL.md Phase 5):** NO `terminate` until checkpoints are **pulled to local OR confirmed on
|
||
NFS by load-test** AND the user approves the cost-affecting action. Because terminate is destructive and
|
||
irreversible, an unverified `cp`/`rsync` to NFS means **permanent loss** — verify the sync (checksum /
|
||
`ls -l` / a load) before terminating, not after. Egress is free, so a belt-and-suspenders `pull` to local
|
||
is cheap. Cross-link: `superpowers:verification-before-completion` (REQUIRED) for the general gate.
|
||
|
||
---
|
||
|
||
## 6. DAEMON TOOL
|
||
|
||
- **Detach primitive: `tmux`** (or `screen` / `nohup`) on a standard Ubuntu VM — same playbook as the
|
||
AutoDL tmux pattern. Install if absent (`sudo apt install -y tmux`); fall back to
|
||
`nohup … </dev/null >log 2>&1 &`.
|
||
- **Survives an SSH drop, NOT a terminate.** tmux keeps the job alive across a dropped connection, but
|
||
with no stop state the detach primitive can't survive a teardown — only the **checkpoint-to-NFS +
|
||
idempotent resume** spine does (principle #8). tmux is the SSH-resilience layer; the checkpoint is the
|
||
instance-resilience layer. (tmux also won't survive a cold reboot — LAM9.)
|
||
- **Native orchestration: SkyPilot** (managed jobs, autostop, retry-until-capacity) + **1-Click
|
||
Clusters** for multi-node; no platform job-queue otherwise. SkyPilot moves the box on capacity loss but
|
||
**restarts the process from scratch — the checkpoint-load restores progress** (don't assume the
|
||
framework resumes training state).
|
||
|
||
---
|
||
|
||
## 7. TOP GOTCHAS (Lambda-pinned — universal ones live in `references/gotchas_universal.md`)
|
||
|
||
- **LAM1 — Terminate erases the local NVMe; there is no stop/suspend.**
|
||
Symptom: relaunched instance is blank, yesterday's run gone. → Root cause: local storage is ephemeral
|
||
(*"Data not stored in the mount location is erased … and cannot be recovered"*) and no stop state
|
||
preserves it; the AutoDL "关机 keeps my data" assumption is false. → Fix: design every workflow around
|
||
destroy/recreate — checkpoint to `/lambda/nfs/<name>` or `pull` off-box before any terminate; never keep
|
||
the only copy on local NVMe. (docs.lambda.ai 2026-06)
|
||
|
||
- **LAM2 — Filesystem is attach-at-launch only and region-locked.**
|
||
Symptom: a running instance has no durable storage and one can't be added; or a us-east filesystem won't
|
||
mount on a us-west instance. → Root cause: filesystems attach only at create time and can't move between
|
||
regions. → Fix: decide the region and attach the filesystem **at launch**; co-locate instance +
|
||
filesystem in the same region. (filesystems doc 2026-06)
|
||
|
||
- **LAM3 — `sudo shutdown` / `poweroff` keeps the meter running (Alert state).**
|
||
Symptom: instance "powered off" but the bill keeps climbing. → Root cause: an in-OS shutdown sends the
|
||
instance to **Alert** without stopping billing; `halt`/`shutdown -P 0` only stop the OS, not the meter.
|
||
→ Fix: stop the meter only via **terminate** (console or `POST /instance-operations/terminate`); never
|
||
rely on an in-box poweroff. (billing doc + DeepTalk staff 2026-06)
|
||
|
||
- **LAM4 — Per-instance firewall rulesets are immutable post-launch.**
|
||
Symptom: a needed inbound port can't be opened (or a wrong one removed) on a live instance. → Root cause:
|
||
per-instance rulesets *"must [be attached] during the instance launch process"* and *"can't [be removed]
|
||
after the instance has been launched."* → Fix: plan port exposure before launch, use an editable
|
||
**global** rule, or tunnel over SSH (`-L`, §3) instead of opening a port. (firewalls doc 2026-06)
|
||
|
||
- **LAM5 — Capacity, not eviction, is the bottleneck (no spot fallback).**
|
||
Symptom: launch fails / dashboard shows the desired GPU type unavailable; long stalls scaling up. → Root
|
||
cause: on-demand supply for a specific GPU/region is exhausted (worst for H100/B200), and there is no
|
||
spot tier to fall back to. → Fix: poll `GET /instance-types` for `regions_with_capacity_available` and
|
||
launch the instant a region appears (or use SkyPilot's cross-region/type provisioner); resume from the
|
||
NFS checkpoint once granted (§4). (cloud-api doc + medium.com/@velinxs 2026-06)
|
||
|
||
- **LAM6 — The NFS filesystem keeps billing after the instance is gone.**
|
||
Symptom: all instances terminated, but storage charges continue. → Root cause: *"Billing continues as
|
||
long as a filesystem exists, even if it's not mounted to an instance"* — $0.20/GiB/month until deleted.
|
||
→ Fix: after the final `pull` + verify, **delete the filesystem** (console Storage → Delete; requires
|
||
terminating attached instances first) — a distinct teardown step. (billing + filesystems docs 2026-06)
|
||
|
||
- **LAM7 — `pip install torch` over Lambda Stack silently shadows or mismatches it.**
|
||
Symptom: a `pip install` in `base` reports *"Defaulting to user installation because normal site-packages
|
||
is not writeable"* and lands in `~/.local`, or a `torch==X` pin drags in a CUDA/torchvision combo that
|
||
conflicts with the system build → import/CUDA errors. → Root cause: Lambda Stack PyTorch lives in
|
||
system `/usr/lib/python3/dist-packages` (not pip-writable as `ubuntu`); pip's user install or a hard
|
||
version pin diverges from it. → Fix: use the Stack's PyTorch as-is (don't reinstall), loosen pins
|
||
(`torch>=2.x` not `==`), or fully isolate in a fresh venv/conda env and install torch there cleanly —
|
||
don't half-mix pip-over-system. (DeepTalk threads 2026-06)
|
||
|
||
- **LAM8 — conda/venv that "borrows" Stack PyTorch via system-site-packages then breaks on pip.**
|
||
Symptom: created a conda env to use the Stack's torch, then a later `pip install` pulls a second,
|
||
conflicting torch or can't write site-packages. → Root cause: mixing `--system-site-packages` (to see
|
||
the system torch) with pip installs into the same env creates two torch copies. → Fix: pick ONE model —
|
||
either run in the bare Stack base (preferred on a rental), or build a fully self-contained env with
|
||
`conda install pytorch torchvision` (no system-site-packages borrowing). (DeepTalk
|
||
bypassing-lambda-stack thread 2026-06)
|
||
|
||
- **LAM9 — Cold reboot wipes RAM and tmux; warm restart still bills.**
|
||
Symptom: after a "reboot" the detached training job is gone and the box came back clean-ish. → Root
|
||
cause: a **cold reboot** *"erases all data currently in the instance's memory and bypasses the operating
|
||
system's safe-shutdown mechanisms"* — kills tmux sessions and any in-RAM state; neither reboot stops the
|
||
meter. → Fix: only cold-reboot a frozen box; rely on checkpoint-to-NFS, not on process survival across a
|
||
reboot; expect to re-`ssh` and re-`tmux attach` (session may be gone). (console doc 2026-06)
|
||
|
||
- **LAM10 — No static IP; the public IP changes on terminate→relaunch.**
|
||
Symptom: automation/SSH config hard-coded to yesterday's IP fails after a relaunch. → Root cause:
|
||
*"On-Demand Cloud doesn't support static IP addresses"* — a fresh launch gets a fresh IP. → Fix: read
|
||
the IP from the console / `GET /instances` on every launch; template SSH config dynamically; never
|
||
hard-code it. (DeepTalk staff 2026-06)
|
||
|
||
- **LAM11 — `apt full-upgrade` on Lambda Stack images can break cuDNN/DOCA.**
|
||
Symptom: after a recommended `apt-get update && upgrade` (or `full-upgrade` on 24.04 images), PyTorch/TF
|
||
fails to find cuDNN, or full-upgrade itself fails on a DOCA package. → Root cause: a system cuDNN bump
|
||
or DOCA repo state diverges from the Stack-bundled libs. → Fix: avoid blanket `full-upgrade` on a
|
||
rental; if cuDNN is missing, symlink the Stack copies —
|
||
`for so in /usr/lib/python3/dist-packages/tensorflow/libcudnn*; do sudo ln -s "$so" /usr/lib/x86_64-linux-gnu/; done`
|
||
(note: Stack cuDNN is usable *only* by the Stack-installed PyTorch/TF). (troubleshooting doc 2026-06)
|
||
|
||
- **LAM12 — 1-Click Clusters / reserved bill differently than on-demand (commitment traps).**
|
||
Symptom: expected per-minute pricing, got a 2-week minimum / weekly invoice / a reservation that expired.
|
||
→ Root cause: **1-Click Clusters** carry a **minimum 2-week commitment with weekly billing** (not
|
||
per-minute); **reserved** capacity requires Lambda approval and the **invoice must be paid within ~10
|
||
days or the reservation is forfeited**, on non-cancelable terms. → Fix: use plain on-demand single
|
||
instances for per-minute experiments; only enter a cluster/reservation with confirmed sustained need and
|
||
budget approval. (1-click-clusters docs + nOps/CheckThat 2026-06)
|
||
|
||
- **LAM13 — GH200 (ARM/aarch64) breaks `pip install torch` — needs the ARM build.**
|
||
Symptom: on a 1× GH200 box, `pip install torch` installs a **CPU-only** wheel (no CUDA), or a pinned
|
||
`torch==2.2.0` fails to resolve. → Root cause: GH200 is aarch64; the default PyPI torch wheel for
|
||
aarch64 is CPU-only. → Fix: use Lambda Stack's pre-compiled ARM PyTorch (e.g. 2.4.1) as-is, or install
|
||
from the CUDA index `pip install torch --index-url https://download.pytorch.org/whl/cu128` (aarch64 GPU
|
||
wheels live there), or compile from source for newer versions; relax exact pins. (DeepTalk GH200 thread
|
||
+ pytorch.org 2026-06)
|
||
|
||
### Platform-specific debugging
|
||
- **Confirm billing actually stopped:** after a teardown, check the instance is **gone** (not in *Alert*)
|
||
via the console or `curl -u $LAMBDA_API_KEY: https://cloud.lambdalabs.com/api/v1/instances` — an Alert-
|
||
state box (from an in-OS shutdown) is still charging (LAM3).
|
||
- **Capacity probe before launch:** `curl -u $LAMBDA_API_KEY: .../instance-types | jq '.data | to_entries[]
|
||
| {type:.key, regions:.value.regions_with_capacity_available}'` — empty `regions` ⇒ that GPU type can't
|
||
launch anywhere right now (LAM5); this is the loop condition for retry-until-available.
|
||
- **GPU sanity on the box:** `nvidia-smi` (driver/CUDA + util) and `python -c "import torch;
|
||
print(torch.__version__, torch.version.cuda, torch.cuda.is_available())"` — mismatch between
|
||
`torch.version.cuda` and `nvidia-smi` CUDA usually means a pip-shadowed torch (LAM7/8/13), not a Stack
|
||
problem.
|
||
- **Read the real Stack version, never assume:** `apt list --installed 2>/dev/null | grep -i lambda-stack`
|
||
and `dpkg -l | grep -i cudnn` — confirm before debugging a "version mismatch."
|
||
- **Disk pressure on the ephemeral root:** `df -h /` and `df -h /lambda/nfs/<name>`; remember `/home/ubuntu`
|
||
is volatile — large datasets/checkpoints filling the root volume are also *lost* on terminate, so move
|
||
them to NFS, not just to clear space.
|
||
- **Detect a stalled download:** background the pull (`nohup … &`) and watch growth —
|
||
`watch -n5 'du -sh <target>; ls -l <target>'` (flat size for minutes ⇒ stalled; re-pull, egress is free).
|
||
- **Stuck/unreachable after reboot:** if SSH dies post-reboot, the box may be in *Alert* or networking
|
||
failed to come up — check the console state and prefer a fresh **terminate→relaunch** (resume from NFS)
|
||
over fighting a cold-reboot that already wiped RAM (LAM9).
|
||
|
||
---
|
||
|
||
## 8. SCRIPT OVERRIDES
|
||
|
||
Values to parameterize the `scripts/` templates for Lambda:
|
||
|
||
```
|
||
DATA_DIR= /home/ubuntu (ephemeral NVMe — lost on terminate)
|
||
DURABLE_DIR= /lambda/nfs/<name>
|
||
PROXY_HOOK= (none — direct egress; no network_turbo)
|
||
CRED_FILE= "" (Lambda key is the $LAMBDA_API_KEY env var, not a file on disk — run_one's [ -n "$CRED_FILE" ] guard skips the file read and the env var passes through; SkyPilot key file at ~/.lambda_cloud/lambda_keys, format `api_key = <KEY>`)
|
||
SCRATCH= prune periodic ckpts on local NVMe; keep only `best` on /lambda/nfs/<name>
|
||
HF_HOME= /lambda/nfs/<name>/.cache/huggingface (durable; survives terminate, free egress on re-pull)
|
||
DETACH= tmux (apt install if absent; nohup fallback)
|
||
SSH_USER= ubuntu (NOT root)
|
||
```
|
||
|
||
Notes for the wrapper:
|
||
- Default checkpoint dir → the NFS mount, not `/home/ubuntu` — the latter is erased on terminate.
|
||
- If no NFS filesystem is attached, set the wrapper to `pull` checkpoints to local on the periodic timer
|
||
(free egress) instead of relying on durable on-box storage.
|
||
- Re-read the instance IP from the console/API on every launch (LAM10) — never persist it in SSH config.
|
||
- Do not `pip install torch` / blanket `apt full-upgrade` on the rental — use the Stack as-is (LAM7/8/11);
|
||
on GH200 use the ARM build (LAM13).
|
||
- The teardown step is **terminate via API**, gated by the Iron Law; verify billing stopped (no *Alert*
|
||
state) and add an explicit reminder to **delete the NFS filesystem** (LAM6) when the project is done.
|