playbook/antigravity-awesome-skills/skills/remote-gpu-trainer/profiles/lambda.md

25 KiB
Raw Permalink Blame History

platform kind meter_stop_verb meter_stop_irreversible detach_primitive spot_available spot_grace shared_fs inode_cap free_egress china_mirror_needed host_driver_cuda_max local_nvme
lambda cloud-api terminate true tmux false n/a true none true false lambda-stack-dependent true

Lambda Cloud — Profile

Lambda Cloud is a cattle-not-pets GPU cloud: on-demand + reserved instances, a prebuilt Lambda Stack image, and no stop/suspend state — an instance can only be launched, restarted, or terminated, and terminate destroys the local NVMe. Nothing on the box survives a teardown except what was pushed off or written to an attached region-locked NFS filesystem. This inverts the AutoDL "关机保留数据" instinct: here, durable design (checkpoint-to-NFS + idempotent resume) is mandatory, not optional.

Surface to the user up front (principle #10): ⚠️ Danger clocks — there is no stop/suspend: an instance can only be launched / restarted / terminated, and terminate wipes the local NVMe — only the attached NFS filesystem survives, and it keeps billing until you delete it manually (LAM6). Conveniences — one-click JupyterLab per instance, free egress both directions. A terminate→relaunch yields a new IP.

Docs/console domain moved from lambdalabs.com to lambda.ai (docs at docs.lambda.ai, console at cloud.lambda.ai); the REST API base is still cloud.lambdalabs.com/api/v1 and cloud.lambda.ai also resolves (verified docs.lambda.ai + cloud-api 2026-06). Treat both hosts as live.

To jump: grep -in <keyword> profiles/lambda.md.

Table of contents — 1. LAUNCH · 2. STORAGE MODEL (survival matrix) · 3. NETWORK · 4. SPOT / INTERRUPTION + RESUME · 5. TEARDOWN / BILLING · 6. DAEMON TOOL · 7. TOP GOTCHAS (LAM1LAM13) + Platform-specific debugging · 8. SCRIPT OVERRIDES.

Universal gotchas (CRLF, inode/df -i, silent sync, cgroup OOM, spot grace) are NOT repeated here — see references/gotchas_universal.md. Universal invariants → references/principles.md.


1. LAUNCH

Entry points:

  • Web console at cloud.lambda.ai → Instances → Launch (pick GPU type + region, attach a filesystem here if one is needed — see §2; attach any per-instance firewall ruleset here too — see §3/LAM4).
  • REST APIhttps://cloud.lambdalabs.com/api/v1, auth curl -u $LAMBDA_API_KEY: (basic-auth, password empty). Canonical automation surface (verified docs.lambda.ai/api/cloud 2026-06):
    • GET /instance-types — lists every GPU type and regions_with_capacity_available[] per type. This field IS the capacity signal — poll it to know where a type can launch right now (drives LAM5 retry-until-available).
    • POST /instance-operations/launch · .../terminate · .../restart — create / stop-meter / reboot.
  • SSH — standard connection to a normal Ubuntu VM; default user is ubuntu (not root); use sudo for root. One-click JupyterLab is offered per instance.
  • SkyPilot — de-facto orchestration layer: pip install "skypilot[lambda]", key file at ~/.lambda_cloud/lambda_keys containing a line api_key = <KEY> (verified docs.skypilot.co 2026-06). Use it for retry-until-capacity + autostop (§4, §6).

Env contract — the image/base IS the env. Instances ship Lambda Stack (NVIDIA driver + CUDA + cuDNN + PyTorch/TensorFlow, all upgraded together as one apt metapackage). Run in it directly on the throwaway box — do not conda create on a rental (references/principles.md §2), and do not pip install torch over the top (LAM7/LAM8). Lambda Stack's exact CUDA/driver/PyTorch moves per release; read it off the box (nvidia-smi, python -c "import torch;print(torch.__version__,torch.version.cuda)") rather than assuming a number. The durable form of the env is a Docker image (Lambda recommends running Docker inside the instance) or a setup script replayed on each launch — because terminate destroys the box. Reserved / 1-Click Clusters provide flat-rate multi-node (own billing model — LAM12).

verify: ssh ubuntu@<IP> 'python -c "import torch;print(torch.cuda.is_available())"'True.


2. STORAGE MODEL (survival matrix — principle #4)

Two tiers, and the trap is that the default working location is the volatile one.

  • Local / root NVMe — fast, per-instance, ephemeral. Docs: "Data not stored in the mount location is erased once you terminate your instance and cannot be recovered" (verified docs.lambda.ai creating-managing-instances 2026-06). This is where work lands by default.
  • NFS filesystem — a regional network filesystem mounted at /lambda/nfs/<name> (docs example mount: /lambda/nfs/persistent-storage). The only durable home. Three hard constraints (verified docs.lambda.ai/public-cloud/filesystems 2026-06):
    • Region-locked"The filesystem must reside in the same region as the instance or cluster" and "Filesystems cannot currently be transferred between regions." Pick the region deliberately at create.
    • Attach-at-launch only"You must attach the filesystem … at the time that the instance … is launched" and "You can't attach a filesystem after you've created an instance."
    • Billed $0.20/GiB/month in 1-hour increments, free ingress/egress; up to 24 filesystems per account; most regions allow up to 8 EB/filesystem but us-south-1 (Texas) caps at 10 TB.
  • No documented inode cap — GiB quota only; no df -i ceiling surfaced (still audit df -i per the universal storage gotcha).
Tier Path Survives RESTART? Survives TERMINATE? Cap
Local / root NVMe /, /home/ubuntu yes (data persists; but cold reboot wipes RAM — LAM9) NO (erased, unrecoverable) instance root volume
NFS filesystem /lambda/nfs/<name> yes yes (separate lifecycle; keeps billing — LAM6) GiB quota; ~10 TB in us-south-1, 8 EB elsewhere

Checkpoints MUST go to /lambda/nfs/<name> (the durable tier) for the §5 terminate verb. A checkpoint left on local NVMe dies with the box. If no filesystem was attached at launch, the only durable path is to pull the result off-box (free egress) before terminating.


3. NETWORK

  • Direct, unproxied egress. US/global cloud — egress to HF / GitHub / PyPI is direct; no network_turbo-style accelerator exists, and none is needed. China-mirror relevance is N/A as a platform feature (relevant only when operating from inside China; then references/china-network.md applies to the user's own setup, nothing platform-provided).
  • Free egress both directions"Transparent pricing with no egress fees" (verified lambda.ai pricing 2026-06). Re-pulling a large model or pushing results off-box costs nothing, making "pull-before-terminate" the cheap, safe default when no NFS is attached.
  • Firewall — default allows "only incoming ICMP traffic or TCP traffic on port 22 (SSH)". Open more via global rules (apply workspace-wide) or per-instance rulesets (region-scoped). Per-instance rulesets: "You must attach rulesets during the instance launch process. You can't attach them after the instance has been launched" and "You can't remove rulesets from an instance after the instance has been launched" (verified docs.lambda.ai/public-cloud/firewalls 2026-06) → plan port exposure before launch (gotcha LAM4). Global rules can still be edited on the workspace afterward.
  • Exposing TB / Jupyter — instances get a public IP; tunnel over SSH rather than opening ports: ssh -L 8888:localhost:8888 -L 6006:localhost:6006 ubuntu@<IP>. No platform-pinned TensorBoard dir — run TB on :6006 against the logdir under the NFS mount.
  • SSH flavor — direct TCP to a normal VM (ubuntu@<IP>); full scp/rsync work, no proxy-jump quirk. No static IP feature"On-Demand Cloud doesn't support static IP addresses" (verified DeepTalk staff 2026-06). The IP is fixed for an instance's life, but terminate→relaunch yields a NEW IP (LAM10) — re-read it from the console/API every launch; never hard-code it in automation.

4. SPOT / INTERRUPTION + RESUME (principle #7/#8)

No spot / preemptible tier — and no mid-run eviction. This is the key divergence from vast.ai/RunPod: there is no SIGTERM→SIGKILL grace window to survive, because a running instance is never evicted mid-epoch. The interruption model is different in kind:

  • Capacity-at-launch is the real failure. The desired GPU type may be unavailable when launch is attempted — Lambda has no spot tier to fall back to, and real-world on-demand fill rates are spiky (one published 6-month log: ~64% same-day A100 success — i.e. ~1 in 3 attempts blocked; a 26 h "temporarily unavailable" stall scaling 2→4 H100; verified medium.com/@velinxs 2026-06). H100/B200 capacity is the tightest. The resilience pattern is retry-until-available, not survive-eviction: poll GET /instance-types for regions_with_capacity_available and POST .../launch the moment a region appears (or let SkyPilot's provisioner retry across regions/types).
  • Self-inflicted termination only. Once running, the only destructive events are an operator terminate, or an improper sudo shutdown that pushes the box to Alert while still billing (LAM3 / §5), or a cold reboot that wipes RAM (LAM9).
  • Resume hook — checkpoint full state to the NFS filesystem on a periodic timer, load-latest unconditionally on startup, so a fresh post-capacity launch resumes instead of restarting. Because the box is cattle, the resume path is exercised on every relaunch, not just after a rare preemption.

Cadence formula (Young/Daly) + atomic-write resume → references/spot-resilience.md. Here the formula's μ is effectively "time between voluntary relaunches," not a preemption rate.


5. TEARDOWN / BILLING (principle #9 + the Iron Law)

TERMINATE is the meter-stop verb — and it is irreversible. "Billing begins the moment you launch an instance and the instance passes health checks, and ends the moment you terminate the instance", billed in one-minute increments, "regardless if they're actively being used" (verified docs.lambda.ai/public-cloud/billing 2026-06).

The shutdown trap (most error-prone fact on this platform): "Do not use commands such as sudo shutdown -h now or sudo systemctl poweroff … These commands will not work as expected and will cause your instances to go into Alert status, and billing will continue" (verified docs.lambda.ai 2026-06). Also halt / shutdown -P 0 only stop the OS, not the meter (DeepTalk staff). Stop the meter only via terminate from the console or POST /instance-operations/terminate — which works even from inside the instance itself.

What each action preserves:

  • terminate — stops the instance meter; erases the local NVMe (unrecoverable). The NFS filesystem has a separate lifecycle and survives — but it keeps billing $0.20/GiB/month until explicitly deleted ("Billing continues as long as a filesystem exists, even if it's not mounted to an instance"), so a terminated-but-forgotten filesystem is a silent ongoing charge (LAM6).
  • There is no stop/suspend state"It currently isn't possible to pause (suspend) your instance … Your only options are to launch, restart, or terminate" (verified docs.lambda.ai 2026-06). Idle-cheap pause is impossible; the only way to stop paying for compute is to destroy the box and rebuild later.
  • restart / cold reboot — does not stop the meter and does not wipe disk, but a cold reboot erases RAM and bypasses safe shutdown — reserve it for a frozen box only (LAM9).

Iron Law (SKILL.md Phase 5): NO terminate until checkpoints are pulled to local OR confirmed on NFS by load-test AND the user approves the cost-affecting action. Because terminate is destructive and irreversible, an unverified cp/rsync to NFS means permanent loss — verify the sync (checksum / ls -l / a load) before terminating, not after. Egress is free, so a belt-and-suspenders pull to local is cheap. Cross-link: superpowers:verification-before-completion (REQUIRED) for the general gate.


6. DAEMON TOOL

  • Detach primitive: tmux (or screen / nohup) on a standard Ubuntu VM — same playbook as the AutoDL tmux pattern. Install if absent (sudo apt install -y tmux); fall back to nohup … </dev/null >log 2>&1 &.
  • Survives an SSH drop, NOT a terminate. tmux keeps the job alive across a dropped connection, but with no stop state the detach primitive can't survive a teardown — only the checkpoint-to-NFS + idempotent resume spine does (principle #8). tmux is the SSH-resilience layer; the checkpoint is the instance-resilience layer. (tmux also won't survive a cold reboot — LAM9.)
  • Native orchestration: SkyPilot (managed jobs, autostop, retry-until-capacity) + 1-Click Clusters for multi-node; no platform job-queue otherwise. SkyPilot moves the box on capacity loss but restarts the process from scratch — the checkpoint-load restores progress (don't assume the framework resumes training state).

7. TOP GOTCHAS (Lambda-pinned — universal ones live in references/gotchas_universal.md)

  • LAM1 — Terminate erases the local NVMe; there is no stop/suspend. Symptom: relaunched instance is blank, yesterday's run gone. → Root cause: local storage is ephemeral ("Data not stored in the mount location is erased … and cannot be recovered") and no stop state preserves it; the AutoDL "关机 keeps my data" assumption is false. → Fix: design every workflow around destroy/recreate — checkpoint to /lambda/nfs/<name> or pull off-box before any terminate; never keep the only copy on local NVMe. (docs.lambda.ai 2026-06)

  • LAM2 — Filesystem is attach-at-launch only and region-locked. Symptom: a running instance has no durable storage and one can't be added; or a us-east filesystem won't mount on a us-west instance. → Root cause: filesystems attach only at create time and can't move between regions. → Fix: decide the region and attach the filesystem at launch; co-locate instance + filesystem in the same region. (filesystems doc 2026-06)

  • LAM3 — sudo shutdown / poweroff keeps the meter running (Alert state). Symptom: instance "powered off" but the bill keeps climbing. → Root cause: an in-OS shutdown sends the instance to Alert without stopping billing; halt/shutdown -P 0 only stop the OS, not the meter. → Fix: stop the meter only via terminate (console or POST /instance-operations/terminate); never rely on an in-box poweroff. (billing doc + DeepTalk staff 2026-06)

  • LAM4 — Per-instance firewall rulesets are immutable post-launch. Symptom: a needed inbound port can't be opened (or a wrong one removed) on a live instance. → Root cause: per-instance rulesets "must [be attached] during the instance launch process" and "can't [be removed] after the instance has been launched." → Fix: plan port exposure before launch, use an editable global rule, or tunnel over SSH (-L, §3) instead of opening a port. (firewalls doc 2026-06)

  • LAM5 — Capacity, not eviction, is the bottleneck (no spot fallback). Symptom: launch fails / dashboard shows the desired GPU type unavailable; long stalls scaling up. → Root cause: on-demand supply for a specific GPU/region is exhausted (worst for H100/B200), and there is no spot tier to fall back to. → Fix: poll GET /instance-types for regions_with_capacity_available and launch the instant a region appears (or use SkyPilot's cross-region/type provisioner); resume from the NFS checkpoint once granted (§4). (cloud-api doc + medium.com/@velinxs 2026-06)

  • LAM6 — The NFS filesystem keeps billing after the instance is gone. Symptom: all instances terminated, but storage charges continue. → Root cause: "Billing continues as long as a filesystem exists, even if it's not mounted to an instance" — $0.20/GiB/month until deleted. → Fix: after the final pull + verify, delete the filesystem (console Storage → Delete; requires terminating attached instances first) — a distinct teardown step. (billing + filesystems docs 2026-06)

  • LAM7 — pip install torch over Lambda Stack silently shadows or mismatches it. Symptom: a pip install in base reports "Defaulting to user installation because normal site-packages is not writeable" and lands in ~/.local, or a torch==X pin drags in a CUDA/torchvision combo that conflicts with the system build → import/CUDA errors. → Root cause: Lambda Stack PyTorch lives in system /usr/lib/python3/dist-packages (not pip-writable as ubuntu); pip's user install or a hard version pin diverges from it. → Fix: use the Stack's PyTorch as-is (don't reinstall), loosen pins (torch>=2.x not ==), or fully isolate in a fresh venv/conda env and install torch there cleanly — don't half-mix pip-over-system. (DeepTalk threads 2026-06)

  • LAM8 — conda/venv that "borrows" Stack PyTorch via system-site-packages then breaks on pip. Symptom: created a conda env to use the Stack's torch, then a later pip install pulls a second, conflicting torch or can't write site-packages. → Root cause: mixing --system-site-packages (to see the system torch) with pip installs into the same env creates two torch copies. → Fix: pick ONE model — either run in the bare Stack base (preferred on a rental), or build a fully self-contained env with conda install pytorch torchvision (no system-site-packages borrowing). (DeepTalk bypassing-lambda-stack thread 2026-06)

  • LAM9 — Cold reboot wipes RAM and tmux; warm restart still bills. Symptom: after a "reboot" the detached training job is gone and the box came back clean-ish. → Root cause: a cold reboot "erases all data currently in the instance's memory and bypasses the operating system's safe-shutdown mechanisms" — kills tmux sessions and any in-RAM state; neither reboot stops the meter. → Fix: only cold-reboot a frozen box; rely on checkpoint-to-NFS, not on process survival across a reboot; expect to re-ssh and re-tmux attach (session may be gone). (console doc 2026-06)

  • LAM10 — No static IP; the public IP changes on terminate→relaunch. Symptom: automation/SSH config hard-coded to yesterday's IP fails after a relaunch. → Root cause: "On-Demand Cloud doesn't support static IP addresses" — a fresh launch gets a fresh IP. → Fix: read the IP from the console / GET /instances on every launch; template SSH config dynamically; never hard-code it. (DeepTalk staff 2026-06)

  • LAM11 — apt full-upgrade on Lambda Stack images can break cuDNN/DOCA. Symptom: after a recommended apt-get update && upgrade (or full-upgrade on 24.04 images), PyTorch/TF fails to find cuDNN, or full-upgrade itself fails on a DOCA package. → Root cause: a system cuDNN bump or DOCA repo state diverges from the Stack-bundled libs. → Fix: avoid blanket full-upgrade on a rental; if cuDNN is missing, symlink the Stack copies — for so in /usr/lib/python3/dist-packages/tensorflow/libcudnn*; do sudo ln -s "$so" /usr/lib/x86_64-linux-gnu/; done (note: Stack cuDNN is usable only by the Stack-installed PyTorch/TF). (troubleshooting doc 2026-06)

  • LAM12 — 1-Click Clusters / reserved bill differently than on-demand (commitment traps). Symptom: expected per-minute pricing, got a 2-week minimum / weekly invoice / a reservation that expired. → Root cause: 1-Click Clusters carry a minimum 2-week commitment with weekly billing (not per-minute); reserved capacity requires Lambda approval and the invoice must be paid within ~10 days or the reservation is forfeited, on non-cancelable terms. → Fix: use plain on-demand single instances for per-minute experiments; only enter a cluster/reservation with confirmed sustained need and budget approval. (1-click-clusters docs + nOps/CheckThat 2026-06)

  • LAM13 — GH200 (ARM/aarch64) breaks pip install torch — needs the ARM build. Symptom: on a 1× GH200 box, pip install torch installs a CPU-only wheel (no CUDA), or a pinned torch==2.2.0 fails to resolve. → Root cause: GH200 is aarch64; the default PyPI torch wheel for aarch64 is CPU-only. → Fix: use Lambda Stack's pre-compiled ARM PyTorch (e.g. 2.4.1) as-is, or install from the CUDA index pip install torch --index-url https://download.pytorch.org/whl/cu128 (aarch64 GPU wheels live there), or compile from source for newer versions; relax exact pins. (DeepTalk GH200 thread

    • pytorch.org 2026-06)

Platform-specific debugging

  • Confirm billing actually stopped: after a teardown, check the instance is gone (not in Alert) via the console or curl -u $LAMBDA_API_KEY: https://cloud.lambdalabs.com/api/v1/instances — an Alert- state box (from an in-OS shutdown) is still charging (LAM3).
  • Capacity probe before launch: curl -u $LAMBDA_API_KEY: .../instance-types | jq '.data | to_entries[] | {type:.key, regions:.value.regions_with_capacity_available}' — empty regions ⇒ that GPU type can't launch anywhere right now (LAM5); this is the loop condition for retry-until-available.
  • GPU sanity on the box: nvidia-smi (driver/CUDA + util) and python -c "import torch; print(torch.__version__, torch.version.cuda, torch.cuda.is_available())" — mismatch between torch.version.cuda and nvidia-smi CUDA usually means a pip-shadowed torch (LAM7/8/13), not a Stack problem.
  • Read the real Stack version, never assume: apt list --installed 2>/dev/null | grep -i lambda-stack and dpkg -l | grep -i cudnn — confirm before debugging a "version mismatch."
  • Disk pressure on the ephemeral root: df -h / and df -h /lambda/nfs/<name>; remember /home/ubuntu is volatile — large datasets/checkpoints filling the root volume are also lost on terminate, so move them to NFS, not just to clear space.
  • Detect a stalled download: background the pull (nohup … &) and watch growth — watch -n5 'du -sh <target>; ls -l <target>' (flat size for minutes ⇒ stalled; re-pull, egress is free).
  • Stuck/unreachable after reboot: if SSH dies post-reboot, the box may be in Alert or networking failed to come up — check the console state and prefer a fresh terminate→relaunch (resume from NFS) over fighting a cold-reboot that already wiped RAM (LAM9).

8. SCRIPT OVERRIDES

Values to parameterize the scripts/ templates for Lambda:

DATA_DIR=       /home/ubuntu (ephemeral NVMe — lost on terminate)
DURABLE_DIR=    /lambda/nfs/<name>
PROXY_HOOK=     (none — direct egress; no network_turbo)
CRED_FILE=      ""  (Lambda key is the $LAMBDA_API_KEY env var, not a file on disk — run_one's [ -n "$CRED_FILE" ] guard skips the file read and the env var passes through; SkyPilot key file at ~/.lambda_cloud/lambda_keys, format `api_key = <KEY>`)
SCRATCH=        prune periodic ckpts on local NVMe; keep only `best` on /lambda/nfs/<name>
HF_HOME=        /lambda/nfs/<name>/.cache/huggingface   (durable; survives terminate, free egress on re-pull)
DETACH=         tmux  (apt install if absent; nohup fallback)
SSH_USER=       ubuntu   (NOT root)

Notes for the wrapper:

  • Default checkpoint dir → the NFS mount, not /home/ubuntu — the latter is erased on terminate.
  • If no NFS filesystem is attached, set the wrapper to pull checkpoints to local on the periodic timer (free egress) instead of relying on durable on-box storage.
  • Re-read the instance IP from the console/API on every launch (LAM10) — never persist it in SSH config.
  • Do not pip install torch / blanket apt full-upgrade on the rental — use the Stack as-is (LAM7/8/11); on GH200 use the ARM build (LAM13).
  • The teardown step is terminate via API, gated by the Iron Law; verify billing stopped (no Alert state) and add an explicit reminder to delete the NFS filesystem (LAM6) when the project is done.