playbook/antigravity-awesome-skills/skills/remote-gpu-trainer/references/spot-resilience.md

# Spot / Preemption Resilience

Make a job survive being killed at a random instant — the price of riding the 50–90 %-cheaper
spot/preemptible/interruptible tier. The whole layer reduces to **principle #8**
(`references/principles.md`): checkpoint full state to durable storage on a Young/Daly timer, load-latest
unconditionally on startup, write atomically, treat the preemption signal only as an opportunistic
last-flush. This file is the deep form: per-platform grace windows, the cadence formula with a worked
number, the atomic-write resume recipe, and a commented Python skeleton.

To jump: `grep -in '<keyword>' references/spot-resilience.md` — keywords: `grace`, `signal`, `young`,
`daly`, `cadence`, `atomic`, `rename`, `resume`, `skeleton`, `managed`, `skypilot`, `sagemaker`, `slurm`.

## Table of contents

1. [Preemption signals + grace windows (per platform)](#1-preemption-signals--grace-windows-per-platform)
2. [Checkpoint cadence — the Young/Daly formula](#2-checkpoint-cadence--the-youngdaly-formula)
3. [The atomic-write resume recipe](#3-the-atomic-write-resume-recipe)
4. [Managed-spot frameworks move the box; the checkpoint-load restores the state](#4-managed-spot-frameworks-move-the-box-the-checkpoint-load-restores-the-state)
5. [Python checkpoint/resume skeleton](#5-python-checkpointresume-skeleton)

---

## 1. Preemption signals + grace windows (per platform)

The grace window dictates the design: it decides whether checkpoint-on-signal is even possible, or
whether the timer is the *only* durability. **The window is NOT the safety net** — see the design-breaking
gotcha below the table. Concrete per-platform reach/billing detail lives in each `profiles/<platform>.md`
§4; this is the cross-platform signal map.

| Platform | Detection signal | Grace window | Implication |
|---|---|---|---|
| **AWS EC2 Spot** | IMDS `http://169.254.169.254/latest/meta-data/spot/instance-action` (404 = none, 200 = pending); rebalance-recommendation fires ~10–20 min earlier | **~120 s** | On-signal flush of a *small* checkpoint is viable; still timer-checkpoint for the big one |
| **GCP Spot** | metadata preemption flag + ACPI G2 Soft Off → shutdown script | **~30 s** default (configurable up to 120 s, Preview) | Timer-primary; on-signal flush only if checkpoint write < window |
| **GCP Preemptible (legacy)** | same signal, **plus a hard 24 h cap** regardless of capacity | ~30 s **+ guillotined at 24 h** | Prefer Spot for long runs; Preemptible dies at 24 h even idle |
| **Azure Spot** | IMDS Scheduled Events `/metadata/scheduledevents`, event type `Preempt` | **≥30 s** (Preempt is the short event; others give ≥5 min) | Timer-primary |
| **Slurm preemption / walltime** | `SIGTERM` (then `SIGKILL`); with `#SBATCH --signal=B:SIGTERM@360` the batch step gets SIGTERM ~360 s before the kill | **SIGTERM → ~30 s** default; widen via `--signal` lead time | `--requeue` + an in-script SIGTERM trap to checkpoint, then resume on requeue |
| **RunPod Spot** | OS **SIGTERM → SIGKILL** (also "interruptible without notice") | **~5 s** | Far too short to flush a large checkpoint — timer is the only real durability |
| **vast.ai Interruptible** | **no signal** — bid-based; instance is *paused* (processes killed) the instant it is outbid | **~0 s (abrupt)** | Pure timer; assume cold restart + reload every time |

URLs: AWS [spot-instance-termination-notices](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-instance-termination-notices.html),
[rebalance-recommendations](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/rebalance-recommendations.html);
GCP [preemptible](https://docs.cloud.google.com/compute/docs/instances/preemptible),
[spot](https://docs.cloud.google.com/compute/docs/instances/spot);
Azure [scheduled-events](https://learn.microsoft.com/en-us/azure/virtual-machines/windows/scheduled-events);
Slurm [sbatch `--signal`](https://slurm.schedmd.com/sbatch.html);
RunPod [spot-vs-on-demand](https://www.runpod.io/blog/spot-vs-on-demand-instances-runpod);
vast.ai [Rental-Types](https://vast.ai/article/Rental-Types).

**Gotcha — the design-breaking one.**
Symptom: a "catch SIGTERM, flush the 40 GB checkpoint to durable storage" handler works in testing on AWS
(120 s) but the job dies before the flush completes on RunPod (5 s) / vast.ai (0 s).
Root cause: treating the grace window as the *primary* durability mechanism — it spans 2 min down to ~0
across platforms, so any handler that needs more than a few seconds is a coin flip.
Fix: checkpoint on a **periodic timer to durable storage** (§2); use the signal trap **only** as an
opportunistic "save a final partial checkpoint if there is time" bonus, never as the safety net.

**Gotcha — GCP Preemptible 24 h guillotine.**
Symptom: a multi-day run on a Preemptible VM stops dead at 24 h even though nothing reclaimed it.
Root cause: legacy Preemptible has a hard 24 h max runtime; Spot VMs have no cap.
Fix: use **Spot, not Preemptible** for anything past a day (prefer Spot over legacy Preemptible for any run past a day).

---

## 2. Checkpoint cadence — the Young/Daly formula

Cadence is a **formula, not a guess.** The optimal checkpoint interval that minimizes total wasted
wall-clock (rollback re-compute after a kill **plus** checkpoint-write overhead) is the Young/Daly result:

```
W = sqrt(2 * mu * C)
```

- `mu` = mean time between preemptions (MTBF), in seconds.
- `C`  = time to write one checkpoint to durable storage, in seconds.
- `W`  = checkpoint interval (write a checkpoint every `W` seconds).

**Worked example.** A checkpoint takes `C = 30 s` to write; the instance is preempted on average every
`mu = 3 h = 10800 s`. Then:

```
W = sqrt(2 * 10800 * 30) = sqrt(648000) ≈ 805 s ≈ 13.4 min  →  checkpoint every ~13 min.
```

Higher preemption rate (smaller `mu`) → shorter interval. Slower checkpoint (larger `C`) → longer interval
(each save costs more, so amortize it over more progress).

**Round W DOWN to an iteration/epoch boundary.** Young/Daly assumes a checkpoint can be taken at *any*
instant, but real iterative training can only snapshot at a step or epoch boundary. So convert `W` to an
integer number of iterations and round *down*: at ~2 s/iteration, `805 s → 402 iters → checkpoint every
400 iters`. Rounding down checkpoints slightly more often than optimal, which is the safe direction.

**Distributed multiplier.** With `N` workers, one preemption wastes `N×` the compute (the whole group rolls
back), so distributed jobs should checkpoint *more* frequently than the single-GPU `W` suggests.

URLs: Young/Daly [robustness paper, INRIA](https://people.bordeaux.inria.fr/gaupy/ressources/pub/confs/icpp20_robustness.pdf),
[Optimal Checkpointing Period, LAWN 281](https://www.netlib.org/lapack/lawnspdf/lawn281.pdf),
[Optimal Checkpointing for Iterative Applications, IEEE](https://ieeexplore.ieee.org/document/9495174/).

---

## 3. The atomic-write resume recipe

Two failure modes turn "I have checkpoints" into "my resume is broken": a **partial weight save** and a
**corrupt-on-kill checkpoint**. The recipe fixes both.

**Save FULL training state, not just model weights.** A resume that restores only weights silently
restarts the epoch, reshuffles data, and degrades accuracy. The checkpoint must include:

- model `state_dict`
- optimizer `state_dict`
- LR-scheduler `state_dict`
- epoch **and** global step/iteration counter
- RNG state (Python `random`, NumPy, `torch`, and CUDA)
- dataloader position (sampler epoch / resumable-sampler offset)

**Write atomically: tmp → fsync → os.replace.** A preemption mid-write corrupts the file, and a naive
overwrite can leave zero good checkpoints. `os.replace` maps to the atomic POSIX `rename(2)` on the same
filesystem (and, unlike `os.rename`, overwrites atomically on Windows too), so:

1. Write the whole state to `latest.pt.tmp`.
2. `fsync` the file (and the directory) so bytes hit disk before the rename.
3. `os.replace("latest.pt.tmp", "latest.pt")` — the swap is all-or-nothing.
4. Keep the previous `latest.pt` until the new one is committed; a kill at any point leaves one intact file.

**Checkpoint to the platform's DURABLE location, not local scratch** (principle #4). A managed replacement
node is *fresh* — anything not on a cloud bucket / network volume / shared FS is gone. On a marketplace box
where local disk persists across a pause, still mirror to durable storage at intervals.

**Load-latest UNCONDITIONALLY on startup.** Use the *same code path* for first launch (no checkpoint →
start fresh) and every restart-after-preemption (checkpoint exists → resume). This is what makes the job
idempotent: the **identical launch command** run any number of times converges to the same end state, which
is exactly what makes principle #7's "retry the identical config" actually resume progress instead of
restarting from zero.

URLs: [Check-N-Run, arXiv](https://arxiv.org/pdf/2010.08679),
[SkyPilot training-guide](https://docs.skypilot.co/en/latest/reference/training-guide.html),
[SageMaker resume-from-checkpoint](https://docs.aws.amazon.com/sagemaker/latest/dg/model-checkpoints-resume.html).

---

## 4. Managed-spot frameworks move the box; the checkpoint-load restores the state

Managed frameworks **auto-provision a replacement** on preemption — but they restart the **process from
scratch**. The framework moves the box; the checkpoint-load written in §3/§5 is what restores progress.
This is the single most-misunderstood point: the framework does **not** resume training on its own.

- **SkyPilot Managed Jobs** — strongest cross-cloud recommendation (re-provisions in a different
  region/cloud to chase capacity, then re-runs the task). Caveat: it auto-recovers **only**
  preemption/hardware failures — a user-code non-zero exit is **not** auto-recovered.
  [managed-jobs](https://docs.skypilot.co/en/latest/examples/managed-jobs.html).
- **AWS SageMaker Managed Spot** — set `use_spot_instances=True` + `checkpoint_s3_uri`; SageMaker syncs the
  checkpoint dir to S3 during training and copies it back on restart (up to ~90 % savings). Gotcha:
  **`max_wait` must be greater than `max_run`** — `max_wait` covers wait-for-capacity *plus* run time
  *plus* interruption gaps; set it too tight and the job is killed mid-resume.
  [managed-spot docs](https://docs.aws.amazon.com/sagemaker/latest/dg/model-managed-spot-training.html).

Universal multi-cloud auto-failover is **out of scope for this skill** — use SkyPilot/dstack for that, then
return here to make the *code* resume-correct so their recovery actually lands on progress
(`superpowers:verification-before-completion` gates the "it resumed" claim against a loaded checkpoint, not
a log line). For the elastic / multi-node tier (torchrun `--max-restarts`, Elastic Horovod) see
`references/multinode.md`; the same invariant holds — the framework restarts processes, the per-epoch
snapshot restores state.

---

## 5. Python checkpoint/resume skeleton

Read this for the algorithm; adapt into the training script. The shape is platform-agnostic — only
`DURABLE_DIR` changes per profile (§8 SCRIPT OVERRIDES).

```python
import os, random, signal, time
import numpy as np
import torch

DURABLE_DIR = os.environ["DURABLE_DIR"]   # profile-supplied bucket/FS/volume mount, NOT local scratch
CKPT = os.path.join(DURABLE_DIR, "latest.pt")
CKPT_EVERY_ITERS = 400                     # = round_down(Young/Daly W / sec_per_iter); see section 2

def save_full_state(model, opt, sched, epoch, step):
    """Atomic write: tmp -> fsync -> os.replace. A kill at any point leaves one intact file."""
    state = {
        "model": model.state_dict(),
        "opt": opt.state_dict(),
        "sched": sched.state_dict(),
        "epoch": epoch, "step": step,        # resume the exact position, not the epoch start
        "rng_python": random.getstate(),
        "rng_numpy": np.random.get_state(),
        "rng_torch": torch.get_rng_state(),
        "rng_cuda": torch.cuda.get_rng_state_all(),
    }
    tmp = CKPT + ".tmp"
    with open(tmp, "wb") as f:
        torch.save(state, f)
        f.flush()
        os.fsync(f.fileno())                 # bytes hit disk BEFORE the rename
    os.replace(tmp, CKPT)                     # POSIX-atomic swap; prev file valid until this returns

def load_latest_if_any(model, opt, sched):
    """Unconditional load-latest: identical command resumes OR starts fresh. Returns (epoch, step)."""
    if not os.path.exists(CKPT):
        return 0, 0                          # first run, no checkpoint -> start from scratch
    s = torch.load(CKPT, map_location="cpu")
    model.load_state_dict(s["model"])
    opt.load_state_dict(s["opt"])
    sched.load_state_dict(s["sched"])
    random.setstate(s["rng_python"])
    np.random.set_state(s["rng_numpy"])
    torch.set_rng_state(s["rng_torch"])
    torch.cuda.set_rng_state_all(s["rng_cuda"])
    return s["epoch"], s["step"]             # caller skips dataloader to this position

# --- opportunistic last-flush only; NOT the safety net (section 1) ---
_preempted = {"flag": False}
def _on_sigterm(signum, frame):
    _preempted["flag"] = True                # set a flag; flush at the next safe boundary, do not block here
signal.signal(signal.SIGTERM, _on_sigterm)

def train(model, opt, sched, dataloader, total_epochs):
    start_epoch, start_step = load_latest_if_any(model, opt, sched)
    step = start_step
    for epoch in range(start_epoch, total_epochs):
        for batch in dataloader:             # a resumable sampler should fast-forward to start_step
            # ... forward / backward / opt.step() / sched.step() ...
            step += 1
            if step % CKPT_EVERY_ITERS == 0 or _preempted["flag"]:
                save_full_state(model, opt, sched, epoch, step)
                if _preempted["flag"]:
                    return                   # grace window may be ~0s; exit cleanly after the flush
```

Verify the resume path before trusting it: kill the process mid-epoch, relaunch the *identical* command,
and confirm step/epoch/loss continue rather than reset (this is the `verifying-dl-experiments`
reproducibility check, applied to preemption). Trust the **loaded** checkpoint, not the "resumed" log line
(principle #3).