playbook/antigravity-awesome-skills/skills/remote-gpu-trainer/references/parallel_ablation.md

# Parallel Ablation Fan-out — FS-shared deployment, isolated write paths, reconciliation

Run N ablation cells in parallel across instances/queues without corrupting shared state, then
reconcile and re-verify every cell before any teardown. The mechanism is **one job per cell with an
isolated write path**; the discipline is **`superpowers:dispatching-parallel-agents`'s independence
predicate + reconciliation**. **REQUIRED:** `superpowers:dispatching-parallel-agents` and
**REQUIRED:** `superpowers:verification-before-completion`.

To jump: `grep -in <keyword> references/parallel_ablation.md`.

## Table of contents

1. The fan-out model (one job per cell)
2. FS-shared wrapper deployment (place once, never mutate mid-run)
3. The independence predicate (isolated write path = the analogue of a git worktree)
4. The portable job request (describe once, run on any profile)
5. Queue-file format + resume via `start_index`
6. Mandatory post-fan-out reconciliation + full re-verify
7. Gotchas

---

## 1. The fan-out model

Parallelism comes from running **multiple queues on multiple instances simultaneously** — never from
parallel jobs inside one instance (sequential per instance keeps memory predictable and prevents disk
contention). The unit of work is the **ablation cell**: one
`(cfg, task, epochs)` row → one `run_one` invocation → one isolated output directory.

```
shared FS: /path/to/shared/run_one.sh, run_queue.sh   (ONE version, all instances read it)
instance A  tmux ──> run_queue.sh queueA.txt ──> cell a1 ──> cell a2 ──> ...
instance B  tmux ──> run_queue.sh queueB.txt ──> cell b1 ──> cell b2 ──> ...
instance C  tmux ──> run_queue.sh queueC.txt ──> cell c1 ──> ...
                                       each cell writes ONLY to its own /ckpt/<name>/ + FS/<name>/
```

Split the N cells across queue files (one per instance) by cost, not count — route the long cells
(detection at 50 epochs) onto faster/idle instances so the queues finish near-simultaneously.

---

## 2. FS-shared wrapper deployment

Place a **single copy** of `run_one`/`run_queue` on the cross-instance shared filesystem
(`profiles/<platform>.md` STORAGE names the mount; on AutoDL it is the FS tier, on RunPod a Network
Volume, on a bare box a synced NFS/`rsync` target). Every instance reads the **same version** — no
per-instance drift, no "fixed it on A but not B."

**Recall principle #6 — never mutate inputs under a live run.** A running queue holds
`run_queue.sh`/`run_one.sh` in memory by byte-offset; overwriting either mid-run lands bash in the
middle of a *different* file and re-executes blocks (duplicate runs, stalled queues). Therefore:

- **Deploy the wrapper before launching any queue.** Treat the FS copy as immutable for the fan-out's
  lifetime. Edit only when nothing reads it (`pgrep -af run_queue.sh` empty on every instance).
- **Appending to a queue *file* mid-flight is safe** (streaming read re-reads on each iteration);
  editing the *script* is not. New cells → append a line, or start a fresh queue file.
- A fix that must reach in-flight jobs → **version the filename** (`run_one.v2.sh`), drain the old
  queues, point new queues at the new file. Never `scp` over the path a live queue is reading.

The FS copy is also the durable safety net: `run_one`'s post-success step syncs `best.pth` +
metrics + log to `FS/<name>/`, so a released/dead instance still leaves its cell's result on the FS.

---

## 3. The independence predicate

**REQUIRED:** `superpowers:dispatching-parallel-agents` — fan out only over work whose units share no
mutable state. Here the predicate is concrete: **each cell writes to its own output directory and
nothing else.** The per-job output dir is the platform analogue of a **git worktree** — an isolated
workspace where one agent's writes can never collide with another's.

Hold the predicate by routing every per-cell write to a name-scoped path:

| Write target | Isolation key | Set via |
|---|---|---|
| checkpoints | `<ckpt_root>/<name>/` | `training.checkpoint_dir` override (per `<name>`) |
| FS final copy | `FS/final_ckpts/<name>/` | `run_one` post-success sync |
| tracker run | `group=<task>_<cat>`, unique run name | `wandb.group` / `wandb.tags` overrides |
| per-cell log | `<name>.log` | `run_queue` per-line logging |

**Never fan out onto shared mutable output.** Two cells writing `latest.pth`, the same
`checkpoint_dir`, or one tracker run id = the exact shared-state violation the predicate forbids —
it produces silently interleaved checkpoints and unattributable metrics, which no amount of
post-hoc reconciliation can untangle. The `<name>` derives 1:1 from the cfg, so distinct cfgs →
distinct paths automatically; **two queue lines must never share a `<name>`.**

What is read-shared (the immutable wrappers, the dataset, the base image) is fine — the predicate
only forbids shared **mutable** state.

---

## 4. The portable job request

Describe a sweep once so the *same* fan-out runs against any profile (the launcher resolves it
against `profiles/<platform>.md`; the profile supplies paths/verbs, the job supplies the work — see
`profiles/_schema.md`):

```yaml
resources:
  gpu: {name: A100, count: 1, memory: 24GB+}    # a CONSTRAINT, never a platform SKU
  disk: 100GB                                    # ckpt_size × cells_per_instance + scratch
candidates: [autodl, china, runpod]              # ordered fallback → describe once, run anywhere
run: "bash run_queue.sh queue.txt"               # the per-instance entry point
```

Per-instance disk budget = `ckpt_size × cells_in_this_queue + scratch` (principle #5). Pre-compute it
in Phase 0; a fan-out that under-budgets disk fails the *last* cells of each queue, not the first.

---

## 5. Queue-file format + resume

One ablation cell per line, whitespace-separated (`while IFS=' ' read -r cfg task epochs`):

```
<cfg_path> <task> [epochs]
```

- `cfg_path` — yaml file relative to repo root; its basename is the cell `<name>` (the isolation key).
- `task` — reconstruction / segmentation / detection (or other supported task) — sets tracker group/tags.
- `epochs` — optional integer; omitted → wrapper default (e.g. `20`). The optional 3rd field lets one
  queue mix per-task budgets (detection 50, recon/seg 20).

```
configs/experiments/ablation/recon/baseline.yaml      reconstruction 20
configs/experiments/ablation/det/baseline.yaml        detection      50
configs/experiments/ablation/seg/no_aug.yaml          segmentation
```

**Resume via `start_index`.** A queue killed at cell k (SSH drop, preemption, OOM) resumes with
`bash run_queue.sh queue.txt <k>` — it skips the first k lines and continues. This is the queue-level
form of principle #8 (idempotent resume); combined with per-cell checkpoint-load-on-startup, a
half-finished cell resumes mid-cell, not from scratch. Keep `start_index` aligned to the queue file:
appending lines is safe, **reordering or deleting earlier lines shifts every index** — append only.

---

## 6. Mandatory post-fan-out reconciliation + full re-verify

**REQUIRED:** `superpowers:dispatching-parallel-agents` (reconcile) and
**REQUIRED:** `superpowers:verification-before-completion` (evidence before any success claim). When
queues report done, the watcher's "done" is a **claim** (principle #3), not ground truth — a cell can
report success on a silently-failed sync, OOM mid-write, or never have run because its instance died.

Reconcile and re-verify **every cell before any teardown** — this is a hard gate, not a spot check:

1. **Roster.** Enumerate the expected cell `<name>` set from all queue files (the ground-truth roster).
2. **Reconcile.** For each `<name>`, confirm `FS/final_ckpts/<name>/` exists and holds `best.pth` +
   metrics + log. List the delta: missing, zero-byte, or duplicate-`<name>` collisions (a predicate
   violation that slipped through — see Gotchas).
3. **Re-verify by load.** Run `scripts/verify_local.py` over the durable copies — *load* each
   checkpoint and metrics file. "The file exists" / "the log said synced" is not evidence; a load
   that succeeds is (principle #3, the `verifying-dl-experiments` boundary owns whether the *number*
   is real — **REQUIRED:** `verifying-dl-experiments`).
4. **Remediate, never blind-retry.** Each missing/failed cell → classify the cause, then re-launch the
   **identical config** (principle #7) on a live instance via `start_index`, or append its line to a
   fresh queue. Do not patch one cell's config to make it pass — that destroys comparability.

Only after the roster is 100% reconciled AND every cell loads does the teardown Iron Law unlock
(SKILL.md Phase 5): no `release`/`terminate`/`destroy` until results are pulled to local AND verified
by load AND the user approves the cost-affecting action.

---

## 7. Gotchas

**Two cells share a `<name>` → interleaved checkpoints, unattributable metrics.**
Symptom: one cell's `best.pth` overwritten, a tracker run with mixed curves, reconciliation finds N-1
output dirs for N cells. → Root cause: independence-predicate violation — two queue lines mapped to
the same isolation key (same cfg basename / hand-set identical `checkpoint_dir`). → Fix: enforce
distinct `<name>` per line *before* launch (the cfg→`<name>` map must be injective); on collision,
rename one cfg and rerun both — interleaved output cannot be un-mixed after the fact.

**Editing the FS wrapper mid-fan-out → duplicate / stalled cells across instances.**
Symptom: cells re-run or queues hang after a "quick fix" to `run_queue.sh`/`run_one.sh` on the FS. →
Root cause: principle #6 — live bash holds the script by byte-offset; overwriting the shared copy
corrupts every reader at once. → Fix: treat the FS wrapper as immutable for the fan-out's lifetime;
version the filename and repoint new queues; edit only when `pgrep -af run_queue.sh` is empty
everywhere.

**Queue reports "all done" but a cell never ran.**
Symptom: roster has N cells, FS has fewer; no error in the surviving logs. → Root cause: the instance
died (released, preempted, host fault) and its queue's "done" was never emitted — absence of failure
is not presence of success (principle #3). → Fix: reconcile against the **roster**, not against the
watcher's last status; re-launch missing cells with `start_index` on a live instance.

**`start_index` resumes the wrong cell after a queue edit.**
Symptom: resume skips or re-runs the wrong rows. → Root cause: a line was inserted/deleted/reordered,
shifting every subsequent index. → Fix: append-only to in-flight queue files; to drop a cell, comment
it (don't delete) so indices stay stable, or start a fresh queue file for the remainder.

> Universal gotchas (SSH drop on `pkill`, CRLF, cgroup OOM, silent sync, inode exhaustion on
> many-small-files eval output across a shared FS) are **not** restated here — see
> `references/gotchas_universal.md`. Shared-FS inode pressure (principle #5) bites hardest exactly
> during fan-out, when N cells write eval artifacts to one FS at once.