# Parallel Ablation Fan-out — FS-shared deployment, isolated write paths, reconciliation Run N ablation cells in parallel across instances/queues without corrupting shared state, then reconcile and re-verify every cell before any teardown. The mechanism is **one job per cell with an isolated write path**; the discipline is **`superpowers:dispatching-parallel-agents`'s independence predicate + reconciliation**. **REQUIRED:** `superpowers:dispatching-parallel-agents` and **REQUIRED:** `superpowers:verification-before-completion`. To jump: `grep -in references/parallel_ablation.md`. ## Table of contents 1. The fan-out model (one job per cell) 2. FS-shared wrapper deployment (place once, never mutate mid-run) 3. The independence predicate (isolated write path = the analogue of a git worktree) 4. The portable job request (describe once, run on any profile) 5. Queue-file format + resume via `start_index` 6. Mandatory post-fan-out reconciliation + full re-verify 7. Gotchas --- ## 1. The fan-out model Parallelism comes from running **multiple queues on multiple instances simultaneously** — never from parallel jobs inside one instance (sequential per instance keeps memory predictable and prevents disk contention). The unit of work is the **ablation cell**: one `(cfg, task, epochs)` row → one `run_one` invocation → one isolated output directory. ``` shared FS: /path/to/shared/run_one.sh, run_queue.sh (ONE version, all instances read it) instance A tmux ──> run_queue.sh queueA.txt ──> cell a1 ──> cell a2 ──> ... instance B tmux ──> run_queue.sh queueB.txt ──> cell b1 ──> cell b2 ──> ... instance C tmux ──> run_queue.sh queueC.txt ──> cell c1 ──> ... each cell writes ONLY to its own /ckpt// + FS// ``` Split the N cells across queue files (one per instance) by cost, not count — route the long cells (detection at 50 epochs) onto faster/idle instances so the queues finish near-simultaneously. --- ## 2. FS-shared wrapper deployment Place a **single copy** of `run_one`/`run_queue` on the cross-instance shared filesystem (`profiles/.md` STORAGE names the mount; on AutoDL it is the FS tier, on RunPod a Network Volume, on a bare box a synced NFS/`rsync` target). Every instance reads the **same version** — no per-instance drift, no "fixed it on A but not B." **Recall principle #6 — never mutate inputs under a live run.** A running queue holds `run_queue.sh`/`run_one.sh` in memory by byte-offset; overwriting either mid-run lands bash in the middle of a *different* file and re-executes blocks (duplicate runs, stalled queues). Therefore: - **Deploy the wrapper before launching any queue.** Treat the FS copy as immutable for the fan-out's lifetime. Edit only when nothing reads it (`pgrep -af run_queue.sh` empty on every instance). - **Appending to a queue *file* mid-flight is safe** (streaming read re-reads on each iteration); editing the *script* is not. New cells → append a line, or start a fresh queue file. - A fix that must reach in-flight jobs → **version the filename** (`run_one.v2.sh`), drain the old queues, point new queues at the new file. Never `scp` over the path a live queue is reading. The FS copy is also the durable safety net: `run_one`'s post-success step syncs `best.pth` + metrics + log to `FS//`, so a released/dead instance still leaves its cell's result on the FS. --- ## 3. The independence predicate **REQUIRED:** `superpowers:dispatching-parallel-agents` — fan out only over work whose units share no mutable state. Here the predicate is concrete: **each cell writes to its own output directory and nothing else.** The per-job output dir is the platform analogue of a **git worktree** — an isolated workspace where one agent's writes can never collide with another's. Hold the predicate by routing every per-cell write to a name-scoped path: | Write target | Isolation key | Set via | |---|---|---| | checkpoints | `//` | `training.checkpoint_dir` override (per ``) | | FS final copy | `FS/final_ckpts//` | `run_one` post-success sync | | tracker run | `group=_`, unique run name | `wandb.group` / `wandb.tags` overrides | | per-cell log | `.log` | `run_queue` per-line logging | **Never fan out onto shared mutable output.** Two cells writing `latest.pth`, the same `checkpoint_dir`, or one tracker run id = the exact shared-state violation the predicate forbids — it produces silently interleaved checkpoints and unattributable metrics, which no amount of post-hoc reconciliation can untangle. The `` derives 1:1 from the cfg, so distinct cfgs → distinct paths automatically; **two queue lines must never share a ``.** What is read-shared (the immutable wrappers, the dataset, the base image) is fine — the predicate only forbids shared **mutable** state. --- ## 4. The portable job request Describe a sweep once so the *same* fan-out runs against any profile (the launcher resolves it against `profiles/.md`; the profile supplies paths/verbs, the job supplies the work — see `profiles/_schema.md`): ```yaml resources: gpu: {name: A100, count: 1, memory: 24GB+} # a CONSTRAINT, never a platform SKU disk: 100GB # ckpt_size × cells_per_instance + scratch candidates: [autodl, china, runpod] # ordered fallback → describe once, run anywhere run: "bash run_queue.sh queue.txt" # the per-instance entry point ``` Per-instance disk budget = `ckpt_size × cells_in_this_queue + scratch` (principle #5). Pre-compute it in Phase 0; a fan-out that under-budgets disk fails the *last* cells of each queue, not the first. --- ## 5. Queue-file format + resume One ablation cell per line, whitespace-separated (`while IFS=' ' read -r cfg task epochs`): ``` [epochs] ``` - `cfg_path` — yaml file relative to repo root; its basename is the cell `` (the isolation key). - `task` — reconstruction / segmentation / detection (or other supported task) — sets tracker group/tags. - `epochs` — optional integer; omitted → wrapper default (e.g. `20`). The optional 3rd field lets one queue mix per-task budgets (detection 50, recon/seg 20). ``` configs/experiments/ablation/recon/baseline.yaml reconstruction 20 configs/experiments/ablation/det/baseline.yaml detection 50 configs/experiments/ablation/seg/no_aug.yaml segmentation ``` **Resume via `start_index`.** A queue killed at cell k (SSH drop, preemption, OOM) resumes with `bash run_queue.sh queue.txt ` — it skips the first k lines and continues. This is the queue-level form of principle #8 (idempotent resume); combined with per-cell checkpoint-load-on-startup, a half-finished cell resumes mid-cell, not from scratch. Keep `start_index` aligned to the queue file: appending lines is safe, **reordering or deleting earlier lines shifts every index** — append only. --- ## 6. Mandatory post-fan-out reconciliation + full re-verify **REQUIRED:** `superpowers:dispatching-parallel-agents` (reconcile) and **REQUIRED:** `superpowers:verification-before-completion` (evidence before any success claim). When queues report done, the watcher's "done" is a **claim** (principle #3), not ground truth — a cell can report success on a silently-failed sync, OOM mid-write, or never have run because its instance died. Reconcile and re-verify **every cell before any teardown** — this is a hard gate, not a spot check: 1. **Roster.** Enumerate the expected cell `` set from all queue files (the ground-truth roster). 2. **Reconcile.** For each ``, confirm `FS/final_ckpts//` exists and holds `best.pth` + metrics + log. List the delta: missing, zero-byte, or duplicate-`` collisions (a predicate violation that slipped through — see Gotchas). 3. **Re-verify by load.** Run `scripts/verify_local.py` over the durable copies — *load* each checkpoint and metrics file. "The file exists" / "the log said synced" is not evidence; a load that succeeds is (principle #3, the `verifying-dl-experiments` boundary owns whether the *number* is real — **REQUIRED:** `verifying-dl-experiments`). 4. **Remediate, never blind-retry.** Each missing/failed cell → classify the cause, then re-launch the **identical config** (principle #7) on a live instance via `start_index`, or append its line to a fresh queue. Do not patch one cell's config to make it pass — that destroys comparability. Only after the roster is 100% reconciled AND every cell loads does the teardown Iron Law unlock (SKILL.md Phase 5): no `release`/`terminate`/`destroy` until results are pulled to local AND verified by load AND the user approves the cost-affecting action. --- ## 7. Gotchas **Two cells share a `` → interleaved checkpoints, unattributable metrics.** Symptom: one cell's `best.pth` overwritten, a tracker run with mixed curves, reconciliation finds N-1 output dirs for N cells. → Root cause: independence-predicate violation — two queue lines mapped to the same isolation key (same cfg basename / hand-set identical `checkpoint_dir`). → Fix: enforce distinct `` per line *before* launch (the cfg→`` map must be injective); on collision, rename one cfg and rerun both — interleaved output cannot be un-mixed after the fact. **Editing the FS wrapper mid-fan-out → duplicate / stalled cells across instances.** Symptom: cells re-run or queues hang after a "quick fix" to `run_queue.sh`/`run_one.sh` on the FS. → Root cause: principle #6 — live bash holds the script by byte-offset; overwriting the shared copy corrupts every reader at once. → Fix: treat the FS wrapper as immutable for the fan-out's lifetime; version the filename and repoint new queues; edit only when `pgrep -af run_queue.sh` is empty everywhere. **Queue reports "all done" but a cell never ran.** Symptom: roster has N cells, FS has fewer; no error in the surviving logs. → Root cause: the instance died (released, preempted, host fault) and its queue's "done" was never emitted — absence of failure is not presence of success (principle #3). → Fix: reconcile against the **roster**, not against the watcher's last status; re-launch missing cells with `start_index` on a live instance. **`start_index` resumes the wrong cell after a queue edit.** Symptom: resume skips or re-runs the wrong rows. → Root cause: a line was inserted/deleted/reordered, shifting every subsequent index. → Fix: append-only to in-flight queue files; to drop a cell, comment it (don't delete) so indices stay stable, or start a fresh queue file for the remainder. > Universal gotchas (SSH drop on `pkill`, CRLF, cgroup OOM, silent sync, inode exhaustion on > many-small-files eval output across a shared FS) are **not** restated here — see > `references/gotchas_universal.md`. Shared-FS inode pressure (principle #5) bites hardest exactly > during fan-out, when N cells write eval artifacts to one FS at once.