playbook/antigravity-awesome-skills/skills/remote-gpu-trainer/references/monitoring_patterns.md

328 lines
22 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Monitoring Patterns — durable watching of a remote GPU job
Platform-agnostic recipes for babysitting a long-running detached job on a rented box. The crown jewel
is the **four-layer durable-monitoring architecture** (§3): a session-bound watcher alone dies with the
session, so layer it. Every recipe uses portable primitives — `tmux` OR `squeue` OR `pgrep`, a log
marker OR an artifact `mtime` — never one platform's paths. Bind the concrete paths/aliases from
`profiles/<platform>.md`.
To jump: `grep -in '<keyword>' references/monitoring_patterns.md`.
## Table of contents
- §0 Monitoring physics — the four facts every recipe rests on
- §1 The robust short-connection ssh-poll template (the safe poll primitive)
- §2 Quick health probes (one round-trip each)
- §3 Durable monitoring architecture — the four layers (L1 self-completion · L2 patrol · L3 sentinels · L4 handbook)
- §4 Stale-waiter hygiene — one waiter per live run, right lifetime
- §5 Two-leg self-completion — guaranteed results + best-effort cadence
- §6 Failure triage on the log
- §7 Monitoring across agent hosts — per-host background/loop/cron primitives + the 2 portability rules (Claude Code · Codex · Cursor · Trae · generic)
---
## §0 Monitoring physics — the four facts every recipe rests on
Verified in-session, not assumed. The whole architecture is engineered around these:
> **Tool-portability note:** `run_in_background`, the ~600 s foreground cap, and `/schedule` (below and
> §5/§3) are the **Claude Code** harness's primitives. On another Agent-Skills host (Codex / Cursor /
> Trae / …) map them to that agent's equivalents — its background-task or async runner, its own
> foreground/turn limit, its scheduler. The four-layer architecture itself is host-agnostic — the full
> per-host mapping (Codex / Cursor / Trae / generic) and the two portability rules are in **§7**.
1. **Foreground Bash hard-caps at 600 s (10 min).** A long foreground wait/monitor is *killed* at the cap
— so never foreground-poll a multi-hour run.
2. **`run_in_background` has NO duration cap and notifies on EXIT.** A 781 s background task ran to
completion and notified (verified). Long task that finishes → background it.
3. **A never-*exiting* watcher never notifies.** No exit event = no notification, ever. A persistent
`while true` / a stray `grep` reading stdin hangs silently forever and the user reads silence as "dead
monitor". Every watcher must have a bounded exit.
4. **An unquoted `|` inside a poll regex hangs forever.** The shell splits `grep -hE a|b|c log` into three
piped commands; the first (`grep -hE a`, no filename) reads **stdin** → blocks → the pipeline never
returns → the ssh never returns → the background process never exits → fact 3 fires. ALWAYS quote the
regex AND give grep a filename.
Corollary — **trust the artifact, not the silence.** When a job "looks done," Read its output file and
re-check ground truth (`grep DONE log; tmux ls / squeue; nvidia-smi`) before claiming success. Do not
wait blindly for a notification that may never fire. This is the `verifying-dl-experiments` (REQUIRED)
Iron Law applied to monitoring.
---
## §1 The robust short-connection ssh-poll template (the safe poll primitive)
The single most important pattern: a poll that cannot hang (fact 4) and cannot strand a half-open
connection. **Never hold one long ssh open for the whole wait** — loop locally, reconnecting each tick.
```bash
#!/usr/bin/env bash
set -u
# Short-connection poll: ssh in → check → disconnect; bounded local loop.
HOST="<alias>" # from profiles/<platform>.md
LOG="/path/to/run.log" # remote log path (profile-bound)
PATTERN='QUEUE DONE|Training completed' # QUOTED → '|' is alternation, never a pipe (fact 4)
MAX=120 # bounded: 120 ticks × 90 s ≈ 3 h, then give up cleanly
i=0
while [ "$i" -lt "$MAX" ]; do
# ConnectTimeout + ServerAlive bound a network blip to ~30 s instead of a multi-minute half-open hang.
if ssh -o ConnectTimeout=15 -o ServerAliveInterval=10 -o ServerAliveCountMax=3 "$HOST" \
"grep -qE '$PATTERN' '$LOG'"; then # quoted regex + a FILENAME → grep reads the file, never stdin
echo "DONE marker found"; exit 0
fi
i=$((i+1)); sleep 90
done
echo "poll gave up after $MAX ticks — check ground truth manually"; exit 1
```
Non-negotiables baked in above:
- **Quoted regex + a filename** on every remote `grep` — the two independent guards against fact 4.
- **`ConnectTimeout` / `ServerAliveInterval` / `ServerAliveCountMax`** — a dropped link self-kills fast.
- **Short connection per tick, bounded local loop** — one ssh per check, a hard tick ceiling so the
waiter always EXITS (fact 3) and therefore always notifies when backgrounded.
- **Detect "done" by a log MARKER, never by `pgrep`** of the waiter's own pattern — `pgrep -f` matches
the waiter's own command line and the loop never ends. On a queue scheduler, `squeue -j <id>` going
empty is the equivalent done-signal.
Run this via `run_in_background` (fact 2: no cap, notifies on exit), or as a single foreground tick under
the 600 s cap (fact 1). On a session scheduler use it as the L2 patrol body (§3). **Never foreground-poll
the full wait.**
---
## §2 Quick health probes (one round-trip each)
Each is a single short ssh. Combine several into ONE round-trip for a patrol tick (§3). Detach-primitive
and paths come from the profile; the structure is identical everywhere.
> A blank live **TensorBoard tile / web panel** while these probes show a healthy run is **not** a dead
> run — it is `references/gotchas_universal.md` **U39**: the panel reads a fixed logdir/port your logger
> didn't write to, or the TB/watcher process died (ran foreground, not under the detach primitive), or the
> port isn't exposed. Fix per the platform profile; never restart a healthy run over an empty panel.
**Is the job alive? (tmux OR squeue OR pgrep — pick the profile's primitive)**
```bash
ssh "$HOST" "tmux ls 2>/dev/null || true; squeue -u \$USER 2>/dev/null || true; pgrep -af 'train' | grep -v grep | head -3"
```
**Progress since last check** — grep the run's OWN log, not a shared master:
```bash
ssh "$HOST" "grep -nE 'Epoch [0-9]+|Training completed|Early stopping|FINISHED|QUEUE DONE' '$RUN_LOG' | tail -6"
```
> **Gotcha — crash-detect on the per-run log, never the shared master.** Symptom: a poll reports "run D
> crashed" while D trains fine. → Root cause: a `tee`'d master log concatenates every run, so grepping it
> for `Traceback|OutOfMemory` matches an EARLIER run's crash text and false-positives on a healthy later
> run. → Fix: scope crash detection to the per-run log (`<name>.log`); reserve the master-log grep for
> `DONE`/`FINISHED`/`QUEUE DONE` and progress markers. A waiter that crash-checks the wrong log spins to
> its full timeout on a phantom failure.
**Resource pressure** (cgroup mem, GPU) — thresholds are rough, profile-tunable:
```bash
ssh "$HOST" "nvidia-smi --query-gpu=utilization.gpu,memory.used --format=csv,noheader; \
[ -f /sys/fs/cgroup/memory.current ] && numfmt --to=iec \$(cat /sys/fs/cgroup/memory.current)/\$(cat /sys/fs/cgroup/memory.max) 2>/dev/null"
```
- cgroup mem > 90% of max → OOM risk; GPU util > 60% → healthy, not data-bottlenecked.
- GPU at 0% but the step log advances ≠ idle — it is CPU-data-bound; sample util over several seconds,
never one snapshot. (Diagnosis → `verifying-dl-experiments`, REQUIRED.)
**Disk — the silent killer** — watch `df -i` (inodes) AND `df -h` (bytes); inodes die first on
many-small-files eval output (→ `references/gotchas_universal.md`):
```bash
ssh "$HOST" "df -h '$DATA_MOUNT'; df -i '$DATA_MOUNT'"
```
---
## §3 Durable monitoring architecture — the four layers (earned by three live failures)
Session-bound watchers die with the session; the instance itself can die under the watcher; and a
monitor that only speaks on terminal events reads as "nobody is watching." One layer cannot fix all
three. Run four — **correctness in L1, liveness in L2, latency in L3, continuity in L4**:
| Layer | Lives where | Job | Survives |
|---|---|---|---|
| **L1 self-completion chain** | ON the box (tmux / nohup / sbatch dependency) | the work sequences itself: `until grep -q 'Training completed' log; do sleep 150; done && <next stage>`; stages hand off via `touch /path/STAGE_DONE` markers | session death, network loss |
| **L2 patrol loop** | session scheduler (cron `/loop`) | every ~30 min fire a SELF-CONTAINED patrol: one combined ssh probe + a decision table + "report EVEN IF nothing changed" | idle gaps (NOT session death — see L4) |
| **L3 event sentinels** | session background (`run_in_background`) | the §1 short-poll `until ssh test -f MARKER; …` for minute-level reaction between patrol ticks | nothing — acceptable; L1/L2 carry correctness |
| **L4 recovery handbook** | persistent notes/memory | exact resume commands, chain definitions, marker paths, "first command on reconnect" — a BRAND-NEW session takes over from one word | everything |
### L1 — on-box self-completion chain (correctness)
The box finishes its own pipeline regardless of any watcher. Chain stages under one detach primitive and
**join them with `&&`, never `;`** so a marker only lands on success:
```bash
# tmux / nohup variant — the detach primitive is the swappable plug (sbatch dependency on Slurm)
nohup bash -c '
set -u
until grep -q "Training completed" /path/to/train.log; do sleep 150; done \
&& python -m eval ... \
&& touch /path/to/STAGE_DONE # marker ONLY on a clean &&-chain
' </dev/null >/path/to/chain.log 2>&1 &
```
> **Gotcha — success-gate the chain markers.** Symptom: the downstream chain fires on a phantom
> completion. → Root cause: joining stages with `;` (or a bare `touch` after a crashing stage) stamps the
> marker even when a stage died — a live disk-full `torch.save` killed stage 3, the `;`-marker still
> landed, the next stage ran on nothing. → Fix: `&&` between every stage and the final `touch`; detect
> done by the marker, never by `pgrep` of the waiter's own pattern (fact 4 / §1).
### L2 — patrol loop (liveness): the design checklist (what made it actually work)
- **ONE combined ssh probe per tick** — alive-check (tmux ls / squeue / pgrep) + `*_DONE` markers + last
epoch line + artifact `ls` + dataset file COUNTS, in a single round-trip.
- **An explicit decision table**, e.g.: ssh down → tell the user to check the console (only they see
balance/power state); detach session missing AND no completion marker → resume from `latest` + rebuild
the L1 chain; result CSV exists → `cat` it and report the numbers verbatim; remote file count below the
local source → resume the transfer; everything done → delete the patrol job itself.
- **Report a one-line status EVEN WHEN nothing changed** — silence between events is exactly what users
read as a dead monitor. ("你有定时看吗??" twice in one campaign is the failure signature of L3-only.)
- **Completeness = file COUNT against the local source** (bytes/hash when names collide), NEVER `test -d`
— a dir created by a killed transfer passes existence checks forever.
- **Never blind-restart** — probe session/log/markers first so a patrol firing mid-run cannot
double-launch (idempotence). Classify each outcome → a fixed remediation; never blind-retry.
> **Ready-made tick:** `scripts/health_patrol.sh.template` is this checklist as one runnable,
> read-only ssh round-trip — alive + done-count + last epoch + crash-scan + `df -h`/`df -i`, an
> escalation predicate, and a one-line report even when nothing changed — parameterized from the
> profile's §8. Fire it from the host's recurring runner (§7: `/loop`, cron `3,33 * * * *`, …).
### L3 — event sentinels (latency)
The §1 short-poll loop for minute-level reaction between patrol ticks. Survives nothing — it is the
disposable fast-reaction layer; L1/L2 carry correctness. Re-arm exactly ONE after any session resume.
> **`run_in_background` is NOT a substitute for `/loop` on an unattended wait.** A one-shot
> `run_in_background` sentinel notifies on EXIT — fine while you keep working in an ACTIVE session, but if
> the session goes idle for hours its exit-notification lands on a closed/reset session and you hear
> nothing (the silent-monitor-for-hours failure). Any UNATTENDED wait over ~1 h → bind the **L2 `/loop`
> patrol** (a recurring agent re-wake), never a lone one-shot sentinel.
### L4 — recovery handbook (continuity)
Persistent notes a brand-new session inherits from one word ("继续"): exact resume commands, the L1 chain
definition, every marker path, the "first command on reconnect." Two durable hardenings:
- **Externalize transfer/monitor state to a stable OS path** + a DONE marker file *outside* the session
dir, so any future session resumes by reading files instead of re-uploading.
- **True restart-immunity means an OS-owned process** (Task Scheduler / cron) — but creating one is
unauthorized persistence to the permission classifier: **get the user's explicit one-line approval
first**, or hand them the launch command to run themselves.
> **Gotcha — after a context compaction, reconcile UI task chips against the OS process table.**
> Symptom: 5 chips show "Running" for 26 h while zero ssh/scp processes exist and a "running" upload
> actually died at 2/10 checkpoints, silently gating the downstream eval all evening. → Root cause:
> background shells die with the old session, but their chips keep showing "Running"; the new session's
> task list is empty, so the only ground truth is a process scan. → Fix: **first action after any
> compaction is a process-scan** (e.g. `Get-CimInstance Win32_Process` matched on the remote host string,
> or `pgrep`/`ps` for ssh/scp), relaunch dead transfers with a byte-size verify, re-arm ONE fresh
> sentinel, and tell the user to clear the husks.
---
## §4 Stale-waiter hygiene — one waiter per live run, right lifetime
> **Gotcha — stale background waiters pile up.** Symptom: the Background-tasks panel shows 8+ "Running"
> wait-loops at 500740 min elapsed, ssh-polling every ~20 s, while the GPU is idle and the experiment
> finished hours ago. → Root cause: every kill+restart of a flaky-network saga armed a NEW
> `until ssh grep MARKER; do sleep 20; done` waiter but never stopped the OLD one — its marker (in a
> superseded log) never appears, so it loops forever (fact 3). → Fix below.
- **One waiter per live run.** Superseding a run → STOP its old waiter *first* (TaskStop, or dismiss a
cross-session chip from the UI — resumed-session IDs aren't stoppable programmatically).
- **Match watcher lifetime to the wait.** Multi-hour wait → a persistent Monitor (no 10-min cap) plus a
stall-detector so a hung run still notifies. A persistent monitor still dies on session resume → after
any resume, **check remote ground truth directly** (tmux ls / squeue, `grep DONE log`, `nvidia-smi`);
do not trust a watcher that may be gone (fact 3 + §0 corollary).
- **A dropped poll connection ≠ the job dying.** A long background ssh poll gets killed by the remote's
idle-SSH timeout while the detached training runs on independently. Re-ssh and verify the process/
artifacts directly before concluding anything died.
---
## §5 Two-leg self-completion — guaranteed results + best-effort cadence
"I'll check periodically" is a lie unless a trigger is ARMED — between turns the assistant does not run.
Two legs, never conflated:
- **Leg 1 — remote self-completion (guaranteed, survives session/SSH death):** the L1 chain
(`train → eval → touch marker` under one detach primitive). Detect done by a log/marker, never by
`pgrep` of the waiter's own pattern. This guarantees RESULTS but gives no reporting cadence.
- **Leg 2 — live progress (best-effort):** a session-bound patrol loop (L2, e.g. `/loop 30m` or cron `3,33 * * * *`)
polling with the LOCAL ssh key. Be honest it dies when the session closes — the remote still finishes;
the user re-pings to pull.
> **A cloud scheduler cannot reach a rented box.** A cloud schedule (`/schedule` / RemoteTrigger) runs in
> an isolated sandbox with its own checkout and **no access to the local SSH key or network** → it cannot
> ssh the box, and the SSH private key must **never** be placed in a cloud agent (secret-leak). The honest recurring check is the remote self-monitor + a session loop, not a cloud robot pinging
> the box. Don't promise autonomous cross-session polling that can't be delivered.
For a hosted tracker whose metrics survive teardown and can be polled as a structured monitor instead of
brittle ssh-tail, use `huggingface-skills:huggingface-trackio` (REQUIRED for that path) — poll its alerts
rather than grepping a remote log.
---
## §6 Failure triage on the log
When a probe shows trouble, pull the full traceback from the per-run log (§2) and classify — each
outcome maps to a FIXED remediation; never blind-retry:
```bash
ssh "$HOST" "grep -B2 -A20 'Traceback' '$RUN_LOG' | head -50"
```
- `basic_ios::clear: iostream error` + `unexpected pos N vs M`**disk full during checkpoint save**;
check `df -h`/`df -i`, prune `latest`/periodic snapshots to recover (→ `references/gotchas_universal.md`).
- bare `Killed` / exit 137, no traceback → **cgroup OOM** (workers × big in-RAM tensor); size workers
vs `memory.max`, not CPU count.
- `CUDA out of memory` → VRAM, usually consistent across runs (batch too big / concurrent job), rarely
transient.
- `KeyError` / `AttributeError` → config/code mismatch; investigate code, do not retry.
- Early-stop far below baseline with a grad_norm P99 spike in epoch 12 → likely **probabilistic
divergence**; whether it's a bug or a real effect, and the retry-the-identical-config rule, belong to
`verifying-dl-experiments` (REQUIRED) — this skill owns *running* the retry, not judging the number.
- log frozen (no new lines) but checkpoint `mtime` advances → **block-buffered stdout**, not a hang
(`references/gotchas_universal.md` U43; run `python -u`/`PYTHONUNBUFFERED=1`).
- `uptime`/`free` on the box look maxed but your cgroup is roomy → **noisy neighbor** on the shared host,
not your job (`references/gotchas_universal.md` U41; the authoritative OOM check is the `oom_kill` counter
in `/sys/fs/cgroup/memory.events`).
- GPU SM% pinned low while a python thread-storm pegs the cores → **intra-op thread oversubscription** on a
vCPU slice (`references/gotchas_universal.md` U40; cap `OMP_NUM_THREADS` to the cgroup quota).
Universal gotchas (silent sync, CRLF, mid-run script overwrite, inode caps) are NOT restated here —
see `references/gotchas_universal.md` (`grep -in '<keyword>' references/gotchas_universal.md` to jump).
---
## §7 Monitoring across agent hosts (portability mapping)
The four layers are host-agnostic; only **which primitive runs L2/L3** changes per host. Two rules port
the whole architecture to Codex / Cursor / Trae / any Agent-Skills host:
**Rule 1 — the durable layer needs no agent.** L1 (the box self-completes + `touch`es a marker) plus the
box **pushing its own notification** at the end of the `&&`-chain — a `curl` webhook / email / a
`huggingface-skills:huggingface-trackio` alert — works on EVERY host, because it runs entirely on the
rented box. On a host with no background/scheduler primitive, this IS the monitor; the agent just pulls
results on its next turn.
**Rule 2 — a CLOUD scheduler cannot reach a rented box (§5), on ANY host.** Every host's hosted
automation runs in an isolated sandbox with no local SSH key or network, so it cannot ssh your box (and
the key must never be placed in one — secret-leak). Use cloud cron only to **re-wake the agent** or
**poll a hosted tracker**, never to probe the box. The box-reaching poll must use the host's
**local/session** runner (which holds your SSH key), or be the L1 on-box loop.
| Agent host | Local runner — reaches the box (L3) | Recurring / loop (L2) | Cloud cron/automation — re-wake / tracker only | Foreground/turn limit |
|---|---|---|---|---|
| **Claude Code** | `run_in_background` (detach + notify-on-exit); the `Monitor` tool | `/loop` + `ScheduleWakeup` (interval or self-paced) | `/schedule` (cron cloud agent) | ~600 s foreground |
| **OpenAI Codex** | Codex Cloud background tasks (async, parallel) | a thread that schedules its own wake-up | **Automations** — cron syntax, results → review queue | per cloud task |
| **Cursor** | Background Agents (async) | — | **Automations** — cron (hourly/daily/weekly) + event triggers | per agent |
| **Trae** (ByteDance) | Agent / `trae-agent` CLI unattended runs; CI/CD | via a CI/CD pipeline | **no native cron found** → external cron / CI-CD, or rely on Rule 1 | per run |
| **Generic / none** | any local background-equivalent (else none) | a shell `while`-loop under the turn limit | none | host turn limit |
> **Hosts not in the table** (Gemini CLI, VS Code / Copilot, Goose, Kiro, …) take the **Generic** row until they expose a local recurring runner that holds your SSH key — until then, wire **Rule 1** (the on-box self-push) and let the agent pull on its next turn.
**Binding the layers:** L1 is unchanged everywhere (on-box). Bind **L2** to the host's local recurring
runner *if* it reaches the box, else to the box's own `cron`/`at` + a push (Rule 1). Bind **L3** to the
host's local background runner, re-armed once per resume. When a host offers only cloud automation (or
nothing), **do not promise agent-side polling of the box** — wire Rule 1 and let the agent pull on its
next turn (§0 corollary: trust the artifact, not the silence).
Host capabilities verified 2026-06: Codex Automations (cron) + Cloud background tasks —
`developers.openai.com/codex/app/automations` + `/codex/cloud`; Cursor Automations (cron + event
triggers) + Background Agents — `cursor.com/docs/cloud-agent/automations`; Trae Agent / `trae-agent` CLI
+ CI/CD, no native cron surfaced — `docs.trae.ai/ide` + `github.com/bytedance/trae-agent`.