playbook/antigravity-awesome-skills/skills/remote-gpu-trainer/references/ssh_transport.md

# SSH Transport — keys, keepalive, resumable copy, secrets-via-stdin

Platform-agnostic SSH + file-transfer substrate for every `ssh-rental` profile (AutoDL, RunPod,
vast.ai, Lambda, Paperspace, China, bare SSH). One-time config so subsequent commands are short and
password-less, plus the copy/secret patterns that survive flaky networks and short rentals. Concrete
hosts, ports, and credential locations are **profile facts** — this file owns the *mechanism*, the
profile (`profiles/<platform>.md` §1/§3/§8) owns the *values*.

To jump: `grep -in '<keyword>' references/ssh_transport.md` (e.g. `keepalive`, `rsync`, `stdin`, `crlf`).

## Table of contents

1. Key generation
2. Push the public key to an instance
3. `~/.ssh/config` alias + keepalive tuning
4. Verify the alias
5. Resumable copy — rsync vs scp, and WHY rsync
6. Bulk per-dir download loop
7. Move secrets via stdin — never inline a key, never on a durable FS
8. CRLF — `.sh` authored on Windows breaks on Linux
9. Two SSH flavors — proxied/basic SSH cannot `scp`
10. Transport gotchas (Symptom → Root cause → Fix)

---

## 1. Key generation

Skip if `~/.ssh/id_ed25519` already exists.

```bash
ssh-keygen -t ed25519 -C "<label>"
# Save path: Enter for the default ~/.ssh/id_ed25519
# Passphrase: optional (Enter for none, or set one + use ssh-agent)
```

`ed25519` is shorter and more secure than RSA; every rental platform accepts both. One local key is
reused across all instances — generate once, push the **public** half (§2) to each box. The private
half (`~/.ssh/id_ed25519`, no `.pub`) never leaves the local machine and **never** goes onto a rental,
a shared FS, or a cloud agent (a cloud scheduler runs in an isolated sandbox with no access to it — and
putting a private key there is a secret leak; see `references/monitoring_patterns.md`).

## 2. Push the public key to an instance

Copy the connection string from the platform's web console / API; it has the shape
`ssh -p <PORT> root@connect.<region>.<provider>.com`. Push the public key once:

```bash
ssh-copy-id -p <PORT> root@connect.<region>.<provider>.com
# enter the platform-provided password ONCE
```

If `ssh-copy-id` is absent (common on Windows-native shells), append the key manually:

```bash
cat ~/.ssh/id_ed25519.pub          # copy the entire line
ssh -p <PORT> root@connect.<region>.<provider>.com
# on the remote:
mkdir -p ~/.ssh && chmod 700 ~/.ssh
echo "<paste the public key line>" >> ~/.ssh/authorized_keys
chmod 600 ~/.ssh/authorized_keys
exit
```

Test: re-running the `ssh …` line should connect **without** a password prompt.

## 3. `~/.ssh/config` alias + keepalive tuning

One block per instance turns `ssh -p <PORT> root@connect.<region>.<provider>.com` into `ssh <alias>`,
and folds in the keepalive options that keep long monitoring/transfer connections from dropping.

```ssh-config
Host proj-1
    HostName connect.<region>.<provider>.com
    Port <PORT>
    User root
    IdentityFile ~/.ssh/id_ed25519
    ServerAliveInterval 60
    ServerAliveCountMax 120
    TCPKeepAlive yes
    # LogLevel VERBOSE   # uncomment to debug a refused/hung connection

Host proj-2
    HostName connect.<region>.<provider>.com
    Port <PORT>
    User root
    IdentityFile ~/.ssh/id_ed25519
    ServerAliveInterval 60
    ServerAliveCountMax 120
```

**Naming**: `<project>-<index>` (e.g. `proj-1`, `proj-2`) reads cleanly in a fan-out loop; avoid bare
`gpu1`. **Why the three keepalive options**:

- `ServerAliveInterval 60` — send an application-layer heartbeat every 60 s, so a NAT/idle timeout on
  the path does not silently drop a parked connection (mid-`scp`, or an open monitor).
- `ServerAliveCountMax 120` — tolerate up to 120 missed heartbeats before declaring the link dead (≈2 h
  of network instability survived). Lower it (e.g. 3) for a *bounded* monitor that should self-kill on a
  blip rather than hang — see the short-connection poll in `references/monitoring_patterns.md`.
- `TCPKeepAlive yes` — let the OS also emit TCP-layer keepalives, catching a peer that vanishes
  ungracefully.

Ports change when a profile re-issues an instance (`ssh-rental` boxes assign a new port on
re-creation) — update the `Port` line after each create/recreate, then re-run §4.

## 4. Verify the alias

```bash
for a in proj-1 proj-2 proj-3 proj-4; do
    echo "=== $a ==="
    ssh -o ConnectTimeout=10 "$a" "hostname; date"
done
```

Each should print a distinct hostname. Then the env probe (SKILL.md Phase 1):
`ssh <alias> 'python -c "import torch;print(torch.cuda.is_available())"'`.

## 5. Resumable copy — rsync vs scp, and WHY rsync

`scp` opens **one** SSH stream for the whole transfer and **cannot resume**: any blip mid-copy aborts
the entire run and a re-run starts from zero. `rsync` compares source/dest and ships only the delta, so
a re-run after a drop **continues** instead of restarting — the single most important property on a
metered box where a 130 GB pull can blip at minute 45.

**Prefer `rsync` for anything large or multi-file:**

```bash
rsync -avz --partial --inplace --progress \
    -e ssh \
    <alias>:/root/autodl-tmp/checkpoints/ /path/to/local/checkpoints/
```

- `-a` archive (recurse + preserve perms/times/symlinks), `-v` verbose, `-z` compress on the wire.
- `--partial` keeps a partially-transferred file on interruption so the next run resumes mid-file
  (without it, rsync deletes the partial and re-sends from the start).
- `--inplace` writes directly into the destination file (resume-friendly; avoids a full temp copy on a
  tight local disk). Drop it if atomic-replace of an existing dest matters more than resumability.
- Re-run the **identical** command after any failure — that *is* the resume (principle #7).

Use plain `scp` only for a **single small** file (a config, one checkpoint < ~1 GB) where resume is
moot. For a large *tree*, even `scp` users should fall back to the **per-dir loop** (§6) so one dir's
failure doesn't lose the rest. If `rsync` is missing on the remote image, `apt-get install rsync` (when
online) or use the §6 loop.

> The bulk-download stall-retry ladder (HF/ModelScope mirror swaps, `timeout … && break` loops) is a
> *download-from-the-internet* concern, not host↔host copy — that lives in `references/china-network.md`.

## 6. Bulk per-dir download loop

For a large directory tree (many run/checkpoint dirs), wrap each dir in its **own** SSH session so a
single drop loses only that dir, and a re-run **skips already-complete dirs**:

→ `scripts/download_loop.sh` (parameterize `LOCAL_TARGET`, `REMOTE_ALIAS`, `REMOTE_PATH`).

Its shape, and why each piece matters:

- **List once, copy per-dir** — each `scp -r <alias>:<remote>/$d ./` is an independent session; one
  failure ≠ whole-transfer loss (the `scp` single-stream trap, §5).
- **Size-threshold skip** — a dir already ≥ threshold counts as complete and is skipped; a partial dir
  is removed and re-pulled. Re-running the whole script is therefore idempotent and resumable.
- **Per-dir `ConnectTimeout` + the §3 keepalive flags** on every `scp` so a hung session self-kills
  instead of blocking the loop.

## 7. Move secrets via stdin — never inline a key, never on a durable FS

Putting a credential **in a command** (`ssh host "echo 'KEY' > …"`, or `scp key.txt host:…`) leaks the
value into shell history, agent transcripts, and hook logs. Putting it on a **shared /
durable FS** is worse: the value persists for every co-tenant, and some platforms' upload classifiers
*block or corrupt* a file matching a known key pattern — so a credential written to the cross-instance
FS may silently never arrive. **Push credentials to each box's per-instance system disk, via stdin**, so
the value flows file → pipe → file and appears in no command text or output:

```bash
# stream exactly one credential block — value never appears on a command line
grep -A 2 "machine api.<provider>.com" ~/.netrc \
  | ssh <alias> 'umask 077; cat > /root/.netrc && chmod 600 /root/.netrc'
```

```bash
# or a single token, same principle (stdin in, file out, chmod 600)
printf '%s\n' "$TOKEN_FROM_ENV" \
  | ssh <alias> 'umask 077; cat > /root/.<service>_key && chmod 600 /root/.<service>_key'
```

Rules that make this safe:

- **One block, not the whole file.** Stream a single `machine …` stanza, never the entire `~/.netrc` —
  it carries unrelated machines' credentials, and security hooks (rightly) block copying the whole file.
- **Reference, never echo.** Source the token from an env var (`$TOKEN_FROM_ENV`) or a keyring; never
  paste the literal value into the command.
- **Per-instance system disk, not the shared FS.** Write to `/root/.<service>_key` (volatile but
  private), not the cross-instance durable mount. The wrapper reads it and exports the env var before
  launch (e.g. `export WANDB_API_KEY=$(cat /root/.wandb_key)`).
- **Verify by capability, not by echoing the value:**
  `ssh <alias> 'python -c "import wandb; print(wandb.Api(timeout=20).default_entity)"'`.

## 8. CRLF — `.sh` authored on Windows breaks on Linux

Symptom → Root cause → Fix:

- **Symptom**: a synced launcher does nothing (empty log); run by hand it errors `set: -: invalid
  option`, `cd: /path\r: No such file or directory`, or `syntax error near unexpected token $'do\r'` —
  every line "ends in `\r`".
- **Root cause**: Windows `core.autocrlf=true` (or `git archive` exporting with the working-tree EOL)
  writes `.sh` with CRLF; Linux `bash` treats the trailing `\r` as part of each token. (`.py` is
  unaffected — Python's universal newlines tolerate CRLF; specifically `bash`/`.sh` breaks.)
- **Fix**: add `.gitattributes` with `*.sh text eol=lf` so `git archive`/checkout always emits LF; as an
  immediate on-box unblock, `sed -i 's/\r$//' scripts/*.sh`.

Every shell script in `scripts/` ships LF and starts `#!/usr/bin/env bash` + `set -u`; keep that
contract when authoring new ones. **Never** put an unquoted `|` inside a `grep` regex in a transport or
poll script — the shell splits it into piped commands and the first reads stdin → hangs forever
(`references/monitoring_patterns.md`).

## 9. Two SSH flavors — proxied/basic SSH cannot `scp`

Some `ssh-rental` platforms expose **two** SSH endpoints, and the difference dictates whether file
transfer works at all:

- **Direct TCP SSH** — a real TCP port to the container (the `connect.<region>.<provider>.com:<PORT>`
  shape above). Full `scp`/`rsync`/`sftp` work. This is what every transfer in this file assumes.
- **Proxied / "basic" SSH** — a relayed or web-terminal SSH (common on RunPod and vast.ai for the
  default exposed endpoint). It carries an **interactive shell only**: `scp`/`rsync`/`sftp` fail (often
  with `subsystem request failed` / a hung handshake) because the proxy doesn't forward the SFTP
  subsystem.

**Fix**: for any code/data/checkpoint transfer, use the **direct-TCP** endpoint — on RunPod expose a
TCP port (the `ssh root@<ip> -p <PORT>` form, not the proxied `ssh <pod>@ssh.runpod.io` one); on vast.ai
use the instance's direct SSH port. Each profile's §3 NETWORK names which endpoint is which and whether
ports change on restart. If only proxied SSH is available, transfer out-of-band instead (push results to
object storage / HF Hub from on-box and pull from there).

## 10. Transport gotchas (Symptom → Root cause → Fix)

Universal gotchas (disk-full, inode, OOM, silent sync) are **not** repeated here — see
`references/gotchas_universal.md`. These are transport-specific.

**T1 — SSH exits 255 / "Connection reset" right after a `pkill`/`kill`.**
Symptom: `ssh <alias> 'pkill -9 -f src.train'` returns `Connection reset by peer`, exit 255. → Root
cause: killing the process tree disrupts the PTY chain; the SSH client receives EOF and exits — and
anything *after* the kill in that same one-liner never runs. → Fix: this is **normal**, not a failure.
Re-ssh to verify (`ssh <alias> "pgrep -af src.train | head -1 || echo CLEAN"`). Split kill and relaunch
into **two** ssh calls — never `pkill X; relaunch X` in one command, the relaunch is dropped with the
session.

**T2 — large `scp -r` drops with "Read from remote host … reset by peer" 30–60 min in.**
Symptom: a 130 GB `scp -r` aborts mid-transfer; the local tree has only the first few dirs, the rest
gone. → Root cause: one SSH stream for the whole transfer; any blip kills it and `scp` does not resume.
→ Fix: use `rsync --partial` (§5) or the per-dir loop (§6) — each dir an independent session, re-run
skips completed dirs.

**T3 — `.sh` "ends in `\r`" after a Windows→Linux sync.**
See §8 (`.gitattributes` `*.sh text eol=lf`; on-box `sed -i 's/\r$//'`).

**T4 — a credential leaks into history / a shared FS, or its FS upload silently fails.**
Symptom: a key pasted into an `ssh`/`scp` command lands in transcripts and hook logs; an scp of the key
to the shared FS "succeeds" but the file is missing or corrupt. → Root cause: the value appeared in a
command line; and some platforms' FS classifiers block/corrupt credential-shaped uploads. → Fix: §7 —
stream one block via stdin to the per-instance disk, verify by capability not by echo.

**T5 — `scp dest open "/root/x/": Failure` instantly.**
Symptom: a (often parallel/background) `scp big.tar <alias>:/root/x/` fails at once because the
destination dir doesn't exist — a sibling command meant to `mkdir` it ran later, or was blocked. → Root
cause: the transfer assumed a directory a *different* command was supposed to create (a parallel-setup
race). → Fix: make every transfer self-sufficient — create the dest in the same command:
`ssh <alias> 'mkdir -p /root/x' && scp … || retry`. Never assume a sibling created the destination.

**T6 — `Host key verification failed` after an instance is recreated.**
Symptom: same `connect.<region>.<provider>.com` host, new host key, so SSH refuses. → Root cause: the
recreated container presents a different host key on the reused hostname/port. → Fix:
`ssh-keygen -R '[connect.<region>.<provider>.com]:<PORT>'`, then reconnect (re-accepts the new key).