271 lines
14 KiB
Markdown
271 lines
14 KiB
Markdown
# SSH Transport — keys, keepalive, resumable copy, secrets-via-stdin
|
||
|
||
Platform-agnostic SSH + file-transfer substrate for every `ssh-rental` profile (AutoDL, RunPod,
|
||
vast.ai, Lambda, Paperspace, China, bare SSH). One-time config so subsequent commands are short and
|
||
password-less, plus the copy/secret patterns that survive flaky networks and short rentals. Concrete
|
||
hosts, ports, and credential locations are **profile facts** — this file owns the *mechanism*, the
|
||
profile (`profiles/<platform>.md` §1/§3/§8) owns the *values*.
|
||
|
||
To jump: `grep -in '<keyword>' references/ssh_transport.md` (e.g. `keepalive`, `rsync`, `stdin`, `crlf`).
|
||
|
||
## Table of contents
|
||
|
||
1. Key generation
|
||
2. Push the public key to an instance
|
||
3. `~/.ssh/config` alias + keepalive tuning
|
||
4. Verify the alias
|
||
5. Resumable copy — rsync vs scp, and WHY rsync
|
||
6. Bulk per-dir download loop
|
||
7. Move secrets via stdin — never inline a key, never on a durable FS
|
||
8. CRLF — `.sh` authored on Windows breaks on Linux
|
||
9. Two SSH flavors — proxied/basic SSH cannot `scp`
|
||
10. Transport gotchas (Symptom → Root cause → Fix)
|
||
|
||
---
|
||
|
||
## 1. Key generation
|
||
|
||
Skip if `~/.ssh/id_ed25519` already exists.
|
||
|
||
```bash
|
||
ssh-keygen -t ed25519 -C "<label>"
|
||
# Save path: Enter for the default ~/.ssh/id_ed25519
|
||
# Passphrase: optional (Enter for none, or set one + use ssh-agent)
|
||
```
|
||
|
||
`ed25519` is shorter and more secure than RSA; every rental platform accepts both. One local key is
|
||
reused across all instances — generate once, push the **public** half (§2) to each box. The private
|
||
half (`~/.ssh/id_ed25519`, no `.pub`) never leaves the local machine and **never** goes onto a rental,
|
||
a shared FS, or a cloud agent (a cloud scheduler runs in an isolated sandbox with no access to it — and
|
||
putting a private key there is a secret leak; see `references/monitoring_patterns.md`).
|
||
|
||
## 2. Push the public key to an instance
|
||
|
||
Copy the connection string from the platform's web console / API; it has the shape
|
||
`ssh -p <PORT> root@connect.<region>.<provider>.com`. Push the public key once:
|
||
|
||
```bash
|
||
ssh-copy-id -p <PORT> root@connect.<region>.<provider>.com
|
||
# enter the platform-provided password ONCE
|
||
```
|
||
|
||
If `ssh-copy-id` is absent (common on Windows-native shells), append the key manually:
|
||
|
||
```bash
|
||
cat ~/.ssh/id_ed25519.pub # copy the entire line
|
||
ssh -p <PORT> root@connect.<region>.<provider>.com
|
||
# on the remote:
|
||
mkdir -p ~/.ssh && chmod 700 ~/.ssh
|
||
echo "<paste the public key line>" >> ~/.ssh/authorized_keys
|
||
chmod 600 ~/.ssh/authorized_keys
|
||
exit
|
||
```
|
||
|
||
Test: re-running the `ssh …` line should connect **without** a password prompt.
|
||
|
||
## 3. `~/.ssh/config` alias + keepalive tuning
|
||
|
||
One block per instance turns `ssh -p <PORT> root@connect.<region>.<provider>.com` into `ssh <alias>`,
|
||
and folds in the keepalive options that keep long monitoring/transfer connections from dropping.
|
||
|
||
```ssh-config
|
||
Host proj-1
|
||
HostName connect.<region>.<provider>.com
|
||
Port <PORT>
|
||
User root
|
||
IdentityFile ~/.ssh/id_ed25519
|
||
ServerAliveInterval 60
|
||
ServerAliveCountMax 120
|
||
TCPKeepAlive yes
|
||
# LogLevel VERBOSE # uncomment to debug a refused/hung connection
|
||
|
||
Host proj-2
|
||
HostName connect.<region>.<provider>.com
|
||
Port <PORT>
|
||
User root
|
||
IdentityFile ~/.ssh/id_ed25519
|
||
ServerAliveInterval 60
|
||
ServerAliveCountMax 120
|
||
```
|
||
|
||
**Naming**: `<project>-<index>` (e.g. `proj-1`, `proj-2`) reads cleanly in a fan-out loop; avoid bare
|
||
`gpu1`. **Why the three keepalive options**:
|
||
|
||
- `ServerAliveInterval 60` — send an application-layer heartbeat every 60 s, so a NAT/idle timeout on
|
||
the path does not silently drop a parked connection (mid-`scp`, or an open monitor).
|
||
- `ServerAliveCountMax 120` — tolerate up to 120 missed heartbeats before declaring the link dead (≈2 h
|
||
of network instability survived). Lower it (e.g. 3) for a *bounded* monitor that should self-kill on a
|
||
blip rather than hang — see the short-connection poll in `references/monitoring_patterns.md`.
|
||
- `TCPKeepAlive yes` — let the OS also emit TCP-layer keepalives, catching a peer that vanishes
|
||
ungracefully.
|
||
|
||
Ports change when a profile re-issues an instance (`ssh-rental` boxes assign a new port on
|
||
re-creation) — update the `Port` line after each create/recreate, then re-run §4.
|
||
|
||
## 4. Verify the alias
|
||
|
||
```bash
|
||
for a in proj-1 proj-2 proj-3 proj-4; do
|
||
echo "=== $a ==="
|
||
ssh -o ConnectTimeout=10 "$a" "hostname; date"
|
||
done
|
||
```
|
||
|
||
Each should print a distinct hostname. Then the env probe (SKILL.md Phase 1):
|
||
`ssh <alias> 'python -c "import torch;print(torch.cuda.is_available())"'`.
|
||
|
||
## 5. Resumable copy — rsync vs scp, and WHY rsync
|
||
|
||
`scp` opens **one** SSH stream for the whole transfer and **cannot resume**: any blip mid-copy aborts
|
||
the entire run and a re-run starts from zero. `rsync` compares source/dest and ships only the delta, so
|
||
a re-run after a drop **continues** instead of restarting — the single most important property on a
|
||
metered box where a 130 GB pull can blip at minute 45.
|
||
|
||
**Prefer `rsync` for anything large or multi-file:**
|
||
|
||
```bash
|
||
rsync -avz --partial --inplace --progress \
|
||
-e ssh \
|
||
<alias>:/root/autodl-tmp/checkpoints/ /path/to/local/checkpoints/
|
||
```
|
||
|
||
- `-a` archive (recurse + preserve perms/times/symlinks), `-v` verbose, `-z` compress on the wire.
|
||
- `--partial` keeps a partially-transferred file on interruption so the next run resumes mid-file
|
||
(without it, rsync deletes the partial and re-sends from the start).
|
||
- `--inplace` writes directly into the destination file (resume-friendly; avoids a full temp copy on a
|
||
tight local disk). Drop it if atomic-replace of an existing dest matters more than resumability.
|
||
- Re-run the **identical** command after any failure — that *is* the resume (principle #7).
|
||
|
||
Use plain `scp` only for a **single small** file (a config, one checkpoint < ~1 GB) where resume is
|
||
moot. For a large *tree*, even `scp` users should fall back to the **per-dir loop** (§6) so one dir's
|
||
failure doesn't lose the rest. If `rsync` is missing on the remote image, `apt-get install rsync` (when
|
||
online) or use the §6 loop.
|
||
|
||
> The bulk-download stall-retry ladder (HF/ModelScope mirror swaps, `timeout … && break` loops) is a
|
||
> *download-from-the-internet* concern, not host↔host copy — that lives in `references/china-network.md`.
|
||
|
||
## 6. Bulk per-dir download loop
|
||
|
||
For a large directory tree (many run/checkpoint dirs), wrap each dir in its **own** SSH session so a
|
||
single drop loses only that dir, and a re-run **skips already-complete dirs**:
|
||
|
||
→ `scripts/download_loop.sh` (parameterize `LOCAL_TARGET`, `REMOTE_ALIAS`, `REMOTE_PATH`).
|
||
|
||
Its shape, and why each piece matters:
|
||
|
||
- **List once, copy per-dir** — each `scp -r <alias>:<remote>/$d ./` is an independent session; one
|
||
failure ≠ whole-transfer loss (the `scp` single-stream trap, §5).
|
||
- **Size-threshold skip** — a dir already ≥ threshold counts as complete and is skipped; a partial dir
|
||
is removed and re-pulled. Re-running the whole script is therefore idempotent and resumable.
|
||
- **Per-dir `ConnectTimeout` + the §3 keepalive flags** on every `scp` so a hung session self-kills
|
||
instead of blocking the loop.
|
||
|
||
## 7. Move secrets via stdin — never inline a key, never on a durable FS
|
||
|
||
Putting a credential **in a command** (`ssh host "echo 'KEY' > …"`, or `scp key.txt host:…`) leaks the
|
||
value into shell history, agent transcripts, and hook logs. Putting it on a **shared /
|
||
durable FS** is worse: the value persists for every co-tenant, and some platforms' upload classifiers
|
||
*block or corrupt* a file matching a known key pattern — so a credential written to the cross-instance
|
||
FS may silently never arrive. **Push credentials to each box's per-instance system disk, via stdin**, so
|
||
the value flows file → pipe → file and appears in no command text or output:
|
||
|
||
```bash
|
||
# stream exactly one credential block — value never appears on a command line
|
||
grep -A 2 "machine api.<provider>.com" ~/.netrc \
|
||
| ssh <alias> 'umask 077; cat > /root/.netrc && chmod 600 /root/.netrc'
|
||
```
|
||
|
||
```bash
|
||
# or a single token, same principle (stdin in, file out, chmod 600)
|
||
printf '%s\n' "$TOKEN_FROM_ENV" \
|
||
| ssh <alias> 'umask 077; cat > /root/.<service>_key && chmod 600 /root/.<service>_key'
|
||
```
|
||
|
||
Rules that make this safe:
|
||
|
||
- **One block, not the whole file.** Stream a single `machine …` stanza, never the entire `~/.netrc` —
|
||
it carries unrelated machines' credentials, and security hooks (rightly) block copying the whole file.
|
||
- **Reference, never echo.** Source the token from an env var (`$TOKEN_FROM_ENV`) or a keyring; never
|
||
paste the literal value into the command.
|
||
- **Per-instance system disk, not the shared FS.** Write to `/root/.<service>_key` (volatile but
|
||
private), not the cross-instance durable mount. The wrapper reads it and exports the env var before
|
||
launch (e.g. `export WANDB_API_KEY=$(cat /root/.wandb_key)`).
|
||
- **Verify by capability, not by echoing the value:**
|
||
`ssh <alias> 'python -c "import wandb; print(wandb.Api(timeout=20).default_entity)"'`.
|
||
|
||
## 8. CRLF — `.sh` authored on Windows breaks on Linux
|
||
|
||
Symptom → Root cause → Fix:
|
||
|
||
- **Symptom**: a synced launcher does nothing (empty log); run by hand it errors `set: -: invalid
|
||
option`, `cd: /path\r: No such file or directory`, or `syntax error near unexpected token $'do\r'` —
|
||
every line "ends in `\r`".
|
||
- **Root cause**: Windows `core.autocrlf=true` (or `git archive` exporting with the working-tree EOL)
|
||
writes `.sh` with CRLF; Linux `bash` treats the trailing `\r` as part of each token. (`.py` is
|
||
unaffected — Python's universal newlines tolerate CRLF; specifically `bash`/`.sh` breaks.)
|
||
- **Fix**: add `.gitattributes` with `*.sh text eol=lf` so `git archive`/checkout always emits LF; as an
|
||
immediate on-box unblock, `sed -i 's/\r$//' scripts/*.sh`.
|
||
|
||
Every shell script in `scripts/` ships LF and starts `#!/usr/bin/env bash` + `set -u`; keep that
|
||
contract when authoring new ones. **Never** put an unquoted `|` inside a `grep` regex in a transport or
|
||
poll script — the shell splits it into piped commands and the first reads stdin → hangs forever
|
||
(`references/monitoring_patterns.md`).
|
||
|
||
## 9. Two SSH flavors — proxied/basic SSH cannot `scp`
|
||
|
||
Some `ssh-rental` platforms expose **two** SSH endpoints, and the difference dictates whether file
|
||
transfer works at all:
|
||
|
||
- **Direct TCP SSH** — a real TCP port to the container (the `connect.<region>.<provider>.com:<PORT>`
|
||
shape above). Full `scp`/`rsync`/`sftp` work. This is what every transfer in this file assumes.
|
||
- **Proxied / "basic" SSH** — a relayed or web-terminal SSH (common on RunPod and vast.ai for the
|
||
default exposed endpoint). It carries an **interactive shell only**: `scp`/`rsync`/`sftp` fail (often
|
||
with `subsystem request failed` / a hung handshake) because the proxy doesn't forward the SFTP
|
||
subsystem.
|
||
|
||
**Fix**: for any code/data/checkpoint transfer, use the **direct-TCP** endpoint — on RunPod expose a
|
||
TCP port (the `ssh root@<ip> -p <PORT>` form, not the proxied `ssh <pod>@ssh.runpod.io` one); on vast.ai
|
||
use the instance's direct SSH port. Each profile's §3 NETWORK names which endpoint is which and whether
|
||
ports change on restart. If only proxied SSH is available, transfer out-of-band instead (push results to
|
||
object storage / HF Hub from on-box and pull from there).
|
||
|
||
## 10. Transport gotchas (Symptom → Root cause → Fix)
|
||
|
||
Universal gotchas (disk-full, inode, OOM, silent sync) are **not** repeated here — see
|
||
`references/gotchas_universal.md`. These are transport-specific.
|
||
|
||
**T1 — SSH exits 255 / "Connection reset" right after a `pkill`/`kill`.**
|
||
Symptom: `ssh <alias> 'pkill -9 -f src.train'` returns `Connection reset by peer`, exit 255. → Root
|
||
cause: killing the process tree disrupts the PTY chain; the SSH client receives EOF and exits — and
|
||
anything *after* the kill in that same one-liner never runs. → Fix: this is **normal**, not a failure.
|
||
Re-ssh to verify (`ssh <alias> "pgrep -af src.train | head -1 || echo CLEAN"`). Split kill and relaunch
|
||
into **two** ssh calls — never `pkill X; relaunch X` in one command, the relaunch is dropped with the
|
||
session.
|
||
|
||
**T2 — large `scp -r` drops with "Read from remote host … reset by peer" 30–60 min in.**
|
||
Symptom: a 130 GB `scp -r` aborts mid-transfer; the local tree has only the first few dirs, the rest
|
||
gone. → Root cause: one SSH stream for the whole transfer; any blip kills it and `scp` does not resume.
|
||
→ Fix: use `rsync --partial` (§5) or the per-dir loop (§6) — each dir an independent session, re-run
|
||
skips completed dirs.
|
||
|
||
**T3 — `.sh` "ends in `\r`" after a Windows→Linux sync.**
|
||
See §8 (`.gitattributes` `*.sh text eol=lf`; on-box `sed -i 's/\r$//'`).
|
||
|
||
**T4 — a credential leaks into history / a shared FS, or its FS upload silently fails.**
|
||
Symptom: a key pasted into an `ssh`/`scp` command lands in transcripts and hook logs; an scp of the key
|
||
to the shared FS "succeeds" but the file is missing or corrupt. → Root cause: the value appeared in a
|
||
command line; and some platforms' FS classifiers block/corrupt credential-shaped uploads. → Fix: §7 —
|
||
stream one block via stdin to the per-instance disk, verify by capability not by echo.
|
||
|
||
**T5 — `scp dest open "/root/x/": Failure` instantly.**
|
||
Symptom: a (often parallel/background) `scp big.tar <alias>:/root/x/` fails at once because the
|
||
destination dir doesn't exist — a sibling command meant to `mkdir` it ran later, or was blocked. → Root
|
||
cause: the transfer assumed a directory a *different* command was supposed to create (a parallel-setup
|
||
race). → Fix: make every transfer self-sufficient — create the dest in the same command:
|
||
`ssh <alias> 'mkdir -p /root/x' && scp … || retry`. Never assume a sibling created the destination.
|
||
|
||
**T6 — `Host key verification failed` after an instance is recreated.**
|
||
Symptom: same `connect.<region>.<provider>.com` host, new host key, so SSH refuses. → Root cause: the
|
||
recreated container presents a different host key on the reused hostname/port. → Fix:
|
||
`ssh-keygen -R '[connect.<region>.<provider>.com]:<PORT>'`, then reconnect (re-accepts the new key).
|