--- platform: china-family # 矩池云 Matpool · 恒源云 Gpushare · Featurize · 揽睿星舟 LanRui kind: ssh-rental # all four: SSH + Jupyter + tmux, cgroup-isolated, prebuilt conda base meter_stop_verb: per-platform # 停止并释放 (Matpool) | 关机→释放 (Gpushare) | 实例归还 (Featurize) | 停止+销毁数据盘 (LanRui) meter_stop_irreversible: mixed # releasing the instance is; the persistent vol survives — EXCEPT LanRui 数据盘 bills while stopped detach_primitive: tmux # preinstalled on most images; backgrounded python survives tab-close too spot_available: false # on-demand only — NO mid-run spot reclaim (see §4) spot_grace: n/a # the involuntary-loss vector is auto-release of STOPPED instances, not preemption shared_fs: per-platform # /mnt | /hy-netdisk(+/hy-nas) | work+/cloud | /home/user/netdisk/data — region/machine-scoped, see §2 inode_cap: undocumented # size caps documented (5/20/30/10 GB free); inode caps NOT — measure df -i live free_egress: true # intra-China; cross-GFW pulls need a mirror (see references/china-network.md) china_mirror_needed: true # all four sit behind the GFW — mirror/proxy story is shared, not per-platform host_driver_cuda_max: image-dependent local_nvme: per-platform # Gpushare /hy-tmp, LanRui /home/user/datadisk, Featurize local scratch --- # Profile: Chinese GPU-rental family (Matpool · Gpushare · Featurize · LanRui) One-line purpose: the AutoDL-shaped Chinese rentals — near-clones that share AutoDL's SSH+tmux+prebuilt-base spine but diverge on **what survives a stop**, **whether a stopped data disk still bills**, and **which (if any) academic proxy** ships. Treat AutoDL (`profiles/autodl.md`) as the reference implementation; this profile records only the deltas, at the FAMILY level first, then a per-platform comparison table. > **Surface to the user up front (principle #10):** ⚠️ Danger clocks (per platform, §5) — a **stopped instance is auto-released** (Gpushare ~10 days, others vary) → data gone; **LanRui's 数据盘 bills while stopped**; Gpushare's **`/hy-tmp` is wiped 24 h after stop** and `/root` resets to the image. Conveniences — built-in **JupyterLab / TensorBoard** quick-tools (all four); **declare any custom port at rent time** ("高级选项") — it can't be opened later. **To jump:** `grep -in '' profiles/china.md` (e.g. `proxy`, `ephemeral`, `bills`, `inode`, `LanRui`). ## Table of contents 1. LAUNCH · 2. STORAGE MODEL (survival matrix + `/root`-ephemeral trap) · 3. NETWORK (→ `references/china-network.md`) · 4. SPOT/INTERRUPTION · 5. TEARDOWN/BILLING · 6. DAEMON TOOL · 7. TOP GOTCHAS (universal → `references/gotchas_universal.md`) + Platform-specific debugging · 8. SCRIPT OVERRIDES · 9. Per-platform comparison table > Universal gotchas (CRLF, cgroup OOM, silent sync, tmux-holds-script, disk-budget, secrets-off-shared-FS) > are NOT restated here — see `references/gotchas_universal.md`. The mirror/proxy/download story is NOT > restated either — it is shared across all CN platforms and lives in `references/china-network.md`. --- ## 1. LAUNCH All four: web console rents a marketplace machine → pick GPU count + a **prebuilt image** (PyTorch/TF + CUDA) that *is* the env → connect via SSH (auto-generated password or pushed public key) + JupyterLab; VS Code Remote-SSH works on all of them. **Env contract — the image/base IS the env; do not `conda create` on a rental.** Same rule as AutoDL, with per-platform base-activation wrinkles (verified per-platform docs 2026-06): - **Featurize** — base is fully provisioned and used directly; `pip`/`conda install` into base persists on the `work` workspace. Activate and run. - **Matpool** — ships a **`myconda` env that auto-activates on startup** (interpreter at `/root/miniconda3/envs/myconda/bin/python`). Run directly; no re-enable needed (verified matpool conda docs 2026-06 — this corrects the earlier "auto-activate off" note, which was true only for Gpushare). - **Gpushare** — ships miniconda but **base auto-activate is *disabled*** (`登陆终端默认取消了自动进入 base 环境`). Re-enable (`conda config --set auto_activate_base true`) or activate the named env per session (verified gpushare.com/docs/best_practices/conda 2026-06). - **LanRui** — image-provisioned (PyTorch images are a purchasable image option); base used directly. An unavoidable custom env goes **on the persistent disk** (`--prefix //myenv`), never the small system disk (§2) — a system-disk env is wiped wherever `/root` is ephemeral. On Gpushare specifically, docs recommend `conda create -p /hy-netdisk/myenv` (NOT `/hy-tmp` — that auto-clears 24 h after shutdown, GS5). → **verify:** `ssh 'python -c "import torch;print(torch.cuda.is_available())"'` returns `True` against the *prebuilt* interpreter, before any install. --- ## 2. STORAGE MODEL *(survival matrix — principle #4)* **The family-level trap that breaks ported AutoDL habits: `/root` (the system disk) is NOT durable on every platform.** AutoDL persists `/root` across a power-off; here it ranges from "resets to image state on every restart" (Gpushare) to "wiped the instant the instance is returned" (Featurize). Checkpoints MUST go to the platform's persistent mount, not `/root`. Each platform pairs a **small reset-prone system disk** with a **persistent network/data disk**: | Platform | System disk (`/root` etc.) | Persistent mount | Local fast scratch | Free quota | |---|---|---|---|---| | Matpool | `/root` (instance-local; snapshot-captured) | **`/mnt`** netdisk (survives release, expandable, region-scoped) | `/root` (local) | 5 GB netdisk | | Gpushare | `/` incl. `/root` — **resets to image on stop/restart** | **`/hy-netdisk`** (only on *marked* machines) | **`/hy-tmp`** (local SSD; auto-cleared 24 h after stop) | 20 GB sys disk | | Featurize | wiped on return | **`work` = `/home/featurize`** (persists) + **`/cloud`** sync drive | local (non-persistent) | **30 GB** free cloud | | LanRui | system disk lost on stop | **数据盘 = `/home/user/datadisk`** + shared **`/home/user/netdisk/data`** | 数据盘 (block-storage, ≈ sys-disk speed) | 网盘 10 GB free | **Survival matrix (family):** | Tier | Survives STOP? | Survives RELEASE/RETURN? | Notes | |---|---|---|---| | System disk / `/root` | varies (Gpushare: **NO, resets to image**; Featurize: wiped on return) | NO | never the checkpoint target | | Persistent netdisk / 数据盘 | YES | YES (except LanRui 数据盘 still **bills** — §5) | the only safe checkpoint target | | Cross-instance shared folder | YES | YES | region/zone/machine-scoped; **clobber risk** (see gotchas) | | Gpushare `/hy-tmp` (local SSD) | **NO — auto-cleared 24 h after shutdown** | NO | fast scratch only; copy results to `/hy-netdisk` before stop (GS5) | **Region / scope locks** (analog of AutoDL's region-scoped FS): - **Matpool** — `/mnt` netdisk is **region-scoped**: different regions have separate netdisks that don't interconnect; pick the region before expanding storage (verified matpool FAQ 2026-06). - **Featurize** — code in `work`/`/cloud` persists; per common usage reports the cloud sync drive does not share across different regions, while *datasets* are reusable — confirm all instances of a sweep are same-region before fan-out. *(med confidence — official wording not re-verified 2026-06.)* - **Gpushare** — `/hy-netdisk` exists **only on machines marked as supporting 内网存储**; an unmarked machine has no shared mount. A separate **`/hy-nas`** shared storage (0.0007 元/GB·h) exists on specific instances (verified gpushare.com/docs/data 2026-06). - **LanRui** — `/home/user/netdisk/data` is a **per-availability-zone shared folder auto-mounted into *every* workspace in that zone** (`data 文件夹下的任何数据,都可以在该可用区下的所有工作空间中使用`) — convenient, but a parallel-ablation clobber hazard (gotcha LR2). The 网盘 is poor at many-small-file *writes*; use the 数据盘 for that (verified docs.lanrui.co storage 2026-06). **Inode caps:** size caps are documented (5 / 20 / 30 / 10 GB free across the four); **explicit inode caps are NOT documented by any of them**. The many-small-files metadata-exhaustion risk still transfers to any shared FS — measure `df -i ` on a live instance in Phase 0 rather than assuming a number. Redirect HF/ModelScope caches off the small system disk → see `references/china-network.md` §2. State the checkpoint mount for §5's teardown verb: write to the **persistent netdisk/数据盘**, never `/root`. On Gpushare, also stage hot datasets to `/hy-tmp` (local SSD) for IO, but copy results back to `/hy-netdisk` before stopping — `/hy-tmp` is local AND auto-wiped 24 h after shutdown (GS5). --- ## 3. NETWORK **The entire mirror / proxy / resumable-download story is shared across all CN platforms and lives in `references/china-network.md` — do NOT duplicate it here.** That reference owns the mirrors table (PyPI/conda/HF), `HF_ENDPOINT=https://hf-mirror.com`, the ModelScope fallback, the resumable-download retry ladder, the `hf_transfer` hang caution, and the `no_proxy` trap. Only the per-platform **egress accelerator** differs and is recorded here (verified per-platform docs 2026-06): - **Gpushare — has a real academic proxy** (the closest analog to AutoDL's `/etc/network_turbo`): `export https_proxy=http://turbo.gpushare.com: http_proxy=http://turbo.gpushare.com:` (a `turbo2.gpushare.com:` backup host also exists). Two critical differences from AutoDL: (a) it is **per-session export**, NOT auto-sourced — re-run it in every new terminal/tmux pane; (b) it **whitelists only `*.github.com`, `*.github.io`, `*.githubusercontent.com`, `*.githubassets.com`, `*.huggingface.co`, `*.pytorch.org`, `*.kaggle.com` and *restricts every other host*** — so `unset http_proxy https_proxy` (or `unset http_proxy && unset https_proxy`) the moment the accelerated pull finishes, or `pip`/`apt`/domestic mirrors mystery-fail (gotcha GS2). This is exactly the `no_proxy`/route-specific trap in principle #7 — validate the speed test on the same route the real transfer uses (verified gpushare.com/docs/instance/network_turbo 2026-06). - **Matpool** — no one-command egress proxy; ships source-switch scripts under `/public/script/` (`switch_conda_source.sh`, `switch_pip_source.sh`, `switch_apt_source.sh`). Fall back to mirrors (`references/china-network.md`). - **Featurize / LanRui** — no documented one-command academic proxy surfaced; mirrors only. **Port exposure:** JupyterLab/TensorBoard are built-in quick-tools (all four). **Custom ports must be declared at rent time** ("高级选项" on Matpool, e.g. HTTP-6006 TensorBoard / HTTP-8888) — they cannot be opened post-launch. Ports may change on restart — re-read the console, don't hard-code a port in an alias. SSH is standard OpenSSH (scp/rsync work directly; no proxied-SSH `scp` limitation). Sanitized shapes: `ssh -p root@.matpool.com` (Matpool, e.g. `hz.matpool.com` / `hz-t2.matpool.com`), `ssh -p root@.gpushare.com` (Gpushare), `ssh user@ssh..lanrui-ai.com -p -i ~/.ssh/id_rsa` (LanRui — public-key must be uploaded to the console first). --- ## 4. SPOT / INTERRUPTION + RESUME *(principle #7/#8)* **These are on-demand-only platforms — there is NO spot bid and NO documented mid-run reclaim.** Do not build SIGTERM-grace preemption handling here; aggressive retry-on-preemption is over-engineering on this family. The real involuntary-loss vectors are: 1. **Auto-release of *stopped* instances.** Gpushare auto-releases (deletes, unrecoverable) a stopped pay-as-you-go instance **10 days after stop** (`实例停止 10 天后,会自动释放` — verified gpushare.com/docs/instance/manage 2026-06). On arrears, **at noon on the 15th day** Gpushare deletes personal data + the `/hy-nas` shared storage + custom images. A stopped box is not a parked box — pull anything needed off it before that window. 2. **`/hy-tmp` 24-hour auto-clear (Gpushare).** Distinct from instance release: even on a *running* server, `/hy-tmp` data is deleted **24 h after the instance is shut down**, and is also wiped on instance migration (GS5). 3. **GPU-idle auto-shutdown.** Most platforms offer an opt-in "idle → auto-stop" policy to prevent waste; if enabled it can stop a job that merely went quiet (e.g. between epochs with no GPU util) — keep it off for long single-GPU jobs unless heartbeat is guaranteed. 4. **Platform churn (LanRui).** LanRui migrated domain `lanrui-ai.com` → **`lanrui.co`** (old-domain data not retained after **2024-11-01**) and retired its **T1/T2 zones on 2025-06-30**, moving users to a new "Cova" platform — **re-verify current console paths/domain before scripting against any cached LanRui path**. **Resume hook:** checkpoint-to-durable + load-latest-on-startup (principle #8) is still the right spine — here it guards against a forgotten stop, a 10-day auto-release, and a `/hy-tmp` 24 h wipe, not a spot kill. The cadence formula in `references/spot-resilience.md` still applies if a job is long enough to span a forced stop. --- ## 5. TEARDOWN / BILLING *(principle #9 + the Iron Law)* **The meter-stop verb is per-platform — bind it from the table below before clicking anything.** The Iron Law (SKILL.md Phase 5) holds unchanged: NO release/return/destroy until checkpoints are **pulled to local AND verified by load**, and the user has approved the cost-affecting action. | Platform | Meter-stop verb | What it preserves | Cost trap | |---|---|---|---| | Matpool | **停止并释放** (stop+release) | `/mnt` netdisk persists (region-scoped) | `.snap` snapshots silently eat the 5 GB netdisk (MP1) | | Gpushare | **关机** stops compute → **释放** deletes | `/hy-netdisk` persists; `/hy-tmp` cleared 24 h post-stop; `/root` **resets to image** | stopped instance **auto-released at 10 days** (GS4); arrears purge day-15 noon | | Featurize | **实例归还** (return) | only `work` (`/home/featurize`) + `/cloud` persist | everything else **wiped immediately on return** (FZ1) | | LanRui | **停止** stops compute; **must *销毁数据盘*** (destroy the 数据盘) to stop disk billing | 网盘 + 数据盘 persist | **数据盘 bills hourly while the workspace is merely STOPPED** (LR1) | **The single most dangerous divergence: on LanRui, "stop to save money" is wrong.** The 数据盘 (`/home/user/datadisk`, block storage, bought in 200 G / 500 G specs) bills hourly from *creation* until *destroyed*, even while the workspace is stopped — `工作空间停止运行,未销毁的数据盘也将持续计费` (verified docs.lanrui.co storage + lanrui.co/pricing 2026-06). So a stopped LanRui workspace keeps a meter running. To actually stop all billing: stop the workspace AND destroy the 数据盘 (after the Iron-Law pull+verify). The 网盘 (10 GB free, 0.15 元/GB·月 overage) persists separately. Contrast: on Matpool/Gpushare/Featurize, release/return/归还 ends compute billing and the persistent volume simply survives (Gpushare /hy-netdisk and /hy-nas bill per-GB but are not destroyed by stopping). **Cost-pause analogs (cheaper than full release, data kept):** Gpushare **无卡模式 / 无卡启动** (low-core CPU-only restart, no GPU) is the analog of AutoDL's no-GPU restart — keeps `/hy-netdisk` data while paused at a fraction of the GPU rate, ideal for env-config + dataset download (verified gpushare 无卡启动 announcement 2026-06). LanRui supports an **auto-stop timer** (set a stop time at workspace start) and per-hour billing. --- ## 6. DAEMON TOOL **tmux** is the family detach primitive — preinstalled on most images. Caveat (from Matpool docs, true family-wide): **run tmux from a local SSH session, NOT the Jupyter web terminal** — keybindings collide with tmux's prefix. A backgrounded `nohup python … log 2>&1 &` also survives a tab-close / page refresh on Featurize (process not killed; only notebook cell state lost) — but tmux is preferred for a named, re-attachable session. tmux survives an **SSH drop** but **NOT** an instance **stop/restart** on any platform (on Gpushare the restart resets `/root`, taking the tmux server and any `/root` logs with it) — so the durable spine is checkpoint-to-persistent-disk (§2, principle #8), not the tmux session. LanRui additionally supports **multi-machine multi-GPU distributed training** — if used, see `references/multinode.md`. --- ## 7. TOP GOTCHAS *(platform-pinned; universal ones → `references/gotchas_universal.md`)* ### Family-wide (China-specific, not in the universal catalog) **CN1 — `/root` ephemerality silently loses work.** Symptom: code/checkpoints written to `/root` vanish after a stop/restart (Gpushare) or instance return (Featurize). → Root cause: the system disk resets to image state / is wiped on return — unlike AutoDL, which persists `/root` across power-off. → Fix: write *everything* to the persistent mount (§2); treat `/root` as RAM. Audit with `ls ` after a test stop before trusting it for a real run. **CN2 — GPU-idle auto-stop kills a quiet job.** Symptom: a long job dies mid-run with no error; console shows "auto-stopped (idle)". → Root cause: an opt-in idle-shutdown policy stopped the instance during a low-GPU-util phase (data loading, eval, between epochs). → Fix: disable idle-auto-stop for long jobs, or emit a periodic GPU-touching heartbeat; confirm the policy state in Phase 0. ### Matpool (matpool.com) **MP1 — `.snap` snapshots silently consume the 5 GB netdisk.** Symptom: "保存环境" / snapshot saves fail or the netdisk fills with no obvious culprit. → Root cause: snapshots are written as `.snap` files **into the netdisk** and count against its tiny 5 GB quota (verified matpool snapshot docs 2026-06). → Fix: prune old `.snap` files (deleting one frees the quota); keep only the latest needed env snapshot. **MP2 — `/mnt` is excluded from snapshots, and the machine is locked while saving.** Symptom: "保存环境" doesn't capture code under `/mnt`; the instance is unusable during the save. → Root cause: a snapshot captures **everything *except* `/mnt`** (the netdisk mount), and the machine cannot be used while the snapshot writes. → Fix: to *shrink* a snapshot move code/data to `/mnt` first (it won't be captured); to *preserve* code via snapshot keep it OFF `/mnt`. Ensure no running process before triggering a save. **MP3 — region-scoped netdisk strands data on a sweep across regions.** Symptom: a second instance in another region can't see files written by the first; expanded storage "missing". → Root cause: `/mnt` netdisks are separate per region and do not interconnect. → Fix: keep all instances of a sweep in one region; pick region before expanding (verified matpool FAQ 2026-06). ### Gpushare (gpushare.com) **GS1 — `/root` resets to image state on every shutdown/restart.** (The instance of CN1 to remember by name.) Symptom: installed packages / code / logs under `/root` gone after restart. → Root cause: only `/hy-tmp` and `/hy-netdisk` persist; `/` reverts to the image. → Fix: env on `/hy-netdisk`, hot data on `/hy-tmp`, results synced to `/hy-netdisk` before stop. **GS2 — turbo proxy left on blocks non-whitelisted hosts.** Symptom: after `export …turbo.gpushare.com…`, `pip install` / `apt` / domestic mirrors hang or `ProxyError`. → Root cause: the academic proxy whitelists only GitHub/HF/PyTorch/Kaggle and **restricts everything else** (verified network_turbo docs 2026-06). → Fix: `unset http_proxy https_proxy` the moment the accelerated pull finishes (§3). Same shape as the `no_proxy` trap in `references/china-network.md`. **GS3 — `/hy-netdisk` absent on unmarked machines.** Symptom: scripts referencing `/hy-netdisk` fail on some rentals. → Root cause: the shared netdisk exists only on machines marked as supporting 内网存储. → Fix: check `mount | grep hy-netdisk` in Phase 0; fall back to personal cloud storage via `oss cp` (OSS tool, ~300 Mbps, compressed archives only) if absent. **GS4 — stopped instance auto-released at 10 days; arrears purge at day 15.** Symptom: a parked stopped instance disappears, or shared/personal data is gone after non-payment. → Root cause: pay-as-you-go auto- release 10 days after stop (`实例停止 10 天后自动释放`); on arrears, day-15-noon deletes personal data + `/hy-nas` + custom images (verified gpushare docs 2026-06). → Fix: pull results off a stopped box promptly; don't treat "stopped" as durable parking; keep the balance positive. **GS5 — `/hy-tmp` auto-cleared 24 h after shutdown (and on migration).** *(NEW — corrects the prior "/hy-tmp persists" assumption.)* Symptom: training data/scratch under `/hy-tmp` gone the day after a stop, even though the instance still exists. → Root cause: `/hy-tmp` is per-server local scratch, auto-deleted 24 h after shutdown and wiped on instance migration (verified gpushare.com/docs/data/storage 2026-06). → Fix: treat `/hy-tmp` as IO scratch only; sync anything durable to `/hy-netdisk` before stopping; do NOT `conda create -p /hy-tmp/...` for a persistent env (use `/hy-netdisk`). ### Featurize (featurize.cn) **FZ1 — anything outside `work`/`/cloud` is wiped the instant the instance is returned.** (The strictest "what survives" rule of the four.) Symptom: results outside `/home/featurize` or `/cloud` gone after 归还. → Root cause: only `work` (per-user cloud storage, `工作区可以一直保存项目文件`) and the `/cloud` sync drive persist; everything else is destroyed on return (verified Featurize tutorials 2026-06). → Fix: write all durable output under `work`/`/cloud`; verify before returning. **FZ2 — `/cloud` sync drive lag makes edits *look* saved but not land.** Symptom: VS Code edits / files appear saved locally but are missing after reconnect or return (the "工作区中修改代码后无法保存" complaint). → Root cause: the Remote-SSH sync to the cloud drive is not always real-time, especially on slow links or large files. → Fix: explicit `Ctrl+S`, then verify on the server (`ls -la` / `cat` the file) before trusting it; on a flaky connection, close and re-open the Remote-SSH session (transient failures are expected). **FZ3 — 30 GB free cloud quota silently breaks large writes / `conda create`.** *(corrects the prior "~20 GB" figure.)* Symptom: env creation or large copies into `work`/`/cloud` fail or truncate. → Root cause: the free cloud storage is **30 GB** (verified featurize.cn 2026-06); over it, writes fail. → Fix: `du -sh ~/work /cloud` to watch headroom; keep only the active env there; large reproducible scratch belongs on local non-persistent disk, not the cloud drive. ### LanRui (lanrui.co / lanrui-ai.com) **LR1 — 数据盘 keeps billing while the workspace is merely *stopped*.** (The most expensive divergence — see §5.) Symptom: a stopped LanRui workspace still accrues cost. → Root cause: the 数据盘 (`/home/user/datadisk`) bills hourly from creation until *destroyed*, independent of workspace run-state (`工作空间停止运行,未销毁的数据盘也将持续计费` — verified docs.lanrui.co storage 2026-06). → Fix: to stop all billing, stop the workspace **and** 销毁 the 数据盘 — only after the Iron-Law pull+verify; the 网盘 keeps the data. **LR2 — shared `netdisk/data` folder mounted into every same-zone workspace → cross-run clobber.** Symptom: a parallel ablation overwrites another run's outputs. → Root cause: `/home/user/netdisk/data` is auto-mounted and shared across *all* workspaces in the same availability zone. → Fix: per-job isolated write paths (`references/parallel_ablation.md`); never share a mutable output dir under `netdisk/data`. Also: the 网盘 is poor at many-small-file *writes* — route those to the 数据盘. **LR3 — platform/domain churn invalidates cached paths.** Symptom: scripted paths/domain fail post-migration. → Root cause: domain `lanrui-ai.com` → `lanrui.co` (old data dropped after 2024-11-01); T1/T2 zones retired 2025-06-30 → "Cova" platform. → Fix: re-verify console domain + paths in-session before scripting against any cached LanRui path. ### Platform-specific debugging Before trusting a run, in Phase 0 (per platform): - **Confirm persistence path is real, not `/root`.** `mount | grep -E 'mnt|hy-netdisk|cloud|datadisk|netdisk'` then `touch /.probe && ls -l /.probe`. On Gpushare also confirm `/hy-netdisk` is present (GS3) — `mount | grep hy-netdisk` (absent on unmarked machines). - **GPU + driver sanity.** `nvidia-smi` (GPU visible, mem free, driver/CUDA), then `python -c "import torch;print(torch.__version__, torch.cuda.is_available(), torch.cuda.get_device_name(0))"` against the prebuilt interpreter. Mismatched local-vs-server PyTorch silently breaks checkpoint loads on Featurize — match versions. - **Detect a stuck / throttled download.** `du -sh ` twice ~30 s apart — flat size = stalled (often the GFW or a left-on Gpushare turbo proxy restricting a non-whitelisted host, GS2). Cross-check with `curl -sI -x "$https_proxy" https://hf-mirror.com` / `env | grep -i proxy`; `unset http_proxy https_proxy` and retry on the mirror. - **Disk / inode pressure (the silent §2 risk).** `df -h ` AND `df -i ` — a full inode table fails writes while `df -h` still shows free GB. On Matpool, a filling 5 GB netdisk is usually stale `.snap` files (`ls -la /mnt/*.snap`, MP1). - **Verify the meter-stop did what was intended.** After "stop", re-check the console billing line — on LanRui a stopped workspace whose 数据盘 was NOT destroyed is still metering (LR1); on Gpushare a stopped box still counts toward the 10-day auto-release clock (GS4). - **Read the running job's log, don't infer from silence.** Job is in tmux/nohup → `tmux capture-pane -pt ` or `tail -f /run.log`. A vanished tmux server after a "restart" means `/root` reset (GS1) — the log must live on the persistent mount to survive. --- ## 8. SCRIPT OVERRIDES Parameterize the `scripts/` templates per platform. `PROXY_HOOK`, `HF_HOME`, and the mirror env all defer to `references/china-network.md`; only the **mounts** truly differ. | Var | Matpool | Gpushare | Featurize | LanRui | |---|---|---|---|---| | `DURABLE_DIR=` (durable) | `/mnt` | `/hy-netdisk` | `/home/featurize` (+`/cloud`) | `/home/user/datadisk` (or `/home/user/netdisk/data`) | | `DATA_DIR=` (fast/ephemeral) | `/root` | `/hy-tmp` (24 h post-stop wipe) | local tmp | `/home/user/datadisk` scratch | | `SCRATCH=` (local, prune) | `/root` | `/hy-tmp` | local tmp | 数据盘 scratch | | `HF_HOME=` | `/mnt/.cache/hf` | `/hy-netdisk/.cache/hf` | `/cloud/.cache/hf` | `/home/user/datadisk/.cache/hf` | | `PROXY_HOOK=` | (mirrors only) | `export …turbo.gpushare.com:…` then `unset` | (mirrors only) | (mirrors only) | | `CRED_FILE=""` (no file — env var) | `$WANDB_API_KEY` / `$HF_TOKEN` on **ephemeral** disk, never the shared netdisk | same | same | same | | `DETACH=` | tmux | tmux | tmux | tmux | `CRED_FILE=""` because on these CN platforms the credential is an **env var** (or `.netrc`) on the ephemeral disk, not a file on the netdisk — leave it empty so run_one's `[ -n "$CRED_FILE" ]` guard skips the file read and `$WANDB_API_KEY` / `$HF_TOKEN` pass through from the platform env. Common to all: the credential lives in an env var or `.netrc` on the **ephemeral system disk**, never on the shared/persistent netdisk (a shared `data` folder mounted into every same-zone workspace, like LanRui's, is especially leaky — universal secrets-off-shared-FS gotcha in `references/gotchas_universal.md`). --- ## 9. Per-platform comparison — the load-bearing differences at a glance The six questions the schema asks, answered per platform. This is the table to read first when picking which delta applies. | Question | Matpool | Gpushare | Featurize | LanRui | |---|---|---|---|---| | Prebuilt base-conda env? | yes (**`myconda`, auto-activated**) | yes (miniconda, base auto-activate **off**) | yes (full PyTorch/TF base, pip persists on `work`) | yes (image-provisioned; PyTorch images purchasable) | | Academic-acceleration proxy? | no (source-switch scripts only) | **yes** `turbo.gpushare.com:` (per-session, 7-host whitelist) | no (mirrors only) | no (mirrors only) | | Shared / region FS? | `/mnt` netdisk (**region-scoped**, expandable) | `/hy-netdisk` (only on *marked* machines) + `/hy-nas` | `work`+`/cloud` (cloud sync; not cross-region, med-conf) | `/home/user/netdisk/data` (shared into *every same-zone* workspace) | | Inode cap? | undocumented — measure `df -i` | undocumented — measure `df -i` | undocumented — measure `df -i` | undocumented — measure `df -i` | | Data disk bills while **stopped**? | no (release ends billing) | no (but stopped box auto-released at 10 d; `/hy-tmp` cleared 24 h) | no (return ends billing) | **YES — 数据盘 bills until destroyed** | | Meter-stop verb | 停止并释放 | 关机 → 释放 (+ 无卡模式 pause) | 实例归还 | **停止 + 销毁数据盘** | | `/root` survives a stop? | local, lost on release | **NO — resets to image** | **NO — wiped on return** | system disk lost; use 数据盘 | **Bottom line for porting an AutoDL workflow:** the SSH/tmux/smoke/checkpoint spine transfers verbatim; the three things to re-bind per platform are (1) the **persistent mount** (never `/root`; on Gpushare never `/hy-tmp` either), (2) the **meter-stop verb** — and on LanRui, that stopping is not enough, the 数据盘 must be destroyed — and (3) the **proxy hook** (real proxy only on Gpushare, with a strict whitelist; mirrors-only elsewhere → `references/china-network.md`).