playbook/antigravity-awesome-skills/skills/remote-gpu-trainer/profiles/china.md

30 KiB

platform kind meter_stop_verb meter_stop_irreversible detach_primitive spot_available spot_grace shared_fs inode_cap free_egress china_mirror_needed host_driver_cuda_max local_nvme
china-family ssh-rental per-platform mixed tmux false n/a per-platform undocumented true true image-dependent per-platform

Profile: Chinese GPU-rental family (Matpool · Gpushare · Featurize · LanRui)

One-line purpose: the AutoDL-shaped Chinese rentals — near-clones that share AutoDL's SSH+tmux+prebuilt-base spine but diverge on what survives a stop, whether a stopped data disk still bills, and which (if any) academic proxy ships. Treat AutoDL (profiles/autodl.md) as the reference implementation; this profile records only the deltas, at the FAMILY level first, then a per-platform comparison table.

Surface to the user up front (principle #10): ⚠️ Danger clocks (per platform, §5) — a stopped instance is auto-released (Gpushare ~10 days, others vary) → data gone; LanRui's 数据盘 bills while stopped; Gpushare's /hy-tmp is wiped 24 h after stop and /root resets to the image. Conveniences — built-in JupyterLab / TensorBoard quick-tools (all four); declare any custom port at rent time ("高级选项") — it can't be opened later.

To jump: grep -in '<keyword>' profiles/china.md (e.g. proxy, ephemeral, bills, inode, LanRui).

Table of contents

  1. LAUNCH · 2. STORAGE MODEL (survival matrix + /root-ephemeral trap) · 3. NETWORK (→ references/china-network.md) · 4. SPOT/INTERRUPTION · 5. TEARDOWN/BILLING · 6. DAEMON TOOL · 7. TOP GOTCHAS (universal → references/gotchas_universal.md)
  • Platform-specific debugging · 8. SCRIPT OVERRIDES · 9. Per-platform comparison table

Universal gotchas (CRLF, cgroup OOM, silent sync, tmux-holds-script, disk-budget, secrets-off-shared-FS) are NOT restated here — see references/gotchas_universal.md. The mirror/proxy/download story is NOT restated either — it is shared across all CN platforms and lives in references/china-network.md.


1. LAUNCH

All four: web console rents a marketplace machine → pick GPU count + a prebuilt image (PyTorch/TF + CUDA) that is the env → connect via SSH (auto-generated password or pushed public key) + JupyterLab; VS Code Remote-SSH works on all of them.

Env contract — the image/base IS the env; do not conda create on a rental. Same rule as AutoDL, with per-platform base-activation wrinkles (verified per-platform docs 2026-06):

  • Featurize — base is fully provisioned and used directly; pip/conda install into base persists on the work workspace. Activate and run.
  • Matpool — ships a myconda env that auto-activates on startup (interpreter at /root/miniconda3/envs/myconda/bin/python). Run directly; no re-enable needed (verified matpool conda docs 2026-06 — this corrects the earlier "auto-activate off" note, which was true only for Gpushare).
  • Gpushare — ships miniconda but base auto-activate is disabled (登陆终端默认取消了自动进入 base 环境). Re-enable (conda config --set auto_activate_base true) or activate the named env per session (verified gpushare.com/docs/best_practices/conda 2026-06).
  • LanRui — image-provisioned (PyTorch images are a purchasable image option); base used directly.

An unavoidable custom env goes on the persistent disk (--prefix /<persistent-mount>/myenv), never the small system disk (§2) — a system-disk env is wiped wherever /root is ephemeral. On Gpushare specifically, docs recommend conda create -p /hy-netdisk/myenv (NOT /hy-tmp — that auto-clears 24 h after shutdown, GS5).

verify: ssh <alias> 'python -c "import torch;print(torch.cuda.is_available())"' returns True against the prebuilt interpreter, before any install.


2. STORAGE MODEL (survival matrix — principle #4)

The family-level trap that breaks ported AutoDL habits: /root (the system disk) is NOT durable on every platform. AutoDL persists /root across a power-off; here it ranges from "resets to image state on every restart" (Gpushare) to "wiped the instant the instance is returned" (Featurize). Checkpoints MUST go to the platform's persistent mount, not /root.

Each platform pairs a small reset-prone system disk with a persistent network/data disk:

Platform System disk (/root etc.) Persistent mount Local fast scratch Free quota
Matpool /root (instance-local; snapshot-captured) /mnt netdisk (survives release, expandable, region-scoped) /root (local) 5 GB netdisk
Gpushare / incl. /rootresets to image on stop/restart /hy-netdisk (only on marked machines) /hy-tmp (local SSD; auto-cleared 24 h after stop) 20 GB sys disk
Featurize wiped on return work = /home/featurize (persists) + /cloud sync drive local (non-persistent) 30 GB free cloud
LanRui system disk lost on stop 数据盘 = /home/user/datadisk + shared /home/user/netdisk/data 数据盘 (block-storage, ≈ sys-disk speed) 网盘 10 GB free

Survival matrix (family):

Tier Survives STOP? Survives RELEASE/RETURN? Notes
System disk / /root varies (Gpushare: NO, resets to image; Featurize: wiped on return) NO never the checkpoint target
Persistent netdisk / 数据盘 YES YES (except LanRui 数据盘 still bills — §5) the only safe checkpoint target
Cross-instance shared folder YES YES region/zone/machine-scoped; clobber risk (see gotchas)
Gpushare /hy-tmp (local SSD) NO — auto-cleared 24 h after shutdown NO fast scratch only; copy results to /hy-netdisk before stop (GS5)

Region / scope locks (analog of AutoDL's region-scoped FS):

  • Matpool/mnt netdisk is region-scoped: different regions have separate netdisks that don't interconnect; pick the region before expanding storage (verified matpool FAQ 2026-06).
  • Featurize — code in work//cloud persists; per common usage reports the cloud sync drive does not share across different regions, while datasets are reusable — confirm all instances of a sweep are same-region before fan-out. (med confidence — official wording not re-verified 2026-06.)
  • Gpushare/hy-netdisk exists only on machines marked as supporting 内网存储; an unmarked machine has no shared mount. A separate /hy-nas shared storage (0.0007 元/GB·h) exists on specific instances (verified gpushare.com/docs/data 2026-06).
  • LanRui/home/user/netdisk/data is a per-availability-zone shared folder auto-mounted into every workspace in that zone (data 文件夹下的任何数据,都可以在该可用区下的所有工作空间中使用) — convenient, but a parallel-ablation clobber hazard (gotcha LR2). The 网盘 is poor at many-small-file writes; use the 数据盘 for that (verified docs.lanrui.co storage 2026-06).

Inode caps: size caps are documented (5 / 20 / 30 / 10 GB free across the four); explicit inode caps are NOT documented by any of them. The many-small-files metadata-exhaustion risk still transfers to any shared FS — measure df -i <persistent-mount> on a live instance in Phase 0 rather than assuming a number. Redirect HF/ModelScope caches off the small system disk → see references/china-network.md §2.

State the checkpoint mount for §5's teardown verb: write to the persistent netdisk/数据盘, never /root. On Gpushare, also stage hot datasets to /hy-tmp (local SSD) for IO, but copy results back to /hy-netdisk before stopping — /hy-tmp is local AND auto-wiped 24 h after shutdown (GS5).


3. NETWORK

The entire mirror / proxy / resumable-download story is shared across all CN platforms and lives in references/china-network.md — do NOT duplicate it here. That reference owns the mirrors table (PyPI/conda/HF), HF_ENDPOINT=https://hf-mirror.com, the ModelScope fallback, the resumable-download retry ladder, the hf_transfer hang caution, and the no_proxy trap. Only the per-platform egress accelerator differs and is recorded here (verified per-platform docs 2026-06):

  • Gpushare — has a real academic proxy (the closest analog to AutoDL's /etc/network_turbo): export https_proxy=http://turbo.gpushare.com:<PORT> http_proxy=http://turbo.gpushare.com:<PORT> (a turbo2.gpushare.com:<PORT> backup host also exists). Two critical differences from AutoDL: (a) it is per-session export, NOT auto-sourced — re-run it in every new terminal/tmux pane; (b) it whitelists only *.github.com, *.github.io, *.githubusercontent.com, *.githubassets.com, *.huggingface.co, *.pytorch.org, *.kaggle.com and restricts every other host — so unset http_proxy https_proxy (or unset http_proxy && unset https_proxy) the moment the accelerated pull finishes, or pip/apt/domestic mirrors mystery-fail (gotcha GS2). This is exactly the no_proxy/route-specific trap in principle #7 — validate the speed test on the same route the real transfer uses (verified gpushare.com/docs/instance/network_turbo 2026-06).
  • Matpool — no one-command egress proxy; ships source-switch scripts under /public/script/ (switch_conda_source.sh, switch_pip_source.sh, switch_apt_source.sh). Fall back to mirrors (references/china-network.md).
  • Featurize / LanRui — no documented one-command academic proxy surfaced; mirrors only.

Port exposure: JupyterLab/TensorBoard are built-in quick-tools (all four). Custom ports must be declared at rent time ("高级选项" on Matpool, e.g. HTTP-6006 TensorBoard / HTTP-8888) — they cannot be opened post-launch. Ports may change on restart — re-read the console, don't hard-code a port in an alias. SSH is standard OpenSSH (scp/rsync work directly; no proxied-SSH scp limitation). Sanitized shapes: ssh -p <PORT> root@<region>.matpool.com (Matpool, e.g. hz.matpool.com / hz-t2.matpool.com), ssh -p <PORT> root@<host>.gpushare.com (Gpushare), ssh user@ssh.<region>.lanrui-ai.com -p <PORT> -i ~/.ssh/id_rsa (LanRui — public-key must be uploaded to the console first).


4. SPOT / INTERRUPTION + RESUME (principle #7/#8)

These are on-demand-only platforms — there is NO spot bid and NO documented mid-run reclaim. Do not build SIGTERM-grace preemption handling here; aggressive retry-on-preemption is over-engineering on this family. The real involuntary-loss vectors are:

  1. Auto-release of stopped instances. Gpushare auto-releases (deletes, unrecoverable) a stopped pay-as-you-go instance 10 days after stop (实例停止 10 天后,会自动释放 — verified gpushare.com/docs/instance/manage 2026-06). On arrears, at noon on the 15th day Gpushare deletes personal data + the /hy-nas shared storage + custom images. A stopped box is not a parked box — pull anything needed off it before that window.
  2. /hy-tmp 24-hour auto-clear (Gpushare). Distinct from instance release: even on a running server, /hy-tmp data is deleted 24 h after the instance is shut down, and is also wiped on instance migration (GS5).
  3. GPU-idle auto-shutdown. Most platforms offer an opt-in "idle → auto-stop" policy to prevent waste; if enabled it can stop a job that merely went quiet (e.g. between epochs with no GPU util) — keep it off for long single-GPU jobs unless heartbeat is guaranteed.
  4. Platform churn (LanRui). LanRui migrated domain lanrui-ai.comlanrui.co (old-domain data not retained after 2024-11-01) and retired its T1/T2 zones on 2025-06-30, moving users to a new "Cova" platform — re-verify current console paths/domain before scripting against any cached LanRui path.

Resume hook: checkpoint-to-durable + load-latest-on-startup (principle #8) is still the right spine — here it guards against a forgotten stop, a 10-day auto-release, and a /hy-tmp 24 h wipe, not a spot kill. The cadence formula in references/spot-resilience.md still applies if a job is long enough to span a forced stop.


5. TEARDOWN / BILLING (principle #9 + the Iron Law)

The meter-stop verb is per-platform — bind it from the table below before clicking anything. The Iron Law (SKILL.md Phase 5) holds unchanged: NO release/return/destroy until checkpoints are pulled to local AND verified by load, and the user has approved the cost-affecting action.

Platform Meter-stop verb What it preserves Cost trap
Matpool 停止并释放 (stop+release) /mnt netdisk persists (region-scoped) .snap snapshots silently eat the 5 GB netdisk (MP1)
Gpushare 关机 stops compute → 释放 deletes /hy-netdisk persists; /hy-tmp cleared 24 h post-stop; /root resets to image stopped instance auto-released at 10 days (GS4); arrears purge day-15 noon
Featurize 实例归还 (return) only work (/home/featurize) + /cloud persist everything else wiped immediately on return (FZ1)
LanRui 停止 stops compute; must 销毁数据盘 (destroy the 数据盘) to stop disk billing 网盘 + 数据盘 persist 数据盘 bills hourly while the workspace is merely STOPPED (LR1)

The single most dangerous divergence: on LanRui, "stop to save money" is wrong. The 数据盘 (/home/user/datadisk, block storage, bought in 200 G / 500 G specs) bills hourly from creation until destroyed, even while the workspace is stopped — 工作空间停止运行,未销毁的数据盘也将持续计费 (verified docs.lanrui.co storage + lanrui.co/pricing 2026-06). So a stopped LanRui workspace keeps a meter running. To actually stop all billing: stop the workspace AND destroy the 数据盘 (after the Iron-Law pull+verify). The 网盘 (10 GB free, 0.15 元/GB·月 overage) persists separately. Contrast: on Matpool/Gpushare/Featurize, release/return/归还 ends compute billing and the persistent volume simply survives (Gpushare /hy-netdisk and /hy-nas bill per-GB but are not destroyed by stopping).

Cost-pause analogs (cheaper than full release, data kept): Gpushare 无卡模式 / 无卡启动 (low-core CPU-only restart, no GPU) is the analog of AutoDL's no-GPU restart — keeps /hy-netdisk data while paused at a fraction of the GPU rate, ideal for env-config + dataset download (verified gpushare 无卡启动 announcement 2026-06). LanRui supports an auto-stop timer (set a stop time at workspace start) and per-hour billing.


6. DAEMON TOOL

tmux is the family detach primitive — preinstalled on most images. Caveat (from Matpool docs, true family-wide): run tmux from a local SSH session, NOT the Jupyter web terminal — keybindings collide with tmux's prefix. A backgrounded nohup python … </dev/null >log 2>&1 & also survives a tab-close / page refresh on Featurize (process not killed; only notebook cell state lost) — but tmux is preferred for a named, re-attachable session.

tmux survives an SSH drop but NOT an instance stop/restart on any platform (on Gpushare the restart resets /root, taking the tmux server and any /root logs with it) — so the durable spine is checkpoint-to-persistent-disk (§2, principle #8), not the tmux session. LanRui additionally supports multi-machine multi-GPU distributed training — if used, see references/multinode.md.


7. TOP GOTCHAS (platform-pinned; universal ones → references/gotchas_universal.md)

Family-wide (China-specific, not in the universal catalog)

CN1 — /root ephemerality silently loses work. Symptom: code/checkpoints written to /root vanish after a stop/restart (Gpushare) or instance return (Featurize). → Root cause: the system disk resets to image state / is wiped on return — unlike AutoDL, which persists /root across power-off. → Fix: write everything to the persistent mount (§2); treat /root as RAM. Audit with ls <persistent-mount> after a test stop before trusting it for a real run.

CN2 — GPU-idle auto-stop kills a quiet job. Symptom: a long job dies mid-run with no error; console shows "auto-stopped (idle)". → Root cause: an opt-in idle-shutdown policy stopped the instance during a low-GPU-util phase (data loading, eval, between epochs). → Fix: disable idle-auto-stop for long jobs, or emit a periodic GPU-touching heartbeat; confirm the policy state in Phase 0.

Matpool (matpool.com)

MP1 — .snap snapshots silently consume the 5 GB netdisk. Symptom: "保存环境" / snapshot saves fail or the netdisk fills with no obvious culprit. → Root cause: snapshots are written as .snap files into the netdisk and count against its tiny 5 GB quota (verified matpool snapshot docs 2026-06). → Fix: prune old .snap files (deleting one frees the quota); keep only the latest needed env snapshot.

MP2 — /mnt is excluded from snapshots, and the machine is locked while saving. Symptom: "保存环境" doesn't capture code under /mnt; the instance is unusable during the save. → Root cause: a snapshot captures everything except /mnt (the netdisk mount), and the machine cannot be used while the snapshot writes. → Fix: to shrink a snapshot move code/data to /mnt first (it won't be captured); to preserve code via snapshot keep it OFF /mnt. Ensure no running process before triggering a save.

MP3 — region-scoped netdisk strands data on a sweep across regions. Symptom: a second instance in another region can't see files written by the first; expanded storage "missing". → Root cause: /mnt netdisks are separate per region and do not interconnect. → Fix: keep all instances of a sweep in one region; pick region before expanding (verified matpool FAQ 2026-06).

Gpushare (gpushare.com)

GS1 — /root resets to image state on every shutdown/restart. (The instance of CN1 to remember by name.) Symptom: installed packages / code / logs under /root gone after restart. → Root cause: only /hy-tmp and /hy-netdisk persist; / reverts to the image. → Fix: env on /hy-netdisk, hot data on /hy-tmp, results synced to /hy-netdisk before stop.

GS2 — turbo proxy left on blocks non-whitelisted hosts. Symptom: after export …turbo.gpushare.com…, pip install / apt / domestic mirrors hang or ProxyError. → Root cause: the academic proxy whitelists only GitHub/HF/PyTorch/Kaggle and restricts everything else (verified network_turbo docs 2026-06). → Fix: unset http_proxy https_proxy the moment the accelerated pull finishes (§3). Same shape as the no_proxy trap in references/china-network.md.

GS3 — /hy-netdisk absent on unmarked machines. Symptom: scripts referencing /hy-netdisk fail on some rentals. → Root cause: the shared netdisk exists only on machines marked as supporting 内网存储. → Fix: check mount | grep hy-netdisk in Phase 0; fall back to personal cloud storage via oss cp (OSS tool, ~300 Mbps, compressed archives only) if absent.

GS4 — stopped instance auto-released at 10 days; arrears purge at day 15. Symptom: a parked stopped instance disappears, or shared/personal data is gone after non-payment. → Root cause: pay-as-you-go auto- release 10 days after stop (实例停止 10 天后自动释放); on arrears, day-15-noon deletes personal data + /hy-nas + custom images (verified gpushare docs 2026-06). → Fix: pull results off a stopped box promptly; don't treat "stopped" as durable parking; keep the balance positive.

GS5 — /hy-tmp auto-cleared 24 h after shutdown (and on migration). (NEW — corrects the prior "/hy-tmp persists" assumption.) Symptom: training data/scratch under /hy-tmp gone the day after a stop, even though the instance still exists. → Root cause: /hy-tmp is per-server local scratch, auto-deleted 24 h after shutdown and wiped on instance migration (verified gpushare.com/docs/data/storage 2026-06). → Fix: treat /hy-tmp as IO scratch only; sync anything durable to /hy-netdisk before stopping; do NOT conda create -p /hy-tmp/... for a persistent env (use /hy-netdisk).

Featurize (featurize.cn)

FZ1 — anything outside work//cloud is wiped the instant the instance is returned. (The strictest "what survives" rule of the four.) Symptom: results outside /home/featurize or /cloud gone after 归还. → Root cause: only work (per-user cloud storage, 工作区可以一直保存项目文件) and the /cloud sync drive persist; everything else is destroyed on return (verified Featurize tutorials 2026-06). → Fix: write all durable output under work//cloud; verify before returning.

FZ2 — /cloud sync drive lag makes edits look saved but not land. Symptom: VS Code edits / files appear saved locally but are missing after reconnect or return (the "工作区中修改代码后无法保存" complaint). → Root cause: the Remote-SSH sync to the cloud drive is not always real-time, especially on slow links or large files. → Fix: explicit Ctrl+S, then verify on the server (ls -la / cat the file) before trusting it; on a flaky connection, close and re-open the Remote-SSH session (transient failures are expected).

FZ3 — 30 GB free cloud quota silently breaks large writes / conda create. (corrects the prior "~20 GB" figure.) Symptom: env creation or large copies into work//cloud fail or truncate. → Root cause: the free cloud storage is 30 GB (verified featurize.cn 2026-06); over it, writes fail. → Fix: du -sh ~/work /cloud to watch headroom; keep only the active env there; large reproducible scratch belongs on local non-persistent disk, not the cloud drive.

LanRui (lanrui.co / lanrui-ai.com)

LR1 — 数据盘 keeps billing while the workspace is merely stopped. (The most expensive divergence — see §5.) Symptom: a stopped LanRui workspace still accrues cost. → Root cause: the 数据盘 (/home/user/datadisk) bills hourly from creation until destroyed, independent of workspace run-state (工作空间停止运行,未销毁的数据盘也将持续计费 — verified docs.lanrui.co storage 2026-06). → Fix: to stop all billing, stop the workspace and 销毁 the 数据盘 — only after the Iron-Law pull+verify; the 网盘 keeps the data.

LR2 — shared netdisk/data folder mounted into every same-zone workspace → cross-run clobber. Symptom: a parallel ablation overwrites another run's outputs. → Root cause: /home/user/netdisk/data is auto-mounted and shared across all workspaces in the same availability zone. → Fix: per-job isolated write paths (references/parallel_ablation.md); never share a mutable output dir under netdisk/data. Also: the 网盘 is poor at many-small-file writes — route those to the 数据盘.

LR3 — platform/domain churn invalidates cached paths. Symptom: scripted paths/domain fail post-migration. → Root cause: domain lanrui-ai.comlanrui.co (old data dropped after 2024-11-01); T1/T2 zones retired 2025-06-30 → "Cova" platform. → Fix: re-verify console domain + paths in-session before scripting against any cached LanRui path.

Platform-specific debugging

Before trusting a run, in Phase 0 (per platform):

  • Confirm persistence path is real, not /root. mount | grep -E 'mnt|hy-netdisk|cloud|datadisk|netdisk' then touch <persistent-mount>/.probe && ls -l <persistent-mount>/.probe. On Gpushare also confirm /hy-netdisk is present (GS3) — mount | grep hy-netdisk (absent on unmarked machines).
  • GPU + driver sanity. nvidia-smi (GPU visible, mem free, driver/CUDA), then python -c "import torch;print(torch.__version__, torch.cuda.is_available(), torch.cuda.get_device_name(0))" against the prebuilt interpreter. Mismatched local-vs-server PyTorch silently breaks checkpoint loads on Featurize — match versions.
  • Detect a stuck / throttled download. du -sh <cache-dir> twice ~30 s apart — flat size = stalled (often the GFW or a left-on Gpushare turbo proxy restricting a non-whitelisted host, GS2). Cross-check with curl -sI -x "$https_proxy" https://hf-mirror.com / env | grep -i proxy; unset http_proxy https_proxy and retry on the mirror.
  • Disk / inode pressure (the silent §2 risk). df -h <persistent-mount> AND df -i <persistent-mount> — a full inode table fails writes while df -h still shows free GB. On Matpool, a filling 5 GB netdisk is usually stale .snap files (ls -la /mnt/*.snap, MP1).
  • Verify the meter-stop did what was intended. After "stop", re-check the console billing line — on LanRui a stopped workspace whose 数据盘 was NOT destroyed is still metering (LR1); on Gpushare a stopped box still counts toward the 10-day auto-release clock (GS4).
  • Read the running job's log, don't infer from silence. Job is in tmux/nohup → tmux capture-pane -pt <session> or tail -f <persistent-mount>/run.log. A vanished tmux server after a "restart" means /root reset (GS1) — the log must live on the persistent mount to survive.

8. SCRIPT OVERRIDES

Parameterize the scripts/ templates per platform. PROXY_HOOK, HF_HOME, and the mirror env all defer to references/china-network.md; only the mounts truly differ.

Var Matpool Gpushare Featurize LanRui
DURABLE_DIR= (durable) /mnt /hy-netdisk /home/featurize (+/cloud) /home/user/datadisk (or /home/user/netdisk/data)
DATA_DIR= (fast/ephemeral) /root /hy-tmp (24 h post-stop wipe) local tmp /home/user/datadisk scratch
SCRATCH= (local, prune) /root /hy-tmp local tmp 数据盘 scratch
HF_HOME= /mnt/.cache/hf /hy-netdisk/.cache/hf /cloud/.cache/hf /home/user/datadisk/.cache/hf
PROXY_HOOK= (mirrors only) export …turbo.gpushare.com:<PORT>… then unset (mirrors only) (mirrors only)
CRED_FILE="" (no file — env var) $WANDB_API_KEY / $HF_TOKEN on ephemeral disk, never the shared netdisk same same same
DETACH= tmux tmux tmux tmux

CRED_FILE="" because on these CN platforms the credential is an env var (or .netrc) on the ephemeral disk, not a file on the netdisk — leave it empty so run_one's [ -n "$CRED_FILE" ] guard skips the file read and $WANDB_API_KEY / $HF_TOKEN pass through from the platform env.

Common to all: the credential lives in an env var or .netrc on the ephemeral system disk, never on the shared/persistent netdisk (a shared data folder mounted into every same-zone workspace, like LanRui's, is especially leaky — universal secrets-off-shared-FS gotcha in references/gotchas_universal.md).


9. Per-platform comparison — the load-bearing differences at a glance

The six questions the schema asks, answered per platform. This is the table to read first when picking which delta applies.

Question Matpool Gpushare Featurize LanRui
Prebuilt base-conda env? yes (myconda, auto-activated) yes (miniconda, base auto-activate off) yes (full PyTorch/TF base, pip persists on work) yes (image-provisioned; PyTorch images purchasable)
Academic-acceleration proxy? no (source-switch scripts only) yes turbo.gpushare.com:<PORT> (per-session, 7-host whitelist) no (mirrors only) no (mirrors only)
Shared / region FS? /mnt netdisk (region-scoped, expandable) /hy-netdisk (only on marked machines) + /hy-nas work+/cloud (cloud sync; not cross-region, med-conf) /home/user/netdisk/data (shared into every same-zone workspace)
Inode cap? undocumented — measure df -i undocumented — measure df -i undocumented — measure df -i undocumented — measure df -i
Data disk bills while stopped? no (release ends billing) no (but stopped box auto-released at 10 d; /hy-tmp cleared 24 h) no (return ends billing) YES — 数据盘 bills until destroyed
Meter-stop verb 停止并释放 关机 → 释放 (+ 无卡模式 pause) 实例归还 停止 + 销毁数据盘
/root survives a stop? local, lost on release NO — resets to image NO — wiped on return system disk lost; use 数据盘

Bottom line for porting an AutoDL workflow: the SSH/tmux/smoke/checkpoint spine transfers verbatim; the three things to re-bind per platform are (1) the persistent mount (never /root; on Gpushare never /hy-tmp either), (2) the meter-stop verb — and on LanRui, that stopping is not enough, the 数据盘 must be destroyed — and (3) the proxy hook (real proxy only on Gpushare, with a strict whitelist; mirrors-only elsewhere → references/china-network.md).