30 KiB
| platform | kind | meter_stop_verb | meter_stop_irreversible | detach_primitive | spot_available | spot_grace | shared_fs | inode_cap | free_egress | china_mirror_needed | host_driver_cuda_max | local_nvme |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| china-family | ssh-rental | per-platform | mixed | tmux | false | n/a | per-platform | undocumented | true | true | image-dependent | per-platform |
Profile: Chinese GPU-rental family (Matpool · Gpushare · Featurize · LanRui)
One-line purpose: the AutoDL-shaped Chinese rentals — near-clones that share AutoDL's SSH+tmux+prebuilt-base
spine but diverge on what survives a stop, whether a stopped data disk still bills, and which (if
any) academic proxy ships. Treat AutoDL (profiles/autodl.md) as the reference implementation; this profile
records only the deltas, at the FAMILY level first, then a per-platform comparison table.
Surface to the user up front (principle #10): ⚠️ Danger clocks (per platform, §5) — a stopped instance is auto-released (Gpushare ~10 days, others vary) → data gone; LanRui's 数据盘 bills while stopped; Gpushare's
/hy-tmpis wiped 24 h after stop and/rootresets to the image. Conveniences — built-in JupyterLab / TensorBoard quick-tools (all four); declare any custom port at rent time ("高级选项") — it can't be opened later.
To jump: grep -in '<keyword>' profiles/china.md (e.g. proxy, ephemeral, bills, inode, LanRui).
Table of contents
- LAUNCH · 2. STORAGE MODEL (survival matrix +
/root-ephemeral trap) · 3. NETWORK (→references/china-network.md) · 4. SPOT/INTERRUPTION · 5. TEARDOWN/BILLING · 6. DAEMON TOOL · 7. TOP GOTCHAS (universal →references/gotchas_universal.md)
- Platform-specific debugging · 8. SCRIPT OVERRIDES · 9. Per-platform comparison table
Universal gotchas (CRLF, cgroup OOM, silent sync, tmux-holds-script, disk-budget, secrets-off-shared-FS) are NOT restated here — see
references/gotchas_universal.md. The mirror/proxy/download story is NOT restated either — it is shared across all CN platforms and lives inreferences/china-network.md.
1. LAUNCH
All four: web console rents a marketplace machine → pick GPU count + a prebuilt image (PyTorch/TF + CUDA) that is the env → connect via SSH (auto-generated password or pushed public key) + JupyterLab; VS Code Remote-SSH works on all of them.
Env contract — the image/base IS the env; do not conda create on a rental. Same rule as AutoDL, with
per-platform base-activation wrinkles (verified per-platform docs 2026-06):
- Featurize — base is fully provisioned and used directly;
pip/conda installinto base persists on theworkworkspace. Activate and run. - Matpool — ships a
mycondaenv that auto-activates on startup (interpreter at/root/miniconda3/envs/myconda/bin/python). Run directly; no re-enable needed (verified matpool conda docs 2026-06 — this corrects the earlier "auto-activate off" note, which was true only for Gpushare). - Gpushare — ships miniconda but base auto-activate is disabled (
登陆终端默认取消了自动进入 base 环境). Re-enable (conda config --set auto_activate_base true) or activate the named env per session (verified gpushare.com/docs/best_practices/conda 2026-06). - LanRui — image-provisioned (PyTorch images are a purchasable image option); base used directly.
An unavoidable custom env goes on the persistent disk (--prefix /<persistent-mount>/myenv), never the
small system disk (§2) — a system-disk env is wiped wherever /root is ephemeral. On Gpushare specifically,
docs recommend conda create -p /hy-netdisk/myenv (NOT /hy-tmp — that auto-clears 24 h after shutdown, GS5).
→ verify: ssh <alias> 'python -c "import torch;print(torch.cuda.is_available())"' returns True
against the prebuilt interpreter, before any install.
2. STORAGE MODEL (survival matrix — principle #4)
The family-level trap that breaks ported AutoDL habits: /root (the system disk) is NOT durable on every
platform. AutoDL persists /root across a power-off; here it ranges from "resets to image state on every
restart" (Gpushare) to "wiped the instant the instance is returned" (Featurize). Checkpoints MUST go to the
platform's persistent mount, not /root.
Each platform pairs a small reset-prone system disk with a persistent network/data disk:
| Platform | System disk (/root etc.) |
Persistent mount | Local fast scratch | Free quota |
|---|---|---|---|---|
| Matpool | /root (instance-local; snapshot-captured) |
/mnt netdisk (survives release, expandable, region-scoped) |
/root (local) |
5 GB netdisk |
| Gpushare | / incl. /root — resets to image on stop/restart |
/hy-netdisk (only on marked machines) |
/hy-tmp (local SSD; auto-cleared 24 h after stop) |
20 GB sys disk |
| Featurize | wiped on return | work = /home/featurize (persists) + /cloud sync drive |
local (non-persistent) | 30 GB free cloud |
| LanRui | system disk lost on stop | 数据盘 = /home/user/datadisk + shared /home/user/netdisk/data |
数据盘 (block-storage, ≈ sys-disk speed) | 网盘 10 GB free |
Survival matrix (family):
| Tier | Survives STOP? | Survives RELEASE/RETURN? | Notes |
|---|---|---|---|
System disk / /root |
varies (Gpushare: NO, resets to image; Featurize: wiped on return) | NO | never the checkpoint target |
| Persistent netdisk / 数据盘 | YES | YES (except LanRui 数据盘 still bills — §5) | the only safe checkpoint target |
| Cross-instance shared folder | YES | YES | region/zone/machine-scoped; clobber risk (see gotchas) |
Gpushare /hy-tmp (local SSD) |
NO — auto-cleared 24 h after shutdown | NO | fast scratch only; copy results to /hy-netdisk before stop (GS5) |
Region / scope locks (analog of AutoDL's region-scoped FS):
- Matpool —
/mntnetdisk is region-scoped: different regions have separate netdisks that don't interconnect; pick the region before expanding storage (verified matpool FAQ 2026-06). - Featurize — code in
work//cloudpersists; per common usage reports the cloud sync drive does not share across different regions, while datasets are reusable — confirm all instances of a sweep are same-region before fan-out. (med confidence — official wording not re-verified 2026-06.) - Gpushare —
/hy-netdiskexists only on machines marked as supporting 内网存储; an unmarked machine has no shared mount. A separate/hy-nasshared storage (0.0007 元/GB·h) exists on specific instances (verified gpushare.com/docs/data 2026-06). - LanRui —
/home/user/netdisk/datais a per-availability-zone shared folder auto-mounted into every workspace in that zone (data 文件夹下的任何数据,都可以在该可用区下的所有工作空间中使用) — convenient, but a parallel-ablation clobber hazard (gotcha LR2). The 网盘 is poor at many-small-file writes; use the 数据盘 for that (verified docs.lanrui.co storage 2026-06).
Inode caps: size caps are documented (5 / 20 / 30 / 10 GB free across the four); explicit inode caps are
NOT documented by any of them. The many-small-files metadata-exhaustion risk still transfers to any shared
FS — measure df -i <persistent-mount> on a live instance in Phase 0 rather than assuming a number. Redirect
HF/ModelScope caches off the small system disk → see references/china-network.md §2.
State the checkpoint mount for §5's teardown verb: write to the persistent netdisk/数据盘, never /root.
On Gpushare, also stage hot datasets to /hy-tmp (local SSD) for IO, but copy results back to /hy-netdisk
before stopping — /hy-tmp is local AND auto-wiped 24 h after shutdown (GS5).
3. NETWORK
The entire mirror / proxy / resumable-download story is shared across all CN platforms and lives in
references/china-network.md — do NOT duplicate it here. That reference owns the mirrors table
(PyPI/conda/HF), HF_ENDPOINT=https://hf-mirror.com, the ModelScope fallback, the resumable-download retry
ladder, the hf_transfer hang caution, and the no_proxy trap. Only the per-platform egress accelerator
differs and is recorded here (verified per-platform docs 2026-06):
- Gpushare — has a real academic proxy (the closest analog to AutoDL's
/etc/network_turbo):export https_proxy=http://turbo.gpushare.com:<PORT> http_proxy=http://turbo.gpushare.com:<PORT>(aturbo2.gpushare.com:<PORT>backup host also exists). Two critical differences from AutoDL: (a) it is per-session export, NOT auto-sourced — re-run it in every new terminal/tmux pane; (b) it whitelists only*.github.com,*.github.io,*.githubusercontent.com,*.githubassets.com,*.huggingface.co,*.pytorch.org,*.kaggle.comand restricts every other host — sounset http_proxy https_proxy(orunset http_proxy && unset https_proxy) the moment the accelerated pull finishes, orpip/apt/domestic mirrors mystery-fail (gotcha GS2). This is exactly theno_proxy/route-specific trap in principle #7 — validate the speed test on the same route the real transfer uses (verified gpushare.com/docs/instance/network_turbo 2026-06). - Matpool — no one-command egress proxy; ships source-switch scripts under
/public/script/(switch_conda_source.sh,switch_pip_source.sh,switch_apt_source.sh). Fall back to mirrors (references/china-network.md). - Featurize / LanRui — no documented one-command academic proxy surfaced; mirrors only.
Port exposure: JupyterLab/TensorBoard are built-in quick-tools (all four). Custom ports must be declared
at rent time ("高级选项" on Matpool, e.g. HTTP-6006 TensorBoard / HTTP-8888) — they cannot be opened
post-launch. Ports may change on restart — re-read the console, don't hard-code a port in an alias. SSH is
standard OpenSSH (scp/rsync work directly; no proxied-SSH scp limitation). Sanitized shapes:
ssh -p <PORT> root@<region>.matpool.com (Matpool, e.g. hz.matpool.com / hz-t2.matpool.com),
ssh -p <PORT> root@<host>.gpushare.com (Gpushare),
ssh user@ssh.<region>.lanrui-ai.com -p <PORT> -i ~/.ssh/id_rsa (LanRui — public-key must be uploaded to the
console first).
4. SPOT / INTERRUPTION + RESUME (principle #7/#8)
These are on-demand-only platforms — there is NO spot bid and NO documented mid-run reclaim. Do not build SIGTERM-grace preemption handling here; aggressive retry-on-preemption is over-engineering on this family. The real involuntary-loss vectors are:
- Auto-release of stopped instances. Gpushare auto-releases (deletes, unrecoverable) a stopped
pay-as-you-go instance 10 days after stop (
实例停止 10 天后,会自动释放— verified gpushare.com/docs/instance/manage 2026-06). On arrears, at noon on the 15th day Gpushare deletes personal data + the/hy-nasshared storage + custom images. A stopped box is not a parked box — pull anything needed off it before that window. /hy-tmp24-hour auto-clear (Gpushare). Distinct from instance release: even on a running server,/hy-tmpdata is deleted 24 h after the instance is shut down, and is also wiped on instance migration (GS5).- GPU-idle auto-shutdown. Most platforms offer an opt-in "idle → auto-stop" policy to prevent waste; if enabled it can stop a job that merely went quiet (e.g. between epochs with no GPU util) — keep it off for long single-GPU jobs unless heartbeat is guaranteed.
- Platform churn (LanRui). LanRui migrated domain
lanrui-ai.com→lanrui.co(old-domain data not retained after 2024-11-01) and retired its T1/T2 zones on 2025-06-30, moving users to a new "Cova" platform — re-verify current console paths/domain before scripting against any cached LanRui path.
Resume hook: checkpoint-to-durable + load-latest-on-startup (principle #8) is still the right spine — here
it guards against a forgotten stop, a 10-day auto-release, and a /hy-tmp 24 h wipe, not a spot kill. The
cadence formula in references/spot-resilience.md still applies if a job is long enough to span a forced stop.
5. TEARDOWN / BILLING (principle #9 + the Iron Law)
The meter-stop verb is per-platform — bind it from the table below before clicking anything. The Iron Law (SKILL.md Phase 5) holds unchanged: NO release/return/destroy until checkpoints are pulled to local AND verified by load, and the user has approved the cost-affecting action.
| Platform | Meter-stop verb | What it preserves | Cost trap |
|---|---|---|---|
| Matpool | 停止并释放 (stop+release) | /mnt netdisk persists (region-scoped) |
.snap snapshots silently eat the 5 GB netdisk (MP1) |
| Gpushare | 关机 stops compute → 释放 deletes | /hy-netdisk persists; /hy-tmp cleared 24 h post-stop; /root resets to image |
stopped instance auto-released at 10 days (GS4); arrears purge day-15 noon |
| Featurize | 实例归还 (return) | only work (/home/featurize) + /cloud persist |
everything else wiped immediately on return (FZ1) |
| LanRui | 停止 stops compute; must 销毁数据盘 (destroy the 数据盘) to stop disk billing | 网盘 + 数据盘 persist | 数据盘 bills hourly while the workspace is merely STOPPED (LR1) |
The single most dangerous divergence: on LanRui, "stop to save money" is wrong. The 数据盘
(/home/user/datadisk, block storage, bought in 200 G / 500 G specs) bills hourly from creation until
destroyed, even while the workspace is stopped — 工作空间停止运行,未销毁的数据盘也将持续计费 (verified
docs.lanrui.co storage + lanrui.co/pricing 2026-06). So a stopped LanRui workspace keeps a meter running. To
actually stop all billing: stop the workspace AND destroy the 数据盘 (after the Iron-Law pull+verify). The 网盘
(10 GB free, 0.15 元/GB·月 overage) persists separately. Contrast: on Matpool/Gpushare/Featurize,
release/return/归还 ends compute billing and the persistent volume simply survives (Gpushare /hy-netdisk and
/hy-nas bill per-GB but are not destroyed by stopping).
Cost-pause analogs (cheaper than full release, data kept): Gpushare 无卡模式 / 无卡启动 (low-core
CPU-only restart, no GPU) is the analog of AutoDL's no-GPU restart — keeps /hy-netdisk data while paused at a
fraction of the GPU rate, ideal for env-config + dataset download (verified gpushare 无卡启动 announcement
2026-06). LanRui supports an auto-stop timer (set a stop time at workspace start) and per-hour billing.
6. DAEMON TOOL
tmux is the family detach primitive — preinstalled on most images. Caveat (from Matpool docs, true
family-wide): run tmux from a local SSH session, NOT the Jupyter web terminal — keybindings collide with
tmux's prefix. A backgrounded nohup python … </dev/null >log 2>&1 & also survives a tab-close / page refresh
on Featurize (process not killed; only notebook cell state lost) — but tmux is preferred for a named,
re-attachable session.
tmux survives an SSH drop but NOT an instance stop/restart on any platform (on Gpushare the restart
resets /root, taking the tmux server and any /root logs with it) — so the durable spine is
checkpoint-to-persistent-disk (§2, principle #8), not the tmux session. LanRui additionally supports
multi-machine multi-GPU distributed training — if used, see references/multinode.md.
7. TOP GOTCHAS (platform-pinned; universal ones → references/gotchas_universal.md)
Family-wide (China-specific, not in the universal catalog)
CN1 — /root ephemerality silently loses work.
Symptom: code/checkpoints written to /root vanish after a stop/restart (Gpushare) or instance return
(Featurize). → Root cause: the system disk resets to image state / is wiped on return — unlike AutoDL, which
persists /root across power-off. → Fix: write everything to the persistent mount (§2); treat /root as
RAM. Audit with ls <persistent-mount> after a test stop before trusting it for a real run.
CN2 — GPU-idle auto-stop kills a quiet job. Symptom: a long job dies mid-run with no error; console shows "auto-stopped (idle)". → Root cause: an opt-in idle-shutdown policy stopped the instance during a low-GPU-util phase (data loading, eval, between epochs). → Fix: disable idle-auto-stop for long jobs, or emit a periodic GPU-touching heartbeat; confirm the policy state in Phase 0.
Matpool (matpool.com)
MP1 — .snap snapshots silently consume the 5 GB netdisk.
Symptom: "保存环境" / snapshot saves fail or the netdisk fills with no obvious culprit. → Root cause: snapshots
are written as .snap files into the netdisk and count against its tiny 5 GB quota (verified matpool
snapshot docs 2026-06). → Fix: prune old .snap files (deleting one frees the quota); keep only the latest
needed env snapshot.
MP2 — /mnt is excluded from snapshots, and the machine is locked while saving.
Symptom: "保存环境" doesn't capture code under /mnt; the instance is unusable during the save. → Root cause:
a snapshot captures everything except /mnt (the netdisk mount), and the machine cannot be used while
the snapshot writes. → Fix: to shrink a snapshot move code/data to /mnt first (it won't be captured); to
preserve code via snapshot keep it OFF /mnt. Ensure no running process before triggering a save.
MP3 — region-scoped netdisk strands data on a sweep across regions.
Symptom: a second instance in another region can't see files written by the first; expanded storage "missing".
→ Root cause: /mnt netdisks are separate per region and do not interconnect. → Fix: keep all instances of a
sweep in one region; pick region before expanding (verified matpool FAQ 2026-06).
Gpushare (gpushare.com)
GS1 — /root resets to image state on every shutdown/restart. (The instance of CN1 to remember by name.)
Symptom: installed packages / code / logs under /root gone after restart. → Root cause: only /hy-tmp and
/hy-netdisk persist; / reverts to the image. → Fix: env on /hy-netdisk, hot data on /hy-tmp, results
synced to /hy-netdisk before stop.
GS2 — turbo proxy left on blocks non-whitelisted hosts.
Symptom: after export …turbo.gpushare.com…, pip install / apt / domestic mirrors hang or ProxyError.
→ Root cause: the academic proxy whitelists only GitHub/HF/PyTorch/Kaggle and restricts everything else
(verified network_turbo docs 2026-06). → Fix: unset http_proxy https_proxy the moment the accelerated pull
finishes (§3). Same shape as the no_proxy trap in references/china-network.md.
GS3 — /hy-netdisk absent on unmarked machines.
Symptom: scripts referencing /hy-netdisk fail on some rentals. → Root cause: the shared netdisk exists only
on machines marked as supporting 内网存储. → Fix: check mount | grep hy-netdisk in Phase 0; fall back to
personal cloud storage via oss cp (OSS tool, ~300 Mbps, compressed archives only) if absent.
GS4 — stopped instance auto-released at 10 days; arrears purge at day 15. Symptom: a parked stopped
instance disappears, or shared/personal data is gone after non-payment. → Root cause: pay-as-you-go auto-
release 10 days after stop (实例停止 10 天后自动释放); on arrears, day-15-noon deletes personal data +
/hy-nas + custom images (verified gpushare docs 2026-06). → Fix: pull results off a stopped box promptly;
don't treat "stopped" as durable parking; keep the balance positive.
GS5 — /hy-tmp auto-cleared 24 h after shutdown (and on migration). (NEW — corrects the prior "/hy-tmp
persists" assumption.) Symptom: training data/scratch under /hy-tmp gone the day after a stop, even though
the instance still exists. → Root cause: /hy-tmp is per-server local scratch, auto-deleted 24 h after
shutdown and wiped on instance migration (verified gpushare.com/docs/data/storage 2026-06). → Fix: treat
/hy-tmp as IO scratch only; sync anything durable to /hy-netdisk before stopping; do NOT
conda create -p /hy-tmp/... for a persistent env (use /hy-netdisk).
Featurize (featurize.cn)
FZ1 — anything outside work//cloud is wiped the instant the instance is returned. (The strictest
"what survives" rule of the four.) Symptom: results outside /home/featurize or /cloud gone after 归还.
→ Root cause: only work (per-user cloud storage, 工作区可以一直保存项目文件) and the /cloud sync drive
persist; everything else is destroyed on return (verified Featurize tutorials 2026-06). → Fix: write all
durable output under work//cloud; verify before returning.
FZ2 — /cloud sync drive lag makes edits look saved but not land. Symptom: VS Code edits / files appear
saved locally but are missing after reconnect or return (the "工作区中修改代码后无法保存" complaint). → Root
cause: the Remote-SSH sync to the cloud drive is not always real-time, especially on slow links or large files.
→ Fix: explicit Ctrl+S, then verify on the server (ls -la / cat the file) before trusting it; on a
flaky connection, close and re-open the Remote-SSH session (transient failures are expected).
FZ3 — 30 GB free cloud quota silently breaks large writes / conda create. (corrects the prior "~20 GB"
figure.) Symptom: env creation or large copies into work//cloud fail or truncate. → Root cause: the free
cloud storage is 30 GB (verified featurize.cn 2026-06); over it, writes fail. → Fix: du -sh ~/work /cloud
to watch headroom; keep only the active env there; large reproducible scratch belongs on local
non-persistent disk, not the cloud drive.
LanRui (lanrui.co / lanrui-ai.com)
LR1 — 数据盘 keeps billing while the workspace is merely stopped. (The most expensive divergence — see
§5.) Symptom: a stopped LanRui workspace still accrues cost. → Root cause: the 数据盘
(/home/user/datadisk) bills hourly from creation until destroyed, independent of workspace run-state
(工作空间停止运行,未销毁的数据盘也将持续计费 — verified docs.lanrui.co storage 2026-06). → Fix: to stop
all billing, stop the workspace and 销毁 the 数据盘 — only after the Iron-Law pull+verify; the 网盘 keeps
the data.
LR2 — shared netdisk/data folder mounted into every same-zone workspace → cross-run clobber. Symptom:
a parallel ablation overwrites another run's outputs. → Root cause: /home/user/netdisk/data is auto-mounted
and shared across all workspaces in the same availability zone. → Fix: per-job isolated write paths
(references/parallel_ablation.md); never share a mutable output dir under netdisk/data. Also: the 网盘 is
poor at many-small-file writes — route those to the 数据盘.
LR3 — platform/domain churn invalidates cached paths. Symptom: scripted paths/domain fail post-migration.
→ Root cause: domain lanrui-ai.com → lanrui.co (old data dropped after 2024-11-01); T1/T2 zones retired
2025-06-30 → "Cova" platform. → Fix: re-verify console domain + paths in-session before scripting against any
cached LanRui path.
Platform-specific debugging
Before trusting a run, in Phase 0 (per platform):
- Confirm persistence path is real, not
/root.mount | grep -E 'mnt|hy-netdisk|cloud|datadisk|netdisk'thentouch <persistent-mount>/.probe && ls -l <persistent-mount>/.probe. On Gpushare also confirm/hy-netdiskis present (GS3) —mount | grep hy-netdisk(absent on unmarked machines). - GPU + driver sanity.
nvidia-smi(GPU visible, mem free, driver/CUDA), thenpython -c "import torch;print(torch.__version__, torch.cuda.is_available(), torch.cuda.get_device_name(0))"against the prebuilt interpreter. Mismatched local-vs-server PyTorch silently breaks checkpoint loads on Featurize — match versions. - Detect a stuck / throttled download.
du -sh <cache-dir>twice ~30 s apart — flat size = stalled (often the GFW or a left-on Gpushare turbo proxy restricting a non-whitelisted host, GS2). Cross-check withcurl -sI -x "$https_proxy" https://hf-mirror.com/env | grep -i proxy;unset http_proxy https_proxyand retry on the mirror. - Disk / inode pressure (the silent §2 risk).
df -h <persistent-mount>ANDdf -i <persistent-mount>— a full inode table fails writes whiledf -hstill shows free GB. On Matpool, a filling 5 GB netdisk is usually stale.snapfiles (ls -la /mnt/*.snap, MP1). - Verify the meter-stop did what was intended. After "stop", re-check the console billing line — on LanRui a stopped workspace whose 数据盘 was NOT destroyed is still metering (LR1); on Gpushare a stopped box still counts toward the 10-day auto-release clock (GS4).
- Read the running job's log, don't infer from silence. Job is in tmux/nohup →
tmux capture-pane -pt <session>ortail -f <persistent-mount>/run.log. A vanished tmux server after a "restart" means/rootreset (GS1) — the log must live on the persistent mount to survive.
8. SCRIPT OVERRIDES
Parameterize the scripts/ templates per platform. PROXY_HOOK, HF_HOME, and the mirror env all defer to
references/china-network.md; only the mounts truly differ.
| Var | Matpool | Gpushare | Featurize | LanRui |
|---|---|---|---|---|
DURABLE_DIR= (durable) |
/mnt |
/hy-netdisk |
/home/featurize (+/cloud) |
/home/user/datadisk (or /home/user/netdisk/data) |
DATA_DIR= (fast/ephemeral) |
/root |
/hy-tmp (24 h post-stop wipe) |
local tmp | /home/user/datadisk scratch |
SCRATCH= (local, prune) |
/root |
/hy-tmp |
local tmp | 数据盘 scratch |
HF_HOME= |
/mnt/.cache/hf |
/hy-netdisk/.cache/hf |
/cloud/.cache/hf |
/home/user/datadisk/.cache/hf |
PROXY_HOOK= |
(mirrors only) | export …turbo.gpushare.com:<PORT>… then unset |
(mirrors only) | (mirrors only) |
CRED_FILE="" (no file — env var) |
$WANDB_API_KEY / $HF_TOKEN on ephemeral disk, never the shared netdisk |
same | same | same |
DETACH= |
tmux | tmux | tmux | tmux |
CRED_FILE="" because on these CN platforms the credential is an env var (or .netrc) on the ephemeral
disk, not a file on the netdisk — leave it empty so run_one's [ -n "$CRED_FILE" ] guard skips the file read
and $WANDB_API_KEY / $HF_TOKEN pass through from the platform env.
Common to all: the credential lives in an env var or .netrc on the ephemeral system disk, never on the
shared/persistent netdisk (a shared data folder mounted into every same-zone workspace, like LanRui's, is
especially leaky — universal secrets-off-shared-FS gotcha in references/gotchas_universal.md).
9. Per-platform comparison — the load-bearing differences at a glance
The six questions the schema asks, answered per platform. This is the table to read first when picking which delta applies.
| Question | Matpool | Gpushare | Featurize | LanRui |
|---|---|---|---|---|
| Prebuilt base-conda env? | yes (myconda, auto-activated) |
yes (miniconda, base auto-activate off) | yes (full PyTorch/TF base, pip persists on work) |
yes (image-provisioned; PyTorch images purchasable) |
| Academic-acceleration proxy? | no (source-switch scripts only) | yes turbo.gpushare.com:<PORT> (per-session, 7-host whitelist) |
no (mirrors only) | no (mirrors only) |
| Shared / region FS? | /mnt netdisk (region-scoped, expandable) |
/hy-netdisk (only on marked machines) + /hy-nas |
work+/cloud (cloud sync; not cross-region, med-conf) |
/home/user/netdisk/data (shared into every same-zone workspace) |
| Inode cap? | undocumented — measure df -i |
undocumented — measure df -i |
undocumented — measure df -i |
undocumented — measure df -i |
| Data disk bills while stopped? | no (release ends billing) | no (but stopped box auto-released at 10 d; /hy-tmp cleared 24 h) |
no (return ends billing) | YES — 数据盘 bills until destroyed |
| Meter-stop verb | 停止并释放 | 关机 → 释放 (+ 无卡模式 pause) | 实例归还 | 停止 + 销毁数据盘 |
/root survives a stop? |
local, lost on release | NO — resets to image | NO — wiped on return | system disk lost; use 数据盘 |
Bottom line for porting an AutoDL workflow: the SSH/tmux/smoke/checkpoint spine transfers verbatim; the
three things to re-bind per platform are (1) the persistent mount (never /root; on Gpushare never
/hy-tmp either), (2) the meter-stop verb — and on LanRui, that stopping is not enough, the 数据盘 must be
destroyed — and (3) the proxy hook (real proxy only on Gpushare, with a strict whitelist; mirrors-only
elsewhere → references/china-network.md).