playbook/antigravity-awesome-skills/skills/remote-gpu-trainer/references/china-network.md

11 KiB

China network + model-download reference

Universal recipe for pulling code, packages, and model weights onto any GPU box behind the GFW — AutoDL, 矩池云, 恒源云, Featurize, 揽睿星舟, or a bare CN SSH instance. The whole problem reduces to four orthogonal env-var switches (mirror, cache location, resume tier, proxy scope); none requires editing training code. This file owns the CN-specific transport swap and stall-retry; REQUIRED: huggingface-skills:hf-cli owns the generic hf download / hf upload verbs underneath it.

Universal gotchas (inode caps, silent sync, symlinked caches) are not restated here — see references/gotchas_universal.md. The AutoDL-pinned form lives in profiles/autodl.md.

To jump: grep -in '<keyword>' references/china-network.md (try mirror, HF_ENDPOINT, hfd, no_proxy, hf_transfer, decision).

Table of contents

  1. Mirrors table — PyPI / conda / HuggingFace / alt hub
  2. Env switchboard — the four switches + the import-time trap + cache redirect
  3. Resumable-download ladder — three tiers + the hf_transfer caution
  4. The no_proxy trap — a proxy that fixes one domain breaks all the others
  5. Decision rule + scripts/setup-china-mirrors.sh

1. Mirrors table

Swap the source, not the workflow. Same package names, same repo IDs — only the endpoint changes. Ship this verbatim; it is identical across every CN platform.

Channel Set Endpoint(s)
PyPI pip config set global.index-url <url> or pip install -i <url> pkg Tsinghua TUNA https://pypi.tuna.tsinghua.edu.cn/simple · Aliyun https://mirrors.aliyun.com/pypi/simple · USTC https://pypi.mirrors.ustc.edu.cn/simple
conda channels in ~/.condarc (TUNA Anaconda) https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main + .../free + the cloud/ channels (pytorch, conda-forge)
HuggingFace export HF_ENDPOINT=https://hf-mirror.com drop-in reverse proxy — identical repo IDs, identical hf download / from_pretrained calls
Alt model hub ModelScope CLI / SDK pip install modelscope; modelscope download <id> or snapshot_download(id, ...) — often hosts the same Qwen / GLM / Llama weights domestically

conda trap — NEVER mirror pytorch-nightly. TUNA (and every CN Anaconda mirror) syncs the stable pytorch channel but does not carry pytorch-nightly — pointing the nightly channel at a mirror silently resolves to a stale or absent build. Install nightly only from the official channel (over a real proxy if the box is offline), and mirror just the stable channels.

Source: HF-Mirror https://hf-mirror.com/; TUNA PyPI https://mirrors.tuna.tsinghua.edu.cn/help/pypi/; TUNA Anaconda https://mirrors.tuna.tsinghua.edu.cn/help/anaconda/; ModelScope client https://github.com/modelscope/modelscope_hub.


2. Env switchboard + the import-time trap

Everything below is environment variables only — no code edits. Export them once per shell (or bake them into scripts/setup-china-mirrors.sh, §5) before anything that touches the wire.

# --- mirror routing ---
export HF_ENDPOINT=https://hf-mirror.com           # MUST precede any HF import (see trap below)
# --- caches OFF the small reset-on-release system disk, ONTO the data disk ---
export HF_HOME=/path/to/datadisk/hf                 # parent for hub/, datasets/, etc.
export HF_HUB_CACHE=/path/to/datadisk/hf/hub        # the model-blob cache specifically
export MODELSCOPE_CACHE=/path/to/datadisk/modelscope
# --- keep hf_transfer OFF on flaky CN links (see §3) ---
export HF_HUB_ENABLE_HF_TRANSFER=0

The import-time trap — HF_ENDPOINT is read once, at import. huggingface_hub / transformers / datasets snapshot HF_ENDPOINT the moment they are imported. Setting it after the import (or in a notebook cell run after the first import transformers) is a no-op — the library already cached the international endpoint and every download hits the slow path. Two safe forms:

# Inline on the command — the env is set before the interpreter starts:
HF_ENDPOINT=https://hf-mirror.com python train.py
# Or export in the wrapper, ABOVE any python invocation:
export HF_ENDPOINT=https://hf-mirror.com   # then later: python -m src.train ...

Cache redirect — why it matters. Most CN images pair a tiny reset-on-release system disk with a larger persistent data disk. Left at defaults, ~/.cache/huggingface lands on the system disk and either fills it (crashing downloads) or is wiped on restart on platforms where /root is ephemeral. Redirecting HF_HOME / HF_HUB_CACHE / MODELSCOPE_CACHE onto the data disk ties model storage to the same disk-budget discipline as checkpoints (principle #5; survival matrix in each profile).

Source: HF-Mirror https://hf-mirror.com/; ModelScope client https://github.com/modelscope/modelscope_hub.


3. Resumable-download ladder

Bulk weight pulls are the prototypically flaky step on a CN link — a stall is not a permanent failure, and every tier below accumulates progress across kills. Escalate by file size and instability.

Tier 1 — hf download <repo> --resume-download (default). Writes partial blobs as *.incomplete; re-running the identical command resumes from the byte offset. Best for single repos under ~10 GB. Wrap in a timeout … && break retry loop so a stall self-recovers:

#!/usr/bin/env bash
set -u
for _ in $(seq 1 20); do
  timeout 600 hf download "$REPO" --local-dir "$DIR" --resume-download && break
  echo "stall, retrying (progress is saved)"; sleep 5
done

(Underlying verbs — hf download --resume-download, hf cache verify — belong to REQUIRED: huggingface-skills:hf-cli; this ladder only wraps them with CN-mirror routing + stall-retry.)

Tier 2 — hfd.sh (aria2 multi-connection) for any single file > 10 GB. hfd.sh (the HF-Mirror companion script) drives aria2c with many parallel connections per file — markedly faster and more stall-resistant than the single-stream CLI on large .safetensors shards over a congested evening link. Reach for it whenever one file exceeds ~10 GB:

./hfd.sh "$REPO" --tool aria2c -x 8     # 8 connections per file, resumes on re-run

Tier 3 — ModelScope snapshot_download (HTTP-Range resume). When a model exists on ModelScope (most CN-origin models do), pull it domestically — snapshot_download does per-file HTTP-Range resume, per-file retry with backoff, and SHA256 verification, all over a domestic route that never touches the GFW:

from modelscope import snapshot_download
snapshot_download("Org/Model", local_dir="/path/to/datadisk/model")

Note: ModelScope writes a plain directory and does not populate the HF cache, so from_pretrained("Org/Model") won't find it — point the load at the local dir.

hf_transfer caution — keep HF_HUB_ENABLE_HF_TRANSFER=0 on flaky CN networks. hf_transfer is a Rust accelerator that helps on fast, stable links, but it has a documented hang-with-no-error in exactly the unstable-bandwidth conditions CN ops hit — the download wedges with no progress and no exception, defeating every retry loop above. Leave it off by default on any CN box; only enable it once a route is verified fast and stable.

Source: hf CLI resume https://github.com/huggingface/huggingface_hub/issues/3580; hf_transfer hang https://github.com/huggingface/hf_transfer/issues/30; ModelScope download https://deepwiki.com/modelscope/modelscope/3.1-model-download-and-caching.


4. The no_proxy trap

The highest-value gotcha in this file. A Clash / VPN proxy added to reach huggingface.co simultaneously breaks every domestic mirrorpip, the TUNA index, ModelScope, intra-cloud OSS all get routed out through an overseas exit node, producing ProxyError or multi-minute stalls (principle #7: a proxy speeds ONE route and slows the others).

Symptom → after exporting http_proxy/https_proxy to fix HF, pip install and ModelScope downloads hang or raise ProxyError, while huggingface.co now works. Root cause → the proxy is global; domestic mirrors that were fast on the direct route are now hauled overseas and back. Fix → exempt every domestic host from the proxy with a no_proxy allowlist, minding these library quirks:

  • Leading-dot domains, no * wildcards. requests honors no_proxy but does not expand * — use .modelscope.cn (leading dot matches the domain and all subdomains), never *.modelscope.cn.
  • Set BOTH no_proxy and NO_PROXY. Different libraries read different casings; set both to the same value.
  • List 127.0.0.1 AND localhost. They are distinct entries; omitting either lets a loopback call (TensorBoard, a local API) get proxied.
  • pip ignores no_proxy for its own connections — pass pip install --proxy "" to force pip onto the direct route regardless of an inherited proxy env.
# Only export this WHEN a proxy is present (see below):
DOMESTIC=".tuna.tsinghua.edu.cn,.aliyun.com,.aliyuncs.com,.ustc.edu.cn,.modelscope.cn,.tencentyun.com"
export no_proxy="127.0.0.1,localhost,${DOMESTIC}"
export NO_PROXY="$no_proxy"

A clean box with no proxy needs no no_proxy at all. no_proxy only un-routes a proxy that is already set. On a freshly rented box with no http_proxy/https_proxy exported, adding no_proxy does nothing — add it only in the same breath as exporting a proxy (§5's "real overseas proxy" branch), and clear it when the proxy is unset.

Source: requests no_proxy https://github.com/psf/requests/issues/4871; no_proxy guide https://www.browserstack.com/guide/no_proxy-environment-variable; Clash pip ProxyError https://github.com/clash-verge-rev/clash-verge-rev/issues/2607.


5. Decision rule + delivery

Pick the cheapest route that reaches the weights, in order:

  1. hf-mirror firstHF_ENDPOINT=https://hf-mirror.com. Drop-in, same repo IDs, no proxy, no no_proxy to manage. Default for everything.
  2. ModelScope if the model is absent on the mirror or the mirror route is flaky — same Qwen / GLM / Llama weights domestically, Tier-3 resume, no GFW crossing.
  3. hfd.sh for any single file > 10 GB on a stable-but-slow link — aria2 multi-connection.
  4. A real overseas proxy ONLY when a model exists only on huggingface.co and neither mirror nor ModelScope carries it. The moment a proxy goes on, immediately apply the §4 no_proxy block so the domestic mirrors keep working — and unset both when the pull is done.

Never reach for a proxy by reflex: it is the slowest, most fragile option and the one that breaks everything else. Mirror → alt hub → multi-connection → proxy, in that order of preference.

Ship scripts/setup-china-mirrors.sh — the orchestrator scps it onto the box and sources it on first connect. It bakes §1 (PyPI + conda mirrors), §2 (the four env switches + cache redirect off the system disk), and the §3 default (HF_HUB_ENABLE_HF_TRANSFER=0) into one idempotent step, leaving the §4 proxy block commented out (added only on the rare proxy branch). Author it with #!/usr/bin/env bash + set -u, forward-slash paths, and no unquoted | inside any grep (an unquoted pipe in a regex reads stdin and hangs the setup forever).

Source: HF-Mirror https://hf-mirror.com/; ModelScope https://github.com/modelscope/modelscope_hub.