9.3 KiB
Reliability Principles for Training Jobs
Contents
- Principle 1: Always Verify Before Use
- Principle 2: Prioritize Reliability Over Performance
- Principle 3: Create Atomic, Self-Contained Scripts
- Principle 4: Provide Clear Error Context
- Principle 5: Test the Happy Path on Known-Good Inputs
- Summary: The Reliability Checklist (pre-flight, script quality, job config)
- When Principles Conflict
These principles are derived from real production failures and successful fixes. Following them prevents common failure modes and ensures reliable job execution.
Principle 1: Always Verify Before Use
Rule: Never assume repos, datasets, or resources exist. Verify with tools first.
What It Prevents
- Non-existent datasets - Jobs fail immediately when dataset doesn't exist
- Typos in names - Simple mistakes like "argilla-dpo-mix-7k" vs "ultrafeedback_binarized"
- Incorrect paths - Old or moved repos, renamed files
- Missing dependencies - Undocumented requirements
How to Apply
Before submitting ANY job:
# Verify dataset exists
dataset_search({"query": "dataset-name", "author": "author-name", "limit": 5})
hub_repo_details(["author/dataset-name"], repo_type="dataset")
# Verify model exists
hub_repo_details(["org/model-name"], repo_type="model")
# Check script/file paths (for URL-based scripts)
# Verify before using: https://github.com/user/repo/blob/main/script.py
Examples that would have caught errors:
# ❌ WRONG: Assumed dataset exists
hf_jobs("uv", {
"script": """...""",
"env": {"DATASET": "trl-lib/argilla-dpo-mix-7k"} # Doesn't exist!
})
# ✅ CORRECT: Verify first
dataset_search({"query": "argilla dpo", "author": "trl-lib"})
# Would show: "trl-lib/ultrafeedback_binarized" is the correct name
hub_repo_details(["trl-lib/ultrafeedback_binarized"], repo_type="dataset")
# Confirms it exists before using
Implementation Checklist
- Check dataset exists before training
- Test script URLs are valid before submitting
- Check for recent updates/renames of resources
- Check for dataset format
Time cost: 5-10 seconds
Time saved: Hours of failed job time + debugging
Principle 2: Prioritize Reliability Over Performance
Rule: Default to what is most likely to succeed, not what is theoretically fastest.
What It Prevents
- Hardware incompatibilities - Features that fail on certain GPUs
- Unstable optimizations - Speed-ups that cause crashes
- Complex configurations - More failure points
- Build system issues - Unreliable compilation methods
How to Apply
Choose reliability:
# ❌ RISKY: Aggressive optimization that may fail
TrainingArguments(
torch_compile=True, # Can fail on T4, A10G GPUs
optim="adamw_bnb_8bit", # Requires specific setup
dataloader_num_workers=8, # May cause OOM on small instances
...
)
# ✅ SAFE: Proven defaults
TrainingArguments(
# torch_compile=True, # Commented with note: "Enable on H100 for 20% speedup"
optim="adamw_torch", # Standard, always works
fp16=True, # Stable and fast on T4/A10G
dataloader_num_workers=4, # Conservative, reliable
...
)
Real-World Example
The torch.compile failure:
- Added for "20% speedup" on H100
- Failed fatally on T4-medium with cryptic error
- Misdiagnosed as dataset issue (cost hours)
- Fix: Disable by default, add as optional comment
Result: Reliability > 20% performance gain
Implementation Checklist
- Use proven, standard configurations by default
- Comment out performance optimizations with hardware notes
- Use stable build systems (CMake > make)
- Test on target hardware before production
- Document known incompatibilities
- Provide "safe" and "fast" variants when needed
Performance loss: 10-20% in best case
Reliability gain: 95%+ success rate vs 60-70%
Principle 3: Create Atomic, Self-Contained Scripts
Rule: Scripts should work as complete, independent units. Don't remove parts to "simplify."
What It Prevents
- Missing dependencies - Removed "unnecessary" packages that are actually required
- Incomplete processes - Skipped steps that seem redundant
- Environment assumptions - Scripts that need pre-setup
- Partial failures - Some parts work, others fail silently
How to Apply
Complete dependency specifications:
# ❌ INCOMPLETE: "Simplified" by removing dependencies
# /// script
# dependencies = [
# "transformers",
# "torch",
# "datasets",
# ]
# ///
# ✅ COMPLETE: All dependencies explicit
# /// script
# dependencies = [
# "transformers>=5.2.0",
# "accelerate>=1.1.0",
# "albumentations>=1.4.16", # Required for augmentation + bbox handling
# "timm", # Required for vision backbones
# "datasets>=4.0",
# "torchmetrics", # Required for mAP/mAR computation
# "pycocotools", # Required for COCO evaluation
# "trackio", # Required for metrics monitoring
# "huggingface_hub",
# ]
# ///
Real-World Example
The albumentations failure:
- Original script had it: augmentations and bbox clipping worked fine
- "Simplified" version removed it: "not strictly needed for training"
- Training crashed on bbox augmentation — no fallback for COCO-format bbox handling
- Hard to debug: error appeared in data loading, not in augmentation setup
- Fix: Restore all original dependencies
Result: Don't remove dependencies without thorough testing
Implementation Checklist
- All dependencies in PEP 723 header with version pins
- All system packages installed by script
- No assumptions about pre-existing environment
- No "optional" steps that are actually required
- Test scripts in clean environment
- Document why each dependency is needed
Complexity: Slightly longer scripts
Reliability: Scripts "just work" every time
Principle 4: Provide Clear Error Context
Rule: When things fail, make it obvious what went wrong and how to fix it.
How to Apply
Wrap subprocess calls:
# ❌ UNCLEAR: Silent failure
subprocess.run([...], check=True, capture_output=True)
# ✅ CLEAR: Shows what failed
try:
result = subprocess.run(
[...],
check=True,
capture_output=True,
text=True
)
print(result.stdout)
if result.stderr:
print("Warnings:", result.stderr)
except subprocess.CalledProcessError as e:
print(f"❌ Command failed!")
print("STDOUT:", e.stdout)
print("STDERR:", e.stderr)
raise
Validate inputs:
# ❌ UNCLEAR: Fails later with cryptic error
model = load_model(MODEL_NAME)
# ✅ CLEAR: Fails fast with clear message
if not MODEL_NAME:
raise ValueError("MODEL_NAME environment variable not set!")
print(f"Loading model: {MODEL_NAME}")
try:
model = load_model(MODEL_NAME)
print(f"✅ Model loaded successfully")
except Exception as e:
print(f"❌ Failed to load model: {MODEL_NAME}")
print(f"Error: {e}")
print("Hint: Check that model exists on Hub")
raise
Implementation Checklist
- Wrap external calls with try/except
- Print stdout/stderr on failure
- Validate environment variables early
- Add progress indicators (✅, ❌, 🔄)
- Include hints for common failures
- Log configuration at start
Principle 5: Test the Happy Path on Known-Good Inputs
Rule: Before using new code in production, test with inputs you know work.
Summary: The Reliability Checklist
Before submitting ANY job:
Pre-Flight Checks
- Verified all repos/datasets exist (hub_repo_details)
- Tested with known-good inputs if new code
- Using proven hardware/configuration
- Included all dependencies in PEP 723 header
- Installed system requirements (build tools, etc.)
- Set appropriate timeout (not default 30m)
- Configured Hub push with HF_TOKEN (login() + hub_token)
- Added clear error handling
Script Quality
- Self-contained (no external setup needed)
- Complete dependencies listed
- Build tools installed by script
- Progress indicators included
- Error messages are clear
- Configuration logged at start
Job Configuration
- Timeout > expected runtime + 30% buffer
- Hardware appropriate for model size
- Secrets include HF_TOKEN (see SKILL.md directive #2 for syntax)
- Script calls
login(token=hf_token)and setstraining_args.hub_token = hf_tokenBEFORETrainer()init - Environment variables set correctly
- Cost estimated and acceptable
Following these principles transforms job success rate from ~60-70% to ~95%+
When Principles Conflict
Sometimes reliability and performance conflict. Here's how to choose:
| Scenario | Choose | Rationale |
|---|---|---|
| Demo/test | Reliability | Fast failure is worse than slow success |
| Production (first run) | Reliability | Prove it works before optimizing |
| Production (proven) | Performance | Safe to optimize after validation |
| Time-critical | Reliability | Failures cause more delay than slow runs |
| Cost-critical | Balanced | Test with small model, then optimize |
General rule: Reliability first, optimize second.