311 lines
9.3 KiB
Markdown
311 lines
9.3 KiB
Markdown
# Reliability Principles for Training Jobs
|
|
|
|
## Contents
|
|
- Principle 1: Always Verify Before Use
|
|
- Principle 2: Prioritize Reliability Over Performance
|
|
- Principle 3: Create Atomic, Self-Contained Scripts
|
|
- Principle 4: Provide Clear Error Context
|
|
- Principle 5: Test the Happy Path on Known-Good Inputs
|
|
- Summary: The Reliability Checklist (pre-flight, script quality, job config)
|
|
- When Principles Conflict
|
|
|
|
---
|
|
|
|
These principles are derived from real production failures and successful fixes. Following them prevents common failure modes and ensures reliable job execution.
|
|
|
|
## Principle 1: Always Verify Before Use
|
|
|
|
**Rule:** Never assume repos, datasets, or resources exist. Verify with tools first.
|
|
|
|
### What It Prevents
|
|
|
|
- **Non-existent datasets** - Jobs fail immediately when dataset doesn't exist
|
|
- **Typos in names** - Simple mistakes like "argilla-dpo-mix-7k" vs "ultrafeedback_binarized"
|
|
- **Incorrect paths** - Old or moved repos, renamed files
|
|
- **Missing dependencies** - Undocumented requirements
|
|
|
|
### How to Apply
|
|
|
|
**Before submitting ANY job:**
|
|
|
|
```python
|
|
# Verify dataset exists
|
|
dataset_search({"query": "dataset-name", "author": "author-name", "limit": 5})
|
|
hub_repo_details(["author/dataset-name"], repo_type="dataset")
|
|
|
|
# Verify model exists
|
|
hub_repo_details(["org/model-name"], repo_type="model")
|
|
|
|
# Check script/file paths (for URL-based scripts)
|
|
# Verify before using: https://github.com/user/repo/blob/main/script.py
|
|
```
|
|
|
|
**Examples that would have caught errors:**
|
|
|
|
```python
|
|
# ❌ WRONG: Assumed dataset exists
|
|
hf_jobs("uv", {
|
|
"script": """...""",
|
|
"env": {"DATASET": "trl-lib/argilla-dpo-mix-7k"} # Doesn't exist!
|
|
})
|
|
|
|
# ✅ CORRECT: Verify first
|
|
dataset_search({"query": "argilla dpo", "author": "trl-lib"})
|
|
# Would show: "trl-lib/ultrafeedback_binarized" is the correct name
|
|
|
|
hub_repo_details(["trl-lib/ultrafeedback_binarized"], repo_type="dataset")
|
|
# Confirms it exists before using
|
|
```
|
|
|
|
### Implementation Checklist
|
|
|
|
- [ ] Check dataset exists before training
|
|
- [ ] Test script URLs are valid before submitting
|
|
- [ ] Check for recent updates/renames of resources
|
|
- [ ] Check for dataset format
|
|
|
|
**Time cost:** 5-10 seconds
|
|
**Time saved:** Hours of failed job time + debugging
|
|
|
|
---
|
|
|
|
## Principle 2: Prioritize Reliability Over Performance
|
|
|
|
**Rule:** Default to what is most likely to succeed, not what is theoretically fastest.
|
|
|
|
### What It Prevents
|
|
|
|
- **Hardware incompatibilities** - Features that fail on certain GPUs
|
|
- **Unstable optimizations** - Speed-ups that cause crashes
|
|
- **Complex configurations** - More failure points
|
|
- **Build system issues** - Unreliable compilation methods
|
|
|
|
### How to Apply
|
|
|
|
**Choose reliability:**
|
|
|
|
```python
|
|
# ❌ RISKY: Aggressive optimization that may fail
|
|
TrainingArguments(
|
|
torch_compile=True, # Can fail on T4, A10G GPUs
|
|
optim="adamw_bnb_8bit", # Requires specific setup
|
|
dataloader_num_workers=8, # May cause OOM on small instances
|
|
...
|
|
)
|
|
|
|
# ✅ SAFE: Proven defaults
|
|
TrainingArguments(
|
|
# torch_compile=True, # Commented with note: "Enable on H100 for 20% speedup"
|
|
optim="adamw_torch", # Standard, always works
|
|
fp16=True, # Stable and fast on T4/A10G
|
|
dataloader_num_workers=4, # Conservative, reliable
|
|
...
|
|
)
|
|
```
|
|
|
|
### Real-World Example
|
|
|
|
**The `torch.compile` failure:**
|
|
- Added for "20% speedup" on H100
|
|
- **Failed fatally on T4-medium** with cryptic error
|
|
- Misdiagnosed as dataset issue (cost hours)
|
|
- **Fix:** Disable by default, add as optional comment
|
|
|
|
**Result:** Reliability > 20% performance gain
|
|
|
|
### Implementation Checklist
|
|
|
|
- [ ] Use proven, standard configurations by default
|
|
- [ ] Comment out performance optimizations with hardware notes
|
|
- [ ] Use stable build systems (CMake > make)
|
|
- [ ] Test on target hardware before production
|
|
- [ ] Document known incompatibilities
|
|
- [ ] Provide "safe" and "fast" variants when needed
|
|
|
|
**Performance loss:** 10-20% in best case
|
|
**Reliability gain:** 95%+ success rate vs 60-70%
|
|
|
|
---
|
|
|
|
## Principle 3: Create Atomic, Self-Contained Scripts
|
|
|
|
**Rule:** Scripts should work as complete, independent units. Don't remove parts to "simplify."
|
|
|
|
### What It Prevents
|
|
|
|
- **Missing dependencies** - Removed "unnecessary" packages that are actually required
|
|
- **Incomplete processes** - Skipped steps that seem redundant
|
|
- **Environment assumptions** - Scripts that need pre-setup
|
|
- **Partial failures** - Some parts work, others fail silently
|
|
|
|
### How to Apply
|
|
|
|
**Complete dependency specifications:**
|
|
|
|
```python
|
|
# ❌ INCOMPLETE: "Simplified" by removing dependencies
|
|
# /// script
|
|
# dependencies = [
|
|
# "transformers",
|
|
# "torch",
|
|
# "datasets",
|
|
# ]
|
|
# ///
|
|
|
|
# ✅ COMPLETE: All dependencies explicit
|
|
# /// script
|
|
# dependencies = [
|
|
# "transformers>=5.2.0",
|
|
# "accelerate>=1.1.0",
|
|
# "albumentations>=1.4.16", # Required for augmentation + bbox handling
|
|
# "timm", # Required for vision backbones
|
|
# "datasets>=4.0",
|
|
# "torchmetrics", # Required for mAP/mAR computation
|
|
# "pycocotools", # Required for COCO evaluation
|
|
# "trackio", # Required for metrics monitoring
|
|
# "huggingface_hub",
|
|
# ]
|
|
# ///
|
|
```
|
|
|
|
### Real-World Example
|
|
|
|
**The `albumentations` failure:**
|
|
- Original script had it: augmentations and bbox clipping worked fine
|
|
- "Simplified" version removed it: "not strictly needed for training"
|
|
- **Training crashed on bbox augmentation** — no fallback for COCO-format bbox handling
|
|
- Hard to debug: error appeared in data loading, not in augmentation setup
|
|
- **Fix:** Restore all original dependencies
|
|
|
|
**Result:** Don't remove dependencies without thorough testing
|
|
|
|
### Implementation Checklist
|
|
|
|
- [ ] All dependencies in PEP 723 header with version pins
|
|
- [ ] All system packages installed by script
|
|
- [ ] No assumptions about pre-existing environment
|
|
- [ ] No "optional" steps that are actually required
|
|
- [ ] Test scripts in clean environment
|
|
- [ ] Document why each dependency is needed
|
|
|
|
**Complexity:** Slightly longer scripts
|
|
**Reliability:** Scripts "just work" every time
|
|
|
|
---
|
|
|
|
## Principle 4: Provide Clear Error Context
|
|
|
|
**Rule:** When things fail, make it obvious what went wrong and how to fix it.
|
|
|
|
### How to Apply
|
|
|
|
**Wrap subprocess calls:**
|
|
|
|
```python
|
|
# ❌ UNCLEAR: Silent failure
|
|
subprocess.run([...], check=True, capture_output=True)
|
|
|
|
# ✅ CLEAR: Shows what failed
|
|
try:
|
|
result = subprocess.run(
|
|
[...],
|
|
check=True,
|
|
capture_output=True,
|
|
text=True
|
|
)
|
|
print(result.stdout)
|
|
if result.stderr:
|
|
print("Warnings:", result.stderr)
|
|
except subprocess.CalledProcessError as e:
|
|
print(f"❌ Command failed!")
|
|
print("STDOUT:", e.stdout)
|
|
print("STDERR:", e.stderr)
|
|
raise
|
|
```
|
|
|
|
**Validate inputs:**
|
|
|
|
```python
|
|
# ❌ UNCLEAR: Fails later with cryptic error
|
|
model = load_model(MODEL_NAME)
|
|
|
|
# ✅ CLEAR: Fails fast with clear message
|
|
if not MODEL_NAME:
|
|
raise ValueError("MODEL_NAME environment variable not set!")
|
|
|
|
print(f"Loading model: {MODEL_NAME}")
|
|
try:
|
|
model = load_model(MODEL_NAME)
|
|
print(f"✅ Model loaded successfully")
|
|
except Exception as e:
|
|
print(f"❌ Failed to load model: {MODEL_NAME}")
|
|
print(f"Error: {e}")
|
|
print("Hint: Check that model exists on Hub")
|
|
raise
|
|
```
|
|
|
|
### Implementation Checklist
|
|
|
|
- [ ] Wrap external calls with try/except
|
|
- [ ] Print stdout/stderr on failure
|
|
- [ ] Validate environment variables early
|
|
- [ ] Add progress indicators (✅, ❌, 🔄)
|
|
- [ ] Include hints for common failures
|
|
- [ ] Log configuration at start
|
|
|
|
---
|
|
|
|
## Principle 5: Test the Happy Path on Known-Good Inputs
|
|
|
|
**Rule:** Before using new code in production, test with inputs you know work.
|
|
|
|
## Summary: The Reliability Checklist
|
|
|
|
Before submitting ANY job:
|
|
|
|
### Pre-Flight Checks
|
|
- [ ] **Verified** all repos/datasets exist (hub_repo_details)
|
|
- [ ] **Tested** with known-good inputs if new code
|
|
- [ ] **Using** proven hardware/configuration
|
|
- [ ] **Included** all dependencies in PEP 723 header
|
|
- [ ] **Installed** system requirements (build tools, etc.)
|
|
- [ ] **Set** appropriate timeout (not default 30m)
|
|
- [ ] **Configured** Hub push with HF_TOKEN (login() + hub_token)
|
|
- [ ] **Added** clear error handling
|
|
|
|
### Script Quality
|
|
- [ ] Self-contained (no external setup needed)
|
|
- [ ] Complete dependencies listed
|
|
- [ ] Build tools installed by script
|
|
- [ ] Progress indicators included
|
|
- [ ] Error messages are clear
|
|
- [ ] Configuration logged at start
|
|
|
|
### Job Configuration
|
|
- [ ] Timeout > expected runtime + 30% buffer
|
|
- [ ] Hardware appropriate for model size
|
|
- [ ] Secrets include HF_TOKEN (see SKILL.md directive #2 for syntax)
|
|
- [ ] Script calls `login(token=hf_token)` and sets `training_args.hub_token = hf_token` BEFORE `Trainer()` init
|
|
- [ ] Environment variables set correctly
|
|
- [ ] Cost estimated and acceptable
|
|
|
|
**Following these principles transforms job success rate from ~60-70% to ~95%+**
|
|
|
|
---
|
|
|
|
## When Principles Conflict
|
|
|
|
Sometimes reliability and performance conflict. Here's how to choose:
|
|
|
|
| Scenario | Choose | Rationale |
|
|
|----------|--------|-----------|
|
|
| Demo/test | Reliability | Fast failure is worse than slow success |
|
|
| Production (first run) | Reliability | Prove it works before optimizing |
|
|
| Production (proven) | Performance | Safe to optimize after validation |
|
|
| Time-critical | Reliability | Failures cause more delay than slow runs |
|
|
| Cost-critical | Balanced | Test with small model, then optimize |
|
|
|
|
**General rule:** Reliability first, optimize second.
|
|
|
|
---
|