# Reliability Principles for Training Jobs ## Contents - Principle 1: Always Verify Before Use - Principle 2: Prioritize Reliability Over Performance - Principle 3: Create Atomic, Self-Contained Scripts - Principle 4: Provide Clear Error Context - Principle 5: Test the Happy Path on Known-Good Inputs - Summary: The Reliability Checklist (pre-flight, script quality, job config) - When Principles Conflict --- These principles are derived from real production failures and successful fixes. Following them prevents common failure modes and ensures reliable job execution. ## Principle 1: Always Verify Before Use **Rule:** Never assume repos, datasets, or resources exist. Verify with tools first. ### What It Prevents - **Non-existent datasets** - Jobs fail immediately when dataset doesn't exist - **Typos in names** - Simple mistakes like "argilla-dpo-mix-7k" vs "ultrafeedback_binarized" - **Incorrect paths** - Old or moved repos, renamed files - **Missing dependencies** - Undocumented requirements ### How to Apply **Before submitting ANY job:** ```python # Verify dataset exists dataset_search({"query": "dataset-name", "author": "author-name", "limit": 5}) hub_repo_details(["author/dataset-name"], repo_type="dataset") # Verify model exists hub_repo_details(["org/model-name"], repo_type="model") # Check script/file paths (for URL-based scripts) # Verify before using: https://github.com/user/repo/blob/main/script.py ``` **Examples that would have caught errors:** ```python # ❌ WRONG: Assumed dataset exists hf_jobs("uv", { "script": """...""", "env": {"DATASET": "trl-lib/argilla-dpo-mix-7k"} # Doesn't exist! }) # ✅ CORRECT: Verify first dataset_search({"query": "argilla dpo", "author": "trl-lib"}) # Would show: "trl-lib/ultrafeedback_binarized" is the correct name hub_repo_details(["trl-lib/ultrafeedback_binarized"], repo_type="dataset") # Confirms it exists before using ``` ### Implementation Checklist - [ ] Check dataset exists before training - [ ] Test script URLs are valid before submitting - [ ] Check for recent updates/renames of resources - [ ] Check for dataset format **Time cost:** 5-10 seconds **Time saved:** Hours of failed job time + debugging --- ## Principle 2: Prioritize Reliability Over Performance **Rule:** Default to what is most likely to succeed, not what is theoretically fastest. ### What It Prevents - **Hardware incompatibilities** - Features that fail on certain GPUs - **Unstable optimizations** - Speed-ups that cause crashes - **Complex configurations** - More failure points - **Build system issues** - Unreliable compilation methods ### How to Apply **Choose reliability:** ```python # ❌ RISKY: Aggressive optimization that may fail TrainingArguments( torch_compile=True, # Can fail on T4, A10G GPUs optim="adamw_bnb_8bit", # Requires specific setup dataloader_num_workers=8, # May cause OOM on small instances ... ) # ✅ SAFE: Proven defaults TrainingArguments( # torch_compile=True, # Commented with note: "Enable on H100 for 20% speedup" optim="adamw_torch", # Standard, always works fp16=True, # Stable and fast on T4/A10G dataloader_num_workers=4, # Conservative, reliable ... ) ``` ### Real-World Example **The `torch.compile` failure:** - Added for "20% speedup" on H100 - **Failed fatally on T4-medium** with cryptic error - Misdiagnosed as dataset issue (cost hours) - **Fix:** Disable by default, add as optional comment **Result:** Reliability > 20% performance gain ### Implementation Checklist - [ ] Use proven, standard configurations by default - [ ] Comment out performance optimizations with hardware notes - [ ] Use stable build systems (CMake > make) - [ ] Test on target hardware before production - [ ] Document known incompatibilities - [ ] Provide "safe" and "fast" variants when needed **Performance loss:** 10-20% in best case **Reliability gain:** 95%+ success rate vs 60-70% --- ## Principle 3: Create Atomic, Self-Contained Scripts **Rule:** Scripts should work as complete, independent units. Don't remove parts to "simplify." ### What It Prevents - **Missing dependencies** - Removed "unnecessary" packages that are actually required - **Incomplete processes** - Skipped steps that seem redundant - **Environment assumptions** - Scripts that need pre-setup - **Partial failures** - Some parts work, others fail silently ### How to Apply **Complete dependency specifications:** ```python # ❌ INCOMPLETE: "Simplified" by removing dependencies # /// script # dependencies = [ # "transformers", # "torch", # "datasets", # ] # /// # ✅ COMPLETE: All dependencies explicit # /// script # dependencies = [ # "transformers>=5.2.0", # "accelerate>=1.1.0", # "albumentations>=1.4.16", # Required for augmentation + bbox handling # "timm", # Required for vision backbones # "datasets>=4.0", # "torchmetrics", # Required for mAP/mAR computation # "pycocotools", # Required for COCO evaluation # "trackio", # Required for metrics monitoring # "huggingface_hub", # ] # /// ``` ### Real-World Example **The `albumentations` failure:** - Original script had it: augmentations and bbox clipping worked fine - "Simplified" version removed it: "not strictly needed for training" - **Training crashed on bbox augmentation** — no fallback for COCO-format bbox handling - Hard to debug: error appeared in data loading, not in augmentation setup - **Fix:** Restore all original dependencies **Result:** Don't remove dependencies without thorough testing ### Implementation Checklist - [ ] All dependencies in PEP 723 header with version pins - [ ] All system packages installed by script - [ ] No assumptions about pre-existing environment - [ ] No "optional" steps that are actually required - [ ] Test scripts in clean environment - [ ] Document why each dependency is needed **Complexity:** Slightly longer scripts **Reliability:** Scripts "just work" every time --- ## Principle 4: Provide Clear Error Context **Rule:** When things fail, make it obvious what went wrong and how to fix it. ### How to Apply **Wrap subprocess calls:** ```python # ❌ UNCLEAR: Silent failure subprocess.run([...], check=True, capture_output=True) # ✅ CLEAR: Shows what failed try: result = subprocess.run( [...], check=True, capture_output=True, text=True ) print(result.stdout) if result.stderr: print("Warnings:", result.stderr) except subprocess.CalledProcessError as e: print(f"❌ Command failed!") print("STDOUT:", e.stdout) print("STDERR:", e.stderr) raise ``` **Validate inputs:** ```python # ❌ UNCLEAR: Fails later with cryptic error model = load_model(MODEL_NAME) # ✅ CLEAR: Fails fast with clear message if not MODEL_NAME: raise ValueError("MODEL_NAME environment variable not set!") print(f"Loading model: {MODEL_NAME}") try: model = load_model(MODEL_NAME) print(f"✅ Model loaded successfully") except Exception as e: print(f"❌ Failed to load model: {MODEL_NAME}") print(f"Error: {e}") print("Hint: Check that model exists on Hub") raise ``` ### Implementation Checklist - [ ] Wrap external calls with try/except - [ ] Print stdout/stderr on failure - [ ] Validate environment variables early - [ ] Add progress indicators (✅, ❌, 🔄) - [ ] Include hints for common failures - [ ] Log configuration at start --- ## Principle 5: Test the Happy Path on Known-Good Inputs **Rule:** Before using new code in production, test with inputs you know work. ## Summary: The Reliability Checklist Before submitting ANY job: ### Pre-Flight Checks - [ ] **Verified** all repos/datasets exist (hub_repo_details) - [ ] **Tested** with known-good inputs if new code - [ ] **Using** proven hardware/configuration - [ ] **Included** all dependencies in PEP 723 header - [ ] **Installed** system requirements (build tools, etc.) - [ ] **Set** appropriate timeout (not default 30m) - [ ] **Configured** Hub push with HF_TOKEN (login() + hub_token) - [ ] **Added** clear error handling ### Script Quality - [ ] Self-contained (no external setup needed) - [ ] Complete dependencies listed - [ ] Build tools installed by script - [ ] Progress indicators included - [ ] Error messages are clear - [ ] Configuration logged at start ### Job Configuration - [ ] Timeout > expected runtime + 30% buffer - [ ] Hardware appropriate for model size - [ ] Secrets include HF_TOKEN (see SKILL.md directive #2 for syntax) - [ ] Script calls `login(token=hf_token)` and sets `training_args.hub_token = hf_token` BEFORE `Trainer()` init - [ ] Environment variables set correctly - [ ] Cost estimated and acceptable **Following these principles transforms job success rate from ~60-70% to ~95%+** --- ## When Principles Conflict Sometimes reliability and performance conflict. Here's how to choose: | Scenario | Choose | Rationale | |----------|--------|-----------| | Demo/test | Reliability | Fast failure is worse than slow success | | Production (first run) | Reliability | Prove it works before optimizing | | Production (proven) | Performance | Safe to optimize after validation | | Time-critical | Reliability | Failures cause more delay than slow runs | | Cost-critical | Balanced | Test with small model, then optimize | **General rule:** Reliability first, optimize second. ---