11 KiB

Raw Blame History

Reliability Principles for Training Jobs

These principles are derived from real production failures and successful fixes. Following them prevents common failure modes and ensures reliable job execution.

Principle 1: Always Verify Before Use

Rule: Never assume repos, datasets, or resources exist. Verify with tools first.

What It Prevents

Non-existent datasets - Jobs fail immediately when dataset doesn't exist
Typos in names - Simple mistakes like "argilla-dpo-mix-7k" vs "ultrafeedback_binarized"
Incorrect paths - Old or moved repos, renamed files
Missing dependencies - Undocumented requirements

How to Apply

Before submitting ANY job:

# Verify dataset exists
dataset_search({"query": "dataset-name", "author": "author-name", "limit": 5})
hub_repo_details(["author/dataset-name"], repo_type="dataset")

# Verify model exists
hub_repo_details(["org/model-name"], repo_type="model")

# Check script/file paths (for URL-based scripts)
# Verify before using: https://github.com/user/repo/blob/main/script.py

Examples that would have caught errors:

# ❌ WRONG: Assumed dataset exists
hf_jobs("uv", {
    "script": """...""",
    "env": {"DATASET": "trl-lib/argilla-dpo-mix-7k"}  # Doesn't exist!
})

# ✅ CORRECT: Verify first
dataset_search({"query": "argilla dpo", "author": "trl-lib"})
# Would show: "trl-lib/ultrafeedback_binarized" is the correct name

hub_repo_details(["trl-lib/ultrafeedback_binarized"], repo_type="dataset")
# Confirms it exists before using

Implementation Checklist

Check dataset exists before training
Verify base model exists before fine-tuning
Confirm adapter model exists before GGUF conversion
Test script URLs are valid before submitting
Validate file paths in repositories
Check for recent updates/renames of resources

Time cost: 5-10 seconds
Time saved: Hours of failed job time + debugging

Principle 2: Prioritize Reliability Over Performance

Rule: Default to what is most likely to succeed, not what is theoretically fastest.

What It Prevents

Hardware incompatibilities - Features that fail on certain GPUs
Unstable optimizations - Speed-ups that cause crashes
Complex configurations - More failure points
Build system issues - Unreliable compilation methods

How to Apply

Choose reliability:

# ❌ RISKY: Aggressive optimization that may fail
SFTConfig(
    torch_compile=True,  # Can fail on T4, A10G GPUs
    optim="adamw_bnb_8bit",  # Requires specific setup
    fp16=False,  # May cause training instability
    ...
)

# ✅ SAFE: Proven defaults
SFTConfig(
    # torch_compile=True,  # Commented with note: "Enable on H100 for 20% speedup"
    optim="adamw_torch",  # Standard, always works
    fp16=True,  # Stable and fast
    ...
)

For build processes:

# ❌ UNRELIABLE: Uses make (platform-dependent)
subprocess.run(["make", "-C", "/tmp/llama.cpp", "llama-quantize"], check=True)

# ✅ RELIABLE: Uses CMake (consistent, documented)
subprocess.run([
    "cmake", "-B", "/tmp/llama.cpp/build", "-S", "/tmp/llama.cpp",
    "-DGGML_CUDA=OFF"  # Disable CUDA for faster, more reliable build
], check=True)

subprocess.run([
    "cmake", "--build", "/tmp/llama.cpp/build",
    "--target", "llama-quantize", "-j", "4"
], check=True)

Real-World Example

The torch.compile failure:

Added for "20% speedup" on H100
Failed fatally on T4-medium with cryptic error
Misdiagnosed as dataset issue (cost hours)
Fix: Disable by default, add as optional comment

Result: Reliability > 20% performance gain

Implementation Checklist

Use proven, standard configurations by default
Comment out performance optimizations with hardware notes
Use stable build systems (CMake > make)
Test on target hardware before production
Document known incompatibilities
Provide "safe" and "fast" variants when needed

Performance loss: 10-20% in best case
Reliability gain: 95%+ success rate vs 60-70%

Principle 3: Create Atomic, Self-Contained Scripts

Rule: Scripts should work as complete, independent units. Don't remove parts to "simplify."

What It Prevents

Missing dependencies - Removed "unnecessary" packages that are actually required
Incomplete processes - Skipped steps that seem redundant
Environment assumptions - Scripts that need pre-setup
Partial failures - Some parts work, others fail silently

How to Apply

Complete dependency specifications:

# ❌ INCOMPLETE: "Simplified" by removing dependencies
# /// script
# dependencies = [
#     "transformers",
#     "peft",
#     "torch",
# ]
# ///

# ✅ COMPLETE: All dependencies explicit
# /// script
# dependencies = [
#     "transformers>=4.36.0",
#     "peft>=0.7.0",
#     "torch>=2.0.0",
#     "accelerate>=0.24.0",
#     "huggingface_hub>=0.20.0",
#     "sentencepiece>=0.1.99",  # Required for tokenizers
#     "protobuf>=3.20.0",        # Required for tokenizers
#     "numpy",
#     "gguf",
# ]
# ///

Complete build processes:

# ❌ INCOMPLETE: Assumes build tools exist
subprocess.run(["git", "clone", "https://github.com/ggerganov/llama.cpp.git", "/tmp/llama.cpp"])
subprocess.run(["make", "-C", "/tmp/llama.cpp", "llama-quantize"])  # FAILS: no gcc/make

# ✅ COMPLETE: Installs all requirements
subprocess.run(["apt-get", "update", "-qq"], check=True)
subprocess.run(["apt-get", "install", "-y", "-qq", "build-essential", "cmake"], check=True)
subprocess.run(["git", "clone", "https://github.com/ggerganov/llama.cpp.git", "/tmp/llama.cpp"])
# ... then build

Real-World Example

The sentencepiece failure:

Original script had it: worked fine
"Simplified" version removed it: "doesn't look necessary"
GGUF conversion failed silently - tokenizer couldn't convert
Hard to debug: no obvious error message
Fix: Restore all original dependencies

Result: Don't remove dependencies without thorough testing

Implementation Checklist

All dependencies in PEP 723 header with version pins
All system packages installed by script
No assumptions about pre-existing environment
No "optional" steps that are actually required
Test scripts in clean environment
Document why each dependency is needed

Complexity: Slightly longer scripts
Reliability: Scripts "just work" every time

Principle 4: Provide Clear Error Context

Rule: When things fail, make it obvious what went wrong and how to fix it.

How to Apply

Wrap subprocess calls:

# ❌ UNCLEAR: Silent failure
subprocess.run([...], check=True, capture_output=True)

# ✅ CLEAR: Shows what failed
try:
    result = subprocess.run(
        [...],
        check=True,
        capture_output=True,
        text=True
    )
    print(result.stdout)
    if result.stderr:
        print("Warnings:", result.stderr)
except subprocess.CalledProcessError as e:
    print(f"❌ Command failed!")
    print("STDOUT:", e.stdout)
    print("STDERR:", e.stderr)
    raise

Validate inputs:

# ❌ UNCLEAR: Fails later with cryptic error
model = load_model(MODEL_NAME)

# ✅ CLEAR: Fails fast with clear message
if not MODEL_NAME:
    raise ValueError("MODEL_NAME environment variable not set!")

print(f"Loading model: {MODEL_NAME}")
try:
    model = load_model(MODEL_NAME)
    print(f"✅ Model loaded successfully")
except Exception as e:
    print(f"❌ Failed to load model: {MODEL_NAME}")
    print(f"Error: {e}")
    print("Hint: Check that model exists on Hub")
    raise

Implementation Checklist

Wrap external calls with try/except
Print stdout/stderr on failure
Validate environment variables early
Add progress indicators (✅, ❌, 🔄)
Include hints for common failures
Log configuration at start

Principle 5: Test the Happy Path on Known-Good Inputs

Rule: Before using new code in production, test with inputs you know work.

How to Apply

Known-good test inputs:

# For training
TEST_DATASET = "trl-lib/Capybara"  # Small, well-formatted, widely used
TEST_MODEL = "Qwen/Qwen2.5-0.5B"  # Small, fast, reliable

# For GGUF conversion
TEST_ADAPTER = "evalstate/qwen-capybara-medium"  # Known working model
TEST_BASE = "Qwen/Qwen2.5-0.5B"  # Compatible base

Testing workflow:

Test with known-good inputs first
If that works, try production inputs
If production fails, you know it's the inputs (not code)
Isolate the difference

Implementation Checklist

Maintain list of known-good test models/datasets
Test new scripts with test inputs first
Document what makes inputs "good"
Keep test jobs cheap (small models, short timeouts)
Only move to production after test succeeds

Time cost: 5-10 minutes for test run
Debugging time saved: Hours

Summary: The Reliability Checklist

Before submitting ANY job:

Pre-Flight Checks

Verified all repos/datasets exist (hub_repo_details)
Tested with known-good inputs if new code
Using proven hardware/configuration
Included all dependencies in PEP 723 header
Installed system requirements (build tools, etc.)
Set appropriate timeout (not default 30m)
Configured Hub push with HF_TOKEN
Added clear error handling

Script Quality

Self-contained (no external setup needed)
Complete dependencies listed
Build tools installed by script
Progress indicators included
Error messages are clear
Configuration logged at start

Job Configuration

Timeout > expected runtime + 30% buffer
Hardware appropriate for model size
Secrets include HF_TOKEN
Environment variables set correctly
Cost estimated and acceptable

Following these principles transforms job success rate from ~60-70% to ~95%+

When Principles Conflict

Sometimes reliability and performance conflict. Here's how to choose:

Scenario	Choose	Rationale
Demo/test	Reliability	Fast failure is worse than slow success
Production (first run)	Reliability	Prove it works before optimizing
Production (proven)	Performance	Safe to optimize after validation
Time-critical	Reliability	Failures cause more delay than slow runs
Cost-critical	Balanced	Test with small model, then optimize

General rule: Reliability first, optimize second.

11 KiB Raw Blame History

Reliability Principles for Training Jobs

Principle 1: Always Verify Before Use

What It Prevents

How to Apply

Implementation Checklist

Principle 2: Prioritize Reliability Over Performance

What It Prevents

How to Apply

Real-World Example

Implementation Checklist

Principle 3: Create Atomic, Self-Contained Scripts

What It Prevents

How to Apply

Real-World Example

Implementation Checklist

Principle 4: Provide Clear Error Context

How to Apply

Implementation Checklist

Principle 5: Test the Happy Path on Known-Good Inputs

How to Apply

Implementation Checklist

Summary: The Reliability Checklist

Pre-Flight Checks

Script Quality

Job Configuration

When Principles Conflict

Further Reading

11 KiB

Raw Blame History