playbook/antigravity-awesome-skills/skills/hugging-face-vision-trainer/references/reliability_principles.md

9.3 KiB

Reliability Principles for Training Jobs

Contents

  • Principle 1: Always Verify Before Use
  • Principle 2: Prioritize Reliability Over Performance
  • Principle 3: Create Atomic, Self-Contained Scripts
  • Principle 4: Provide Clear Error Context
  • Principle 5: Test the Happy Path on Known-Good Inputs
  • Summary: The Reliability Checklist (pre-flight, script quality, job config)
  • When Principles Conflict

These principles are derived from real production failures and successful fixes. Following them prevents common failure modes and ensures reliable job execution.

Principle 1: Always Verify Before Use

Rule: Never assume repos, datasets, or resources exist. Verify with tools first.

What It Prevents

  • Non-existent datasets - Jobs fail immediately when dataset doesn't exist
  • Typos in names - Simple mistakes like "argilla-dpo-mix-7k" vs "ultrafeedback_binarized"
  • Incorrect paths - Old or moved repos, renamed files
  • Missing dependencies - Undocumented requirements

How to Apply

Before submitting ANY job:

# Verify dataset exists
dataset_search({"query": "dataset-name", "author": "author-name", "limit": 5})
hub_repo_details(["author/dataset-name"], repo_type="dataset")

# Verify model exists
hub_repo_details(["org/model-name"], repo_type="model")

# Check script/file paths (for URL-based scripts)
# Verify before using: https://github.com/user/repo/blob/main/script.py

Examples that would have caught errors:

# ❌ WRONG: Assumed dataset exists
hf_jobs("uv", {
    "script": """...""",
    "env": {"DATASET": "trl-lib/argilla-dpo-mix-7k"}  # Doesn't exist!
})

# ✅ CORRECT: Verify first
dataset_search({"query": "argilla dpo", "author": "trl-lib"})
# Would show: "trl-lib/ultrafeedback_binarized" is the correct name

hub_repo_details(["trl-lib/ultrafeedback_binarized"], repo_type="dataset")
# Confirms it exists before using

Implementation Checklist

  • Check dataset exists before training
  • Test script URLs are valid before submitting
  • Check for recent updates/renames of resources
  • Check for dataset format

Time cost: 5-10 seconds
Time saved: Hours of failed job time + debugging


Principle 2: Prioritize Reliability Over Performance

Rule: Default to what is most likely to succeed, not what is theoretically fastest.

What It Prevents

  • Hardware incompatibilities - Features that fail on certain GPUs
  • Unstable optimizations - Speed-ups that cause crashes
  • Complex configurations - More failure points
  • Build system issues - Unreliable compilation methods

How to Apply

Choose reliability:

# ❌ RISKY: Aggressive optimization that may fail
TrainingArguments(
    torch_compile=True,  # Can fail on T4, A10G GPUs
    optim="adamw_bnb_8bit",  # Requires specific setup
    dataloader_num_workers=8,  # May cause OOM on small instances
    ...
)

# ✅ SAFE: Proven defaults
TrainingArguments(
    # torch_compile=True,  # Commented with note: "Enable on H100 for 20% speedup"
    optim="adamw_torch",  # Standard, always works
    fp16=True,  # Stable and fast on T4/A10G
    dataloader_num_workers=4,  # Conservative, reliable
    ...
)

Real-World Example

The torch.compile failure:

  • Added for "20% speedup" on H100
  • Failed fatally on T4-medium with cryptic error
  • Misdiagnosed as dataset issue (cost hours)
  • Fix: Disable by default, add as optional comment

Result: Reliability > 20% performance gain

Implementation Checklist

  • Use proven, standard configurations by default
  • Comment out performance optimizations with hardware notes
  • Use stable build systems (CMake > make)
  • Test on target hardware before production
  • Document known incompatibilities
  • Provide "safe" and "fast" variants when needed

Performance loss: 10-20% in best case
Reliability gain: 95%+ success rate vs 60-70%


Principle 3: Create Atomic, Self-Contained Scripts

Rule: Scripts should work as complete, independent units. Don't remove parts to "simplify."

What It Prevents

  • Missing dependencies - Removed "unnecessary" packages that are actually required
  • Incomplete processes - Skipped steps that seem redundant
  • Environment assumptions - Scripts that need pre-setup
  • Partial failures - Some parts work, others fail silently

How to Apply

Complete dependency specifications:

# ❌ INCOMPLETE: "Simplified" by removing dependencies
# /// script
# dependencies = [
#     "transformers",
#     "torch",
#     "datasets",
# ]
# ///

# ✅ COMPLETE: All dependencies explicit
# /// script
# dependencies = [
#     "transformers>=5.2.0",
#     "accelerate>=1.1.0",
#     "albumentations>=1.4.16",  # Required for augmentation + bbox handling
#     "timm",                     # Required for vision backbones
#     "datasets>=4.0",
#     "torchmetrics",             # Required for mAP/mAR computation
#     "pycocotools",              # Required for COCO evaluation
#     "trackio",                  # Required for metrics monitoring
#     "huggingface_hub",
# ]
# ///

Real-World Example

The albumentations failure:

  • Original script had it: augmentations and bbox clipping worked fine
  • "Simplified" version removed it: "not strictly needed for training"
  • Training crashed on bbox augmentation — no fallback for COCO-format bbox handling
  • Hard to debug: error appeared in data loading, not in augmentation setup
  • Fix: Restore all original dependencies

Result: Don't remove dependencies without thorough testing

Implementation Checklist

  • All dependencies in PEP 723 header with version pins
  • All system packages installed by script
  • No assumptions about pre-existing environment
  • No "optional" steps that are actually required
  • Test scripts in clean environment
  • Document why each dependency is needed

Complexity: Slightly longer scripts
Reliability: Scripts "just work" every time


Principle 4: Provide Clear Error Context

Rule: When things fail, make it obvious what went wrong and how to fix it.

How to Apply

Wrap subprocess calls:

# ❌ UNCLEAR: Silent failure
subprocess.run([...], check=True, capture_output=True)

# ✅ CLEAR: Shows what failed
try:
    result = subprocess.run(
        [...],
        check=True,
        capture_output=True,
        text=True
    )
    print(result.stdout)
    if result.stderr:
        print("Warnings:", result.stderr)
except subprocess.CalledProcessError as e:
    print(f"❌ Command failed!")
    print("STDOUT:", e.stdout)
    print("STDERR:", e.stderr)
    raise

Validate inputs:

# ❌ UNCLEAR: Fails later with cryptic error
model = load_model(MODEL_NAME)

# ✅ CLEAR: Fails fast with clear message
if not MODEL_NAME:
    raise ValueError("MODEL_NAME environment variable not set!")

print(f"Loading model: {MODEL_NAME}")
try:
    model = load_model(MODEL_NAME)
    print(f"✅ Model loaded successfully")
except Exception as e:
    print(f"❌ Failed to load model: {MODEL_NAME}")
    print(f"Error: {e}")
    print("Hint: Check that model exists on Hub")
    raise

Implementation Checklist

  • Wrap external calls with try/except
  • Print stdout/stderr on failure
  • Validate environment variables early
  • Add progress indicators (, , 🔄)
  • Include hints for common failures
  • Log configuration at start

Principle 5: Test the Happy Path on Known-Good Inputs

Rule: Before using new code in production, test with inputs you know work.

Summary: The Reliability Checklist

Before submitting ANY job:

Pre-Flight Checks

  • Verified all repos/datasets exist (hub_repo_details)
  • Tested with known-good inputs if new code
  • Using proven hardware/configuration
  • Included all dependencies in PEP 723 header
  • Installed system requirements (build tools, etc.)
  • Set appropriate timeout (not default 30m)
  • Configured Hub push with HF_TOKEN (login() + hub_token)
  • Added clear error handling

Script Quality

  • Self-contained (no external setup needed)
  • Complete dependencies listed
  • Build tools installed by script
  • Progress indicators included
  • Error messages are clear
  • Configuration logged at start

Job Configuration

  • Timeout > expected runtime + 30% buffer
  • Hardware appropriate for model size
  • Secrets include HF_TOKEN (see SKILL.md directive #2 for syntax)
  • Script calls login(token=hf_token) and sets training_args.hub_token = hf_token BEFORE Trainer() init
  • Environment variables set correctly
  • Cost estimated and acceptable

Following these principles transforms job success rate from ~60-70% to ~95%+


When Principles Conflict

Sometimes reliability and performance conflict. Here's how to choose:

Scenario Choose Rationale
Demo/test Reliability Fast failure is worse than slow success
Production (first run) Reliability Prove it works before optimizing
Production (proven) Performance Safe to optimize after validation
Time-critical Reliability Failures cause more delay than slow runs
Cost-critical Balanced Test with small model, then optimize

General rule: Reliability first, optimize second.