playbook/antigravity-awesome-skills/skills/hugging-face-jobs/references/hardware_guide.md

8.1 KiB
Raw Blame History

Hardware Selection Guide

Choosing the right hardware (flavor) is critical for cost-effective workloads.

Reference: HF Jobs Hardware Documentation (updated 07/2025)

Available Hardware

CPU Flavors

Flavor Description Use Case
cpu-basic Basic CPU instance Testing, lightweight scripts
cpu-upgrade Enhanced CPU instance Data processing, parallel workloads

Use cases: Data processing, testing scripts, lightweight workloads Not recommended for: Model training, GPU-accelerated workloads

GPU Flavors

Flavor GPU VRAM Use Case
t4-small NVIDIA T4 16GB <1B models, demos, quick tests
t4-medium NVIDIA T4 16GB 1-3B models, development
l4x1 NVIDIA L4 24GB 3-7B models, efficient workloads
l4x4 4x NVIDIA L4 96GB Multi-GPU, parallel workloads
a10g-small NVIDIA A10G 24GB 3-7B models, production
a10g-large NVIDIA A10G 24GB 7-13B models, batch inference
a10g-largex2 2x NVIDIA A10G 48GB Multi-GPU, large models
a10g-largex4 4x NVIDIA A10G 96GB Multi-GPU, very large models
a100-large NVIDIA A100 40GB 13B+ models, fastest GPU option

TPU Flavors

Flavor Configuration Use Case
v5e-1x1 TPU v5e (1x1) Small TPU workloads
v5e-2x2 TPU v5e (2x2) Medium TPU workloads
v5e-2x4 TPU v5e (2x4) Large TPU workloads

TPU Use Cases:

  • JAX/Flax model training
  • Large-scale inference
  • TPU-optimized workloads

Selection Guidelines

By Workload Type

Data Processing

  • Recommended: cpu-upgrade or l4x1
  • Use case: Transform, filter, analyze datasets
  • Batch size: Depends on data size
  • Time: Varies by dataset size

Batch Inference

  • Recommended: a10g-large or a100-large
  • Use case: Run inference on thousands of samples
  • Batch size: 8-32 depending on model
  • Time: Depends on number of samples

Experiments & Benchmarks

  • Recommended: a10g-small or a10g-large
  • Use case: Reproducible ML experiments
  • Batch size: Varies
  • Time: Depends on experiment complexity

Model Training (see model-trainer skill for details)

  • Recommended: See model-trainer skill
  • Use case: Fine-tuning models
  • Batch size: Depends on model size
  • Time: Hours to days

Synthetic Data Generation

  • Recommended: a10g-large or a100-large
  • Use case: Generate datasets using LLMs
  • Batch size: Depends on generation method
  • Time: Hours for large datasets

By Budget

Minimal Budget (<$5 total)

  • Use cpu-basic or t4-small
  • Process small datasets
  • Quick tests and demos

Small Budget ($5-20)

  • Use t4-medium or a10g-small
  • Process medium datasets
  • Run experiments

Medium Budget ($20-50)

  • Use a10g-small or a10g-large
  • Process large datasets
  • Production workloads

Large Budget ($50-200)

  • Use a10g-large or a100-large
  • Large-scale processing
  • Multiple experiments

By Model Size (for inference/processing)

Tiny Models (<1B parameters)

  • Recommended: t4-small
  • Example: Qwen2.5-0.5B, TinyLlama
  • Batch size: 8-16

Small Models (1-3B parameters)

  • Recommended: t4-medium or a10g-small
  • Example: Qwen2.5-1.5B, Phi-2
  • Batch size: 4-8

Medium Models (3-7B parameters)

  • Recommended: a10g-small or a10g-large
  • Example: Qwen2.5-7B, Mistral-7B
  • Batch size: 2-4

Large Models (7-13B parameters)

  • Recommended: a10g-large or a100-large
  • Example: Llama-3-8B
  • Batch size: 1-2

Very Large Models (13B+ parameters)

  • Recommended: a100-large
  • Example: Llama-3-13B, Llama-3-70B
  • Batch size: 1

Memory Considerations

Estimating Memory Requirements

For inference:

Memory (GB) ≈ (Model params in billions) × 2-4

For training:

Memory (GB) ≈ (Model params in billions) × 20 (full) or × 4 (LoRA)

Examples:

  • Qwen2.5-0.5B inference: ~1-2GB fits t4-small
  • Qwen2.5-7B inference: ~14-28GB fits a10g-large
  • Qwen2.5-7B training: ~140GB not feasible without LoRA

Memory Optimization

If hitting memory limits:

  1. Reduce batch size

    batch_size = 1
    
  2. Process in chunks

    for chunk in chunks:
        process(chunk)
    
  3. Use smaller models

    • Use quantized models
    • Use LoRA adapters
  4. Upgrade hardware

    • cpu → t4 → a10g → a100

Cost Estimation

Formula

Total Cost = (Hours of runtime) × (Cost per hour)

Example Calculations

Data processing:

  • Hardware: cpu-upgrade ($0.50/hour)
  • Time: 1 hour
  • Cost: $0.50

Batch inference:

  • Hardware: a10g-large ($5/hour)
  • Time: 2 hours
  • Cost: $10.00

Experiments:

  • Hardware: a10g-small ($3.50/hour)
  • Time: 4 hours
  • Cost: $14.00

Cost Optimization Tips

  1. Start small: Test on cpu-basic or t4-small
  2. Monitor runtime: Set appropriate timeouts
  3. Optimize code: Reduce unnecessary compute
  4. Choose right hardware: Don't over-provision
  5. Use checkpoints: Resume if job fails
  6. Monitor costs: Check running jobs regularly

Multi-GPU Workloads

Multi-GPU flavors automatically distribute workloads:

Multi-GPU flavors:

  • l4x4 - 4x L4 GPUs (96GB total VRAM)
  • a10g-largex2 - 2x A10G GPUs (48GB total VRAM)
  • a10g-largex4 - 4x A10G GPUs (96GB total VRAM)

When to use:

  • Large models (>13B parameters)
  • Need faster processing (linear speedup)
  • Large datasets (>100K samples)
  • Parallel workloads
  • Tensor parallelism for inference

MCP Tool Example:

hf_jobs("uv", {
    "script": "process.py",
    "flavor": "a10g-largex2",  # 2 GPUs
    "timeout": "4h",
    "secrets": {"HF_TOKEN": "$HF_TOKEN"}
})

CLI Equivalent:

hf jobs uv run process.py --flavor a10g-largex2 --timeout 4h

Choosing Between Options

CPU vs GPU

Choose CPU when:

  • No GPU acceleration needed
  • Data processing only
  • Budget constrained
  • Simple workloads

Choose GPU when:

  • Model inference/training
  • GPU-accelerated libraries
  • Need faster processing
  • Large models

a10g vs a100

Choose a10g when:

  • Model <13B parameters
  • Budget conscious
  • Processing time not critical

Choose a100 when:

  • Model 13B+ parameters
  • Need fastest processing
  • Memory requirements high
  • Budget allows

Single vs Multi-GPU

Choose single GPU when:

  • Model <7B parameters
  • Budget constrained
  • Simpler debugging

Choose multi-GPU when:

  • Model >13B parameters
  • Need faster processing
  • Large batch sizes required
  • Cost-effective for large jobs

Quick Reference

All Available Flavors

# Official flavor list (updated 07/2025)
FLAVORS = {
    # CPU
    "cpu-basic",      # Testing, lightweight
    "cpu-upgrade",    # Data processing
    
    # GPU - Single
    "t4-small",       # 16GB - <1B models
    "t4-medium",      # 16GB - 1-3B models
    "l4x1",           # 24GB - 3-7B models
    "a10g-small",     # 24GB - 3-7B production
    "a10g-large",     # 24GB - 7-13B models
    "a100-large",     # 40GB - 13B+ models
    
    # GPU - Multi
    "l4x4",           # 4x L4 (96GB total)
    "a10g-largex2",   # 2x A10G (48GB total)
    "a10g-largex4",   # 4x A10G (96GB total)
    
    # TPU
    "v5e-1x1",        # TPU v5e 1x1
    "v5e-2x2",        # TPU v5e 2x2
    "v5e-2x4",        # TPU v5e 2x4
}

Workload → Hardware Mapping

HARDWARE_MAP = {
    "data_processing": "cpu-upgrade",
    "batch_inference_small": "t4-small",
    "batch_inference_medium": "a10g-large",
    "batch_inference_large": "a100-large",
    "experiments": "a10g-small",
    "tpu_workloads": "v5e-1x1",
    "training": "see model-trainer skill"
}

CLI Examples

# CPU job
hf jobs run python:3.12 python script.py

# GPU job
hf jobs run --flavor a10g-large pytorch/pytorch:2.6.0-cuda12.4-cudnn9-devel python script.py

# TPU job
hf jobs run --flavor v5e-1x1 your-tpu-image python script.py

# UV script with GPU
hf jobs uv run --flavor a10g-small my_script.py