8.1 KiB

Raw Blame History

Hardware Selection Guide

Choosing the right hardware (flavor) is critical for cost-effective workloads.

Reference: HF Jobs Hardware Documentation (updated 07/2025)

Available Hardware

CPU Flavors

Flavor	Description	Use Case
`cpu-basic`	Basic CPU instance	Testing, lightweight scripts
`cpu-upgrade`	Enhanced CPU instance	Data processing, parallel workloads

Use cases: Data processing, testing scripts, lightweight workloads Not recommended for: Model training, GPU-accelerated workloads

GPU Flavors

Flavor	GPU	VRAM	Use Case
`t4-small`	NVIDIA T4	16GB	<1B models, demos, quick tests
`t4-medium`	NVIDIA T4	16GB	1-3B models, development
`l4x1`	NVIDIA L4	24GB	3-7B models, efficient workloads
`l4x4`	4x NVIDIA L4	96GB	Multi-GPU, parallel workloads
`a10g-small`	NVIDIA A10G	24GB	3-7B models, production
`a10g-large`	NVIDIA A10G	24GB	7-13B models, batch inference
`a10g-largex2`	2x NVIDIA A10G	48GB	Multi-GPU, large models
`a10g-largex4`	4x NVIDIA A10G	96GB	Multi-GPU, very large models
`a100-large`	NVIDIA A100	40GB	13B+ models, fastest GPU option

TPU Flavors

Flavor	Configuration	Use Case
`v5e-1x1`	TPU v5e (1x1)	Small TPU workloads
`v5e-2x2`	TPU v5e (2x2)	Medium TPU workloads
`v5e-2x4`	TPU v5e (2x4)	Large TPU workloads

TPU Use Cases:

JAX/Flax model training
Large-scale inference
TPU-optimized workloads

Selection Guidelines

By Workload Type

Data Processing

Recommended: cpu-upgrade or l4x1
Use case: Transform, filter, analyze datasets
Batch size: Depends on data size
Time: Varies by dataset size

Batch Inference

Recommended: a10g-large or a100-large
Use case: Run inference on thousands of samples
Batch size: 8-32 depending on model
Time: Depends on number of samples

Experiments & Benchmarks

Recommended: a10g-small or a10g-large
Use case: Reproducible ML experiments
Batch size: Varies
Time: Depends on experiment complexity

Model Training (see model-trainer skill for details)

Recommended: See model-trainer skill
Use case: Fine-tuning models
Batch size: Depends on model size
Time: Hours to days

Synthetic Data Generation

Recommended: a10g-large or a100-large
Use case: Generate datasets using LLMs
Batch size: Depends on generation method
Time: Hours for large datasets

By Budget

Minimal Budget (<$5 total)

Use cpu-basic or t4-small
Process small datasets
Quick tests and demos

Small Budget ($5-20)

Use t4-medium or a10g-small
Process medium datasets
Run experiments

Medium Budget ($20-50)

Use a10g-small or a10g-large
Process large datasets
Production workloads

Large Budget ($50-200)

Use a10g-large or a100-large
Large-scale processing
Multiple experiments

By Model Size (for inference/processing)

Tiny Models (<1B parameters)

Recommended: t4-small
Example: Qwen2.5-0.5B, TinyLlama
Batch size: 8-16

Small Models (1-3B parameters)

Recommended: t4-medium or a10g-small
Example: Qwen2.5-1.5B, Phi-2
Batch size: 4-8

Medium Models (3-7B parameters)

Recommended: a10g-small or a10g-large
Example: Qwen2.5-7B, Mistral-7B
Batch size: 2-4

Large Models (7-13B parameters)

Recommended: a10g-large or a100-large
Example: Llama-3-8B
Batch size: 1-2

Very Large Models (13B+ parameters)

Recommended: a100-large
Example: Llama-3-13B, Llama-3-70B
Batch size: 1

Memory Considerations

Estimating Memory Requirements

For inference:

Memory (GB) ≈ (Model params in billions) × 2-4

For training:

Memory (GB) ≈ (Model params in billions) × 20 (full) or × 4 (LoRA)

Examples:

Qwen2.5-0.5B inference: ~1-2GB ✅ fits t4-small
Qwen2.5-7B inference: ~14-28GB ✅ fits a10g-large
Qwen2.5-7B training: ~140GB ❌ not feasible without LoRA

Memory Optimization

If hitting memory limits:

Reduce batch size
```
batch_size = 1
```

Process in chunks

for chunk in chunks:
    process(chunk)

Use smaller models
- Use quantized models
- Use LoRA adapters
Upgrade hardware
- cpu → t4 → a10g → a100

Cost Estimation

Formula

Total Cost = (Hours of runtime) × (Cost per hour)

Example Calculations

Data processing:

Hardware: cpu-upgrade ($0.50/hour)
Time: 1 hour
Cost: $0.50

Batch inference:

Hardware: a10g-large ($5/hour)
Time: 2 hours
Cost: $10.00

Experiments:

Hardware: a10g-small ($3.50/hour)
Time: 4 hours
Cost: $14.00

Cost Optimization Tips

Start small: Test on cpu-basic or t4-small
Monitor runtime: Set appropriate timeouts
Optimize code: Reduce unnecessary compute
Choose right hardware: Don't over-provision
Use checkpoints: Resume if job fails
Monitor costs: Check running jobs regularly

Multi-GPU Workloads

Multi-GPU flavors automatically distribute workloads:

Multi-GPU flavors:

l4x4 - 4x L4 GPUs (96GB total VRAM)
a10g-largex2 - 2x A10G GPUs (48GB total VRAM)
a10g-largex4 - 4x A10G GPUs (96GB total VRAM)

When to use:

Large models (>13B parameters)
Need faster processing (linear speedup)
Large datasets (>100K samples)
Parallel workloads
Tensor parallelism for inference

MCP Tool Example:

hf_jobs("uv", {
    "script": "process.py",
    "flavor": "a10g-largex2",  # 2 GPUs
    "timeout": "4h",
    "secrets": {"HF_TOKEN": "$HF_TOKEN"}
})

CLI Equivalent:

hf jobs uv run process.py --flavor a10g-largex2 --timeout 4h

Choosing Between Options

CPU vs GPU

Choose CPU when:

No GPU acceleration needed
Data processing only
Budget constrained
Simple workloads

Choose GPU when:

Model inference/training
GPU-accelerated libraries
Need faster processing
Large models

a10g vs a100

Choose a10g when:

Model <13B parameters
Budget conscious
Processing time not critical

Choose a100 when:

Model 13B+ parameters
Need fastest processing
Memory requirements high
Budget allows

Single vs Multi-GPU

Choose single GPU when:

Model <7B parameters
Budget constrained
Simpler debugging

Choose multi-GPU when:

Model >13B parameters
Need faster processing
Large batch sizes required
Cost-effective for large jobs

Quick Reference

All Available Flavors

# Official flavor list (updated 07/2025)
FLAVORS = {
    # CPU
    "cpu-basic",      # Testing, lightweight
    "cpu-upgrade",    # Data processing
    
    # GPU - Single
    "t4-small",       # 16GB - <1B models
    "t4-medium",      # 16GB - 1-3B models
    "l4x1",           # 24GB - 3-7B models
    "a10g-small",     # 24GB - 3-7B production
    "a10g-large",     # 24GB - 7-13B models
    "a100-large",     # 40GB - 13B+ models
    
    # GPU - Multi
    "l4x4",           # 4x L4 (96GB total)
    "a10g-largex2",   # 2x A10G (48GB total)
    "a10g-largex4",   # 4x A10G (96GB total)
    
    # TPU
    "v5e-1x1",        # TPU v5e 1x1
    "v5e-2x2",        # TPU v5e 2x2
    "v5e-2x4",        # TPU v5e 2x4
}

Workload → Hardware Mapping

HARDWARE_MAP = {
    "data_processing": "cpu-upgrade",
    "batch_inference_small": "t4-small",
    "batch_inference_medium": "a10g-large",
    "batch_inference_large": "a100-large",
    "experiments": "a10g-small",
    "tpu_workloads": "v5e-1x1",
    "training": "see model-trainer skill"
}

CLI Examples

# CPU job
hf jobs run python:3.12 python script.py

# GPU job
hf jobs run --flavor a10g-large pytorch/pytorch:2.6.0-cuda12.4-cudnn9-devel python script.py

# TPU job
hf jobs run --flavor v5e-1x1 your-tpu-image python script.py

# UV script with GPU
hf jobs uv run --flavor a10g-small my_script.py

8.1 KiB Raw Blame History Unescape Escape