8.1 KiB
Hardware Selection Guide
Choosing the right hardware (flavor) is critical for cost-effective workloads.
Reference: HF Jobs Hardware Documentation (updated 07/2025)
Available Hardware
CPU Flavors
| Flavor | Description | Use Case |
|---|---|---|
cpu-basic |
Basic CPU instance | Testing, lightweight scripts |
cpu-upgrade |
Enhanced CPU instance | Data processing, parallel workloads |
Use cases: Data processing, testing scripts, lightweight workloads Not recommended for: Model training, GPU-accelerated workloads
GPU Flavors
| Flavor | GPU | VRAM | Use Case |
|---|---|---|---|
t4-small |
NVIDIA T4 | 16GB | <1B models, demos, quick tests |
t4-medium |
NVIDIA T4 | 16GB | 1-3B models, development |
l4x1 |
NVIDIA L4 | 24GB | 3-7B models, efficient workloads |
l4x4 |
4x NVIDIA L4 | 96GB | Multi-GPU, parallel workloads |
a10g-small |
NVIDIA A10G | 24GB | 3-7B models, production |
a10g-large |
NVIDIA A10G | 24GB | 7-13B models, batch inference |
a10g-largex2 |
2x NVIDIA A10G | 48GB | Multi-GPU, large models |
a10g-largex4 |
4x NVIDIA A10G | 96GB | Multi-GPU, very large models |
a100-large |
NVIDIA A100 | 40GB | 13B+ models, fastest GPU option |
TPU Flavors
| Flavor | Configuration | Use Case |
|---|---|---|
v5e-1x1 |
TPU v5e (1x1) | Small TPU workloads |
v5e-2x2 |
TPU v5e (2x2) | Medium TPU workloads |
v5e-2x4 |
TPU v5e (2x4) | Large TPU workloads |
TPU Use Cases:
- JAX/Flax model training
- Large-scale inference
- TPU-optimized workloads
Selection Guidelines
By Workload Type
Data Processing
- Recommended:
cpu-upgradeorl4x1 - Use case: Transform, filter, analyze datasets
- Batch size: Depends on data size
- Time: Varies by dataset size
Batch Inference
- Recommended:
a10g-largeora100-large - Use case: Run inference on thousands of samples
- Batch size: 8-32 depending on model
- Time: Depends on number of samples
Experiments & Benchmarks
- Recommended:
a10g-smallora10g-large - Use case: Reproducible ML experiments
- Batch size: Varies
- Time: Depends on experiment complexity
Model Training (see model-trainer skill for details)
- Recommended: See model-trainer skill
- Use case: Fine-tuning models
- Batch size: Depends on model size
- Time: Hours to days
Synthetic Data Generation
- Recommended:
a10g-largeora100-large - Use case: Generate datasets using LLMs
- Batch size: Depends on generation method
- Time: Hours for large datasets
By Budget
Minimal Budget (<$5 total)
- Use
cpu-basicort4-small - Process small datasets
- Quick tests and demos
Small Budget ($5-20)
- Use
t4-mediumora10g-small - Process medium datasets
- Run experiments
Medium Budget ($20-50)
- Use
a10g-smallora10g-large - Process large datasets
- Production workloads
Large Budget ($50-200)
- Use
a10g-largeora100-large - Large-scale processing
- Multiple experiments
By Model Size (for inference/processing)
Tiny Models (<1B parameters)
- Recommended:
t4-small - Example: Qwen2.5-0.5B, TinyLlama
- Batch size: 8-16
Small Models (1-3B parameters)
- Recommended:
t4-mediumora10g-small - Example: Qwen2.5-1.5B, Phi-2
- Batch size: 4-8
Medium Models (3-7B parameters)
- Recommended:
a10g-smallora10g-large - Example: Qwen2.5-7B, Mistral-7B
- Batch size: 2-4
Large Models (7-13B parameters)
- Recommended:
a10g-largeora100-large - Example: Llama-3-8B
- Batch size: 1-2
Very Large Models (13B+ parameters)
- Recommended:
a100-large - Example: Llama-3-13B, Llama-3-70B
- Batch size: 1
Memory Considerations
Estimating Memory Requirements
For inference:
Memory (GB) ≈ (Model params in billions) × 2-4
For training:
Memory (GB) ≈ (Model params in billions) × 20 (full) or × 4 (LoRA)
Examples:
- Qwen2.5-0.5B inference: ~1-2GB ✅ fits t4-small
- Qwen2.5-7B inference: ~14-28GB ✅ fits a10g-large
- Qwen2.5-7B training: ~140GB ❌ not feasible without LoRA
Memory Optimization
If hitting memory limits:
-
Reduce batch size
batch_size = 1 -
Process in chunks
for chunk in chunks: process(chunk) -
Use smaller models
- Use quantized models
- Use LoRA adapters
-
Upgrade hardware
- cpu → t4 → a10g → a100
Cost Estimation
Formula
Total Cost = (Hours of runtime) × (Cost per hour)
Example Calculations
Data processing:
- Hardware: cpu-upgrade ($0.50/hour)
- Time: 1 hour
- Cost: $0.50
Batch inference:
- Hardware: a10g-large ($5/hour)
- Time: 2 hours
- Cost: $10.00
Experiments:
- Hardware: a10g-small ($3.50/hour)
- Time: 4 hours
- Cost: $14.00
Cost Optimization Tips
- Start small: Test on cpu-basic or t4-small
- Monitor runtime: Set appropriate timeouts
- Optimize code: Reduce unnecessary compute
- Choose right hardware: Don't over-provision
- Use checkpoints: Resume if job fails
- Monitor costs: Check running jobs regularly
Multi-GPU Workloads
Multi-GPU flavors automatically distribute workloads:
Multi-GPU flavors:
l4x4- 4x L4 GPUs (96GB total VRAM)a10g-largex2- 2x A10G GPUs (48GB total VRAM)a10g-largex4- 4x A10G GPUs (96GB total VRAM)
When to use:
- Large models (>13B parameters)
- Need faster processing (linear speedup)
- Large datasets (>100K samples)
- Parallel workloads
- Tensor parallelism for inference
MCP Tool Example:
hf_jobs("uv", {
"script": "process.py",
"flavor": "a10g-largex2", # 2 GPUs
"timeout": "4h",
"secrets": {"HF_TOKEN": "$HF_TOKEN"}
})
CLI Equivalent:
hf jobs uv run process.py --flavor a10g-largex2 --timeout 4h
Choosing Between Options
CPU vs GPU
Choose CPU when:
- No GPU acceleration needed
- Data processing only
- Budget constrained
- Simple workloads
Choose GPU when:
- Model inference/training
- GPU-accelerated libraries
- Need faster processing
- Large models
a10g vs a100
Choose a10g when:
- Model <13B parameters
- Budget conscious
- Processing time not critical
Choose a100 when:
- Model 13B+ parameters
- Need fastest processing
- Memory requirements high
- Budget allows
Single vs Multi-GPU
Choose single GPU when:
- Model <7B parameters
- Budget constrained
- Simpler debugging
Choose multi-GPU when:
- Model >13B parameters
- Need faster processing
- Large batch sizes required
- Cost-effective for large jobs
Quick Reference
All Available Flavors
# Official flavor list (updated 07/2025)
FLAVORS = {
# CPU
"cpu-basic", # Testing, lightweight
"cpu-upgrade", # Data processing
# GPU - Single
"t4-small", # 16GB - <1B models
"t4-medium", # 16GB - 1-3B models
"l4x1", # 24GB - 3-7B models
"a10g-small", # 24GB - 3-7B production
"a10g-large", # 24GB - 7-13B models
"a100-large", # 40GB - 13B+ models
# GPU - Multi
"l4x4", # 4x L4 (96GB total)
"a10g-largex2", # 2x A10G (48GB total)
"a10g-largex4", # 4x A10G (96GB total)
# TPU
"v5e-1x1", # TPU v5e 1x1
"v5e-2x2", # TPU v5e 2x2
"v5e-2x4", # TPU v5e 2x4
}
Workload → Hardware Mapping
HARDWARE_MAP = {
"data_processing": "cpu-upgrade",
"batch_inference_small": "t4-small",
"batch_inference_medium": "a10g-large",
"batch_inference_large": "a100-large",
"experiments": "a10g-small",
"tpu_workloads": "v5e-1x1",
"training": "see model-trainer skill"
}
CLI Examples
# CPU job
hf jobs run python:3.12 python script.py
# GPU job
hf jobs run --flavor a10g-large pytorch/pytorch:2.6.0-cuda12.4-cudnn9-devel python script.py
# TPU job
hf jobs run --flavor v5e-1x1 your-tpu-image python script.py
# UV script with GPU
hf jobs uv run --flavor a10g-small my_script.py