337 lines
8.1 KiB
Markdown
337 lines
8.1 KiB
Markdown
# Hardware Selection Guide
|
||
|
||
Choosing the right hardware (flavor) is critical for cost-effective workloads.
|
||
|
||
> **Reference:** [HF Jobs Hardware Documentation](https://huggingface.co/docs/hub/en/spaces-config-reference) (updated 07/2025)
|
||
|
||
## Available Hardware
|
||
|
||
### CPU Flavors
|
||
| Flavor | Description | Use Case |
|
||
|--------|-------------|----------|
|
||
| `cpu-basic` | Basic CPU instance | Testing, lightweight scripts |
|
||
| `cpu-upgrade` | Enhanced CPU instance | Data processing, parallel workloads |
|
||
|
||
**Use cases:** Data processing, testing scripts, lightweight workloads
|
||
**Not recommended for:** Model training, GPU-accelerated workloads
|
||
|
||
### GPU Flavors
|
||
|
||
| Flavor | GPU | VRAM | Use Case |
|
||
|--------|-----|------|----------|
|
||
| `t4-small` | NVIDIA T4 | 16GB | <1B models, demos, quick tests |
|
||
| `t4-medium` | NVIDIA T4 | 16GB | 1-3B models, development |
|
||
| `l4x1` | NVIDIA L4 | 24GB | 3-7B models, efficient workloads |
|
||
| `l4x4` | 4x NVIDIA L4 | 96GB | Multi-GPU, parallel workloads |
|
||
| `a10g-small` | NVIDIA A10G | 24GB | 3-7B models, production |
|
||
| `a10g-large` | NVIDIA A10G | 24GB | 7-13B models, batch inference |
|
||
| `a10g-largex2` | 2x NVIDIA A10G | 48GB | Multi-GPU, large models |
|
||
| `a10g-largex4` | 4x NVIDIA A10G | 96GB | Multi-GPU, very large models |
|
||
| `a100-large` | NVIDIA A100 | 40GB | 13B+ models, fastest GPU option |
|
||
|
||
### TPU Flavors
|
||
|
||
| Flavor | Configuration | Use Case |
|
||
|--------|---------------|----------|
|
||
| `v5e-1x1` | TPU v5e (1x1) | Small TPU workloads |
|
||
| `v5e-2x2` | TPU v5e (2x2) | Medium TPU workloads |
|
||
| `v5e-2x4` | TPU v5e (2x4) | Large TPU workloads |
|
||
|
||
**TPU Use Cases:**
|
||
- JAX/Flax model training
|
||
- Large-scale inference
|
||
- TPU-optimized workloads
|
||
|
||
## Selection Guidelines
|
||
|
||
### By Workload Type
|
||
|
||
**Data Processing**
|
||
- **Recommended:** `cpu-upgrade` or `l4x1`
|
||
- **Use case:** Transform, filter, analyze datasets
|
||
- **Batch size:** Depends on data size
|
||
- **Time:** Varies by dataset size
|
||
|
||
**Batch Inference**
|
||
- **Recommended:** `a10g-large` or `a100-large`
|
||
- **Use case:** Run inference on thousands of samples
|
||
- **Batch size:** 8-32 depending on model
|
||
- **Time:** Depends on number of samples
|
||
|
||
**Experiments & Benchmarks**
|
||
- **Recommended:** `a10g-small` or `a10g-large`
|
||
- **Use case:** Reproducible ML experiments
|
||
- **Batch size:** Varies
|
||
- **Time:** Depends on experiment complexity
|
||
|
||
**Model Training** (see `model-trainer` skill for details)
|
||
- **Recommended:** See model-trainer skill
|
||
- **Use case:** Fine-tuning models
|
||
- **Batch size:** Depends on model size
|
||
- **Time:** Hours to days
|
||
|
||
**Synthetic Data Generation**
|
||
- **Recommended:** `a10g-large` or `a100-large`
|
||
- **Use case:** Generate datasets using LLMs
|
||
- **Batch size:** Depends on generation method
|
||
- **Time:** Hours for large datasets
|
||
|
||
### By Budget
|
||
|
||
**Minimal Budget (<$5 total)**
|
||
- Use `cpu-basic` or `t4-small`
|
||
- Process small datasets
|
||
- Quick tests and demos
|
||
|
||
**Small Budget ($5-20)**
|
||
- Use `t4-medium` or `a10g-small`
|
||
- Process medium datasets
|
||
- Run experiments
|
||
|
||
**Medium Budget ($20-50)**
|
||
- Use `a10g-small` or `a10g-large`
|
||
- Process large datasets
|
||
- Production workloads
|
||
|
||
**Large Budget ($50-200)**
|
||
- Use `a10g-large` or `a100-large`
|
||
- Large-scale processing
|
||
- Multiple experiments
|
||
|
||
### By Model Size (for inference/processing)
|
||
|
||
**Tiny Models (<1B parameters)**
|
||
- **Recommended:** `t4-small`
|
||
- **Example:** Qwen2.5-0.5B, TinyLlama
|
||
- **Batch size:** 8-16
|
||
|
||
**Small Models (1-3B parameters)**
|
||
- **Recommended:** `t4-medium` or `a10g-small`
|
||
- **Example:** Qwen2.5-1.5B, Phi-2
|
||
- **Batch size:** 4-8
|
||
|
||
**Medium Models (3-7B parameters)**
|
||
- **Recommended:** `a10g-small` or `a10g-large`
|
||
- **Example:** Qwen2.5-7B, Mistral-7B
|
||
- **Batch size:** 2-4
|
||
|
||
**Large Models (7-13B parameters)**
|
||
- **Recommended:** `a10g-large` or `a100-large`
|
||
- **Example:** Llama-3-8B
|
||
- **Batch size:** 1-2
|
||
|
||
**Very Large Models (13B+ parameters)**
|
||
- **Recommended:** `a100-large`
|
||
- **Example:** Llama-3-13B, Llama-3-70B
|
||
- **Batch size:** 1
|
||
|
||
## Memory Considerations
|
||
|
||
### Estimating Memory Requirements
|
||
|
||
**For inference:**
|
||
```
|
||
Memory (GB) ≈ (Model params in billions) × 2-4
|
||
```
|
||
|
||
**For training:**
|
||
```
|
||
Memory (GB) ≈ (Model params in billions) × 20 (full) or × 4 (LoRA)
|
||
```
|
||
|
||
**Examples:**
|
||
- Qwen2.5-0.5B inference: ~1-2GB ✅ fits t4-small
|
||
- Qwen2.5-7B inference: ~14-28GB ✅ fits a10g-large
|
||
- Qwen2.5-7B training: ~140GB ❌ not feasible without LoRA
|
||
|
||
### Memory Optimization
|
||
|
||
If hitting memory limits:
|
||
|
||
1. **Reduce batch size**
|
||
```python
|
||
batch_size = 1
|
||
```
|
||
|
||
2. **Process in chunks**
|
||
```python
|
||
for chunk in chunks:
|
||
process(chunk)
|
||
```
|
||
|
||
3. **Use smaller models**
|
||
- Use quantized models
|
||
- Use LoRA adapters
|
||
|
||
4. **Upgrade hardware**
|
||
- cpu → t4 → a10g → a100
|
||
|
||
## Cost Estimation
|
||
|
||
### Formula
|
||
|
||
```
|
||
Total Cost = (Hours of runtime) × (Cost per hour)
|
||
```
|
||
|
||
### Example Calculations
|
||
|
||
**Data processing:**
|
||
- Hardware: cpu-upgrade ($0.50/hour)
|
||
- Time: 1 hour
|
||
- Cost: $0.50
|
||
|
||
**Batch inference:**
|
||
- Hardware: a10g-large ($5/hour)
|
||
- Time: 2 hours
|
||
- Cost: $10.00
|
||
|
||
**Experiments:**
|
||
- Hardware: a10g-small ($3.50/hour)
|
||
- Time: 4 hours
|
||
- Cost: $14.00
|
||
|
||
### Cost Optimization Tips
|
||
|
||
1. **Start small:** Test on cpu-basic or t4-small
|
||
2. **Monitor runtime:** Set appropriate timeouts
|
||
3. **Optimize code:** Reduce unnecessary compute
|
||
4. **Choose right hardware:** Don't over-provision
|
||
5. **Use checkpoints:** Resume if job fails
|
||
6. **Monitor costs:** Check running jobs regularly
|
||
|
||
## Multi-GPU Workloads
|
||
|
||
Multi-GPU flavors automatically distribute workloads:
|
||
|
||
**Multi-GPU flavors:**
|
||
- `l4x4` - 4x L4 GPUs (96GB total VRAM)
|
||
- `a10g-largex2` - 2x A10G GPUs (48GB total VRAM)
|
||
- `a10g-largex4` - 4x A10G GPUs (96GB total VRAM)
|
||
|
||
**When to use:**
|
||
- Large models (>13B parameters)
|
||
- Need faster processing (linear speedup)
|
||
- Large datasets (>100K samples)
|
||
- Parallel workloads
|
||
- Tensor parallelism for inference
|
||
|
||
**MCP Tool Example:**
|
||
```python
|
||
hf_jobs("uv", {
|
||
"script": "process.py",
|
||
"flavor": "a10g-largex2", # 2 GPUs
|
||
"timeout": "4h",
|
||
"secrets": {"HF_TOKEN": "$HF_TOKEN"}
|
||
})
|
||
```
|
||
|
||
**CLI Equivalent:**
|
||
```bash
|
||
hf jobs uv run process.py --flavor a10g-largex2 --timeout 4h
|
||
```
|
||
|
||
## Choosing Between Options
|
||
|
||
### CPU vs GPU
|
||
|
||
**Choose CPU when:**
|
||
- No GPU acceleration needed
|
||
- Data processing only
|
||
- Budget constrained
|
||
- Simple workloads
|
||
|
||
**Choose GPU when:**
|
||
- Model inference/training
|
||
- GPU-accelerated libraries
|
||
- Need faster processing
|
||
- Large models
|
||
|
||
### a10g vs a100
|
||
|
||
**Choose a10g when:**
|
||
- Model <13B parameters
|
||
- Budget conscious
|
||
- Processing time not critical
|
||
|
||
**Choose a100 when:**
|
||
- Model 13B+ parameters
|
||
- Need fastest processing
|
||
- Memory requirements high
|
||
- Budget allows
|
||
|
||
### Single vs Multi-GPU
|
||
|
||
**Choose single GPU when:**
|
||
- Model <7B parameters
|
||
- Budget constrained
|
||
- Simpler debugging
|
||
|
||
**Choose multi-GPU when:**
|
||
- Model >13B parameters
|
||
- Need faster processing
|
||
- Large batch sizes required
|
||
- Cost-effective for large jobs
|
||
|
||
## Quick Reference
|
||
|
||
### All Available Flavors
|
||
|
||
```python
|
||
# Official flavor list (updated 07/2025)
|
||
FLAVORS = {
|
||
# CPU
|
||
"cpu-basic", # Testing, lightweight
|
||
"cpu-upgrade", # Data processing
|
||
|
||
# GPU - Single
|
||
"t4-small", # 16GB - <1B models
|
||
"t4-medium", # 16GB - 1-3B models
|
||
"l4x1", # 24GB - 3-7B models
|
||
"a10g-small", # 24GB - 3-7B production
|
||
"a10g-large", # 24GB - 7-13B models
|
||
"a100-large", # 40GB - 13B+ models
|
||
|
||
# GPU - Multi
|
||
"l4x4", # 4x L4 (96GB total)
|
||
"a10g-largex2", # 2x A10G (48GB total)
|
||
"a10g-largex4", # 4x A10G (96GB total)
|
||
|
||
# TPU
|
||
"v5e-1x1", # TPU v5e 1x1
|
||
"v5e-2x2", # TPU v5e 2x2
|
||
"v5e-2x4", # TPU v5e 2x4
|
||
}
|
||
```
|
||
|
||
### Workload → Hardware Mapping
|
||
|
||
```python
|
||
HARDWARE_MAP = {
|
||
"data_processing": "cpu-upgrade",
|
||
"batch_inference_small": "t4-small",
|
||
"batch_inference_medium": "a10g-large",
|
||
"batch_inference_large": "a100-large",
|
||
"experiments": "a10g-small",
|
||
"tpu_workloads": "v5e-1x1",
|
||
"training": "see model-trainer skill"
|
||
}
|
||
```
|
||
|
||
### CLI Examples
|
||
|
||
```bash
|
||
# CPU job
|
||
hf jobs run python:3.12 python script.py
|
||
|
||
# GPU job
|
||
hf jobs run --flavor a10g-large pytorch/pytorch:2.6.0-cuda12.4-cudnn9-devel python script.py
|
||
|
||
# TPU job
|
||
hf jobs run --flavor v5e-1x1 your-tpu-image python script.py
|
||
|
||
# UV script with GPU
|
||
hf jobs uv run --flavor a10g-small my_script.py
|
||
```
|
||
|