playbook/antigravity-awesome-skills/skills/hugging-face-jobs/references/hardware_guide.md

337 lines
8.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Hardware Selection Guide
Choosing the right hardware (flavor) is critical for cost-effective workloads.
> **Reference:** [HF Jobs Hardware Documentation](https://huggingface.co/docs/hub/en/spaces-config-reference) (updated 07/2025)
## Available Hardware
### CPU Flavors
| Flavor | Description | Use Case |
|--------|-------------|----------|
| `cpu-basic` | Basic CPU instance | Testing, lightweight scripts |
| `cpu-upgrade` | Enhanced CPU instance | Data processing, parallel workloads |
**Use cases:** Data processing, testing scripts, lightweight workloads
**Not recommended for:** Model training, GPU-accelerated workloads
### GPU Flavors
| Flavor | GPU | VRAM | Use Case |
|--------|-----|------|----------|
| `t4-small` | NVIDIA T4 | 16GB | <1B models, demos, quick tests |
| `t4-medium` | NVIDIA T4 | 16GB | 1-3B models, development |
| `l4x1` | NVIDIA L4 | 24GB | 3-7B models, efficient workloads |
| `l4x4` | 4x NVIDIA L4 | 96GB | Multi-GPU, parallel workloads |
| `a10g-small` | NVIDIA A10G | 24GB | 3-7B models, production |
| `a10g-large` | NVIDIA A10G | 24GB | 7-13B models, batch inference |
| `a10g-largex2` | 2x NVIDIA A10G | 48GB | Multi-GPU, large models |
| `a10g-largex4` | 4x NVIDIA A10G | 96GB | Multi-GPU, very large models |
| `a100-large` | NVIDIA A100 | 40GB | 13B+ models, fastest GPU option |
### TPU Flavors
| Flavor | Configuration | Use Case |
|--------|---------------|----------|
| `v5e-1x1` | TPU v5e (1x1) | Small TPU workloads |
| `v5e-2x2` | TPU v5e (2x2) | Medium TPU workloads |
| `v5e-2x4` | TPU v5e (2x4) | Large TPU workloads |
**TPU Use Cases:**
- JAX/Flax model training
- Large-scale inference
- TPU-optimized workloads
## Selection Guidelines
### By Workload Type
**Data Processing**
- **Recommended:** `cpu-upgrade` or `l4x1`
- **Use case:** Transform, filter, analyze datasets
- **Batch size:** Depends on data size
- **Time:** Varies by dataset size
**Batch Inference**
- **Recommended:** `a10g-large` or `a100-large`
- **Use case:** Run inference on thousands of samples
- **Batch size:** 8-32 depending on model
- **Time:** Depends on number of samples
**Experiments & Benchmarks**
- **Recommended:** `a10g-small` or `a10g-large`
- **Use case:** Reproducible ML experiments
- **Batch size:** Varies
- **Time:** Depends on experiment complexity
**Model Training** (see `model-trainer` skill for details)
- **Recommended:** See model-trainer skill
- **Use case:** Fine-tuning models
- **Batch size:** Depends on model size
- **Time:** Hours to days
**Synthetic Data Generation**
- **Recommended:** `a10g-large` or `a100-large`
- **Use case:** Generate datasets using LLMs
- **Batch size:** Depends on generation method
- **Time:** Hours for large datasets
### By Budget
**Minimal Budget (<$5 total)**
- Use `cpu-basic` or `t4-small`
- Process small datasets
- Quick tests and demos
**Small Budget ($5-20)**
- Use `t4-medium` or `a10g-small`
- Process medium datasets
- Run experiments
**Medium Budget ($20-50)**
- Use `a10g-small` or `a10g-large`
- Process large datasets
- Production workloads
**Large Budget ($50-200)**
- Use `a10g-large` or `a100-large`
- Large-scale processing
- Multiple experiments
### By Model Size (for inference/processing)
**Tiny Models (<1B parameters)**
- **Recommended:** `t4-small`
- **Example:** Qwen2.5-0.5B, TinyLlama
- **Batch size:** 8-16
**Small Models (1-3B parameters)**
- **Recommended:** `t4-medium` or `a10g-small`
- **Example:** Qwen2.5-1.5B, Phi-2
- **Batch size:** 4-8
**Medium Models (3-7B parameters)**
- **Recommended:** `a10g-small` or `a10g-large`
- **Example:** Qwen2.5-7B, Mistral-7B
- **Batch size:** 2-4
**Large Models (7-13B parameters)**
- **Recommended:** `a10g-large` or `a100-large`
- **Example:** Llama-3-8B
- **Batch size:** 1-2
**Very Large Models (13B+ parameters)**
- **Recommended:** `a100-large`
- **Example:** Llama-3-13B, Llama-3-70B
- **Batch size:** 1
## Memory Considerations
### Estimating Memory Requirements
**For inference:**
```
Memory (GB) ≈ (Model params in billions) × 2-4
```
**For training:**
```
Memory (GB) ≈ (Model params in billions) × 20 (full) or × 4 (LoRA)
```
**Examples:**
- Qwen2.5-0.5B inference: ~1-2GB fits t4-small
- Qwen2.5-7B inference: ~14-28GB fits a10g-large
- Qwen2.5-7B training: ~140GB not feasible without LoRA
### Memory Optimization
If hitting memory limits:
1. **Reduce batch size**
```python
batch_size = 1
```
2. **Process in chunks**
```python
for chunk in chunks:
process(chunk)
```
3. **Use smaller models**
- Use quantized models
- Use LoRA adapters
4. **Upgrade hardware**
- cpu t4 a10g a100
## Cost Estimation
### Formula
```
Total Cost = (Hours of runtime) × (Cost per hour)
```
### Example Calculations
**Data processing:**
- Hardware: cpu-upgrade ($0.50/hour)
- Time: 1 hour
- Cost: $0.50
**Batch inference:**
- Hardware: a10g-large ($5/hour)
- Time: 2 hours
- Cost: $10.00
**Experiments:**
- Hardware: a10g-small ($3.50/hour)
- Time: 4 hours
- Cost: $14.00
### Cost Optimization Tips
1. **Start small:** Test on cpu-basic or t4-small
2. **Monitor runtime:** Set appropriate timeouts
3. **Optimize code:** Reduce unnecessary compute
4. **Choose right hardware:** Don't over-provision
5. **Use checkpoints:** Resume if job fails
6. **Monitor costs:** Check running jobs regularly
## Multi-GPU Workloads
Multi-GPU flavors automatically distribute workloads:
**Multi-GPU flavors:**
- `l4x4` - 4x L4 GPUs (96GB total VRAM)
- `a10g-largex2` - 2x A10G GPUs (48GB total VRAM)
- `a10g-largex4` - 4x A10G GPUs (96GB total VRAM)
**When to use:**
- Large models (>13B parameters)
- Need faster processing (linear speedup)
- Large datasets (>100K samples)
- Parallel workloads
- Tensor parallelism for inference
**MCP Tool Example:**
```python
hf_jobs("uv", {
"script": "process.py",
"flavor": "a10g-largex2", # 2 GPUs
"timeout": "4h",
"secrets": {"HF_TOKEN": "$HF_TOKEN"}
})
```
**CLI Equivalent:**
```bash
hf jobs uv run process.py --flavor a10g-largex2 --timeout 4h
```
## Choosing Between Options
### CPU vs GPU
**Choose CPU when:**
- No GPU acceleration needed
- Data processing only
- Budget constrained
- Simple workloads
**Choose GPU when:**
- Model inference/training
- GPU-accelerated libraries
- Need faster processing
- Large models
### a10g vs a100
**Choose a10g when:**
- Model <13B parameters
- Budget conscious
- Processing time not critical
**Choose a100 when:**
- Model 13B+ parameters
- Need fastest processing
- Memory requirements high
- Budget allows
### Single vs Multi-GPU
**Choose single GPU when:**
- Model <7B parameters
- Budget constrained
- Simpler debugging
**Choose multi-GPU when:**
- Model >13B parameters
- Need faster processing
- Large batch sizes required
- Cost-effective for large jobs
## Quick Reference
### All Available Flavors
```python
# Official flavor list (updated 07/2025)
FLAVORS = {
# CPU
"cpu-basic", # Testing, lightweight
"cpu-upgrade", # Data processing
# GPU - Single
"t4-small", # 16GB - <1B models
"t4-medium", # 16GB - 1-3B models
"l4x1", # 24GB - 3-7B models
"a10g-small", # 24GB - 3-7B production
"a10g-large", # 24GB - 7-13B models
"a100-large", # 40GB - 13B+ models
# GPU - Multi
"l4x4", # 4x L4 (96GB total)
"a10g-largex2", # 2x A10G (48GB total)
"a10g-largex4", # 4x A10G (96GB total)
# TPU
"v5e-1x1", # TPU v5e 1x1
"v5e-2x2", # TPU v5e 2x2
"v5e-2x4", # TPU v5e 2x4
}
```
### Workload → Hardware Mapping
```python
HARDWARE_MAP = {
"data_processing": "cpu-upgrade",
"batch_inference_small": "t4-small",
"batch_inference_medium": "a10g-large",
"batch_inference_large": "a100-large",
"experiments": "a10g-small",
"tpu_workloads": "v5e-1x1",
"training": "see model-trainer skill"
}
```
### CLI Examples
```bash
# CPU job
hf jobs run python:3.12 python script.py
# GPU job
hf jobs run --flavor a10g-large pytorch/pytorch:2.6.0-cuda12.4-cudnn9-devel python script.py
# TPU job
hf jobs run --flavor v5e-1x1 your-tpu-image python script.py
# UV script with GPU
hf jobs uv run --flavor a10g-small my_script.py
```