playbook/antigravity-awesome-skills/skills/hugging-face-jobs/references/hardware_guide.md

# Hardware Selection Guide

Choosing the right hardware (flavor) is critical for cost-effective workloads.

> **Reference:** [HF Jobs Hardware Documentation](https://huggingface.co/docs/hub/en/spaces-config-reference) (updated 07/2025)

## Available Hardware

### CPU Flavors
| Flavor | Description | Use Case |
|--------|-------------|----------|
| `cpu-basic` | Basic CPU instance | Testing, lightweight scripts |
| `cpu-upgrade` | Enhanced CPU instance | Data processing, parallel workloads |

**Use cases:** Data processing, testing scripts, lightweight workloads
**Not recommended for:** Model training, GPU-accelerated workloads

### GPU Flavors

| Flavor | GPU | VRAM | Use Case |
|--------|-----|------|----------|
| `t4-small` | NVIDIA T4 | 16GB | <1B models, demos, quick tests |
| `t4-medium` | NVIDIA T4 | 16GB | 1-3B models, development |
| `l4x1` | NVIDIA L4 | 24GB | 3-7B models, efficient workloads |
| `l4x4` | 4x NVIDIA L4 | 96GB | Multi-GPU, parallel workloads |
| `a10g-small` | NVIDIA A10G | 24GB | 3-7B models, production |
| `a10g-large` | NVIDIA A10G | 24GB | 7-13B models, batch inference |
| `a10g-largex2` | 2x NVIDIA A10G | 48GB | Multi-GPU, large models |
| `a10g-largex4` | 4x NVIDIA A10G | 96GB | Multi-GPU, very large models |
| `a100-large` | NVIDIA A100 | 40GB | 13B+ models, fastest GPU option |

### TPU Flavors

| Flavor | Configuration | Use Case |
|--------|---------------|----------|
| `v5e-1x1` | TPU v5e (1x1) | Small TPU workloads |
| `v5e-2x2` | TPU v5e (2x2) | Medium TPU workloads |
| `v5e-2x4` | TPU v5e (2x4) | Large TPU workloads |

**TPU Use Cases:**
- JAX/Flax model training
- Large-scale inference
- TPU-optimized workloads

## Selection Guidelines

### By Workload Type

**Data Processing**
- **Recommended:** `cpu-upgrade` or `l4x1`
- **Use case:** Transform, filter, analyze datasets
- **Batch size:** Depends on data size
- **Time:** Varies by dataset size

**Batch Inference**
- **Recommended:** `a10g-large` or `a100-large`
- **Use case:** Run inference on thousands of samples
- **Batch size:** 8-32 depending on model
- **Time:** Depends on number of samples

**Experiments & Benchmarks**
- **Recommended:** `a10g-small` or `a10g-large`
- **Use case:** Reproducible ML experiments
- **Batch size:** Varies
- **Time:** Depends on experiment complexity

**Model Training** (see `model-trainer` skill for details)
- **Recommended:** See model-trainer skill
- **Use case:** Fine-tuning models
- **Batch size:** Depends on model size
- **Time:** Hours to days

**Synthetic Data Generation**
- **Recommended:** `a10g-large` or `a100-large`
- **Use case:** Generate datasets using LLMs
- **Batch size:** Depends on generation method
- **Time:** Hours for large datasets

### By Budget

**Minimal Budget (<$5 total)**
- Use `cpu-basic` or `t4-small`
- Process small datasets
- Quick tests and demos

**Small Budget ($5-20)**
- Use `t4-medium` or `a10g-small`
- Process medium datasets
- Run experiments

**Medium Budget ($20-50)**
- Use `a10g-small` or `a10g-large`
- Process large datasets
- Production workloads

**Large Budget ($50-200)**
- Use `a10g-large` or `a100-large`
- Large-scale processing
- Multiple experiments

### By Model Size (for inference/processing)

**Tiny Models (<1B parameters)**
- **Recommended:** `t4-small`
- **Example:** Qwen2.5-0.5B, TinyLlama
- **Batch size:** 8-16

**Small Models (1-3B parameters)**
- **Recommended:** `t4-medium` or `a10g-small`
- **Example:** Qwen2.5-1.5B, Phi-2
- **Batch size:** 4-8

**Medium Models (3-7B parameters)**
- **Recommended:** `a10g-small` or `a10g-large`
- **Example:** Qwen2.5-7B, Mistral-7B
- **Batch size:** 2-4

**Large Models (7-13B parameters)**
- **Recommended:** `a10g-large` or `a100-large`
- **Example:** Llama-3-8B
- **Batch size:** 1-2

**Very Large Models (13B+ parameters)**
- **Recommended:** `a100-large`
- **Example:** Llama-3-13B, Llama-3-70B
- **Batch size:** 1

## Memory Considerations

### Estimating Memory Requirements

**For inference:**
```
Memory (GB) ≈ (Model params in billions) × 2-4
```

**For training:**
```
Memory (GB) ≈ (Model params in billions) × 20 (full) or × 4 (LoRA)
```

**Examples:**
- Qwen2.5-0.5B inference: ~1-2GB ✅ fits t4-small
- Qwen2.5-7B inference: ~14-28GB ✅ fits a10g-large
- Qwen2.5-7B training: ~140GB ❌ not feasible without LoRA

### Memory Optimization

If hitting memory limits:

1. **Reduce batch size**
   ```python
   batch_size = 1
   ```

2. **Process in chunks**
   ```python
   for chunk in chunks:
       process(chunk)
   ```

3. **Use smaller models**
   - Use quantized models
   - Use LoRA adapters

4. **Upgrade hardware**
   - cpu → t4 → a10g → a100

## Cost Estimation

### Formula

```
Total Cost = (Hours of runtime) × (Cost per hour)
```

### Example Calculations

**Data processing:**
- Hardware: cpu-upgrade ($0.50/hour)
- Time: 1 hour
- Cost: $0.50

**Batch inference:**
- Hardware: a10g-large ($5/hour)
- Time: 2 hours
- Cost: $10.00

**Experiments:**
- Hardware: a10g-small ($3.50/hour)
- Time: 4 hours
- Cost: $14.00

### Cost Optimization Tips

1. **Start small:** Test on cpu-basic or t4-small
2. **Monitor runtime:** Set appropriate timeouts
3. **Optimize code:** Reduce unnecessary compute
4. **Choose right hardware:** Don't over-provision
5. **Use checkpoints:** Resume if job fails
6. **Monitor costs:** Check running jobs regularly

## Multi-GPU Workloads

Multi-GPU flavors automatically distribute workloads:

**Multi-GPU flavors:**
- `l4x4` - 4x L4 GPUs (96GB total VRAM)
- `a10g-largex2` - 2x A10G GPUs (48GB total VRAM)
- `a10g-largex4` - 4x A10G GPUs (96GB total VRAM)

**When to use:**
- Large models (>13B parameters)
- Need faster processing (linear speedup)
- Large datasets (>100K samples)
- Parallel workloads
- Tensor parallelism for inference

**MCP Tool Example:**
```python
hf_jobs("uv", {
    "script": "process.py",
    "flavor": "a10g-largex2",  # 2 GPUs
    "timeout": "4h",
    "secrets": {"HF_TOKEN": "$HF_TOKEN"}
})
```

**CLI Equivalent:**
```bash
hf jobs uv run process.py --flavor a10g-largex2 --timeout 4h
```

## Choosing Between Options

### CPU vs GPU

**Choose CPU when:**
- No GPU acceleration needed
- Data processing only
- Budget constrained
- Simple workloads

**Choose GPU when:**
- Model inference/training
- GPU-accelerated libraries
- Need faster processing
- Large models

### a10g vs a100

**Choose a10g when:**
- Model <13B parameters
- Budget conscious
- Processing time not critical

**Choose a100 when:**
- Model 13B+ parameters
- Need fastest processing
- Memory requirements high
- Budget allows

### Single vs Multi-GPU

**Choose single GPU when:**
- Model <7B parameters
- Budget constrained
- Simpler debugging

**Choose multi-GPU when:**
- Model >13B parameters
- Need faster processing
- Large batch sizes required
- Cost-effective for large jobs

## Quick Reference

### All Available Flavors

```python
# Official flavor list (updated 07/2025)
FLAVORS = {
    # CPU
    "cpu-basic",      # Testing, lightweight
    "cpu-upgrade",    # Data processing

    # GPU - Single
    "t4-small",       # 16GB - <1B models
    "t4-medium",      # 16GB - 1-3B models
    "l4x1",           # 24GB - 3-7B models
    "a10g-small",     # 24GB - 3-7B production
    "a10g-large",     # 24GB - 7-13B models
    "a100-large",     # 40GB - 13B+ models

    # GPU - Multi
    "l4x4",           # 4x L4 (96GB total)
    "a10g-largex2",   # 2x A10G (48GB total)
    "a10g-largex4",   # 4x A10G (96GB total)

    # TPU
    "v5e-1x1",        # TPU v5e 1x1
    "v5e-2x2",        # TPU v5e 2x2
    "v5e-2x4",        # TPU v5e 2x4
}
```

### Workload → Hardware Mapping

```python
HARDWARE_MAP = {
    "data_processing": "cpu-upgrade",
    "batch_inference_small": "t4-small",
    "batch_inference_medium": "a10g-large",
    "batch_inference_large": "a100-large",
    "experiments": "a10g-small",
    "tpu_workloads": "v5e-1x1",
    "training": "see model-trainer skill"
}
```

### CLI Examples

```bash
# CPU job
hf jobs run python:3.12 python script.py

# GPU job
hf jobs run --flavor a10g-large pytorch/pytorch:2.6.0-cuda12.4-cudnn9-devel python script.py

# TPU job
hf jobs run --flavor v5e-1x1 your-tpu-image python script.py

# UV script with GPU
hf jobs uv run --flavor a10g-small my_script.py
```