297 lines
9.6 KiB
Markdown
297 lines
9.6 KiB
Markdown
# GGUF Conversion Guide
|
|
|
|
After training models with TRL on Hugging Face Jobs, convert them to **GGUF format** for use with llama.cpp, Ollama, LM Studio, and other local inference tools.
|
|
|
|
**This guide provides production-ready, tested code based on successful conversions.** All critical dependencies and build steps are included.
|
|
|
|
## What is GGUF?
|
|
|
|
**GGUF** (GPT-Generated Unified Format):
|
|
- Optimized format for CPU/GPU inference with llama.cpp
|
|
- Supports quantization (4-bit, 5-bit, 8-bit) to reduce model size
|
|
- Compatible with: Ollama, LM Studio, Jan, GPT4All, llama.cpp
|
|
- Typically 2-8GB for 7B models (vs 14GB unquantized)
|
|
|
|
## When to Convert to GGUF
|
|
|
|
**Convert when:**
|
|
- Running models locally with Ollama or LM Studio
|
|
- Using CPU-optimized inference
|
|
- Reducing model size with quantization
|
|
- Deploying to edge devices
|
|
- Sharing models for local-first use
|
|
|
|
## Critical Success Factors
|
|
|
|
Based on production testing, these are **essential** for reliable conversion:
|
|
|
|
### 1. ✅ Install Build Tools FIRST
|
|
**Before cloning llama.cpp**, install build dependencies:
|
|
```python
|
|
subprocess.run(["apt-get", "update", "-qq"], check=True, capture_output=True)
|
|
subprocess.run(["apt-get", "install", "-y", "-qq", "build-essential", "cmake"], check=True, capture_output=True)
|
|
```
|
|
|
|
**Why:** The quantization tool requires gcc and cmake. Installing after cloning doesn't help.
|
|
|
|
### 2. ✅ Use CMake (Not Make)
|
|
**Build the quantize tool with CMake:**
|
|
```python
|
|
# Create build directory
|
|
os.makedirs("/tmp/llama.cpp/build", exist_ok=True)
|
|
|
|
# Configure
|
|
subprocess.run([
|
|
"cmake", "-B", "/tmp/llama.cpp/build", "-S", "/tmp/llama.cpp",
|
|
"-DGGML_CUDA=OFF" # Faster build, CUDA not needed for quantization
|
|
], check=True, capture_output=True, text=True)
|
|
|
|
# Build
|
|
subprocess.run([
|
|
"cmake", "--build", "/tmp/llama.cpp/build",
|
|
"--target", "llama-quantize", "-j", "4"
|
|
], check=True, capture_output=True, text=True)
|
|
|
|
# Binary path
|
|
quantize_bin = "/tmp/llama.cpp/build/bin/llama-quantize"
|
|
```
|
|
|
|
**Why:** CMake is more reliable than `make` and produces consistent binary paths.
|
|
|
|
### 3. ✅ Include All Dependencies
|
|
**PEP 723 header must include:**
|
|
```python
|
|
# /// script
|
|
# dependencies = [
|
|
# "transformers>=4.36.0",
|
|
# "peft>=0.7.0",
|
|
# "torch>=2.0.0",
|
|
# "accelerate>=0.24.0",
|
|
# "huggingface_hub>=0.20.0",
|
|
# "sentencepiece>=0.1.99", # Required for tokenizer
|
|
# "protobuf>=3.20.0", # Required for tokenizer
|
|
# "numpy",
|
|
# "gguf",
|
|
# ]
|
|
# ///
|
|
```
|
|
|
|
**Why:** `sentencepiece` and `protobuf` are critical for tokenizer conversion. Missing them causes silent failures.
|
|
|
|
### 4. ✅ Verify Names Before Use
|
|
**Always verify repos exist:**
|
|
```python
|
|
# Before submitting job, verify:
|
|
hub_repo_details([ADAPTER_MODEL], repo_type="model")
|
|
hub_repo_details([BASE_MODEL], repo_type="model")
|
|
```
|
|
|
|
**Why:** Non-existent dataset/model names cause job failures that could be caught in seconds.
|
|
|
|
## Complete Conversion Script
|
|
|
|
See `scripts/convert_to_gguf.py` for the complete, production-ready script.
|
|
|
|
**Key features:**
|
|
- ✅ All dependencies in PEP 723 header
|
|
- ✅ Build tools installed automatically
|
|
- ✅ CMake build process (reliable)
|
|
- ✅ Comprehensive error handling
|
|
- ✅ Environment variable configuration
|
|
- ✅ Automatic README generation
|
|
|
|
## Quick Conversion Job
|
|
|
|
```python
|
|
# Before submitting: VERIFY MODELS EXIST
|
|
hub_repo_details(["username/my-finetuned-model"], repo_type="model")
|
|
hub_repo_details(["Qwen/Qwen2.5-0.5B"], repo_type="model")
|
|
|
|
# Submit conversion job
|
|
hf_jobs("uv", {
|
|
"script": open("trl/scripts/convert_to_gguf.py").read(), # Or inline the script
|
|
"flavor": "a10g-large",
|
|
"timeout": "45m",
|
|
"secrets": {"HF_TOKEN": "$HF_TOKEN"},
|
|
"env": {
|
|
"ADAPTER_MODEL": "username/my-finetuned-model",
|
|
"BASE_MODEL": "Qwen/Qwen2.5-0.5B",
|
|
"OUTPUT_REPO": "username/my-model-gguf",
|
|
"HF_USERNAME": "username" # Optional, for README
|
|
}
|
|
})
|
|
```
|
|
|
|
## Conversion Process
|
|
|
|
The script performs these steps:
|
|
|
|
1. **Load and Merge** - Load base model and LoRA adapter, merge them
|
|
2. **Install Build Tools** - Install gcc, cmake (CRITICAL: before cloning llama.cpp)
|
|
3. **Setup llama.cpp** - Clone repo, install Python dependencies
|
|
4. **Convert to GGUF** - Create FP16 GGUF using llama.cpp converter
|
|
5. **Build Quantize Tool** - Use CMake to build `llama-quantize`
|
|
6. **Quantize** - Create Q4_K_M, Q5_K_M, Q8_0 versions
|
|
7. **Upload** - Upload all versions + README to Hub
|
|
|
|
## Quantization Options
|
|
|
|
Common quantization formats (from smallest to largest):
|
|
|
|
| Format | Size | Quality | Use Case |
|
|
|--------|------|---------|----------|
|
|
| **Q4_K_M** | ~300MB | Good | **Recommended** - best balance of size/quality |
|
|
| **Q5_K_M** | ~350MB | Better | Higher quality, slightly larger |
|
|
| **Q8_0** | ~500MB | Very High | Near-original quality |
|
|
| **F16** | ~1GB | Original | Full precision, largest file |
|
|
|
|
**Recommendation:** Create Q4_K_M, Q5_K_M, and Q8_0 versions to give users options.
|
|
|
|
## Hardware Requirements
|
|
|
|
**For conversion:**
|
|
- Small models (<1B): CPU-basic works, but slow
|
|
- Medium models (1-7B): a10g-large recommended
|
|
- Large models (7B+): a10g-large or a100-large
|
|
|
|
**Time estimates:**
|
|
- 0.5B model: ~15-25 minutes on A10G
|
|
- 3B model: ~30-45 minutes on A10G
|
|
- 7B model: ~45-60 minutes on A10G
|
|
|
|
## Using GGUF Models
|
|
|
|
**GGUF models work on both CPU and GPU.** They're optimized for CPU inference but can also leverage GPU acceleration when available.
|
|
|
|
### With Ollama (auto-detects GPU)
|
|
```bash
|
|
# Download GGUF
|
|
hf download username/my-model-gguf model-q4_k_m.gguf
|
|
|
|
# Create Modelfile
|
|
echo "FROM ./model-q4_k_m.gguf" > Modelfile
|
|
|
|
# Create and run (uses GPU automatically if available)
|
|
ollama create my-model -f Modelfile
|
|
ollama run my-model
|
|
```
|
|
|
|
### With llama.cpp
|
|
```bash
|
|
# CPU only
|
|
./llama-cli -m model-q4_k_m.gguf -p "Your prompt"
|
|
|
|
# With GPU acceleration (offload 32 layers to GPU)
|
|
./llama-cli -m model-q4_k_m.gguf -ngl 32 -p "Your prompt"
|
|
```
|
|
|
|
### With LM Studio
|
|
1. Download the `.gguf` file
|
|
2. Import into LM Studio
|
|
3. Start chatting
|
|
|
|
## Best Practices
|
|
|
|
### ✅ DO:
|
|
1. **Verify repos exist** before submitting jobs (use `hub_repo_details`)
|
|
2. **Install build tools FIRST** before cloning llama.cpp
|
|
3. **Use CMake** for building quantize tool (not make)
|
|
4. **Include all dependencies** in PEP 723 header (especially sentencepiece, protobuf)
|
|
5. **Create multiple quantizations** - Give users choice
|
|
6. **Test on known models** before production use
|
|
7. **Use A10G GPU** for faster conversion
|
|
|
|
### ❌ DON'T:
|
|
1. **Assume repos exist** - Always verify with hub tools
|
|
2. **Use make** instead of CMake - Less reliable
|
|
3. **Remove dependencies** to "simplify" - They're all needed
|
|
4. **Skip build tools** - Quantization will fail silently
|
|
5. **Use default paths** - CMake puts binaries in build/bin/
|
|
|
|
## Common Issues
|
|
|
|
### Out of memory during merge
|
|
**Fix:**
|
|
- Use larger GPU (a10g-large or a100-large)
|
|
- Ensure `device_map="auto"` for automatic placement
|
|
- Use `dtype=torch.float16` or `torch.bfloat16`
|
|
|
|
### Conversion fails with architecture error
|
|
**Fix:**
|
|
- Ensure llama.cpp supports the model architecture
|
|
- Check for standard architecture (Qwen, Llama, Mistral, etc.)
|
|
- Update llama.cpp to latest: `git clone --depth 1 https://github.com/ggerganov/llama.cpp.git`
|
|
- Check llama.cpp documentation for model support
|
|
|
|
### Quantization fails
|
|
**Fix:**
|
|
- Verify build tools installed: `apt-get install build-essential cmake`
|
|
- Use CMake (not make) to build quantize tool
|
|
- Check binary path: `/tmp/llama.cpp/build/bin/llama-quantize`
|
|
- Verify FP16 GGUF exists before quantizing
|
|
|
|
### Missing sentencepiece error
|
|
**Fix:**
|
|
- Add to PEP 723 header: `"sentencepiece>=0.1.99", "protobuf>=3.20.0"`
|
|
- Don't remove dependencies to "simplify" - all are required
|
|
|
|
### Upload fails or times out
|
|
**Fix:**
|
|
- Large models (>2GB) need longer timeout: `"timeout": "1h"`
|
|
- Upload quantized versions separately if needed
|
|
- Check network/Hub status
|
|
|
|
## Lessons Learned
|
|
|
|
These are from production testing and real failures:
|
|
|
|
### 1. Always Verify Before Use
|
|
**Lesson:** Don't assume repos/datasets exist. Check first.
|
|
```python
|
|
# BEFORE submitting job
|
|
hub_repo_details(["trl-lib/argilla-dpo-mix-7k"], repo_type="dataset") # Would catch error
|
|
```
|
|
**Prevented failures:** Non-existent dataset names, typos in model names
|
|
|
|
### 2. Prioritize Reliability Over Performance
|
|
**Lesson:** Default to what's most likely to succeed.
|
|
- Use CMake (not make) - more reliable
|
|
- Disable CUDA in build - faster, not needed
|
|
- Include all dependencies - don't "simplify"
|
|
|
|
**Prevented failures:** Build failures, missing binaries
|
|
|
|
### 3. Create Atomic, Self-Contained Scripts
|
|
**Lesson:** Don't remove dependencies or steps. Scripts should work as a unit.
|
|
- All dependencies in PEP 723 header
|
|
- All build steps included
|
|
- Clear error messages
|
|
|
|
**Prevented failures:** Missing tokenizer libraries, build tool failures
|
|
|
|
## References
|
|
|
|
**In this skill:**
|
|
- `scripts/convert_to_gguf.py` - Complete, production-ready script
|
|
|
|
**External:**
|
|
- [llama.cpp Repository](https://github.com/ggerganov/llama.cpp)
|
|
- [GGUF Specification](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md)
|
|
- [Ollama Documentation](https://ollama.ai)
|
|
- [LM Studio](https://lmstudio.ai)
|
|
|
|
## Summary
|
|
|
|
**Critical checklist for GGUF conversion:**
|
|
- [ ] Verify adapter and base models exist on Hub
|
|
- [ ] Use production script from `scripts/convert_to_gguf.py`
|
|
- [ ] All dependencies in PEP 723 header (including sentencepiece, protobuf)
|
|
- [ ] Build tools installed before cloning llama.cpp
|
|
- [ ] CMake used for building quantize tool (not make)
|
|
- [ ] Correct binary path: `/tmp/llama.cpp/build/bin/llama-quantize`
|
|
- [ ] A10G GPU selected for reasonable conversion time
|
|
- [ ] Timeout set to 45m minimum
|
|
- [ ] HF_TOKEN in secrets for Hub upload
|
|
|
|
**The script in `scripts/convert_to_gguf.py` incorporates all these lessons and has been tested successfully in production.**
|