# GGUF Conversion Guide After training models with TRL on Hugging Face Jobs, convert them to **GGUF format** for use with llama.cpp, Ollama, LM Studio, and other local inference tools. **This guide provides production-ready, tested code based on successful conversions.** All critical dependencies and build steps are included. ## What is GGUF? **GGUF** (GPT-Generated Unified Format): - Optimized format for CPU/GPU inference with llama.cpp - Supports quantization (4-bit, 5-bit, 8-bit) to reduce model size - Compatible with: Ollama, LM Studio, Jan, GPT4All, llama.cpp - Typically 2-8GB for 7B models (vs 14GB unquantized) ## When to Convert to GGUF **Convert when:** - Running models locally with Ollama or LM Studio - Using CPU-optimized inference - Reducing model size with quantization - Deploying to edge devices - Sharing models for local-first use ## Critical Success Factors Based on production testing, these are **essential** for reliable conversion: ### 1. ✅ Install Build Tools FIRST **Before cloning llama.cpp**, install build dependencies: ```python subprocess.run(["apt-get", "update", "-qq"], check=True, capture_output=True) subprocess.run(["apt-get", "install", "-y", "-qq", "build-essential", "cmake"], check=True, capture_output=True) ``` **Why:** The quantization tool requires gcc and cmake. Installing after cloning doesn't help. ### 2. ✅ Use CMake (Not Make) **Build the quantize tool with CMake:** ```python # Create build directory os.makedirs("/tmp/llama.cpp/build", exist_ok=True) # Configure subprocess.run([ "cmake", "-B", "/tmp/llama.cpp/build", "-S", "/tmp/llama.cpp", "-DGGML_CUDA=OFF" # Faster build, CUDA not needed for quantization ], check=True, capture_output=True, text=True) # Build subprocess.run([ "cmake", "--build", "/tmp/llama.cpp/build", "--target", "llama-quantize", "-j", "4" ], check=True, capture_output=True, text=True) # Binary path quantize_bin = "/tmp/llama.cpp/build/bin/llama-quantize" ``` **Why:** CMake is more reliable than `make` and produces consistent binary paths. ### 3. ✅ Include All Dependencies **PEP 723 header must include:** ```python # /// script # dependencies = [ # "transformers>=4.36.0", # "peft>=0.7.0", # "torch>=2.0.0", # "accelerate>=0.24.0", # "huggingface_hub>=0.20.0", # "sentencepiece>=0.1.99", # Required for tokenizer # "protobuf>=3.20.0", # Required for tokenizer # "numpy", # "gguf", # ] # /// ``` **Why:** `sentencepiece` and `protobuf` are critical for tokenizer conversion. Missing them causes silent failures. ### 4. ✅ Verify Names Before Use **Always verify repos exist:** ```python # Before submitting job, verify: hub_repo_details([ADAPTER_MODEL], repo_type="model") hub_repo_details([BASE_MODEL], repo_type="model") ``` **Why:** Non-existent dataset/model names cause job failures that could be caught in seconds. ## Complete Conversion Script See `scripts/convert_to_gguf.py` for the complete, production-ready script. **Key features:** - ✅ All dependencies in PEP 723 header - ✅ Build tools installed automatically - ✅ CMake build process (reliable) - ✅ Comprehensive error handling - ✅ Environment variable configuration - ✅ Automatic README generation ## Quick Conversion Job ```python # Before submitting: VERIFY MODELS EXIST hub_repo_details(["username/my-finetuned-model"], repo_type="model") hub_repo_details(["Qwen/Qwen2.5-0.5B"], repo_type="model") # Submit conversion job hf_jobs("uv", { "script": open("trl/scripts/convert_to_gguf.py").read(), # Or inline the script "flavor": "a10g-large", "timeout": "45m", "secrets": {"HF_TOKEN": "$HF_TOKEN"}, "env": { "ADAPTER_MODEL": "username/my-finetuned-model", "BASE_MODEL": "Qwen/Qwen2.5-0.5B", "OUTPUT_REPO": "username/my-model-gguf", "HF_USERNAME": "username" # Optional, for README } }) ``` ## Conversion Process The script performs these steps: 1. **Load and Merge** - Load base model and LoRA adapter, merge them 2. **Install Build Tools** - Install gcc, cmake (CRITICAL: before cloning llama.cpp) 3. **Setup llama.cpp** - Clone repo, install Python dependencies 4. **Convert to GGUF** - Create FP16 GGUF using llama.cpp converter 5. **Build Quantize Tool** - Use CMake to build `llama-quantize` 6. **Quantize** - Create Q4_K_M, Q5_K_M, Q8_0 versions 7. **Upload** - Upload all versions + README to Hub ## Quantization Options Common quantization formats (from smallest to largest): | Format | Size | Quality | Use Case | |--------|------|---------|----------| | **Q4_K_M** | ~300MB | Good | **Recommended** - best balance of size/quality | | **Q5_K_M** | ~350MB | Better | Higher quality, slightly larger | | **Q8_0** | ~500MB | Very High | Near-original quality | | **F16** | ~1GB | Original | Full precision, largest file | **Recommendation:** Create Q4_K_M, Q5_K_M, and Q8_0 versions to give users options. ## Hardware Requirements **For conversion:** - Small models (<1B): CPU-basic works, but slow - Medium models (1-7B): a10g-large recommended - Large models (7B+): a10g-large or a100-large **Time estimates:** - 0.5B model: ~15-25 minutes on A10G - 3B model: ~30-45 minutes on A10G - 7B model: ~45-60 minutes on A10G ## Using GGUF Models **GGUF models work on both CPU and GPU.** They're optimized for CPU inference but can also leverage GPU acceleration when available. ### With Ollama (auto-detects GPU) ```bash # Download GGUF hf download username/my-model-gguf model-q4_k_m.gguf # Create Modelfile echo "FROM ./model-q4_k_m.gguf" > Modelfile # Create and run (uses GPU automatically if available) ollama create my-model -f Modelfile ollama run my-model ``` ### With llama.cpp ```bash # CPU only ./llama-cli -m model-q4_k_m.gguf -p "Your prompt" # With GPU acceleration (offload 32 layers to GPU) ./llama-cli -m model-q4_k_m.gguf -ngl 32 -p "Your prompt" ``` ### With LM Studio 1. Download the `.gguf` file 2. Import into LM Studio 3. Start chatting ## Best Practices ### ✅ DO: 1. **Verify repos exist** before submitting jobs (use `hub_repo_details`) 2. **Install build tools FIRST** before cloning llama.cpp 3. **Use CMake** for building quantize tool (not make) 4. **Include all dependencies** in PEP 723 header (especially sentencepiece, protobuf) 5. **Create multiple quantizations** - Give users choice 6. **Test on known models** before production use 7. **Use A10G GPU** for faster conversion ### ❌ DON'T: 1. **Assume repos exist** - Always verify with hub tools 2. **Use make** instead of CMake - Less reliable 3. **Remove dependencies** to "simplify" - They're all needed 4. **Skip build tools** - Quantization will fail silently 5. **Use default paths** - CMake puts binaries in build/bin/ ## Common Issues ### Out of memory during merge **Fix:** - Use larger GPU (a10g-large or a100-large) - Ensure `device_map="auto"` for automatic placement - Use `dtype=torch.float16` or `torch.bfloat16` ### Conversion fails with architecture error **Fix:** - Ensure llama.cpp supports the model architecture - Check for standard architecture (Qwen, Llama, Mistral, etc.) - Update llama.cpp to latest: `git clone --depth 1 https://github.com/ggerganov/llama.cpp.git` - Check llama.cpp documentation for model support ### Quantization fails **Fix:** - Verify build tools installed: `apt-get install build-essential cmake` - Use CMake (not make) to build quantize tool - Check binary path: `/tmp/llama.cpp/build/bin/llama-quantize` - Verify FP16 GGUF exists before quantizing ### Missing sentencepiece error **Fix:** - Add to PEP 723 header: `"sentencepiece>=0.1.99", "protobuf>=3.20.0"` - Don't remove dependencies to "simplify" - all are required ### Upload fails or times out **Fix:** - Large models (>2GB) need longer timeout: `"timeout": "1h"` - Upload quantized versions separately if needed - Check network/Hub status ## Lessons Learned These are from production testing and real failures: ### 1. Always Verify Before Use **Lesson:** Don't assume repos/datasets exist. Check first. ```python # BEFORE submitting job hub_repo_details(["trl-lib/argilla-dpo-mix-7k"], repo_type="dataset") # Would catch error ``` **Prevented failures:** Non-existent dataset names, typos in model names ### 2. Prioritize Reliability Over Performance **Lesson:** Default to what's most likely to succeed. - Use CMake (not make) - more reliable - Disable CUDA in build - faster, not needed - Include all dependencies - don't "simplify" **Prevented failures:** Build failures, missing binaries ### 3. Create Atomic, Self-Contained Scripts **Lesson:** Don't remove dependencies or steps. Scripts should work as a unit. - All dependencies in PEP 723 header - All build steps included - Clear error messages **Prevented failures:** Missing tokenizer libraries, build tool failures ## References **In this skill:** - `scripts/convert_to_gguf.py` - Complete, production-ready script **External:** - [llama.cpp Repository](https://github.com/ggerganov/llama.cpp) - [GGUF Specification](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md) - [Ollama Documentation](https://ollama.ai) - [LM Studio](https://lmstudio.ai) ## Summary **Critical checklist for GGUF conversion:** - [ ] Verify adapter and base models exist on Hub - [ ] Use production script from `scripts/convert_to_gguf.py` - [ ] All dependencies in PEP 723 header (including sentencepiece, protobuf) - [ ] Build tools installed before cloning llama.cpp - [ ] CMake used for building quantize tool (not make) - [ ] Correct binary path: `/tmp/llama.cpp/build/bin/llama-quantize` - [ ] A10G GPU selected for reasonable conversion time - [ ] Timeout set to 45m minimum - [ ] HF_TOKEN in secrets for Hub upload **The script in `scripts/convert_to_gguf.py` incorporates all these lessons and has been tested successfully in production.**