# Troubleshooting Guide Common issues and solutions for Hugging Face Jobs. ## Authentication Issues ### Error: 401 Unauthorized **Symptoms:** ``` 401 Client Error: Unauthorized for url: https://huggingface.co/api/... ``` **Causes:** - Token missing from job - Token invalid or expired - Token not passed correctly **Solutions:** 1. Add token to secrets: `hf_jobs` MCP uses `"$HF_TOKEN"` (auto-replaced); `HfApi().run_uv_job()` MUST use `get_token()` from `huggingface_hub` (the literal string `"$HF_TOKEN"` will NOT work with the Python API) 2. Verify `hf_whoami()` works locally 3. Re-login: `hf auth login` 4. Check token hasn't expired **Verification:** ```python # In your script import os assert "HF_TOKEN" in os.environ, "HF_TOKEN missing!" ``` ### Error: 403 Forbidden **Symptoms:** ``` 403 Client Error: Forbidden for url: https://huggingface.co/api/... ``` **Causes:** - Token lacks required permissions - No access to private repository - Organization permissions insufficient **Solutions:** 1. Ensure token has write permissions 2. Check token type at https://huggingface.co/settings/tokens 3. Verify access to target repository 4. Use organization token if needed ### Error: Token not found in environment **Symptoms:** ``` KeyError: 'HF_TOKEN' ValueError: HF_TOKEN not found ``` **Causes:** - `secrets` not passed in job config - Wrong key name (should be `HF_TOKEN`) - Using `env` instead of `secrets` **Solutions:** 1. Use `secrets` (not `env`) — with `hf_jobs` MCP: `"$HF_TOKEN"`; with `HfApi().run_uv_job()`: `get_token()` 2. Verify key name is exactly `HF_TOKEN` 3. Check job config syntax ## Job Execution Issues ### Error: Job Timeout **Symptoms:** - Job stops unexpectedly - Status shows "TIMEOUT" - Partial results only **Causes:** - Default 30min timeout exceeded - Job takes longer than expected - No timeout specified **Solutions:** 1. Check logs for actual runtime 2. Increase timeout with buffer: `"timeout": "3h"` 3. Optimize code for faster execution 4. Process data in chunks 5. Add 20-30% buffer to estimated time **MCP Tool Example:** ```python hf_jobs("uv", { "script": "...", "timeout": "2h" # Set appropriate timeout }) ``` **Python API Example:** ```python from huggingface_hub import run_uv_job, inspect_job, fetch_job_logs job = run_uv_job("script.py", timeout="4h") # Check if job failed job_info = inspect_job(job_id=job.id) if job_info.status.stage == "ERROR": print(f"Job failed: {job_info.status.message}") # Check logs for details for log in fetch_job_logs(job_id=job.id): print(log) ``` ### Error: Out of Memory (OOM) **Symptoms:** ``` RuntimeError: CUDA out of memory MemoryError: Unable to allocate array ``` **Causes:** - Batch size too large - Model too large for hardware - Insufficient GPU memory **Solutions:** 1. Reduce batch size 2. Process data in smaller chunks 3. Upgrade hardware: cpu → t4 → a10g → a100 4. Use smaller models or quantization 5. Enable gradient checkpointing (for training) **Example:** ```python # Reduce batch size batch_size = 1 # Process in chunks for chunk in chunks: process(chunk) ``` ### Error: Missing Dependencies **Symptoms:** ``` ModuleNotFoundError: No module named 'package_name' ImportError: cannot import name 'X' ``` **Causes:** - Package not in dependencies - Wrong package name - Version mismatch **Solutions:** 1. Add to PEP 723 header: ```python # /// script # dependencies = ["package-name>=1.0.0"] # /// ``` 2. Check package name spelling 3. Specify version if needed 4. Check package availability ### Error: Script Not Found **Symptoms:** ``` FileNotFoundError: script.py not found ``` **Causes:** - Local file path used (not supported) - URL incorrect - Script not accessible **Solutions:** 1. Use inline script (recommended) 2. Use publicly accessible URL 3. Upload script to Hub first 4. Check URL is correct **Correct approaches:** ```python # ✅ Inline code hf_jobs("uv", {"script": "# /// script\n# dependencies = [...]\n# ///\n\n"}) # ✅ From URL hf_jobs("uv", {"script": "https://huggingface.co/user/repo/resolve/main/script.py"}) ``` ## Hub Push Issues ### Error: Push Failed **Symptoms:** ``` Error pushing to Hub Upload failed ``` **Causes:** - Network issues - Token missing or invalid - Repository access denied - File too large **Solutions:** 1. Check token: `assert "HF_TOKEN" in os.environ` 2. Verify repository exists or can be created 3. Check network connectivity in logs 4. Retry push operation 5. Split large files into chunks ### Error: Repository Not Found **Symptoms:** ``` 404 Client Error: Not Found Repository not found ``` **Causes:** - Repository doesn't exist - Wrong repository name - No access to private repo **Solutions:** 1. Create repository first: ```python from huggingface_hub import HfApi api = HfApi() api.create_repo("username/repo-name", repo_type="dataset") ``` 2. Check repository name format 3. Verify namespace exists 4. Check repository visibility ### Error: Results Not Saved **Symptoms:** - Job completes successfully - No results visible on Hub - Files not persisted **Causes:** - No persistence code in script - Push code not executed - Push failed silently **Solutions:** 1. Add persistence code to script 2. Verify push executes successfully 3. Check logs for push errors 4. Add error handling around push **Example:** ```python try: dataset.push_to_hub("username/dataset") print("✅ Push successful") except Exception as e: print(f"❌ Push failed: {e}") raise ``` ## Hardware Issues ### Error: GPU Not Available **Symptoms:** ``` CUDA not available No GPU found ``` **Causes:** - CPU flavor used instead of GPU - GPU not requested - CUDA not installed in image **Solutions:** 1. Use GPU flavor: `"flavor": "a10g-large"` 2. Check image has CUDA support 3. Verify GPU availability in logs ### Error: Slow Performance **Symptoms:** - Job takes longer than expected - Low GPU utilization - CPU bottleneck **Causes:** - Wrong hardware selected - Inefficient code - Data loading bottleneck **Solutions:** 1. Upgrade hardware 2. Optimize code 3. Use batch processing 4. Profile code to find bottlenecks ## General Issues ### Error: Job Status Unknown **Symptoms:** - Can't check job status - Status API returns error **Solutions:** 1. Use job URL: `https://huggingface.co/jobs/username/job-id` 2. Check logs: `hf_jobs("logs", {"job_id": "..."})` 3. Inspect job: `hf_jobs("inspect", {"job_id": "..."})` ### Error: Logs Not Available **Symptoms:** - No logs visible - Logs delayed **Causes:** - Job just started (logs delayed 30-60s) - Job failed before logging - Logs not yet generated **Solutions:** 1. Wait 30-60 seconds after job start 2. Check job status first 3. Use job URL for web interface ### Error: Cost Unexpectedly High **Symptoms:** - Job costs more than expected - Longer runtime than estimated **Causes:** - Job ran longer than timeout - Wrong hardware selected - Inefficient code **Solutions:** 1. Monitor job runtime 2. Set appropriate timeout 3. Optimize code 4. Choose right hardware 5. Check cost estimates before running ## Debugging Tips ### 1. Add Logging ```python import logging logging.basicConfig(level=logging.INFO) logger = logging.getLogger(__name__) logger.info("Starting processing...") logger.info(f"Processed {count} items") ``` ### 2. Verify Environment ```python import os print(f"Python version: {os.sys.version}") print(f"CUDA available: {torch.cuda.is_available()}") print(f"HF_TOKEN present: {'HF_TOKEN' in os.environ}") ``` ### 3. Test Locally First Run script locally before submitting to catch errors early: ```bash python script.py # Or with uv uv run script.py ``` ### 4. Check Job Logs **MCP Tool:** ```python # View logs hf_jobs("logs", {"job_id": "your-job-id"}) ``` **CLI:** ```bash hf jobs logs ``` **Python API:** ```python from huggingface_hub import fetch_job_logs for log in fetch_job_logs(job_id="your-job-id"): print(log) ``` **Or use job URL:** `https://huggingface.co/jobs/username/job-id` ### 5. Add Error Handling ```python try: # Your code process_data() except Exception as e: print(f"Error: {e}") import traceback traceback.print_exc() raise ``` ### 6. Check Job Status Programmatically ```python from huggingface_hub import inspect_job, fetch_job_logs job_info = inspect_job(job_id="your-job-id") print(f"Status: {job_info.status.stage}") print(f"Message: {job_info.status.message}") if job_info.status.stage == "ERROR": print("Job failed! Logs:") for log in fetch_job_logs(job_id="your-job-id"): print(log) ``` ## Quick Reference ### Common Error Codes | Code | Meaning | Solution | |------|---------|----------| | 401 | Unauthorized | Add token to secrets: MCP uses `"$HF_TOKEN"`, Python API uses `get_token()` | | 403 | Forbidden | Check token permissions | | 404 | Not Found | Verify repository exists | | 500 | Server Error | Retry or contact support | ### Checklist Before Submitting - [ ] Token configured: MCP uses `secrets={"HF_TOKEN": "$HF_TOKEN"}`, Python API uses `secrets={"HF_TOKEN": get_token()}` - [ ] Script checks for token: `assert "HF_TOKEN" in os.environ` - [ ] Timeout set appropriately - [ ] Hardware selected correctly - [ ] Dependencies listed in PEP 723 header - [ ] Persistence code included - [ ] Error handling added - [ ] Logging added for debugging ## Getting Help If issues persist: 1. **Check logs** - Most errors include detailed messages 2. **Review documentation** - See main SKILL.md 3. **Check Hub status** - https://status.huggingface.co 4. **Community forums** - https://discuss.huggingface.co 5. **GitHub issues** - For bugs in huggingface_hub ## Key Takeaways 1. **Always include token** - MCP: `secrets={"HF_TOKEN": "$HF_TOKEN"}`, Python API: `secrets={"HF_TOKEN": get_token()}` 2. **Set appropriate timeout** - Default 30min may be insufficient 3. **Verify persistence** - Results won't persist without code 4. **Check logs** - Most issues visible in job logs 5. **Test locally** - Catch errors before submitting 6. **Add error handling** - Better debugging information 7. **Monitor costs** - Set timeouts to avoid unexpected charges