10 KiB
10 KiB
Troubleshooting Guide
Common issues and solutions for Hugging Face Jobs.
Authentication Issues
Error: 401 Unauthorized
Symptoms:
401 Client Error: Unauthorized for url: https://huggingface.co/api/...
Causes:
- Token missing from job
- Token invalid or expired
- Token not passed correctly
Solutions:
- Add token to secrets:
hf_jobsMCP uses"$HF_TOKEN"(auto-replaced);HfApi().run_uv_job()MUST useget_token()fromhuggingface_hub(the literal string"$HF_TOKEN"will NOT work with the Python API) - Verify
hf_whoami()works locally - Re-login:
hf auth login - Check token hasn't expired
Verification:
# In your script
import os
assert "HF_TOKEN" in os.environ, "HF_TOKEN missing!"
Error: 403 Forbidden
Symptoms:
403 Client Error: Forbidden for url: https://huggingface.co/api/...
Causes:
- Token lacks required permissions
- No access to private repository
- Organization permissions insufficient
Solutions:
- Ensure token has write permissions
- Check token type at https://huggingface.co/settings/tokens
- Verify access to target repository
- Use organization token if needed
Error: Token not found in environment
Symptoms:
KeyError: 'HF_TOKEN'
ValueError: HF_TOKEN not found
Causes:
secretsnot passed in job config- Wrong key name (should be
HF_TOKEN) - Using
envinstead ofsecrets
Solutions:
- Use
secrets(notenv) — withhf_jobsMCP:"$HF_TOKEN"; withHfApi().run_uv_job():get_token() - Verify key name is exactly
HF_TOKEN - Check job config syntax
Job Execution Issues
Error: Job Timeout
Symptoms:
- Job stops unexpectedly
- Status shows "TIMEOUT"
- Partial results only
Causes:
- Default 30min timeout exceeded
- Job takes longer than expected
- No timeout specified
Solutions:
- Check logs for actual runtime
- Increase timeout with buffer:
"timeout": "3h" - Optimize code for faster execution
- Process data in chunks
- Add 20-30% buffer to estimated time
MCP Tool Example:
hf_jobs("uv", {
"script": "...",
"timeout": "2h" # Set appropriate timeout
})
Python API Example:
from huggingface_hub import run_uv_job, inspect_job, fetch_job_logs
job = run_uv_job("script.py", timeout="4h")
# Check if job failed
job_info = inspect_job(job_id=job.id)
if job_info.status.stage == "ERROR":
print(f"Job failed: {job_info.status.message}")
# Check logs for details
for log in fetch_job_logs(job_id=job.id):
print(log)
Error: Out of Memory (OOM)
Symptoms:
RuntimeError: CUDA out of memory
MemoryError: Unable to allocate array
Causes:
- Batch size too large
- Model too large for hardware
- Insufficient GPU memory
Solutions:
- Reduce batch size
- Process data in smaller chunks
- Upgrade hardware: cpu → t4 → a10g → a100
- Use smaller models or quantization
- Enable gradient checkpointing (for training)
Example:
# Reduce batch size
batch_size = 1
# Process in chunks
for chunk in chunks:
process(chunk)
Error: Missing Dependencies
Symptoms:
ModuleNotFoundError: No module named 'package_name'
ImportError: cannot import name 'X'
Causes:
- Package not in dependencies
- Wrong package name
- Version mismatch
Solutions:
- Add to PEP 723 header:
# /// script # dependencies = ["package-name>=1.0.0"] # /// - Check package name spelling
- Specify version if needed
- Check package availability
Error: Script Not Found
Symptoms:
FileNotFoundError: script.py not found
Causes:
- Local file path used (not supported)
- URL incorrect
- Script not accessible
Solutions:
- Use inline script (recommended)
- Use publicly accessible URL
- Upload script to Hub first
- Check URL is correct
Correct approaches:
# ✅ Inline code
hf_jobs("uv", {"script": "# /// script\n# dependencies = [...]\n# ///\n\n<code>"})
# ✅ From URL
hf_jobs("uv", {"script": "https://huggingface.co/user/repo/resolve/main/script.py"})
Hub Push Issues
Error: Push Failed
Symptoms:
Error pushing to Hub
Upload failed
Causes:
- Network issues
- Token missing or invalid
- Repository access denied
- File too large
Solutions:
- Check token:
assert "HF_TOKEN" in os.environ - Verify repository exists or can be created
- Check network connectivity in logs
- Retry push operation
- Split large files into chunks
Error: Repository Not Found
Symptoms:
404 Client Error: Not Found
Repository not found
Causes:
- Repository doesn't exist
- Wrong repository name
- No access to private repo
Solutions:
- Create repository first:
from huggingface_hub import HfApi api = HfApi() api.create_repo("username/repo-name", repo_type="dataset") - Check repository name format
- Verify namespace exists
- Check repository visibility
Error: Results Not Saved
Symptoms:
- Job completes successfully
- No results visible on Hub
- Files not persisted
Causes:
- No persistence code in script
- Push code not executed
- Push failed silently
Solutions:
- Add persistence code to script
- Verify push executes successfully
- Check logs for push errors
- Add error handling around push
Example:
try:
dataset.push_to_hub("username/dataset")
print("✅ Push successful")
except Exception as e:
print(f"❌ Push failed: {e}")
raise
Hardware Issues
Error: GPU Not Available
Symptoms:
CUDA not available
No GPU found
Causes:
- CPU flavor used instead of GPU
- GPU not requested
- CUDA not installed in image
Solutions:
- Use GPU flavor:
"flavor": "a10g-large" - Check image has CUDA support
- Verify GPU availability in logs
Error: Slow Performance
Symptoms:
- Job takes longer than expected
- Low GPU utilization
- CPU bottleneck
Causes:
- Wrong hardware selected
- Inefficient code
- Data loading bottleneck
Solutions:
- Upgrade hardware
- Optimize code
- Use batch processing
- Profile code to find bottlenecks
General Issues
Error: Job Status Unknown
Symptoms:
- Can't check job status
- Status API returns error
Solutions:
- Use job URL:
https://huggingface.co/jobs/username/job-id - Check logs:
hf_jobs("logs", {"job_id": "..."}) - Inspect job:
hf_jobs("inspect", {"job_id": "..."})
Error: Logs Not Available
Symptoms:
- No logs visible
- Logs delayed
Causes:
- Job just started (logs delayed 30-60s)
- Job failed before logging
- Logs not yet generated
Solutions:
- Wait 30-60 seconds after job start
- Check job status first
- Use job URL for web interface
Error: Cost Unexpectedly High
Symptoms:
- Job costs more than expected
- Longer runtime than estimated
Causes:
- Job ran longer than timeout
- Wrong hardware selected
- Inefficient code
Solutions:
- Monitor job runtime
- Set appropriate timeout
- Optimize code
- Choose right hardware
- Check cost estimates before running
Debugging Tips
1. Add Logging
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
logger.info("Starting processing...")
logger.info(f"Processed {count} items")
2. Verify Environment
import os
print(f"Python version: {os.sys.version}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"HF_TOKEN present: {'HF_TOKEN' in os.environ}")
3. Test Locally First
Run script locally before submitting to catch errors early:
python script.py
# Or with uv
uv run script.py
4. Check Job Logs
MCP Tool:
# View logs
hf_jobs("logs", {"job_id": "your-job-id"})
CLI:
hf jobs logs <job-id>
Python API:
from huggingface_hub import fetch_job_logs
for log in fetch_job_logs(job_id="your-job-id"):
print(log)
Or use job URL: https://huggingface.co/jobs/username/job-id
5. Add Error Handling
try:
# Your code
process_data()
except Exception as e:
print(f"Error: {e}")
import traceback
traceback.print_exc()
raise
6. Check Job Status Programmatically
from huggingface_hub import inspect_job, fetch_job_logs
job_info = inspect_job(job_id="your-job-id")
print(f"Status: {job_info.status.stage}")
print(f"Message: {job_info.status.message}")
if job_info.status.stage == "ERROR":
print("Job failed! Logs:")
for log in fetch_job_logs(job_id="your-job-id"):
print(log)
Quick Reference
Common Error Codes
| Code | Meaning | Solution |
|---|---|---|
| 401 | Unauthorized | Add token to secrets: MCP uses "$HF_TOKEN", Python API uses get_token() |
| 403 | Forbidden | Check token permissions |
| 404 | Not Found | Verify repository exists |
| 500 | Server Error | Retry or contact support |
Checklist Before Submitting
- Token configured: MCP uses
secrets={"HF_TOKEN": "$HF_TOKEN"}, Python API usessecrets={"HF_TOKEN": get_token()} - Script checks for token:
assert "HF_TOKEN" in os.environ - Timeout set appropriately
- Hardware selected correctly
- Dependencies listed in PEP 723 header
- Persistence code included
- Error handling added
- Logging added for debugging
Getting Help
If issues persist:
- Check logs - Most errors include detailed messages
- Review documentation - See main SKILL.md
- Check Hub status - https://status.huggingface.co
- Community forums - https://discuss.huggingface.co
- GitHub issues - For bugs in huggingface_hub
Key Takeaways
- Always include token - MCP:
secrets={"HF_TOKEN": "$HF_TOKEN"}, Python API:secrets={"HF_TOKEN": get_token()} - Set appropriate timeout - Default 30min may be insufficient
- Verify persistence - Results won't persist without code
- Check logs - Most issues visible in job logs
- Test locally - Catch errors before submitting
- Add error handling - Better debugging information
- Monitor costs - Set timeouts to avoid unexpected charges