playbook/antigravity-awesome-skills/skills/hugging-face-jobs/references/troubleshooting.md

10 KiB

Troubleshooting Guide

Common issues and solutions for Hugging Face Jobs.

Authentication Issues

Error: 401 Unauthorized

Symptoms:

401 Client Error: Unauthorized for url: https://huggingface.co/api/...

Causes:

  • Token missing from job
  • Token invalid or expired
  • Token not passed correctly

Solutions:

  1. Add token to secrets: hf_jobs MCP uses "$HF_TOKEN" (auto-replaced); HfApi().run_uv_job() MUST use get_token() from huggingface_hub (the literal string "$HF_TOKEN" will NOT work with the Python API)
  2. Verify hf_whoami() works locally
  3. Re-login: hf auth login
  4. Check token hasn't expired

Verification:

# In your script
import os
assert "HF_TOKEN" in os.environ, "HF_TOKEN missing!"

Error: 403 Forbidden

Symptoms:

403 Client Error: Forbidden for url: https://huggingface.co/api/...

Causes:

  • Token lacks required permissions
  • No access to private repository
  • Organization permissions insufficient

Solutions:

  1. Ensure token has write permissions
  2. Check token type at https://huggingface.co/settings/tokens
  3. Verify access to target repository
  4. Use organization token if needed

Error: Token not found in environment

Symptoms:

KeyError: 'HF_TOKEN'
ValueError: HF_TOKEN not found

Causes:

  • secrets not passed in job config
  • Wrong key name (should be HF_TOKEN)
  • Using env instead of secrets

Solutions:

  1. Use secrets (not env) — with hf_jobs MCP: "$HF_TOKEN"; with HfApi().run_uv_job(): get_token()
  2. Verify key name is exactly HF_TOKEN
  3. Check job config syntax

Job Execution Issues

Error: Job Timeout

Symptoms:

  • Job stops unexpectedly
  • Status shows "TIMEOUT"
  • Partial results only

Causes:

  • Default 30min timeout exceeded
  • Job takes longer than expected
  • No timeout specified

Solutions:

  1. Check logs for actual runtime
  2. Increase timeout with buffer: "timeout": "3h"
  3. Optimize code for faster execution
  4. Process data in chunks
  5. Add 20-30% buffer to estimated time

MCP Tool Example:

hf_jobs("uv", {
    "script": "...",
    "timeout": "2h"  # Set appropriate timeout
})

Python API Example:

from huggingface_hub import run_uv_job, inspect_job, fetch_job_logs

job = run_uv_job("script.py", timeout="4h")

# Check if job failed
job_info = inspect_job(job_id=job.id)
if job_info.status.stage == "ERROR":
    print(f"Job failed: {job_info.status.message}")
    # Check logs for details
    for log in fetch_job_logs(job_id=job.id):
        print(log)

Error: Out of Memory (OOM)

Symptoms:

RuntimeError: CUDA out of memory
MemoryError: Unable to allocate array

Causes:

  • Batch size too large
  • Model too large for hardware
  • Insufficient GPU memory

Solutions:

  1. Reduce batch size
  2. Process data in smaller chunks
  3. Upgrade hardware: cpu → t4 → a10g → a100
  4. Use smaller models or quantization
  5. Enable gradient checkpointing (for training)

Example:

# Reduce batch size
batch_size = 1

# Process in chunks
for chunk in chunks:
    process(chunk)

Error: Missing Dependencies

Symptoms:

ModuleNotFoundError: No module named 'package_name'
ImportError: cannot import name 'X'

Causes:

  • Package not in dependencies
  • Wrong package name
  • Version mismatch

Solutions:

  1. Add to PEP 723 header:
    # /// script
    # dependencies = ["package-name>=1.0.0"]
    # ///
    
  2. Check package name spelling
  3. Specify version if needed
  4. Check package availability

Error: Script Not Found

Symptoms:

FileNotFoundError: script.py not found

Causes:

  • Local file path used (not supported)
  • URL incorrect
  • Script not accessible

Solutions:

  1. Use inline script (recommended)
  2. Use publicly accessible URL
  3. Upload script to Hub first
  4. Check URL is correct

Correct approaches:

# ✅ Inline code
hf_jobs("uv", {"script": "# /// script\n# dependencies = [...]\n# ///\n\n<code>"})

# ✅ From URL
hf_jobs("uv", {"script": "https://huggingface.co/user/repo/resolve/main/script.py"})

Hub Push Issues

Error: Push Failed

Symptoms:

Error pushing to Hub
Upload failed

Causes:

  • Network issues
  • Token missing or invalid
  • Repository access denied
  • File too large

Solutions:

  1. Check token: assert "HF_TOKEN" in os.environ
  2. Verify repository exists or can be created
  3. Check network connectivity in logs
  4. Retry push operation
  5. Split large files into chunks

Error: Repository Not Found

Symptoms:

404 Client Error: Not Found
Repository not found

Causes:

  • Repository doesn't exist
  • Wrong repository name
  • No access to private repo

Solutions:

  1. Create repository first:
    from huggingface_hub import HfApi
    api = HfApi()
    api.create_repo("username/repo-name", repo_type="dataset")
    
  2. Check repository name format
  3. Verify namespace exists
  4. Check repository visibility

Error: Results Not Saved

Symptoms:

  • Job completes successfully
  • No results visible on Hub
  • Files not persisted

Causes:

  • No persistence code in script
  • Push code not executed
  • Push failed silently

Solutions:

  1. Add persistence code to script
  2. Verify push executes successfully
  3. Check logs for push errors
  4. Add error handling around push

Example:

try:
    dataset.push_to_hub("username/dataset")
    print("✅ Push successful")
except Exception as e:
    print(f"❌ Push failed: {e}")
    raise

Hardware Issues

Error: GPU Not Available

Symptoms:

CUDA not available
No GPU found

Causes:

  • CPU flavor used instead of GPU
  • GPU not requested
  • CUDA not installed in image

Solutions:

  1. Use GPU flavor: "flavor": "a10g-large"
  2. Check image has CUDA support
  3. Verify GPU availability in logs

Error: Slow Performance

Symptoms:

  • Job takes longer than expected
  • Low GPU utilization
  • CPU bottleneck

Causes:

  • Wrong hardware selected
  • Inefficient code
  • Data loading bottleneck

Solutions:

  1. Upgrade hardware
  2. Optimize code
  3. Use batch processing
  4. Profile code to find bottlenecks

General Issues

Error: Job Status Unknown

Symptoms:

  • Can't check job status
  • Status API returns error

Solutions:

  1. Use job URL: https://huggingface.co/jobs/username/job-id
  2. Check logs: hf_jobs("logs", {"job_id": "..."})
  3. Inspect job: hf_jobs("inspect", {"job_id": "..."})

Error: Logs Not Available

Symptoms:

  • No logs visible
  • Logs delayed

Causes:

  • Job just started (logs delayed 30-60s)
  • Job failed before logging
  • Logs not yet generated

Solutions:

  1. Wait 30-60 seconds after job start
  2. Check job status first
  3. Use job URL for web interface

Error: Cost Unexpectedly High

Symptoms:

  • Job costs more than expected
  • Longer runtime than estimated

Causes:

  • Job ran longer than timeout
  • Wrong hardware selected
  • Inefficient code

Solutions:

  1. Monitor job runtime
  2. Set appropriate timeout
  3. Optimize code
  4. Choose right hardware
  5. Check cost estimates before running

Debugging Tips

1. Add Logging

import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

logger.info("Starting processing...")
logger.info(f"Processed {count} items")

2. Verify Environment

import os
print(f"Python version: {os.sys.version}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"HF_TOKEN present: {'HF_TOKEN' in os.environ}")

3. Test Locally First

Run script locally before submitting to catch errors early:

python script.py
# Or with uv
uv run script.py

4. Check Job Logs

MCP Tool:

# View logs
hf_jobs("logs", {"job_id": "your-job-id"})

CLI:

hf jobs logs <job-id>

Python API:

from huggingface_hub import fetch_job_logs
for log in fetch_job_logs(job_id="your-job-id"):
    print(log)

Or use job URL: https://huggingface.co/jobs/username/job-id

5. Add Error Handling

try:
    # Your code
    process_data()
except Exception as e:
    print(f"Error: {e}")
    import traceback
    traceback.print_exc()
    raise

6. Check Job Status Programmatically

from huggingface_hub import inspect_job, fetch_job_logs

job_info = inspect_job(job_id="your-job-id")
print(f"Status: {job_info.status.stage}")
print(f"Message: {job_info.status.message}")

if job_info.status.stage == "ERROR":
    print("Job failed! Logs:")
    for log in fetch_job_logs(job_id="your-job-id"):
        print(log)

Quick Reference

Common Error Codes

Code Meaning Solution
401 Unauthorized Add token to secrets: MCP uses "$HF_TOKEN", Python API uses get_token()
403 Forbidden Check token permissions
404 Not Found Verify repository exists
500 Server Error Retry or contact support

Checklist Before Submitting

  • Token configured: MCP uses secrets={"HF_TOKEN": "$HF_TOKEN"}, Python API uses secrets={"HF_TOKEN": get_token()}
  • Script checks for token: assert "HF_TOKEN" in os.environ
  • Timeout set appropriately
  • Hardware selected correctly
  • Dependencies listed in PEP 723 header
  • Persistence code included
  • Error handling added
  • Logging added for debugging

Getting Help

If issues persist:

  1. Check logs - Most errors include detailed messages
  2. Review documentation - See main SKILL.md
  3. Check Hub status - https://status.huggingface.co
  4. Community forums - https://discuss.huggingface.co
  5. GitHub issues - For bugs in huggingface_hub

Key Takeaways

  1. Always include token - MCP: secrets={"HF_TOKEN": "$HF_TOKEN"}, Python API: secrets={"HF_TOKEN": get_token()}
  2. Set appropriate timeout - Default 30min may be insufficient
  3. Verify persistence - Results won't persist without code
  4. Check logs - Most issues visible in job logs
  5. Test locally - Catch errors before submitting
  6. Add error handling - Better debugging information
  7. Monitor costs - Set timeouts to avoid unexpected charges