17 KiB

Raw Blame History

Saving Vision Models to Hugging Face Hub

Why Hub Push is Required
Required Configuration (TrainingArguments, job config)
Complete Example
What Gets Saved
Important: Save Image Processor
Checkpoint Saving
Model Card Configuration
Saving Label Mappings
Authentication Methods
Verification Checklist
Repository Setup (automatic/manual creation, naming)
Troubleshooting (401, 403, push failures, inference issues)
Manual Push After Training
Example: Full Production Setup
Inference Example

CRITICAL: Training environments are ephemeral. ALL results are lost when a job completes unless pushed to the Hub.

Why Hub Push is Required

When running on Hugging Face Jobs:

Environment is temporary
All files deleted on job completion
No local disk persistence
Cannot access results after job ends

Without Hub push, training is completely wasted.

Required Configuration

1. Training Configuration

In your TrainingArguments:

from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="my-object-detector",
    push_to_hub=True,                    # Enable Hub push
    hub_model_id="username/model-name",   # Target repository
)

2. Job Configuration

When submitting the job:

hf_jobs("uv", {
    "script": training_script_content,  # Pass the Python script content directly as a string
    "secrets": {"HF_TOKEN": "$HF_TOKEN"}  # Provide authentication
})

The $HF_TOKEN syntax references your actual Hugging Face token value.

Complete Example

# train_detector.py
# /// script
# dependencies = ["transformers", "torch", "torchvision", "datasets"]
# ///

from transformers import (
    AutoImageProcessor,
    AutoModelForObjectDetection,
    TrainingArguments,
    Trainer
)
from datasets import load_dataset
import os
import torch

# Load dataset
dataset = load_dataset("cppe-5", split="train")

# Load model and processor
model_name = "facebook/detr-resnet-50"
image_processor = AutoImageProcessor.from_pretrained(model_name)
model = AutoModelForObjectDetection.from_pretrained(
    model_name,
    num_labels=5,  # Number of classes
    ignore_mismatched_sizes=True
)

# Configure with Hub push
training_args = TrainingArguments(
    output_dir="my-detector",
    num_train_epochs=10,
    per_device_train_batch_size=8,

    # ✅ CRITICAL: Hub push configuration
    push_to_hub=True,
    hub_model_id="myusername/cppe5-detector",

    # Optional: Push strategy
    hub_strategy="checkpoint",  # Push checkpoints during training
)

# ✅ CRITICAL: Authenticate with Hub BEFORE creating Trainer
from huggingface_hub import login
hf_token = os.environ.get("HF_TOKEN") or os.environ.get("hfjob")
if hf_token:
    login(token=hf_token)
    training_args.hub_token = hf_token
elif training_args.push_to_hub:
    raise ValueError("HF_TOKEN not found! Add secrets={'HF_TOKEN': '$HF_TOKEN'} to job config.")

# Define collate function
def collate_fn(batch):
    pixel_values = [item["pixel_values"] for item in batch]
    labels = [item["labels"] for item in batch]
    encoding = image_processor.pad(pixel_values, return_tensors="pt")
    return {
        "pixel_values": encoding["pixel_values"],
        "labels": labels
    }

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    data_collator=collate_fn,
)

trainer.train()

# ✅ Push final model and processor
trainer.push_to_hub()
image_processor.push_to_hub("myusername/cppe5-detector")

print("✅ Model saved to: https://huggingface.co/myusername/cppe5-detector")

Submit with authentication:

hf_jobs("uv", {
    "script": training_script_content,  # Pass script content as a string, NOT a filename
    "flavor": "a10g-large",
    "timeout": "4h",
    "secrets": {"HF_TOKEN": "$HF_TOKEN"}  # ✅ Required!
})

What Gets Saved

When push_to_hub=True:

Model weights - Final trained parameters
Image processor - Associated preprocessing configuration
Configuration - Model config (config.json) including:
- Number of labels/classes
- Architecture details (backbone, num_queries, etc.)
- Label mappings (id2label, label2id)
Training arguments - Hyperparameters used
Model card - Auto-generated documentation
Checkpoints - If save_strategy="steps" enabled

Important: Save Image Processor

Object detection models require the image processor to be saved separately:

# After training completes
trainer.push_to_hub()

# ✅ Also push the image processor
image_processor.push_to_hub(
    repo_id="username/model-name",
    commit_message="Upload image processor"
)

Why this matters:

Models need specific image preprocessing (resizing, normalization)
Image processor contains critical configuration
Without it, model cannot be used for inference

Checkpoint Saving

Save intermediate checkpoints during training:

TrainingArguments(
    output_dir="my-detector",
    push_to_hub=True,
    hub_model_id="username/my-detector",

    # Checkpoint configuration
    save_strategy="steps",
    save_steps=500,              # Save every 500 steps
    save_total_limit=3,          # Keep only last 3 checkpoints
    hub_strategy="checkpoint",   # Push checkpoints to Hub
)

Benefits:

Resume training if job fails
Compare checkpoint performance
Use intermediate models
Track training progress

Checkpoints are pushed to: username/my-detector (same repo)

Model Card Configuration

Add metadata for better discoverability:

# At the end of training script
model.push_to_hub(
    "username/my-detector",
    commit_message="Upload trained object detection model",
    tags=["object-detection", "vision", "cppe-5"],
    model_card_kwargs={
        "license": "apache-2.0",
        "dataset": "cppe-5",
        "metrics": ["map", "recall", "precision"],
        "pipeline_tag": "object-detection",
    }
)

Saving Label Mappings

Critical for object detection: Save class labels with the model:

# Define your label mappings
id2label = {0: "Coverall", 1: "Face_Shield", 2: "Gloves", 3: "Goggles", 4: "Mask"}
label2id = {v: k for k, v in id2label.items()}

# Update model config before training
model.config.id2label = id2label
model.config.label2id = label2id

# Now train and push
trainer.train()
trainer.push_to_hub()

Without label mappings:

Model outputs will be numeric IDs only
No human-readable class names
Difficult to interpret results

Authentication Methods

For a complete guide on token types, $HF_TOKEN automatic replacement, secrets vs env differences, and security best practices, see the hugging-face-jobs skill → Token Usage Guide.

Recommended: Always pass tokens via secrets (encrypted server-side):

"secrets": {"HF_TOKEN": "$HF_TOKEN"}  # ✅ Automatic replacement with your logged-in token

Verification Checklist

Before submitting any training job, verify:

push_to_hub=True in TrainingArguments
hub_model_id is specified (format: username/model-name)
Image processor will be saved separately
Label mappings (id2label, label2id) are configured
Repository name doesn't conflict with existing repos
You have write access to the target namespace

Repository Setup

Automatic Creation

If repository doesn't exist, it's created automatically when first pushing.

Manual Creation

Create repository before training:

from huggingface_hub import HfApi

api = HfApi()
api.create_repo(
    repo_id="username/detector-name",
    repo_type="model",
    private=False,  # or True for private repo
)

Repository Naming

Valid names:

username/detr-cppe5
username/yolos-object-detector
organization/custom-detector

Invalid names:

detector-name (missing username)
username/detector name (spaces not allowed)
username/DETECTOR (uppercase discouraged)

Recommended naming:

Include model architecture: detr-, yolos-, deta-
Include dataset: -cppe5, -coco, -voc
Be descriptive: detr-resnet50-cppe5 > model1

Troubleshooting

Error: 401 Unauthorized

Cause: HF_TOKEN not provided, invalid, or not authenticated before Trainer init

Solutions:

Verify secrets={"HF_TOKEN": "$HF_TOKEN"} in job config
Verify script calls login(token=hf_token) AND sets training_args.hub_token = hf_token BEFORE creating the Trainer
Check you're logged in locally: hf auth whoami
Re-login: hf auth login

Root cause: The Trainer calls create_repo(token=self.args.hub_token) during __init__() when push_to_hub=True. Relying on implicit env-var token resolution is unreliable in Jobs. Calling login() saves the token globally, and setting training_args.hub_token ensures the Trainer passes it explicitly to all Hub API calls.

Error: 403 Forbidden

Cause: No write access to repository

Solutions:

Check repository namespace matches your username
Verify you're a member of organization (if using org namespace)
Check repository isn't private (if accessing org repo)

Error: Repository not found

Cause: Repository doesn't exist and auto-creation failed

Solutions:

Manually create repository first
Check repository name format
Verify namespace exists

Error: Push failed during training

Cause: Network issues or Hub unavailable

Solutions:

Training continues but final push fails
Checkpoints may be saved
Re-run push manually after job completes

Issue: Model loads but inference fails

Possible causes:

Image processor not saved—verify it's pushed separately
Label mappings missing—check config.json has id2label
Wrong image size—verify image processor matches training config

Issue: Model saved but not visible

Possible causes:

Repository is private—check https://huggingface.co/username
Wrong namespace—verify hub_model_id matches login
Push still in progress—wait a few minutes

Manual Push After Training

If training completes but push fails, push manually:

from transformers import AutoModelForObjectDetection, AutoImageProcessor

# Load from local checkpoint
model = AutoModelForObjectDetection.from_pretrained("./output_dir")
image_processor = AutoImageProcessor.from_pretrained("./output_dir")

# Push to Hub
model.push_to_hub("username/model-name", token="hf_abc123...")
image_processor.push_to_hub("username/model-name", token="hf_abc123...")

Note: Only possible if job hasn't completed (files still exist).

Best Practices

Always enable push_to_hub=True
Save image processor separately - critical for inference
Configure label mappings before training
Use checkpoint saving for long training runs
Verify Hub push in logs before job completes
Set appropriate save_total_limit to avoid excessive checkpoints
Use descriptive repo names (e.g., detr-cppe5 not detector1)
Add model card with:
- Training dataset
- Evaluation metrics (mAP, IoU)
- Example usage code
- Limitations
Tag models appropriately:
- object-detection
- Architecture: detr, yolos, deta
- Dataset: coco, voc, cppe-5

Monitoring Push Progress

Check logs for push progress:

hf_jobs("logs", {"job_id": "your-job-id"})

Look for:

Pushing model to username/detector-name...
Upload file pytorch_model.bin: 100%
✅ Model pushed successfully
Pushing image processor...
✅ Image processor pushed successfully

Example: Full Production Setup

# production_detector.py
# /// script
# dependencies = [
#     "transformers>=4.30.0",
#     "torch>=2.0.0",
#     "torchvision>=0.15.0",
#     "datasets>=2.12.0",
#     "evaluate>=0.4.0"
# ]
# ///

from transformers import (
    AutoImageProcessor,
    AutoModelForObjectDetection,
    TrainingArguments,
    Trainer
)
from datasets import load_dataset
import os
import torch

# Configuration
MODEL_NAME = "facebook/detr-resnet-50"
DATASET_NAME = "cppe-5"
HUB_MODEL_ID = "myusername/detr-cppe5-detector"
NUM_CLASSES = 5

# Class labels
id2label = {0: "Coverall", 1: "Face_Shield", 2: "Gloves", 3: "Goggles", 4: "Mask"}
label2id = {v: k for k, v in id2label.items()}

print(f"🔧 Loading dataset: {DATASET_NAME}")
dataset = load_dataset(DATASET_NAME, split="train")
print(f"✅ Dataset loaded: {len(dataset)} examples")

print(f"🔧 Loading model: {MODEL_NAME}")
image_processor = AutoImageProcessor.from_pretrained(MODEL_NAME)
model = AutoModelForObjectDetection.from_pretrained(
    MODEL_NAME,
    num_labels=NUM_CLASSES,
    id2label=id2label,
    label2id=label2id,
    ignore_mismatched_sizes=True
)
print("✅ Model loaded")

# Configure with comprehensive Hub settings
training_args = TrainingArguments(
    output_dir="detr-cppe5",

    # Hub configuration
    push_to_hub=True,
    hub_model_id=HUB_MODEL_ID,
    hub_strategy="checkpoint",  # Push checkpoints

    # Checkpoint configuration
    save_strategy="steps",
    save_steps=500,
    save_total_limit=3,

    # Training settings
    num_train_epochs=10,
    per_device_train_batch_size=8,
    gradient_accumulation_steps=2,
    learning_rate=1e-4,
    warmup_steps=500,

    # Evaluation
    eval_strategy="steps",
    eval_steps=500,

    # Logging
    logging_steps=50,
    logging_first_step=True,

    # Performance
    fp16=True,  # Mixed precision training
    dataloader_num_workers=4,
)

# ✅ CRITICAL: Authenticate with Hub BEFORE creating Trainer
# login() saves the token globally so ALL hub operations can find it.
from huggingface_hub import login
hf_token = os.environ.get("HF_TOKEN") or os.environ.get("hfjob")
if hf_token:
    login(token=hf_token)
    training_args.hub_token = hf_token
elif training_args.push_to_hub:
    raise ValueError("HF_TOKEN not found! Add secrets={'HF_TOKEN': '$HF_TOKEN'} to job config.")

# Data collator
def collate_fn(batch):
    pixel_values = [item["pixel_values"] for item in batch]
    labels = [item["labels"] for item in batch]
    encoding = image_processor.pad(pixel_values, return_tensors="pt")
    return {
        "pixel_values": encoding["pixel_values"],
        "labels": labels
    }

# Create trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    data_collator=collate_fn,
)

print("🚀 Starting training...")
trainer.train()

print("💾 Pushing final model to Hub...")
trainer.push_to_hub(
    commit_message="Upload trained DETR model on CPPE-5",
    tags=["object-detection", "detr", "cppe-5", "vision"],
)

print("💾 Pushing image processor to Hub...")
image_processor.push_to_hub(
    repo_id=HUB_MODEL_ID,
    commit_message="Upload image processor"
)

print("✅ Training complete!")
print(f"Model available at: https://huggingface.co/{HUB_MODEL_ID}")
print(f"\nTo use your model:")
print(f"```python")
print(f"from transformers import AutoImageProcessor, AutoModelForObjectDetection")
print(f"")
print(f"processor = AutoImageProcessor.from_pretrained('{HUB_MODEL_ID}')")
print(f"model = AutoModelForObjectDetection.from_pretrained('{HUB_MODEL_ID}')")
print(f"```")

Submit:

hf_jobs("uv", {
    "script": training_script_content,  # Pass script content as a string, NOT a filename
    "flavor": "a10g-large",
    "timeout": "8h",
    "secrets": {"HF_TOKEN": "$HF_TOKEN"}
})

Inference Example

After training, use your model:

from transformers import AutoImageProcessor, AutoModelForObjectDetection
from PIL import Image
import torch

# Load model from Hub
processor = AutoImageProcessor.from_pretrained("username/detr-cppe5-detector")
model = AutoModelForObjectDetection.from_pretrained("username/detr-cppe5-detector")

# Load and process image
image = Image.open("test_image.jpg")
inputs = processor(images=image, return_tensors="pt")

# Run inference
with torch.no_grad():
    outputs = model(**inputs)

# Post-process results
target_sizes = torch.tensor([image.size[::-1]])
results = processor.post_process_object_detection(
    outputs,
    threshold=0.5,
    target_sizes=target_sizes
)[0]

# Print detections
for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
    box = [round(i, 2) for i in box.tolist()]
    print(
        f"Detected {model.config.id2label[label.item()]} with confidence "
        f"{round(score.item(), 3)} at location {box}"
    )

Key Takeaway

Without push_to_hub=True and secrets={"HF_TOKEN": "$HF_TOKEN"}, all training results are permanently lost.

For object detection, also remember to:

Save the image processor separately
Configure label mappings (id2label, label2id)
Include appropriate model card metadata

Always verify all three are configured before submitting any training job.

17 KiB Raw Blame History

Saving Vision Models to Hugging Face Hub

Contents

Why Hub Push is Required

Required Configuration

1. Training Configuration

2. Job Configuration

Complete Example

What Gets Saved

Important: Save Image Processor

Checkpoint Saving

Model Card Configuration

Saving Label Mappings

Authentication Methods

Verification Checklist

Repository Setup

Automatic Creation

Manual Creation

Repository Naming

Troubleshooting

Error: 401 Unauthorized

Error: 403 Forbidden

Error: Repository not found

Error: Push failed during training

Issue: Model loads but inference fails

Issue: Model saved but not visible

Manual Push After Training

Best Practices

Monitoring Push Progress

Example: Full Production Setup

Inference Example

Key Takeaway

17 KiB

Raw Blame History