playbook/antigravity-awesome-skills/skills/hugging-face-vision-trainer/references/hub_saving.md

17 KiB

Saving Vision Models to Hugging Face Hub

Contents

  • Why Hub Push is Required
  • Required Configuration (TrainingArguments, job config)
  • Complete Example
  • What Gets Saved
  • Important: Save Image Processor
  • Checkpoint Saving
  • Model Card Configuration
  • Saving Label Mappings
  • Authentication Methods
  • Verification Checklist
  • Repository Setup (automatic/manual creation, naming)
  • Troubleshooting (401, 403, push failures, inference issues)
  • Manual Push After Training
  • Example: Full Production Setup
  • Inference Example

CRITICAL: Training environments are ephemeral. ALL results are lost when a job completes unless pushed to the Hub.

Why Hub Push is Required

When running on Hugging Face Jobs:

  • Environment is temporary
  • All files deleted on job completion
  • No local disk persistence
  • Cannot access results after job ends

Without Hub push, training is completely wasted.

Required Configuration

1. Training Configuration

In your TrainingArguments:

from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="my-object-detector",
    push_to_hub=True,                    # Enable Hub push
    hub_model_id="username/model-name",   # Target repository
)

2. Job Configuration

When submitting the job:

hf_jobs("uv", {
    "script": training_script_content,  # Pass the Python script content directly as a string
    "secrets": {"HF_TOKEN": "$HF_TOKEN"}  # Provide authentication
})

The $HF_TOKEN syntax references your actual Hugging Face token value.

Complete Example

# train_detector.py
# /// script
# dependencies = ["transformers", "torch", "torchvision", "datasets"]
# ///

from transformers import (
    AutoImageProcessor,
    AutoModelForObjectDetection,
    TrainingArguments,
    Trainer
)
from datasets import load_dataset
import os
import torch

# Load dataset
dataset = load_dataset("cppe-5", split="train")

# Load model and processor
model_name = "facebook/detr-resnet-50"
image_processor = AutoImageProcessor.from_pretrained(model_name)
model = AutoModelForObjectDetection.from_pretrained(
    model_name,
    num_labels=5,  # Number of classes
    ignore_mismatched_sizes=True
)

# Configure with Hub push
training_args = TrainingArguments(
    output_dir="my-detector",
    num_train_epochs=10,
    per_device_train_batch_size=8,

    # ✅ CRITICAL: Hub push configuration
    push_to_hub=True,
    hub_model_id="myusername/cppe5-detector",

    # Optional: Push strategy
    hub_strategy="checkpoint",  # Push checkpoints during training
)

# ✅ CRITICAL: Authenticate with Hub BEFORE creating Trainer
from huggingface_hub import login
hf_token = os.environ.get("HF_TOKEN") or os.environ.get("hfjob")
if hf_token:
    login(token=hf_token)
    training_args.hub_token = hf_token
elif training_args.push_to_hub:
    raise ValueError("HF_TOKEN not found! Add secrets={'HF_TOKEN': '$HF_TOKEN'} to job config.")

# Define collate function
def collate_fn(batch):
    pixel_values = [item["pixel_values"] for item in batch]
    labels = [item["labels"] for item in batch]
    encoding = image_processor.pad(pixel_values, return_tensors="pt")
    return {
        "pixel_values": encoding["pixel_values"],
        "labels": labels
    }

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    data_collator=collate_fn,
)

trainer.train()

# ✅ Push final model and processor
trainer.push_to_hub()
image_processor.push_to_hub("myusername/cppe5-detector")

print("✅ Model saved to: https://huggingface.co/myusername/cppe5-detector")

Submit with authentication:

hf_jobs("uv", {
    "script": training_script_content,  # Pass script content as a string, NOT a filename
    "flavor": "a10g-large",
    "timeout": "4h",
    "secrets": {"HF_TOKEN": "$HF_TOKEN"}  # ✅ Required!
})

What Gets Saved

When push_to_hub=True:

  1. Model weights - Final trained parameters
  2. Image processor - Associated preprocessing configuration
  3. Configuration - Model config (config.json) including:
    • Number of labels/classes
    • Architecture details (backbone, num_queries, etc.)
    • Label mappings (id2label, label2id)
  4. Training arguments - Hyperparameters used
  5. Model card - Auto-generated documentation
  6. Checkpoints - If save_strategy="steps" enabled

Important: Save Image Processor

Object detection models require the image processor to be saved separately:

# After training completes
trainer.push_to_hub()

# ✅ Also push the image processor
image_processor.push_to_hub(
    repo_id="username/model-name",
    commit_message="Upload image processor"
)

Why this matters:

  • Models need specific image preprocessing (resizing, normalization)
  • Image processor contains critical configuration
  • Without it, model cannot be used for inference

Checkpoint Saving

Save intermediate checkpoints during training:

TrainingArguments(
    output_dir="my-detector",
    push_to_hub=True,
    hub_model_id="username/my-detector",

    # Checkpoint configuration
    save_strategy="steps",
    save_steps=500,              # Save every 500 steps
    save_total_limit=3,          # Keep only last 3 checkpoints
    hub_strategy="checkpoint",   # Push checkpoints to Hub
)

Benefits:

  • Resume training if job fails
  • Compare checkpoint performance
  • Use intermediate models
  • Track training progress

Checkpoints are pushed to: username/my-detector (same repo)

Model Card Configuration

Add metadata for better discoverability:

# At the end of training script
model.push_to_hub(
    "username/my-detector",
    commit_message="Upload trained object detection model",
    tags=["object-detection", "vision", "cppe-5"],
    model_card_kwargs={
        "license": "apache-2.0",
        "dataset": "cppe-5",
        "metrics": ["map", "recall", "precision"],
        "pipeline_tag": "object-detection",
    }
)

Saving Label Mappings

Critical for object detection: Save class labels with the model:

# Define your label mappings
id2label = {0: "Coverall", 1: "Face_Shield", 2: "Gloves", 3: "Goggles", 4: "Mask"}
label2id = {v: k for k, v in id2label.items()}

# Update model config before training
model.config.id2label = id2label
model.config.label2id = label2id

# Now train and push
trainer.train()
trainer.push_to_hub()

Without label mappings:

  • Model outputs will be numeric IDs only
  • No human-readable class names
  • Difficult to interpret results

Authentication Methods

For a complete guide on token types, $HF_TOKEN automatic replacement, secrets vs env differences, and security best practices, see the hugging-face-jobs skill → Token Usage Guide.

Recommended: Always pass tokens via secrets (encrypted server-side):

"secrets": {"HF_TOKEN": "$HF_TOKEN"}  # ✅ Automatic replacement with your logged-in token

Verification Checklist

Before submitting any training job, verify:

  • push_to_hub=True in TrainingArguments
  • hub_model_id is specified (format: username/model-name)
  • Image processor will be saved separately
  • Label mappings (id2label, label2id) are configured
  • Repository name doesn't conflict with existing repos
  • You have write access to the target namespace

Repository Setup

Automatic Creation

If repository doesn't exist, it's created automatically when first pushing.

Manual Creation

Create repository before training:

from huggingface_hub import HfApi

api = HfApi()
api.create_repo(
    repo_id="username/detector-name",
    repo_type="model",
    private=False,  # or True for private repo
)

Repository Naming

Valid names:

  • username/detr-cppe5
  • username/yolos-object-detector
  • organization/custom-detector

Invalid names:

  • detector-name (missing username)
  • username/detector name (spaces not allowed)
  • username/DETECTOR (uppercase discouraged)

Recommended naming:

  • Include model architecture: detr-, yolos-, deta-
  • Include dataset: -cppe5, -coco, -voc
  • Be descriptive: detr-resnet50-cppe5 > model1

Troubleshooting

Error: 401 Unauthorized

Cause: HF_TOKEN not provided, invalid, or not authenticated before Trainer init

Solutions:

  1. Verify secrets={"HF_TOKEN": "$HF_TOKEN"} in job config
  2. Verify script calls login(token=hf_token) AND sets training_args.hub_token = hf_token BEFORE creating the Trainer
  3. Check you're logged in locally: hf auth whoami
  4. Re-login: hf auth login

Root cause: The Trainer calls create_repo(token=self.args.hub_token) during __init__() when push_to_hub=True. Relying on implicit env-var token resolution is unreliable in Jobs. Calling login() saves the token globally, and setting training_args.hub_token ensures the Trainer passes it explicitly to all Hub API calls.

Error: 403 Forbidden

Cause: No write access to repository

Solutions:

  1. Check repository namespace matches your username
  2. Verify you're a member of organization (if using org namespace)
  3. Check repository isn't private (if accessing org repo)

Error: Repository not found

Cause: Repository doesn't exist and auto-creation failed

Solutions:

  1. Manually create repository first
  2. Check repository name format
  3. Verify namespace exists

Error: Push failed during training

Cause: Network issues or Hub unavailable

Solutions:

  1. Training continues but final push fails
  2. Checkpoints may be saved
  3. Re-run push manually after job completes

Issue: Model loads but inference fails

Possible causes:

  1. Image processor not saved—verify it's pushed separately
  2. Label mappings missing—check config.json has id2label
  3. Wrong image size—verify image processor matches training config

Issue: Model saved but not visible

Possible causes:

  1. Repository is private—check https://huggingface.co/username
  2. Wrong namespace—verify hub_model_id matches login
  3. Push still in progress—wait a few minutes

Manual Push After Training

If training completes but push fails, push manually:

from transformers import AutoModelForObjectDetection, AutoImageProcessor

# Load from local checkpoint
model = AutoModelForObjectDetection.from_pretrained("./output_dir")
image_processor = AutoImageProcessor.from_pretrained("./output_dir")

# Push to Hub
model.push_to_hub("username/model-name", token="hf_abc123...")
image_processor.push_to_hub("username/model-name", token="hf_abc123...")

Note: Only possible if job hasn't completed (files still exist).

Best Practices

  1. Always enable push_to_hub=True
  2. Save image processor separately - critical for inference
  3. Configure label mappings before training
  4. Use checkpoint saving for long training runs
  5. Verify Hub push in logs before job completes
  6. Set appropriate save_total_limit to avoid excessive checkpoints
  7. Use descriptive repo names (e.g., detr-cppe5 not detector1)
  8. Add model card with:
    • Training dataset
    • Evaluation metrics (mAP, IoU)
    • Example usage code
    • Limitations
  9. Tag models appropriately:
    • object-detection
    • Architecture: detr, yolos, deta
    • Dataset: coco, voc, cppe-5

Monitoring Push Progress

Check logs for push progress:

hf_jobs("logs", {"job_id": "your-job-id"})

Look for:

Pushing model to username/detector-name...
Upload file pytorch_model.bin: 100%
✅ Model pushed successfully
Pushing image processor...
✅ Image processor pushed successfully

Example: Full Production Setup

# production_detector.py
# /// script
# dependencies = [
#     "transformers>=4.30.0",
#     "torch>=2.0.0",
#     "torchvision>=0.15.0",
#     "datasets>=2.12.0",
#     "evaluate>=0.4.0"
# ]
# ///

from transformers import (
    AutoImageProcessor,
    AutoModelForObjectDetection,
    TrainingArguments,
    Trainer
)
from datasets import load_dataset
import os
import torch

# Configuration
MODEL_NAME = "facebook/detr-resnet-50"
DATASET_NAME = "cppe-5"
HUB_MODEL_ID = "myusername/detr-cppe5-detector"
NUM_CLASSES = 5

# Class labels
id2label = {0: "Coverall", 1: "Face_Shield", 2: "Gloves", 3: "Goggles", 4: "Mask"}
label2id = {v: k for k, v in id2label.items()}

print(f"🔧 Loading dataset: {DATASET_NAME}")
dataset = load_dataset(DATASET_NAME, split="train")
print(f"✅ Dataset loaded: {len(dataset)} examples")

print(f"🔧 Loading model: {MODEL_NAME}")
image_processor = AutoImageProcessor.from_pretrained(MODEL_NAME)
model = AutoModelForObjectDetection.from_pretrained(
    MODEL_NAME,
    num_labels=NUM_CLASSES,
    id2label=id2label,
    label2id=label2id,
    ignore_mismatched_sizes=True
)
print("✅ Model loaded")

# Configure with comprehensive Hub settings
training_args = TrainingArguments(
    output_dir="detr-cppe5",

    # Hub configuration
    push_to_hub=True,
    hub_model_id=HUB_MODEL_ID,
    hub_strategy="checkpoint",  # Push checkpoints

    # Checkpoint configuration
    save_strategy="steps",
    save_steps=500,
    save_total_limit=3,

    # Training settings
    num_train_epochs=10,
    per_device_train_batch_size=8,
    gradient_accumulation_steps=2,
    learning_rate=1e-4,
    warmup_steps=500,

    # Evaluation
    eval_strategy="steps",
    eval_steps=500,

    # Logging
    logging_steps=50,
    logging_first_step=True,

    # Performance
    fp16=True,  # Mixed precision training
    dataloader_num_workers=4,
)

# ✅ CRITICAL: Authenticate with Hub BEFORE creating Trainer
# login() saves the token globally so ALL hub operations can find it.
from huggingface_hub import login
hf_token = os.environ.get("HF_TOKEN") or os.environ.get("hfjob")
if hf_token:
    login(token=hf_token)
    training_args.hub_token = hf_token
elif training_args.push_to_hub:
    raise ValueError("HF_TOKEN not found! Add secrets={'HF_TOKEN': '$HF_TOKEN'} to job config.")

# Data collator
def collate_fn(batch):
    pixel_values = [item["pixel_values"] for item in batch]
    labels = [item["labels"] for item in batch]
    encoding = image_processor.pad(pixel_values, return_tensors="pt")
    return {
        "pixel_values": encoding["pixel_values"],
        "labels": labels
    }

# Create trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    data_collator=collate_fn,
)

print("🚀 Starting training...")
trainer.train()

print("💾 Pushing final model to Hub...")
trainer.push_to_hub(
    commit_message="Upload trained DETR model on CPPE-5",
    tags=["object-detection", "detr", "cppe-5", "vision"],
)

print("💾 Pushing image processor to Hub...")
image_processor.push_to_hub(
    repo_id=HUB_MODEL_ID,
    commit_message="Upload image processor"
)

print("✅ Training complete!")
print(f"Model available at: https://huggingface.co/{HUB_MODEL_ID}")
print(f"\nTo use your model:")
print(f"```python")
print(f"from transformers import AutoImageProcessor, AutoModelForObjectDetection")
print(f"")
print(f"processor = AutoImageProcessor.from_pretrained('{HUB_MODEL_ID}')")
print(f"model = AutoModelForObjectDetection.from_pretrained('{HUB_MODEL_ID}')")
print(f"```")

Submit:

hf_jobs("uv", {
    "script": training_script_content,  # Pass script content as a string, NOT a filename
    "flavor": "a10g-large",
    "timeout": "8h",
    "secrets": {"HF_TOKEN": "$HF_TOKEN"}
})

Inference Example

After training, use your model:

from transformers import AutoImageProcessor, AutoModelForObjectDetection
from PIL import Image
import torch

# Load model from Hub
processor = AutoImageProcessor.from_pretrained("username/detr-cppe5-detector")
model = AutoModelForObjectDetection.from_pretrained("username/detr-cppe5-detector")

# Load and process image
image = Image.open("test_image.jpg")
inputs = processor(images=image, return_tensors="pt")

# Run inference
with torch.no_grad():
    outputs = model(**inputs)

# Post-process results
target_sizes = torch.tensor([image.size[::-1]])
results = processor.post_process_object_detection(
    outputs,
    threshold=0.5,
    target_sizes=target_sizes
)[0]

# Print detections
for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
    box = [round(i, 2) for i in box.tolist()]
    print(
        f"Detected {model.config.id2label[label.item()]} with confidence "
        f"{round(score.item(), 3)} at location {box}"
    )

Key Takeaway

Without push_to_hub=True and secrets={"HF_TOKEN": "$HF_TOKEN"}, all training results are permanently lost.

For object detection, also remember to:

  1. Save the image processor separately
  2. Configure label mappings (id2label, label2id)
  3. Include appropriate model card metadata

Always verify all three are configured before submitting any training job.