619 lines
17 KiB
Markdown
619 lines
17 KiB
Markdown
# Saving Vision Models to Hugging Face Hub
|
|
|
|
## Contents
|
|
- Why Hub Push is Required
|
|
- Required Configuration (TrainingArguments, job config)
|
|
- Complete Example
|
|
- What Gets Saved
|
|
- Important: Save Image Processor
|
|
- Checkpoint Saving
|
|
- Model Card Configuration
|
|
- Saving Label Mappings
|
|
- Authentication Methods
|
|
- Verification Checklist
|
|
- Repository Setup (automatic/manual creation, naming)
|
|
- Troubleshooting (401, 403, push failures, inference issues)
|
|
- Manual Push After Training
|
|
- Example: Full Production Setup
|
|
- Inference Example
|
|
|
|
---
|
|
|
|
**CRITICAL:** Training environments are ephemeral. ALL results are lost when a job completes unless pushed to the Hub.
|
|
|
|
## Why Hub Push is Required
|
|
|
|
When running on Hugging Face Jobs:
|
|
- Environment is temporary
|
|
- All files deleted on job completion
|
|
- No local disk persistence
|
|
- Cannot access results after job ends
|
|
|
|
**Without Hub push, training is completely wasted.**
|
|
|
|
## Required Configuration
|
|
|
|
### 1. Training Configuration
|
|
|
|
In your TrainingArguments:
|
|
|
|
```python
|
|
from transformers import TrainingArguments
|
|
|
|
training_args = TrainingArguments(
|
|
output_dir="my-object-detector",
|
|
push_to_hub=True, # Enable Hub push
|
|
hub_model_id="username/model-name", # Target repository
|
|
)
|
|
```
|
|
|
|
### 2. Job Configuration
|
|
|
|
When submitting the job:
|
|
|
|
```python
|
|
hf_jobs("uv", {
|
|
"script": training_script_content, # Pass the Python script content directly as a string
|
|
"secrets": {"HF_TOKEN": "$HF_TOKEN"} # Provide authentication
|
|
})
|
|
```
|
|
|
|
**The `$HF_TOKEN` syntax references your actual Hugging Face token value.**
|
|
|
|
## Complete Example
|
|
|
|
```python
|
|
# train_detector.py
|
|
# /// script
|
|
# dependencies = ["transformers", "torch", "torchvision", "datasets"]
|
|
# ///
|
|
|
|
from transformers import (
|
|
AutoImageProcessor,
|
|
AutoModelForObjectDetection,
|
|
TrainingArguments,
|
|
Trainer
|
|
)
|
|
from datasets import load_dataset
|
|
import os
|
|
import torch
|
|
|
|
# Load dataset
|
|
dataset = load_dataset("cppe-5", split="train")
|
|
|
|
# Load model and processor
|
|
model_name = "facebook/detr-resnet-50"
|
|
image_processor = AutoImageProcessor.from_pretrained(model_name)
|
|
model = AutoModelForObjectDetection.from_pretrained(
|
|
model_name,
|
|
num_labels=5, # Number of classes
|
|
ignore_mismatched_sizes=True
|
|
)
|
|
|
|
# Configure with Hub push
|
|
training_args = TrainingArguments(
|
|
output_dir="my-detector",
|
|
num_train_epochs=10,
|
|
per_device_train_batch_size=8,
|
|
|
|
# ✅ CRITICAL: Hub push configuration
|
|
push_to_hub=True,
|
|
hub_model_id="myusername/cppe5-detector",
|
|
|
|
# Optional: Push strategy
|
|
hub_strategy="checkpoint", # Push checkpoints during training
|
|
)
|
|
|
|
# ✅ CRITICAL: Authenticate with Hub BEFORE creating Trainer
|
|
from huggingface_hub import login
|
|
hf_token = os.environ.get("HF_TOKEN") or os.environ.get("hfjob")
|
|
if hf_token:
|
|
login(token=hf_token)
|
|
training_args.hub_token = hf_token
|
|
elif training_args.push_to_hub:
|
|
raise ValueError("HF_TOKEN not found! Add secrets={'HF_TOKEN': '$HF_TOKEN'} to job config.")
|
|
|
|
# Define collate function
|
|
def collate_fn(batch):
|
|
pixel_values = [item["pixel_values"] for item in batch]
|
|
labels = [item["labels"] for item in batch]
|
|
encoding = image_processor.pad(pixel_values, return_tensors="pt")
|
|
return {
|
|
"pixel_values": encoding["pixel_values"],
|
|
"labels": labels
|
|
}
|
|
|
|
trainer = Trainer(
|
|
model=model,
|
|
args=training_args,
|
|
train_dataset=dataset,
|
|
data_collator=collate_fn,
|
|
)
|
|
|
|
trainer.train()
|
|
|
|
# ✅ Push final model and processor
|
|
trainer.push_to_hub()
|
|
image_processor.push_to_hub("myusername/cppe5-detector")
|
|
|
|
print("✅ Model saved to: https://huggingface.co/myusername/cppe5-detector")
|
|
```
|
|
|
|
**Submit with authentication:**
|
|
|
|
```python
|
|
hf_jobs("uv", {
|
|
"script": training_script_content, # Pass script content as a string, NOT a filename
|
|
"flavor": "a10g-large",
|
|
"timeout": "4h",
|
|
"secrets": {"HF_TOKEN": "$HF_TOKEN"} # ✅ Required!
|
|
})
|
|
```
|
|
|
|
## What Gets Saved
|
|
|
|
When `push_to_hub=True`:
|
|
|
|
1. **Model weights** - Final trained parameters
|
|
2. **Image processor** - Associated preprocessing configuration
|
|
3. **Configuration** - Model config (config.json) including:
|
|
- Number of labels/classes
|
|
- Architecture details (backbone, num_queries, etc.)
|
|
- Label mappings (id2label, label2id)
|
|
4. **Training arguments** - Hyperparameters used
|
|
5. **Model card** - Auto-generated documentation
|
|
6. **Checkpoints** - If `save_strategy="steps"` enabled
|
|
|
|
## Important: Save Image Processor
|
|
|
|
**Object detection models require the image processor to be saved separately:**
|
|
|
|
```python
|
|
# After training completes
|
|
trainer.push_to_hub()
|
|
|
|
# ✅ Also push the image processor
|
|
image_processor.push_to_hub(
|
|
repo_id="username/model-name",
|
|
commit_message="Upload image processor"
|
|
)
|
|
```
|
|
|
|
**Why this matters:**
|
|
- Models need specific image preprocessing (resizing, normalization)
|
|
- Image processor contains critical configuration
|
|
- Without it, model cannot be used for inference
|
|
|
|
## Checkpoint Saving
|
|
|
|
Save intermediate checkpoints during training:
|
|
|
|
```python
|
|
TrainingArguments(
|
|
output_dir="my-detector",
|
|
push_to_hub=True,
|
|
hub_model_id="username/my-detector",
|
|
|
|
# Checkpoint configuration
|
|
save_strategy="steps",
|
|
save_steps=500, # Save every 500 steps
|
|
save_total_limit=3, # Keep only last 3 checkpoints
|
|
hub_strategy="checkpoint", # Push checkpoints to Hub
|
|
)
|
|
```
|
|
|
|
**Benefits:**
|
|
- Resume training if job fails
|
|
- Compare checkpoint performance
|
|
- Use intermediate models
|
|
- Track training progress
|
|
|
|
**Checkpoints are pushed to:** `username/my-detector` (same repo)
|
|
|
|
## Model Card Configuration
|
|
|
|
Add metadata for better discoverability:
|
|
|
|
```python
|
|
# At the end of training script
|
|
model.push_to_hub(
|
|
"username/my-detector",
|
|
commit_message="Upload trained object detection model",
|
|
tags=["object-detection", "vision", "cppe-5"],
|
|
model_card_kwargs={
|
|
"license": "apache-2.0",
|
|
"dataset": "cppe-5",
|
|
"metrics": ["map", "recall", "precision"],
|
|
"pipeline_tag": "object-detection",
|
|
}
|
|
)
|
|
```
|
|
|
|
## Saving Label Mappings
|
|
|
|
**Critical for object detection:** Save class labels with the model:
|
|
|
|
```python
|
|
# Define your label mappings
|
|
id2label = {0: "Coverall", 1: "Face_Shield", 2: "Gloves", 3: "Goggles", 4: "Mask"}
|
|
label2id = {v: k for k, v in id2label.items()}
|
|
|
|
# Update model config before training
|
|
model.config.id2label = id2label
|
|
model.config.label2id = label2id
|
|
|
|
# Now train and push
|
|
trainer.train()
|
|
trainer.push_to_hub()
|
|
```
|
|
|
|
**Without label mappings:**
|
|
- Model outputs will be numeric IDs only
|
|
- No human-readable class names
|
|
- Difficult to interpret results
|
|
|
|
## Authentication Methods
|
|
|
|
For a complete guide on token types, `$HF_TOKEN` automatic replacement, `secrets` vs `env` differences, and security best practices, see the `hugging-face-jobs` skill → *Token Usage Guide*.
|
|
|
|
**Recommended:** Always pass tokens via `secrets` (encrypted server-side):
|
|
|
|
```python
|
|
"secrets": {"HF_TOKEN": "$HF_TOKEN"} # ✅ Automatic replacement with your logged-in token
|
|
```
|
|
|
|
## Verification Checklist
|
|
|
|
Before submitting any training job, verify:
|
|
|
|
- [ ] `push_to_hub=True` in TrainingArguments
|
|
- [ ] `hub_model_id` is specified (format: `username/model-name`)
|
|
- [ ] Image processor will be saved separately
|
|
- [ ] Label mappings (id2label, label2id) are configured
|
|
- [ ] Repository name doesn't conflict with existing repos
|
|
- [ ] You have write access to the target namespace
|
|
|
|
## Repository Setup
|
|
|
|
### Automatic Creation
|
|
|
|
If repository doesn't exist, it's created automatically when first pushing.
|
|
|
|
### Manual Creation
|
|
|
|
Create repository before training:
|
|
|
|
```python
|
|
from huggingface_hub import HfApi
|
|
|
|
api = HfApi()
|
|
api.create_repo(
|
|
repo_id="username/detector-name",
|
|
repo_type="model",
|
|
private=False, # or True for private repo
|
|
)
|
|
```
|
|
|
|
### Repository Naming
|
|
|
|
**Valid names:**
|
|
- `username/detr-cppe5`
|
|
- `username/yolos-object-detector`
|
|
- `organization/custom-detector`
|
|
|
|
**Invalid names:**
|
|
- `detector-name` (missing username)
|
|
- `username/detector name` (spaces not allowed)
|
|
- `username/DETECTOR` (uppercase discouraged)
|
|
|
|
**Recommended naming:**
|
|
- Include model architecture: `detr-`, `yolos-`, `deta-`
|
|
- Include dataset: `-cppe5`, `-coco`, `-voc`
|
|
- Be descriptive: `detr-resnet50-cppe5` > `model1`
|
|
|
|
## Troubleshooting
|
|
|
|
### Error: 401 Unauthorized
|
|
|
|
**Cause:** HF_TOKEN not provided, invalid, or not authenticated before Trainer init
|
|
|
|
**Solutions:**
|
|
1. Verify `secrets={"HF_TOKEN": "$HF_TOKEN"}` in job config
|
|
2. Verify script calls `login(token=hf_token)` AND sets `training_args.hub_token = hf_token` BEFORE creating the `Trainer`
|
|
3. Check you're logged in locally: `hf auth whoami`
|
|
4. Re-login: `hf auth login`
|
|
|
|
**Root cause:** The `Trainer` calls `create_repo(token=self.args.hub_token)` during `__init__()` when `push_to_hub=True`. Relying on implicit env-var token resolution is unreliable in Jobs. Calling `login()` saves the token globally, and setting `training_args.hub_token` ensures the Trainer passes it explicitly to all Hub API calls.
|
|
|
|
### Error: 403 Forbidden
|
|
|
|
**Cause:** No write access to repository
|
|
|
|
**Solutions:**
|
|
1. Check repository namespace matches your username
|
|
2. Verify you're a member of organization (if using org namespace)
|
|
3. Check repository isn't private (if accessing org repo)
|
|
|
|
### Error: Repository not found
|
|
|
|
**Cause:** Repository doesn't exist and auto-creation failed
|
|
|
|
**Solutions:**
|
|
1. Manually create repository first
|
|
2. Check repository name format
|
|
3. Verify namespace exists
|
|
|
|
### Error: Push failed during training
|
|
|
|
**Cause:** Network issues or Hub unavailable
|
|
|
|
**Solutions:**
|
|
1. Training continues but final push fails
|
|
2. Checkpoints may be saved
|
|
3. Re-run push manually after job completes
|
|
|
|
### Issue: Model loads but inference fails
|
|
|
|
**Possible causes:**
|
|
1. Image processor not saved—verify it's pushed separately
|
|
2. Label mappings missing—check config.json has id2label
|
|
3. Wrong image size—verify image processor matches training config
|
|
|
|
### Issue: Model saved but not visible
|
|
|
|
**Possible causes:**
|
|
1. Repository is private—check https://huggingface.co/username
|
|
2. Wrong namespace—verify `hub_model_id` matches login
|
|
3. Push still in progress—wait a few minutes
|
|
|
|
## Manual Push After Training
|
|
|
|
If training completes but push fails, push manually:
|
|
|
|
```python
|
|
from transformers import AutoModelForObjectDetection, AutoImageProcessor
|
|
|
|
# Load from local checkpoint
|
|
model = AutoModelForObjectDetection.from_pretrained("./output_dir")
|
|
image_processor = AutoImageProcessor.from_pretrained("./output_dir")
|
|
|
|
# Push to Hub
|
|
model.push_to_hub("username/model-name", token="hf_abc123...")
|
|
image_processor.push_to_hub("username/model-name", token="hf_abc123...")
|
|
```
|
|
|
|
**Note:** Only possible if job hasn't completed (files still exist).
|
|
|
|
## Best Practices
|
|
|
|
1. **Always enable `push_to_hub=True`**
|
|
2. **Save image processor separately** - critical for inference
|
|
3. **Configure label mappings** before training
|
|
4. **Use checkpoint saving** for long training runs
|
|
5. **Verify Hub push** in logs before job completes
|
|
6. **Set appropriate `save_total_limit`** to avoid excessive checkpoints
|
|
7. **Use descriptive repo names** (e.g., `detr-cppe5` not `detector1`)
|
|
8. **Add model card** with:
|
|
- Training dataset
|
|
- Evaluation metrics (mAP, IoU)
|
|
- Example usage code
|
|
- Limitations
|
|
9. **Tag models appropriately**:
|
|
- `object-detection`
|
|
- Architecture: `detr`, `yolos`, `deta`
|
|
- Dataset: `coco`, `voc`, `cppe-5`
|
|
|
|
## Monitoring Push Progress
|
|
|
|
Check logs for push progress:
|
|
|
|
```python
|
|
hf_jobs("logs", {"job_id": "your-job-id"})
|
|
```
|
|
|
|
**Look for:**
|
|
```
|
|
Pushing model to username/detector-name...
|
|
Upload file pytorch_model.bin: 100%
|
|
✅ Model pushed successfully
|
|
Pushing image processor...
|
|
✅ Image processor pushed successfully
|
|
```
|
|
|
|
## Example: Full Production Setup
|
|
|
|
```python
|
|
# production_detector.py
|
|
# /// script
|
|
# dependencies = [
|
|
# "transformers>=4.30.0",
|
|
# "torch>=2.0.0",
|
|
# "torchvision>=0.15.0",
|
|
# "datasets>=2.12.0",
|
|
# "evaluate>=0.4.0"
|
|
# ]
|
|
# ///
|
|
|
|
from transformers import (
|
|
AutoImageProcessor,
|
|
AutoModelForObjectDetection,
|
|
TrainingArguments,
|
|
Trainer
|
|
)
|
|
from datasets import load_dataset
|
|
import os
|
|
import torch
|
|
|
|
# Configuration
|
|
MODEL_NAME = "facebook/detr-resnet-50"
|
|
DATASET_NAME = "cppe-5"
|
|
HUB_MODEL_ID = "myusername/detr-cppe5-detector"
|
|
NUM_CLASSES = 5
|
|
|
|
# Class labels
|
|
id2label = {0: "Coverall", 1: "Face_Shield", 2: "Gloves", 3: "Goggles", 4: "Mask"}
|
|
label2id = {v: k for k, v in id2label.items()}
|
|
|
|
print(f"🔧 Loading dataset: {DATASET_NAME}")
|
|
dataset = load_dataset(DATASET_NAME, split="train")
|
|
print(f"✅ Dataset loaded: {len(dataset)} examples")
|
|
|
|
print(f"🔧 Loading model: {MODEL_NAME}")
|
|
image_processor = AutoImageProcessor.from_pretrained(MODEL_NAME)
|
|
model = AutoModelForObjectDetection.from_pretrained(
|
|
MODEL_NAME,
|
|
num_labels=NUM_CLASSES,
|
|
id2label=id2label,
|
|
label2id=label2id,
|
|
ignore_mismatched_sizes=True
|
|
)
|
|
print("✅ Model loaded")
|
|
|
|
# Configure with comprehensive Hub settings
|
|
training_args = TrainingArguments(
|
|
output_dir="detr-cppe5",
|
|
|
|
# Hub configuration
|
|
push_to_hub=True,
|
|
hub_model_id=HUB_MODEL_ID,
|
|
hub_strategy="checkpoint", # Push checkpoints
|
|
|
|
# Checkpoint configuration
|
|
save_strategy="steps",
|
|
save_steps=500,
|
|
save_total_limit=3,
|
|
|
|
# Training settings
|
|
num_train_epochs=10,
|
|
per_device_train_batch_size=8,
|
|
gradient_accumulation_steps=2,
|
|
learning_rate=1e-4,
|
|
warmup_steps=500,
|
|
|
|
# Evaluation
|
|
eval_strategy="steps",
|
|
eval_steps=500,
|
|
|
|
# Logging
|
|
logging_steps=50,
|
|
logging_first_step=True,
|
|
|
|
# Performance
|
|
fp16=True, # Mixed precision training
|
|
dataloader_num_workers=4,
|
|
)
|
|
|
|
# ✅ CRITICAL: Authenticate with Hub BEFORE creating Trainer
|
|
# login() saves the token globally so ALL hub operations can find it.
|
|
from huggingface_hub import login
|
|
hf_token = os.environ.get("HF_TOKEN") or os.environ.get("hfjob")
|
|
if hf_token:
|
|
login(token=hf_token)
|
|
training_args.hub_token = hf_token
|
|
elif training_args.push_to_hub:
|
|
raise ValueError("HF_TOKEN not found! Add secrets={'HF_TOKEN': '$HF_TOKEN'} to job config.")
|
|
|
|
# Data collator
|
|
def collate_fn(batch):
|
|
pixel_values = [item["pixel_values"] for item in batch]
|
|
labels = [item["labels"] for item in batch]
|
|
encoding = image_processor.pad(pixel_values, return_tensors="pt")
|
|
return {
|
|
"pixel_values": encoding["pixel_values"],
|
|
"labels": labels
|
|
}
|
|
|
|
# Create trainer
|
|
trainer = Trainer(
|
|
model=model,
|
|
args=training_args,
|
|
train_dataset=dataset,
|
|
data_collator=collate_fn,
|
|
)
|
|
|
|
print("🚀 Starting training...")
|
|
trainer.train()
|
|
|
|
print("💾 Pushing final model to Hub...")
|
|
trainer.push_to_hub(
|
|
commit_message="Upload trained DETR model on CPPE-5",
|
|
tags=["object-detection", "detr", "cppe-5", "vision"],
|
|
)
|
|
|
|
print("💾 Pushing image processor to Hub...")
|
|
image_processor.push_to_hub(
|
|
repo_id=HUB_MODEL_ID,
|
|
commit_message="Upload image processor"
|
|
)
|
|
|
|
print("✅ Training complete!")
|
|
print(f"Model available at: https://huggingface.co/{HUB_MODEL_ID}")
|
|
print(f"\nTo use your model:")
|
|
print(f"```python")
|
|
print(f"from transformers import AutoImageProcessor, AutoModelForObjectDetection")
|
|
print(f"")
|
|
print(f"processor = AutoImageProcessor.from_pretrained('{HUB_MODEL_ID}')")
|
|
print(f"model = AutoModelForObjectDetection.from_pretrained('{HUB_MODEL_ID}')")
|
|
print(f"```")
|
|
```
|
|
|
|
**Submit:**
|
|
|
|
```python
|
|
hf_jobs("uv", {
|
|
"script": training_script_content, # Pass script content as a string, NOT a filename
|
|
"flavor": "a10g-large",
|
|
"timeout": "8h",
|
|
"secrets": {"HF_TOKEN": "$HF_TOKEN"}
|
|
})
|
|
```
|
|
|
|
## Inference Example
|
|
|
|
After training, use your model:
|
|
|
|
```python
|
|
from transformers import AutoImageProcessor, AutoModelForObjectDetection
|
|
from PIL import Image
|
|
import torch
|
|
|
|
# Load model from Hub
|
|
processor = AutoImageProcessor.from_pretrained("username/detr-cppe5-detector")
|
|
model = AutoModelForObjectDetection.from_pretrained("username/detr-cppe5-detector")
|
|
|
|
# Load and process image
|
|
image = Image.open("test_image.jpg")
|
|
inputs = processor(images=image, return_tensors="pt")
|
|
|
|
# Run inference
|
|
with torch.no_grad():
|
|
outputs = model(**inputs)
|
|
|
|
# Post-process results
|
|
target_sizes = torch.tensor([image.size[::-1]])
|
|
results = processor.post_process_object_detection(
|
|
outputs,
|
|
threshold=0.5,
|
|
target_sizes=target_sizes
|
|
)[0]
|
|
|
|
# Print detections
|
|
for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
|
|
box = [round(i, 2) for i in box.tolist()]
|
|
print(
|
|
f"Detected {model.config.id2label[label.item()]} with confidence "
|
|
f"{round(score.item(), 3)} at location {box}"
|
|
)
|
|
```
|
|
|
|
## Key Takeaway
|
|
|
|
**Without `push_to_hub=True` and `secrets={"HF_TOKEN": "$HF_TOKEN"}`, all training results are permanently lost.**
|
|
|
|
**For object detection, also remember to:**
|
|
1. Save the image processor separately
|
|
2. Configure label mappings (id2label, label2id)
|
|
3. Include appropriate model card metadata
|
|
|
|
Always verify all three are configured before submitting any training job.
|