playbook/antigravity-awesome-skills/skills/hugging-face-paper-publisher/templates/ml-report.md

---
title: {{TITLE}}
authors: {{AUTHORS}}
date: {{DATE}}
type: ml-experiment-report
tags: [machine-learning, experiment-report]
---

# {{TITLE}}

**Machine Learning Experiment Report**

**Researchers**: {{AUTHORS}}
**Date**: {{DATE}}
**Status**: Draft / Final / In Review

---

## Executive Summary

{{ABSTRACT}}

### Key Findings
- Finding 1
- Finding 2
- Finding 3

### Recommendations
- Recommendation 1
- Recommendation 2

---

## 1. Objective

### 1.1 Research Question

What specific question are we trying to answer?

### 1.2 Success Criteria

How will we measure success?

- **Metric 1**: Target value
- **Metric 2**: Target value
- **Metric 3**: Target value

### 1.3 Constraints

- Computational budget
- Time constraints
- Data availability

---

## 2. Dataset

### 2.1 Data Description

| Property | Value |
|----------|-------|
| **Name** | Dataset name |
| **Source** | Origin of data |
| **Size** | Number of examples |
| **Features** | Feature count and types |
| **Target** | What we're predicting |
| **License** | Usage rights |

### 2.2 Data Splits

| Split | Size | Percentage |
|-------|------|------------|
| Train | X examples | Y% |
| Validation | X examples | Y% |
| Test | X examples | Y% |

### 2.3 Data Quality

- **Missing Values**: Analysis and handling
- **Outliers**: Detection and treatment
- **Imbalance**: Class distribution
- **Preprocessing**: Transformations applied

### 2.4 Exploratory Analysis

Key insights from data exploration:

1. Pattern 1
2. Pattern 2
3. Pattern 3

---

## 3. Model

### 3.1 Architecture

Describe the model architecture:

```
Input → Layer 1 → Layer 2 → ... → Output
```

### 3.2 Model Specifications

| Component | Configuration |
|-----------|--------------|
| **Type** | Model family |
| **Parameters** | Total count |
| **Layers** | Number and types |
| **Activation** | Functions used |
| **Dropout** | Regularization rate |

### 3.3 Baseline Models

What are we comparing against?

1. **Baseline 1**: Simple baseline (e.g., majority class)
2. **Baseline 2**: Standard approach (e.g., logistic regression)
3. **Baseline 3**: Previous best method

---

## 4. Training

### 4.1 Hyperparameters

| Hyperparameter | Value | Rationale |
|----------------|-------|-----------|
| Learning Rate | 1e-4 | Tuned via grid search |
| Batch Size | 32 | GPU memory constraint |
| Epochs | 100 | Based on validation |
| Optimizer | AdamW | Standard for transformers |
| Weight Decay | 0.01 | Regularization |
| LR Schedule | Cosine | Smooth convergence |

### 4.2 Training Process

```python
# Training pseudocode
for epoch in range(num_epochs):
    train_loss = train_one_epoch(model, train_loader)
    val_loss = validate(model, val_loader)
    if val_loss < best_loss:
        save_checkpoint(model)
```

### 4.3 Computational Resources

| Resource | Specification |
|----------|--------------|
| **Hardware** | GPU model and count |
| **Memory** | RAM and VRAM |
| **Training Time** | Hours/days |
| **Cost** | Estimated compute cost |

### 4.4 Training Curves

Include plots of:
- Training loss over time
- Validation loss over time
- Learning rate schedule
- Other relevant metrics

---

## 5. Results

### 5.1 Quantitative Results

| Model | Accuracy | Precision | Recall | F1 | AUC |
|-------|----------|-----------|--------|-------|-----|
| Baseline 1 | 0.65 | 0.64 | 0.66 | 0.65 | 0.70 |
| Baseline 2 | 0.78 | 0.77 | 0.79 | 0.78 | 0.82 |
| **Ours** | **0.89** | **0.88** | **0.90** | **0.89** | **0.93** |

### 5.2 Statistical Significance

- **P-value**: Statistical test results
- **Confidence Intervals**: 95% CI for key metrics
- **Multiple Runs**: Mean ± std over N runs

### 5.3 Per-Class Performance

| Class | Precision | Recall | F1 | Support |
|-------|-----------|--------|-----|---------|
| Class 1 | 0.90 | 0.88 | 0.89 | 500 |
| Class 2 | 0.87 | 0.91 | 0.89 | 450 |
| Class 3 | 0.88 | 0.89 | 0.88 | 550 |

### 5.4 Qualitative Results

#### Success Cases

Examples where the model performs well.

#### Failure Cases

Examples where the model fails and why.

---

## 6. Analysis

### 6.1 Ablation Study

| Configuration | Score | Change |
|---------------|-------|--------|
| Full Model | 0.89 | - |
| - Feature Set A | 0.85 | -0.04 |
| - Feature Set B | 0.87 | -0.02 |
| - Augmentation | 0.86 | -0.03 |

### 6.2 Error Analysis

What types of errors is the model making?

1. **Error Type 1**: Frequency and cause
2. **Error Type 2**: Frequency and cause
3. **Error Type 3**: Frequency and cause

### 6.3 Feature Importance

Which features matter most?

| Feature | Importance | Notes |
|---------|------------|-------|
| Feature 1 | 0.35 | Most predictive |
| Feature 2 | 0.28 | Secondary signal |
| Feature 3 | 0.15 | Marginal impact |

---

## 7. Robustness

### 7.1 Cross-Dataset Evaluation

How does the model generalize to other datasets?

| Dataset | Score | Notes |
|---------|-------|-------|
| Original | 0.89 | Training distribution |
| Dataset A | 0.82 | Similar domain |
| Dataset B | 0.71 | Different domain |

### 7.2 Adversarial Robustness

Performance under adversarial conditions.

### 7.3 Fairness Analysis

Performance across demographic groups or sensitive attributes.

---

## 8. Deployment Considerations

### 8.1 Model Size

- **Parameters**: Total count
- **Disk Size**: MB/GB on disk
- **Memory**: Runtime memory usage

### 8.2 Inference Speed

| Batch Size | Latency | Throughput |
|------------|---------|------------|
| 1 | 10ms | 100 QPS |
| 8 | 45ms | 178 QPS |
| 32 | 150ms | 213 QPS |

### 8.3 Production Requirements

- **Dependencies**: Software requirements
- **Infrastructure**: Hardware needs
- **Monitoring**: What to track in production
- **Fallback**: Backup strategy

---

## 9. Conclusions

### 9.1 Summary

Key takeaways from the experiment.

### 9.2 Did We Meet Objectives?

| Objective | Status | Notes |
|-----------|--------|-------|
| Objective 1 | ✅ Met | Achieved target |
| Objective 2 | ⚠️ Partial | Close to target |
| Objective 3 | ❌ Not Met | Needs more work |

### 9.3 Lessons Learned

What did we learn from this experiment?

1. Lesson 1
2. Lesson 2
3. Lesson 3

---

## 10. Next Steps

### 10.1 Short-term (1-2 weeks)

- [ ] Task 1
- [ ] Task 2
- [ ] Task 3

### 10.2 Medium-term (1-2 months)

- [ ] Task 1
- [ ] Task 2
- [ ] Task 3

### 10.3 Long-term (3+ months)

- [ ] Task 1
- [ ] Task 2
- [ ] Task 3

---

## References

1. Reference 1
2. Reference 2
3. Reference 3

---

## Appendix

### A. Hyperparameter Search

Results from hyperparameter tuning.

### B. Additional Experiments

Supplementary experiments not included in main text.

### C. Code

Links to code repositories:
- Training code: [link]
- Evaluation code: [link]
- Model checkpoint: [link]

### D. Data Card

Detailed data documentation following standard practices.

### E. Model Card

Model documentation following responsible AI practices.