playbook/antigravity-awesome-skills/skills/hugging-face-paper-publisher/templates/ml-report.md

6.7 KiB

title authors date type tags
TITLE
AUTHORS
DATE
ml-experiment-report
machine-learning
experiment-report

{{TITLE}}

Machine Learning Experiment Report

Researchers: {{AUTHORS}} Date: {{DATE}} Status: Draft / Final / In Review


Executive Summary

{{ABSTRACT}}

Key Findings

  • Finding 1
  • Finding 2
  • Finding 3

Recommendations

  • Recommendation 1
  • Recommendation 2

1. Objective

1.1 Research Question

What specific question are we trying to answer?

1.2 Success Criteria

How will we measure success?

  • Metric 1: Target value
  • Metric 2: Target value
  • Metric 3: Target value

1.3 Constraints

  • Computational budget
  • Time constraints
  • Data availability

2. Dataset

2.1 Data Description

Property Value
Name Dataset name
Source Origin of data
Size Number of examples
Features Feature count and types
Target What we're predicting
License Usage rights

2.2 Data Splits

Split Size Percentage
Train X examples Y%
Validation X examples Y%
Test X examples Y%

2.3 Data Quality

  • Missing Values: Analysis and handling
  • Outliers: Detection and treatment
  • Imbalance: Class distribution
  • Preprocessing: Transformations applied

2.4 Exploratory Analysis

Key insights from data exploration:

  1. Pattern 1
  2. Pattern 2
  3. Pattern 3

3. Model

3.1 Architecture

Describe the model architecture:

Input → Layer 1 → Layer 2 → ... → Output

3.2 Model Specifications

Component Configuration
Type Model family
Parameters Total count
Layers Number and types
Activation Functions used
Dropout Regularization rate

3.3 Baseline Models

What are we comparing against?

  1. Baseline 1: Simple baseline (e.g., majority class)
  2. Baseline 2: Standard approach (e.g., logistic regression)
  3. Baseline 3: Previous best method

4. Training

4.1 Hyperparameters

Hyperparameter Value Rationale
Learning Rate 1e-4 Tuned via grid search
Batch Size 32 GPU memory constraint
Epochs 100 Based on validation
Optimizer AdamW Standard for transformers
Weight Decay 0.01 Regularization
LR Schedule Cosine Smooth convergence

4.2 Training Process

# Training pseudocode
for epoch in range(num_epochs):
    train_loss = train_one_epoch(model, train_loader)
    val_loss = validate(model, val_loader)
    if val_loss < best_loss:
        save_checkpoint(model)

4.3 Computational Resources

Resource Specification
Hardware GPU model and count
Memory RAM and VRAM
Training Time Hours/days
Cost Estimated compute cost

4.4 Training Curves

Include plots of:

  • Training loss over time
  • Validation loss over time
  • Learning rate schedule
  • Other relevant metrics

5. Results

5.1 Quantitative Results

Model Accuracy Precision Recall F1 AUC
Baseline 1 0.65 0.64 0.66 0.65 0.70
Baseline 2 0.78 0.77 0.79 0.78 0.82
Ours 0.89 0.88 0.90 0.89 0.93

5.2 Statistical Significance

  • P-value: Statistical test results
  • Confidence Intervals: 95% CI for key metrics
  • Multiple Runs: Mean ± std over N runs

5.3 Per-Class Performance

Class Precision Recall F1 Support
Class 1 0.90 0.88 0.89 500
Class 2 0.87 0.91 0.89 450
Class 3 0.88 0.89 0.88 550

5.4 Qualitative Results

Success Cases

Examples where the model performs well.

Failure Cases

Examples where the model fails and why.


6. Analysis

6.1 Ablation Study

Configuration Score Change
Full Model 0.89 -
- Feature Set A 0.85 -0.04
- Feature Set B 0.87 -0.02
- Augmentation 0.86 -0.03

6.2 Error Analysis

What types of errors is the model making?

  1. Error Type 1: Frequency and cause
  2. Error Type 2: Frequency and cause
  3. Error Type 3: Frequency and cause

6.3 Feature Importance

Which features matter most?

Feature Importance Notes
Feature 1 0.35 Most predictive
Feature 2 0.28 Secondary signal
Feature 3 0.15 Marginal impact

7. Robustness

7.1 Cross-Dataset Evaluation

How does the model generalize to other datasets?

Dataset Score Notes
Original 0.89 Training distribution
Dataset A 0.82 Similar domain
Dataset B 0.71 Different domain

7.2 Adversarial Robustness

Performance under adversarial conditions.

7.3 Fairness Analysis

Performance across demographic groups or sensitive attributes.


8. Deployment Considerations

8.1 Model Size

  • Parameters: Total count
  • Disk Size: MB/GB on disk
  • Memory: Runtime memory usage

8.2 Inference Speed

Batch Size Latency Throughput
1 10ms 100 QPS
8 45ms 178 QPS
32 150ms 213 QPS

8.3 Production Requirements

  • Dependencies: Software requirements
  • Infrastructure: Hardware needs
  • Monitoring: What to track in production
  • Fallback: Backup strategy

9. Conclusions

9.1 Summary

Key takeaways from the experiment.

9.2 Did We Meet Objectives?

Objective Status Notes
Objective 1 Met Achieved target
Objective 2 ⚠️ Partial Close to target
Objective 3 Not Met Needs more work

9.3 Lessons Learned

What did we learn from this experiment?

  1. Lesson 1
  2. Lesson 2
  3. Lesson 3

10. Next Steps

10.1 Short-term (1-2 weeks)

  • Task 1
  • Task 2
  • Task 3

10.2 Medium-term (1-2 months)

  • Task 1
  • Task 2
  • Task 3

10.3 Long-term (3+ months)

  • Task 1
  • Task 2
  • Task 3

References

  1. Reference 1
  2. Reference 2
  3. Reference 3

Appendix

Results from hyperparameter tuning.

B. Additional Experiments

Supplementary experiments not included in main text.

C. Code

Links to code repositories:

  • Training code: [link]
  • Evaluation code: [link]
  • Model checkpoint: [link]

D. Data Card

Detailed data documentation following standard practices.

E. Model Card

Model documentation following responsible AI practices.