playbook/outfitter-agents/plugins/outfitter/skills/performance/references/benchmarking.md

7.7 KiB

Benchmarking Methodology

Rigorous performance measurement techniques for reliable optimization decisions.

Core Principles

Statistical rigor — account for variance, run multiple iterations, report confidence intervals.

Environmental isolation — eliminate noise from other processes, network, disk I/O.

Realistic workload — use production-representative data, not toy examples.

Consistent conditions — same hardware, OS load, data set across runs.

Benchmark Design

1. Define Success Criteria

Before benchmarking, specify:

  • Target metric (latency, throughput, memory)
  • Acceptable threshold (e.g., P95 < 100ms)
  • Minimum improvement to justify change (e.g., 20% faster)

2. Choose Workload

Representative data:

  • Production dataset sample
  • Realistic data distribution
  • Edge cases included
  • Sufficient size (not trivially small)

Load patterns:

  • Typical request rate
  • Burst scenarios
  • Concurrent users/requests
  • Data size variations

3. Isolate Environment

Eliminate interference:

  • Close unnecessary applications
  • Disable background services
  • Stop cron jobs during testing
  • Use dedicated hardware if critical

System configuration:

  • Document CPU, RAM, OS version
  • Pin process to specific cores (avoid migration)
  • Disable CPU frequency scaling
  • Clear filesystem caches between runs

4. Warm Up

JIT compilation:

  • Run warm-up iterations before measurement
  • Allow JIT to optimize hot paths
  • Discard initial slow runs

Caching:

  • Decide: cold cache or warm cache testing
  • Document cache state
  • Be consistent across runs

Statistical Methodology

Multiple Runs

Never trust single measurement:

  • Run at least 10-30 iterations
  • More iterations for high-variance operations
  • Discard outliers (carefully, document why)

Measure Variance

Report distribution, not just mean:

Operation: parse_json
Runs: 50
Mean: 42.3ms
Median (P50): 41.8ms
P95: 48.2ms
P99: 52.1ms
Std Dev: 3.2ms
Range: 38.1ms - 54.3ms

Statistical Significance

Use t-test or Mann-Whitney U test:

  • Null hypothesis: no difference between implementations
  • Reject if p-value < 0.05 (95% confidence)
  • Higher confidence (p < 0.01) for critical changes

Effect size:

  • Report percentage improvement: (old - new) / old * 100%
  • Cohen's d for standardized effect size
  • Confidence interval around improvement estimate

Tool Selection

TypeScript/Bun

microbench (recommended):

import { bench, run } from 'mitata'

bench('fast implementation', () => {
  // code to benchmark
})

bench('slow implementation', () => {
  // code to benchmark
})

await run()

Benchmark.js:

import Benchmark from 'benchmark'

const suite = new Benchmark.Suite()

suite
  .add('implementation A', () => { /* code */ })
  .add('implementation B', () => { /* code */ })
  .on('cycle', (event) => console.log(String(event.target)))
  .on('complete', function() {
    console.log('Fastest is ' + this.filter('fastest').map('name'))
  })
  .run({ async: true })

Rust

criterion (recommended):

use criterion::{black_box, criterion_group, criterion_main, Criterion, BenchmarkId};

fn benchmark_implementations(c: &mut Criterion) {
    let mut group = c.benchmark_group("comparison");

    for size in [10, 100, 1000].iter() {
        group.bench_with_input(BenchmarkId::new("fast", size), size, |b, &size| {
            b.iter(|| fast_implementation(black_box(size)));
        });

        group.bench_with_input(BenchmarkId::new("slow", size), size, |b, &size| {
            b.iter(|| slow_implementation(black_box(size)));
        });
    }

    group.finish();
}

criterion_group!(benches, benchmark_implementations);
criterion_main!(benches);

cargo bench output:

  • Automatic outlier detection
  • Statistical analysis included
  • Regression detection across runs
  • HTML reports with plots

Comparison Techniques

Before/After Comparison

Document baseline:

Baseline (commit abc123):
  Operation: process_batch
  Mean: 125ms
  P95: 142ms
  Throughput: 8000 ops/sec

Measure improvement:

Optimized (commit def456):
  Operation: process_batch
  Mean: 78ms (-37.6%)
  P95: 89ms (-37.3%)
  Throughput: 12800 ops/sec (+60%)

Statistical significance: p < 0.001

A/B Comparison

Concurrent testing:

  • Run both implementations with same data
  • Randomize order to avoid bias
  • Use same hardware/environment
  • Report relative performance

Example output:

Implementation A vs B (1000 runs each):

  A: 42.3ms ± 3.2ms
  B: 38.1ms ± 2.8ms

  Improvement: 9.9% faster (p < 0.01)
  Effect size: Cohen's d = 1.42 (large)

Scaling Analysis

Test multiple input sizes:

Input Size | Time (ms) | Ops/sec
-----------|-----------|--------
10         | 1.2       | 8333
100        | 11.5      | 869
1000       | 118.3     | 85
10000      | 1205.7    | 8.3

Complexity: O(n) confirmed
Slope: 0.12ms per item

Common Pitfalls

Dead Code Elimination

Optimizer removes unused results:

// Bad: result never used, might be optimized away
bench('compute', () => {
  compute_expensive()
})

// Good: use black_box or assert result
bench('compute', () => {
  const result = compute_expensive()
  assert(result !== undefined) // forces computation
})
// Bad: optimizer removes unused work
b.iter(|| expensive_function());

// Good: black_box prevents elimination
b.iter(|| black_box(expensive_function()));

Memory Effects

Cache effects distort results:

  • Small dataset fits in L1 cache (unrealistic)
  • Repeated access to same data (cache hot)
  • Sequential access vs random (cache friendly)

Mitigation:

  • Use realistic data sizes
  • Randomize access patterns
  • Clear caches between runs
  • Test with cold cache scenario

Timing Overhead

Measurement affects result:

  • Timer resolution too coarse (use nanoseconds)
  • Timer overhead significant for fast operations
  • Loop overhead in benchmark

Mitigation:

  • Batch operations for fast functions
  • Subtract timer overhead from results
  • Use high-resolution timers

Confirmation Bias

Expecting improvement, find it:

  • Cherry-picking favorable runs
  • Ignoring variance in results
  • Stopping when desired result appears

Mitigation:

  • Pre-register hypothesis and methodology
  • Use automated statistical tests
  • Report all results, not just favorable
  • Peer review benchmark design

Documentation Template

## Performance Benchmark: {OPERATION}

### Goal
{PERFORMANCE_GOAL}

### Environment
- Hardware: {CPU, RAM, DISK}
- OS: {VERSION}
- Runtime: {LANGUAGE_VERSION}
- Date: {YYYY-MM-DD}

### Methodology
- Workload: {DESCRIPTION}
- Data size: {SIZE}
- Iterations: {N}
- Warm-up: {N} iterations
- Cache state: {COLD/WARM}

### Baseline (commit {SHA})
```text
Mean:   {X}ms
Median: {X}ms
P95:    {X}ms
P99:    {X}ms
Std:    {X}ms

Optimized (commit {SHA})

Mean:   {X}ms (-{X}%)
Median: {X}ms (-{X}%)
P95:    {X}ms (-{X}%)
P99:    {X}ms (-{X}%)
Std:    {X}ms

Statistical Analysis

  • t-test: p < {VALUE}
  • Effect size: {COHENS_D}
  • Conclusion: {SIGNIFICANT/NOT_SIGNIFICANT}

Tradeoffs

  • {TRADEOFF_1}
  • {TRADEOFF_2}

Recommendation

{ACCEPT/REJECT} optimization based on {CRITERIA}


## Resources

**Papers:**
- "Statistically Rigorous Java Performance Evaluation" (Georges et al.)
- "Producing Wrong Data Without Doing Anything Obviously Wrong!" (Mytkowicz et al.)

**Tools:**
- Criterion (Rust) — statistical benchmarking
- mitata (JavaScript) — modern benchmarking
- perf (Linux) — low-level profiling
- Flamegraph — visualization

**Validation:**
- Always review benchmark methodology with team
- Reproduce results on different hardware
- Document assumptions and limitations
- Update benchmarks as codebase evolves