# Benchmarking Methodology

Rigorous performance measurement techniques for reliable optimization decisions.

## Core Principles

**Statistical rigor** — account for variance, run multiple iterations, report confidence intervals.

**Environmental isolation** — eliminate noise from other processes, network, disk I/O.

**Realistic workload** — use production-representative data, not toy examples.

**Consistent conditions** — same hardware, OS load, data set across runs.

## Benchmark Design

### 1. Define Success Criteria

**Before benchmarking, specify:**
- Target metric (latency, throughput, memory)
- Acceptable threshold (e.g., P95 < 100ms)
- Minimum improvement to justify change (e.g., 20% faster)

### 2. Choose Workload

**Representative data:**
- Production dataset sample
- Realistic data distribution
- Edge cases included
- Sufficient size (not trivially small)

**Load patterns:**
- Typical request rate
- Burst scenarios
- Concurrent users/requests
- Data size variations

### 3. Isolate Environment

**Eliminate interference:**
- Close unnecessary applications
- Disable background services
- Stop cron jobs during testing
- Use dedicated hardware if critical

**System configuration:**
- Document CPU, RAM, OS version
- Pin process to specific cores (avoid migration)
- Disable CPU frequency scaling
- Clear filesystem caches between runs

### 4. Warm Up

**JIT compilation:**
- Run warm-up iterations before measurement
- Allow JIT to optimize hot paths
- Discard initial slow runs

**Caching:**
- Decide: cold cache or warm cache testing
- Document cache state
- Be consistent across runs

## Statistical Methodology

### Multiple Runs

**Never trust single measurement:**
- Run at least 10-30 iterations
- More iterations for high-variance operations
- Discard outliers (carefully, document why)

### Measure Variance

**Report distribution, not just mean:**

```text
Operation: parse_json
Runs: 50
Mean: 42.3ms
Median (P50): 41.8ms
P95: 48.2ms
P99: 52.1ms
Std Dev: 3.2ms
Range: 38.1ms - 54.3ms
```

### Statistical Significance

**Use t-test or Mann-Whitney U test:**
- Null hypothesis: no difference between implementations
- Reject if p-value < 0.05 (95% confidence)
- Higher confidence (p < 0.01) for critical changes

**Effect size:**
- Report percentage improvement: `(old - new) / old * 100%`
- Cohen's d for standardized effect size
- Confidence interval around improvement estimate

## Tool Selection

### TypeScript/Bun

**microbench (recommended):**

```typescript
import { bench, run } from 'mitata'

bench('fast implementation', () => {
  // code to benchmark
})

bench('slow implementation', () => {
  // code to benchmark
})

await run()
```

**Benchmark.js:**

```typescript
import Benchmark from 'benchmark'

const suite = new Benchmark.Suite()

suite
  .add('implementation A', () => { /* code */ })
  .add('implementation B', () => { /* code */ })
  .on('cycle', (event) => console.log(String(event.target)))
  .on('complete', function() {
    console.log('Fastest is ' + this.filter('fastest').map('name'))
  })
  .run({ async: true })
```

### Rust

**criterion (recommended):**

```rust
use criterion::{black_box, criterion_group, criterion_main, Criterion, BenchmarkId};

fn benchmark_implementations(c: &mut Criterion) {
    let mut group = c.benchmark_group("comparison");

    for size in [10, 100, 1000].iter() {
        group.bench_with_input(BenchmarkId::new("fast", size), size, |b, &size| {
            b.iter(|| fast_implementation(black_box(size)));
        });

        group.bench_with_input(BenchmarkId::new("slow", size), size, |b, &size| {
            b.iter(|| slow_implementation(black_box(size)));
        });
    }

    group.finish();
}

criterion_group!(benches, benchmark_implementations);
criterion_main!(benches);
```

**cargo bench output:**
- Automatic outlier detection
- Statistical analysis included
- Regression detection across runs
- HTML reports with plots

## Comparison Techniques

### Before/After Comparison

**Document baseline:**

```text
Baseline (commit abc123):
  Operation: process_batch
  Mean: 125ms
  P95: 142ms
  Throughput: 8000 ops/sec
```

**Measure improvement:**

```text
Optimized (commit def456):
  Operation: process_batch
  Mean: 78ms (-37.6%)
  P95: 89ms (-37.3%)
  Throughput: 12800 ops/sec (+60%)

Statistical significance: p < 0.001
```

### A/B Comparison

**Concurrent testing:**
- Run both implementations with same data
- Randomize order to avoid bias
- Use same hardware/environment
- Report relative performance

**Example output:**

```text
Implementation A vs B (1000 runs each):

  A: 42.3ms ± 3.2ms
  B: 38.1ms ± 2.8ms

  Improvement: 9.9% faster (p < 0.01)
  Effect size: Cohen's d = 1.42 (large)
```

### Scaling Analysis

**Test multiple input sizes:**

```text
Input Size | Time (ms) | Ops/sec
-----------|-----------|--------
10         | 1.2       | 8333
100        | 11.5      | 869
1000       | 118.3     | 85
10000      | 1205.7    | 8.3

Complexity: O(n) confirmed
Slope: 0.12ms per item
```

## Common Pitfalls

### Dead Code Elimination

**Optimizer removes unused results:**

```typescript
// Bad: result never used, might be optimized away
bench('compute', () => {
  compute_expensive()
})

// Good: use black_box or assert result
bench('compute', () => {
  const result = compute_expensive()
  assert(result !== undefined) // forces computation
})
```

```rust
// Bad: optimizer removes unused work
b.iter(|| expensive_function());

// Good: black_box prevents elimination
b.iter(|| black_box(expensive_function()));
```

### Memory Effects

**Cache effects distort results:**
- Small dataset fits in L1 cache (unrealistic)
- Repeated access to same data (cache hot)
- Sequential access vs random (cache friendly)

**Mitigation:**
- Use realistic data sizes
- Randomize access patterns
- Clear caches between runs
- Test with cold cache scenario

### Timing Overhead

**Measurement affects result:**
- Timer resolution too coarse (use nanoseconds)
- Timer overhead significant for fast operations
- Loop overhead in benchmark

**Mitigation:**
- Batch operations for fast functions
- Subtract timer overhead from results
- Use high-resolution timers

### Confirmation Bias

**Expecting improvement, find it:**
- Cherry-picking favorable runs
- Ignoring variance in results
- Stopping when desired result appears

**Mitigation:**
- Pre-register hypothesis and methodology
- Use automated statistical tests
- Report all results, not just favorable
- Peer review benchmark design

## Documentation Template

```markdown
## Performance Benchmark: {OPERATION}

### Goal
{PERFORMANCE_GOAL}

### Environment
- Hardware: {CPU, RAM, DISK}
- OS: {VERSION}
- Runtime: {LANGUAGE_VERSION}
- Date: {YYYY-MM-DD}

### Methodology
- Workload: {DESCRIPTION}
- Data size: {SIZE}
- Iterations: {N}
- Warm-up: {N} iterations
- Cache state: {COLD/WARM}

### Baseline (commit {SHA})
```text
Mean:   {X}ms
Median: {X}ms
P95:    {X}ms
P99:    {X}ms
Std:    {X}ms
```

### Optimized (commit {SHA})

```text
Mean:   {X}ms (-{X}%)
Median: {X}ms (-{X}%)
P95:    {X}ms (-{X}%)
P99:    {X}ms (-{X}%)
Std:    {X}ms
```

### Statistical Analysis

- t-test: p < {VALUE}
- Effect size: {COHENS_D}
- Conclusion: {SIGNIFICANT/NOT_SIGNIFICANT}

### Tradeoffs

- {TRADEOFF_1}
- {TRADEOFF_2}

### Recommendation

{ACCEPT/REJECT} optimization based on {CRITERIA}

```

## Resources

**Papers:**
- "Statistically Rigorous Java Performance Evaluation" (Georges et al.)
- "Producing Wrong Data Without Doing Anything Obviously Wrong!" (Mytkowicz et al.)

**Tools:**
- Criterion (Rust) — statistical benchmarking
- mitata (JavaScript) — modern benchmarking
- perf (Linux) — low-level profiling
- Flamegraph — visualization

**Validation:**
- Always review benchmark methodology with team
- Reproduce results on different hardware
- Document assumptions and limitations
- Update benchmarks as codebase evolves