7.7 KiB

Raw Blame History

Benchmarking Methodology

Rigorous performance measurement techniques for reliable optimization decisions.

Core Principles

Statistical rigor — account for variance, run multiple iterations, report confidence intervals.

Environmental isolation — eliminate noise from other processes, network, disk I/O.

Realistic workload — use production-representative data, not toy examples.

Consistent conditions — same hardware, OS load, data set across runs.

Benchmark Design

1. Define Success Criteria

Before benchmarking, specify:

Target metric (latency, throughput, memory)
Acceptable threshold (e.g., P95 < 100ms)
Minimum improvement to justify change (e.g., 20% faster)

2. Choose Workload

Representative data:

Production dataset sample
Realistic data distribution
Edge cases included
Sufficient size (not trivially small)

Load patterns:

Typical request rate
Burst scenarios
Concurrent users/requests
Data size variations

3. Isolate Environment

Eliminate interference:

Close unnecessary applications
Disable background services
Stop cron jobs during testing
Use dedicated hardware if critical

System configuration:

Document CPU, RAM, OS version
Pin process to specific cores (avoid migration)
Disable CPU frequency scaling
Clear filesystem caches between runs

4. Warm Up

JIT compilation:

Run warm-up iterations before measurement
Allow JIT to optimize hot paths
Discard initial slow runs

Caching:

Decide: cold cache or warm cache testing
Document cache state
Be consistent across runs

Statistical Methodology

Multiple Runs

Never trust single measurement:

Run at least 10-30 iterations
More iterations for high-variance operations
Discard outliers (carefully, document why)

Measure Variance

Report distribution, not just mean:

Operation: parse_json
Runs: 50
Mean: 42.3ms
Median (P50): 41.8ms
P95: 48.2ms
P99: 52.1ms
Std Dev: 3.2ms
Range: 38.1ms - 54.3ms

Statistical Significance

Use t-test or Mann-Whitney U test:

Null hypothesis: no difference between implementations
Reject if p-value < 0.05 (95% confidence)
Higher confidence (p < 0.01) for critical changes

Effect size:

Report percentage improvement: (old - new) / old * 100%
Cohen's d for standardized effect size
Confidence interval around improvement estimate

Tool Selection

TypeScript/Bun

microbench (recommended):

import { bench, run } from 'mitata'

bench('fast implementation', () => {
  // code to benchmark
})

bench('slow implementation', () => {
  // code to benchmark
})

await run()

Benchmark.js:

import Benchmark from 'benchmark'

const suite = new Benchmark.Suite()

suite
  .add('implementation A', () => { /* code */ })
  .add('implementation B', () => { /* code */ })
  .on('cycle', (event) => console.log(String(event.target)))
  .on('complete', function() {
    console.log('Fastest is ' + this.filter('fastest').map('name'))
  })
  .run({ async: true })

Rust

criterion (recommended):

use criterion::{black_box, criterion_group, criterion_main, Criterion, BenchmarkId};

fn benchmark_implementations(c: &mut Criterion) {
    let mut group = c.benchmark_group("comparison");

    for size in [10, 100, 1000].iter() {
        group.bench_with_input(BenchmarkId::new("fast", size), size, |b, &size| {
            b.iter(|| fast_implementation(black_box(size)));
        });

        group.bench_with_input(BenchmarkId::new("slow", size), size, |b, &size| {
            b.iter(|| slow_implementation(black_box(size)));
        });
    }

    group.finish();
}

criterion_group!(benches, benchmark_implementations);
criterion_main!(benches);

cargo bench output:

Automatic outlier detection
Statistical analysis included
Regression detection across runs
HTML reports with plots

Comparison Techniques

Before/After Comparison

Document baseline:

Baseline (commit abc123):
  Operation: process_batch
  Mean: 125ms
  P95: 142ms
  Throughput: 8000 ops/sec

Measure improvement:

Optimized (commit def456):
  Operation: process_batch
  Mean: 78ms (-37.6%)
  P95: 89ms (-37.3%)
  Throughput: 12800 ops/sec (+60%)

Statistical significance: p < 0.001

A/B Comparison

Concurrent testing:

Run both implementations with same data
Randomize order to avoid bias
Use same hardware/environment
Report relative performance

Example output:

Implementation A vs B (1000 runs each):

  A: 42.3ms ± 3.2ms
  B: 38.1ms ± 2.8ms

  Improvement: 9.9% faster (p < 0.01)
  Effect size: Cohen's d = 1.42 (large)

Scaling Analysis

Test multiple input sizes:

Input Size | Time (ms) | Ops/sec
-----------|-----------|--------
10         | 1.2       | 8333
100        | 11.5      | 869
1000       | 118.3     | 85
10000      | 1205.7    | 8.3

Complexity: O(n) confirmed
Slope: 0.12ms per item

Common Pitfalls

Dead Code Elimination

Optimizer removes unused results:

// Bad: result never used, might be optimized away
bench('compute', () => {
  compute_expensive()
})

// Good: use black_box or assert result
bench('compute', () => {
  const result = compute_expensive()
  assert(result !== undefined) // forces computation
})

// Bad: optimizer removes unused work
b.iter(|| expensive_function());

// Good: black_box prevents elimination
b.iter(|| black_box(expensive_function()));

Memory Effects

Cache effects distort results:

Small dataset fits in L1 cache (unrealistic)
Repeated access to same data (cache hot)
Sequential access vs random (cache friendly)

Mitigation:

Use realistic data sizes
Randomize access patterns
Clear caches between runs
Test with cold cache scenario

Timing Overhead

Measurement affects result:

Timer resolution too coarse (use nanoseconds)
Timer overhead significant for fast operations
Loop overhead in benchmark

Mitigation:

Batch operations for fast functions
Subtract timer overhead from results
Use high-resolution timers

Confirmation Bias

Expecting improvement, find it:

Cherry-picking favorable runs
Ignoring variance in results
Stopping when desired result appears

Mitigation:

Pre-register hypothesis and methodology
Use automated statistical tests
Report all results, not just favorable
Peer review benchmark design

Documentation Template

## Performance Benchmark: {OPERATION}

### Goal
{PERFORMANCE_GOAL}

### Environment
- Hardware: {CPU, RAM, DISK}
- OS: {VERSION}
- Runtime: {LANGUAGE_VERSION}
- Date: {YYYY-MM-DD}

### Methodology
- Workload: {DESCRIPTION}
- Data size: {SIZE}
- Iterations: {N}
- Warm-up: {N} iterations
- Cache state: {COLD/WARM}

### Baseline (commit {SHA})
```text
Mean:   {X}ms
Median: {X}ms
P95:    {X}ms
P99:    {X}ms
Std:    {X}ms

Optimized (commit {SHA})

Mean:   {X}ms (-{X}%)
Median: {X}ms (-{X}%)
P95:    {X}ms (-{X}%)
P99:    {X}ms (-{X}%)
Std:    {X}ms

Statistical Analysis

t-test: p < {VALUE}
Effect size: {COHENS_D}
Conclusion: {SIGNIFICANT/NOT_SIGNIFICANT}

Tradeoffs

{TRADEOFF_1}
{TRADEOFF_2}

Recommendation

{ACCEPT/REJECT} optimization based on {CRITERIA}


## Resources

**Papers:**
- "Statistically Rigorous Java Performance Evaluation" (Georges et al.)
- "Producing Wrong Data Without Doing Anything Obviously Wrong!" (Mytkowicz et al.)

**Tools:**
- Criterion (Rust) — statistical benchmarking
- mitata (JavaScript) — modern benchmarking
- perf (Linux) — low-level profiling
- Flamegraph — visualization

**Validation:**
- Always review benchmark methodology with team
- Reproduce results on different hardware
- Document assumptions and limitations
- Update benchmarks as codebase evolves

7.7 KiB Raw Blame History

Benchmarking Methodology

Core Principles

Benchmark Design

1. Define Success Criteria

2. Choose Workload

3. Isolate Environment

4. Warm Up

Statistical Methodology

Multiple Runs

Measure Variance

Statistical Significance

Tool Selection

TypeScript/Bun

Rust

Comparison Techniques

Before/After Comparison

A/B Comparison

Scaling Analysis

Common Pitfalls

Dead Code Elimination

Memory Effects

Timing Overhead

Confirmation Bias

Documentation Template

Optimized (commit {SHA})

Statistical Analysis

Tradeoffs

Recommendation

7.7 KiB

Raw Blame History