playbook/outfitter-agents/plugins/outfitter/skills/find-root-causes/references/elimination-techniques.md

234 lines
5.4 KiB
Markdown

# Elimination Techniques
Systematic methods for narrowing problem scope.
## Binary Search
Halving the problem space with each test.
### When to Use
- Large problem space
- Changes have clear ordering (time, code versions, config options)
- Tests are quick relative to problem size
### Process
```
1. Identify range: known-good state → known-bad state
2. Test midpoint: does issue exist here?
3. Narrow range: move to half containing issue
4. Repeat: until single change identified
```
### Example: Git Bisect
```bash
# Automated binary search through commits
git bisect start
git bisect bad HEAD # Current commit is bad
git bisect good v1.2.0 # Known good version
git bisect run ./test.sh # Automatically find breaking commit
```
### Example: Configuration
```
50 config options, one causes issue
Round 1: Test with first 25 options only
→ Issue present → problem in first 25
Round 2: Test with first 12 options only
→ Issue absent → problem in options 13-25
Round 3: Test with options 13-18
→ Issue present → problem in 13-18
...continue until single option found
```
### Efficiency
| Problem Size | Binary Search Steps | Linear Search Steps |
|--------------|---------------------|---------------------|
| 10 items | ~4 | 10 |
| 100 items | ~7 | 100 |
| 1000 items | ~10 | 1000 |
## Variable Isolation
Changing one thing at a time.
### When to Use
- Multiple variables could be cause
- Interactions between variables possible
- Need to establish clear causation
### Process
```
1. Baseline: measure with all defaults
2. Change X only: measure impact
3. Revert X, change Y only: measure impact
4. Repeat for each variable
5. If interactions suspected: test combinations
```
### Example: Performance Degradation
```
Suspects: new library version, config change, increased data volume
Test 1: Revert library only → no change → not library
Test 2: Revert config only → improvement → config contributes
Test 3: Reduce data volume → improvement → data also contributes
Test 4: Both config + data → full improvement → both factors
Root cause: Config change + data growth interaction
```
### Common Mistakes
- Changing multiple variables at once
- Not reverting between tests
- Assuming first positive result is complete answer
- Not testing combinations when interactions possible
## Process of Elimination
Systematically ruling out possibilities.
### When to Use
- Finite set of possible causes
- Can definitively rule things out
- Structured environment
### Process
```
Start with: All possible causes
For each possibility:
- Design test to rule out
- Execute test
- If ruled out: remove from list
- If not ruled out: keep on list
Continue until: single possibility remains
```
### Documentation Format
```
Possible causes:
✗ Component A — ruled out: reproduced without A present
✗ Component B — ruled out: tested in isolation, worked
✗ External factor — ruled out: reproduced in clean environment
○ Component C — not yet tested
✓ Component D — confirmed: removing D fixes issue
```
### Example: Integration Failure
```
System: API → Queue → Worker → Database
Test 1: Call API directly, bypass queue
→ Issue persists → not queue-related
Test 2: Worker processes test message
→ Success → worker + database OK
Test 3: Examine API-to-queue handoff
→ Found: message format incorrect
Root cause: API serialization bug
```
## Divide and Conquer
Breaking complex system into testable segments.
### When to Use
- Complex multi-component systems
- Don't know which area to focus on
- Want to parallelize investigation
### Process
```
1. Map system components
2. Identify boundaries between components
3. Test at each boundary: is data correct here?
4. Find boundary where data becomes incorrect
5. Focus investigation on that component
```
### Example: Data Pipeline
```
Source → Ingestion → Transform → Validation → Storage → API
Check at each stage:
- After Ingestion: data correct ✓
- After Transform: data correct ✓
- After Validation: data INCORRECT ✗
Root cause is in Validation stage.
```
## Environment Bisection
Isolating environment-specific factors.
### When to Use
- "Works on my machine" situations
- Environment-dependent bugs
- Deployment issues
### Process
```
1. List environment differences (OS, versions, config, resources)
2. Create minimal diff between working and failing
3. Test with progressive alignment
4. Identify minimum difference causing failure
```
### Difference Checklist
| Category | Working | Failing |
|----------|---------|---------|
| OS/Version | | |
| Runtime version | | |
| Dependencies | | |
| Config files | | |
| Environment variables | | |
| Network/ports | | |
| Permissions | | |
| Resource limits | | |
## Technique Selection Guide
| Situation | Recommended Technique |
|-----------|----------------------|
| Many commits to check | Binary search (git bisect) |
| Multiple config options | Variable isolation |
| Finite component list | Process of elimination |
| Multi-stage pipeline | Divide and conquer |
| "Works elsewhere" | Environment bisection |
| Unknown scope | Start with divide and conquer, then specialize |
## Combining Techniques
Often multiple techniques used together:
```
1. Divide and conquer: narrow to subsystem
2. Process of elimination: rule out components in subsystem
3. Variable isolation: identify specific configuration
4. Binary search: find when it broke
```
Each technique narrows scope; combine for efficiency.