223 lines
4.5 KiB
Markdown
223 lines
4.5 KiB
Markdown
# Documentation Templates
|
|
|
|
Templates for investigation logging and root cause reports.
|
|
|
|
## Investigation Log
|
|
|
|
Real-time documentation during investigation.
|
|
|
|
### Format
|
|
|
|
```
|
|
[TIMESTAMP] STAGE: Action → Result
|
|
|
|
[10:15] DISCOVERY: Gathered error logs → Found NullPointerException in UserService
|
|
[10:22] HYPOTHESIS: Suspect user object not initialized when accessed
|
|
[10:28] TEST: Added null check logging → Confirmed user is null
|
|
[10:35] EVIDENCE: Traced call path → findById returning null for valid ID
|
|
[10:42] HYPOTHESIS: Database connection issue
|
|
[10:48] TEST: Direct DB query → Returns data correctly
|
|
[10:52] HYPOTHESIS: Caching returning stale null
|
|
[10:58] TEST: Disabled cache → Issue resolved
|
|
[11:05] ROOT CAUSE: Cache not invalidated on user update
|
|
```
|
|
|
|
### Entry Types
|
|
|
|
| Prefix | Use For |
|
|
|--------|---------|
|
|
| DISCOVERY | New information gathered |
|
|
| HYPOTHESIS | New theory formed |
|
|
| TEST | Experiment executed |
|
|
| EVIDENCE | Data that supports/refutes hypothesis |
|
|
| ROOT CAUSE | Final determination |
|
|
| BLOCKED | Cannot proceed, need help |
|
|
|
|
### Benefits
|
|
|
|
- Prevents revisiting same ground
|
|
- Enables handoff to others
|
|
- Creates learning artifact
|
|
- Catches circular investigation
|
|
- Documents what was tried
|
|
|
|
## Root Cause Report
|
|
|
|
Post-investigation documentation.
|
|
|
|
### Template
|
|
|
|
```markdown
|
|
# Root Cause Analysis: {Issue Title}
|
|
|
|
## Summary
|
|
Brief description of issue and resolution in 2-3 sentences.
|
|
|
|
## Timeline
|
|
- {DATE/TIME}: Issue first observed
|
|
- {DATE/TIME}: Investigation started
|
|
- {DATE/TIME}: Root cause identified
|
|
- {DATE/TIME}: Fix deployed
|
|
- {DATE/TIME}: Issue verified resolved
|
|
|
|
## Symptoms
|
|
What users/systems experienced:
|
|
- {Observable symptom 1}
|
|
- {Observable symptom 2}
|
|
|
|
## Root Cause
|
|
**Primary cause**: {What ultimately caused the issue}
|
|
|
|
**Contributing factors**:
|
|
- {Factor that made issue worse or harder to catch}
|
|
- {Factor that enabled issue to occur}
|
|
|
|
## Evidence
|
|
How we confirmed this was the root cause:
|
|
- {Evidence 1}
|
|
- {Evidence 2}
|
|
- {Test that confirmed}
|
|
|
|
## Resolution
|
|
**Immediate fix**: {What was done to resolve}
|
|
|
|
**Code/config changes**: {Links to PRs, commits}
|
|
|
|
## Prevention
|
|
**How to prevent recurrence**:
|
|
- {Preventive measure 1}
|
|
- {Preventive measure 2}
|
|
|
|
**Detection improvements**:
|
|
- {How to catch this earlier next time}
|
|
|
|
## Lessons Learned
|
|
- {What went well in investigation}
|
|
- {What could improve}
|
|
- {Knowledge gap discovered}
|
|
|
|
## Appendix
|
|
- Investigation log
|
|
- Relevant logs/screenshots
|
|
- Related incidents
|
|
```
|
|
|
|
## Quick Incident Notes
|
|
|
|
For minor issues not requiring full RCA.
|
|
|
|
### Template
|
|
|
|
```markdown
|
|
## Incident: {Brief title}
|
|
**Date**: {When}
|
|
**Duration**: {How long}
|
|
**Impact**: {Who/what affected}
|
|
|
|
**Cause**: {One sentence}
|
|
**Fix**: {One sentence}
|
|
**Prevention**: {One sentence}
|
|
```
|
|
|
|
## Hypothesis Tracking
|
|
|
|
When testing multiple hypotheses.
|
|
|
|
### Template
|
|
|
|
```markdown
|
|
## Hypotheses
|
|
|
|
### H1: {Description}
|
|
- **Likelihood**: High/Medium/Low
|
|
- **Evidence for**: {Supporting data}
|
|
- **Evidence against**: {Contradicting data}
|
|
- **Test**: {How to verify}
|
|
- **Status**: Pending/Testing/Confirmed/Ruled Out
|
|
|
|
### H2: {Description}
|
|
...
|
|
```
|
|
|
|
### Status Flow
|
|
|
|
```
|
|
Pending → Testing → Confirmed
|
|
→ Ruled Out
|
|
```
|
|
|
|
## Environment Snapshot
|
|
|
|
Document system state at time of issue.
|
|
|
|
### Template
|
|
|
|
```markdown
|
|
## Environment Snapshot
|
|
|
|
**System**: {Service/application name}
|
|
**Instance**: {Server/container ID}
|
|
**Time**: {Timestamp}
|
|
|
|
### Versions
|
|
- Application: {version}
|
|
- Runtime: {version}
|
|
- Key dependencies: {versions}
|
|
|
|
### Configuration
|
|
- {Relevant config values}
|
|
|
|
### State
|
|
- Memory: {usage}
|
|
- CPU: {usage}
|
|
- Connections: {counts}
|
|
- Queue depth: {if applicable}
|
|
|
|
### Recent Changes
|
|
- {Deploy/config change within last 24-48h}
|
|
```
|
|
|
|
## Postmortem Meeting Notes
|
|
|
|
For team discussions.
|
|
|
|
### Agenda Template
|
|
|
|
```markdown
|
|
## Postmortem: {Incident}
|
|
**Date**: {Meeting date}
|
|
**Attendees**: {Names}
|
|
|
|
### Summary (5 min)
|
|
{Owner presents incident summary}
|
|
|
|
### Timeline Review (10 min)
|
|
{Walk through what happened when}
|
|
|
|
### Root Cause Discussion (15 min)
|
|
- Confirmed root cause
|
|
- Contributing factors
|
|
- Why wasn't this caught earlier?
|
|
|
|
### Action Items (15 min)
|
|
| Action | Owner | Due Date |
|
|
|--------|-------|----------|
|
|
| {Preventive measure} | {Name} | {Date} |
|
|
|
|
### Follow-up
|
|
- Next review: {Date if needed}
|
|
- Documentation: {Where RCA will be stored}
|
|
```
|
|
|
|
## Checklist: Documentation Quality
|
|
|
|
Before closing investigation:
|
|
|
|
- [ ] Root cause clearly stated
|
|
- [ ] Evidence documented
|
|
- [ ] Alternative hypotheses addressed
|
|
- [ ] Resolution steps recorded
|
|
- [ ] Prevention measures identified
|
|
- [ ] Learning captured
|
|
- [ ] Stakeholders informed
|