1.9 KiB
1.9 KiB
| name | description |
|---|---|
| root-cause-tracing | Root cause analysis (RCA) and tracing failures back to the original trigger across layers. Triggers: root cause, RCA, tracing, 回溯, 根因, 追溯, 为什么会发生. |
Root Cause Tracing(根因溯源 / RCA)
When to Use
- Incidents, regressions, flaky tests, recurring bugs
- “Fix the symptom” patches where the underlying trigger is unknown
- Multi-layer failures (client → service → DB → async jobs)
Inputs(required)
- Evidence: logs, stack traces, metrics, failing test output
- Timeline: when it started, what changed, rollout events
- Scope: affected users/paths, frequency, severity
- Verification: how to reproduce (or how to detect reliably)
Procedure(default)
-
Frame the Failure
- Define expected vs actual behavior
- Identify the earliest known bad signal
-
Trace Backwards
- Walk back through layers: surface error → caller → upstream trigger
- Look for the first point where invariants were violated
-
Find the Trigger
- What input/state/sequence causes it?
- What changed around that area (code/config/deps/data)?
-
Fix at the Right Layer
- Prefer root-cause fix + defense-in-depth guardrails
- Add regression test or a deterministic repro harness
-
Validate
- Reproduce before fix; verify after fix
- Add monitoring/alerts if appropriate
Output Contract(stable)
- Summary: what broke and impact
- Root cause: the earliest causal violation + why it happened
- Trigger: minimal repro steps / conditions
- Fix: what changed and why it prevents recurrence
- Verification: tests/commands + evidence
- Follow-ups: guardrails/observability/rollout notes
Guardrails
- Don’t stop at “where it crashed”; find “why the bad state existed”
- Separate contributing factors vs root cause
- Avoid speculative RCA; label assumptions and request missing evidence