--- name: root-cause-tracing description: "Root cause analysis (RCA) and tracing failures back to the original trigger across layers. Triggers: root cause, RCA, tracing, 回溯, 根因, 追溯, 为什么会发生." --- # Root Cause Tracing(根因溯源 / RCA) ## When to Use - Incidents, regressions, flaky tests, recurring bugs - “Fix the symptom” patches where the underlying trigger is unknown - Multi-layer failures (client → service → DB → async jobs) ## Inputs(required) - Evidence: logs, stack traces, metrics, failing test output - Timeline: when it started, what changed, rollout events - Scope: affected users/paths, frequency, severity - Verification: how to reproduce (or how to detect reliably) ## Procedure(default) 1. **Frame the Failure** - Define expected vs actual behavior - Identify the earliest known bad signal 2. **Trace Backwards** - Walk back through layers: surface error → caller → upstream trigger - Look for the first point where invariants were violated 3. **Find the Trigger** - What input/state/sequence causes it? - What changed around that area (code/config/deps/data)? 4. **Fix at the Right Layer** - Prefer root-cause fix + defense-in-depth guardrails - Add regression test or a deterministic repro harness 5. **Validate** - Reproduce before fix; verify after fix - Add monitoring/alerts if appropriate ## Output Contract(stable) - Summary: what broke and impact - Root cause: the earliest causal violation + why it happened - Trigger: minimal repro steps / conditions - Fix: what changed and why it prevents recurrence - Verification: tests/commands + evidence - Follow-ups: guardrails/observability/rollout notes ## Guardrails - Don’t stop at “where it crashed”; find “why the bad state existed” - Separate contributing factors vs root cause - Avoid speculative RCA; label assumptions and request missing evidence