---
name: root-cause-tracing
description: "Root cause analysis (RCA) and tracing failures back to the original trigger across layers. Triggers: root cause, RCA, tracing, 回溯, 根因, 追溯, 为什么会发生."
---

# Root Cause Tracing（根因溯源 / RCA）

## When to Use
- Incidents, regressions, flaky tests, recurring bugs
- “Fix the symptom” patches where the underlying trigger is unknown
- Multi-layer failures (client → service → DB → async jobs)

## Inputs（required）
- Evidence: logs, stack traces, metrics, failing test output
- Timeline: when it started, what changed, rollout events
- Scope: affected users/paths, frequency, severity
- Verification: how to reproduce (or how to detect reliably)

## Procedure（default）
1. **Frame the Failure**
   - Define expected vs actual behavior
   - Identify the earliest known bad signal

2. **Trace Backwards**
   - Walk back through layers: surface error → caller → upstream trigger
   - Look for the first point where invariants were violated

3. **Find the Trigger**
   - What input/state/sequence causes it?
   - What changed around that area (code/config/deps/data)?

4. **Fix at the Right Layer**
   - Prefer root-cause fix + defense-in-depth guardrails
   - Add regression test or a deterministic repro harness

5. **Validate**
   - Reproduce before fix; verify after fix
   - Add monitoring/alerts if appropriate

## Output Contract（stable）
- Summary: what broke and impact
- Root cause: the earliest causal violation + why it happened
- Trigger: minimal repro steps / conditions
- Fix: what changed and why it prevents recurrence
- Verification: tests/commands + evidence
- Follow-ups: guardrails/observability/rollout notes

## Guardrails
- Don’t stop at “where it crashed”; find “why the bad state existed”
- Separate contributing factors vs root cause
- Avoid speculative RCA; label assumptions and request missing evidence