playbook/codex/skills/root-cause-tracing/SKILL.md

1.9 KiB
Raw Blame History

name: root-cause-tracing description: Root cause analysis (RCA) and tracing failures back to the original trigger across layers. Triggers: root cause, RCA, tracing, 回溯, 根因, 追溯, 为什么会发生.

Root Cause Tracing根因溯源 / RCA

When to Use

  • Incidents, regressions, flaky tests, recurring bugs
  • “Fix the symptom” patches where the underlying trigger is unknown
  • Multi-layer failures (client → service → DB → async jobs)

Inputsrequired

  • Evidence: logs, stack traces, metrics, failing test output
  • Timeline: when it started, what changed, rollout events
  • Scope: affected users/paths, frequency, severity
  • Verification: how to reproduce (or how to detect reliably)

Proceduredefault

  1. Frame the Failure

    • Define expected vs actual behavior
    • Identify the earliest known bad signal
  2. Trace Backwards

    • Walk back through layers: surface error → caller → upstream trigger
    • Look for the first point where invariants were violated
  3. Find the Trigger

    • What input/state/sequence causes it?
    • What changed around that area (code/config/deps/data)?
  4. Fix at the Right Layer

    • Prefer root-cause fix + defense-in-depth guardrails
    • Add regression test or a deterministic repro harness
  5. Validate

    • Reproduce before fix; verify after fix
    • Add monitoring/alerts if appropriate

Output Contractstable

  • Summary: what broke and impact
  • Root cause: the earliest causal violation + why it happened
  • Trigger: minimal repro steps / conditions
  • Fix: what changed and why it prevents recurrence
  • Verification: tests/commands + evidence
  • Follow-ups: guardrails/observability/rollout notes

Guardrails

  • Dont stop at “where it crashed”; find “why the bad state existed”
  • Separate contributing factors vs root cause
  • Avoid speculative RCA; label assumptions and request missing evidence