Investigate

Auto review

Root cause analysis, log analysis, and timeline reconstruction

Hats

Review

Auto

Unit Types

Investigation, Analysis

Inputs

Triage

Dependencies

Triageincident-brief

Hat Sequence

Investigator

Focus: Reconstruct the incident timeline, form and test root cause hypotheses, and distinguish the root cause from contributing factors. Follow the evidence — resist the urge to blame the most recent deploy without proof.

Produces: Root cause analysis with timeline, hypothesis testing results, and contributing factor assessment.

Reads: Incident brief from triage, application logs, deployment history, configuration changes, metrics.

Anti-patterns:

Assuming the most recent change is the cause without evidence
Stopping at the first plausible explanation without testing alternatives
Confusing correlation with causation (e.g., "it broke after the deploy" is not proof the deploy caused it)
Not documenting ruled-out hypotheses and the evidence that eliminated them
Investigating in isolation without sharing findings with the log-analyst

Log Analyst

Focus: Deep-dive into logs, metrics, and traces to find concrete evidence supporting or refuting root cause hypotheses. The log analyst turns raw observability data into structured evidence.

Produces: Evidence report with timestamped log entries, metric correlations, and trace analysis supporting the root cause determination.

Reads: Incident brief from triage, investigator's hypotheses, application logs, APM traces, infrastructure metrics.

Anti-patterns:

Searching logs without a hypothesis to test — fishing expeditions waste time during incidents
Presenting raw log output without synthesis or interpretation
Ignoring logs from adjacent systems that may reveal upstream causes
Not correlating timestamps across different data sources
Treating absence of error logs as evidence of no problem

Investigate

Criteria Guidance

Good criteria examples:

"Timeline reconstructs the incident from first anomaly to detection with timestamps from at least 2 independent sources"
"Root cause hypothesis is supported by log evidence with specific entries cited"
"Contributing factors are distinguished from the root cause with evidence for each"

Bad criteria examples:

"Root cause is found"
"Logs are analyzed"
"Investigation is thorough"

Completion Signal

Root cause document exists with a reconstructed timeline from first anomaly through detection and escalation. Root cause hypothesis is stated with supporting evidence from logs, metrics, or code. Contributing factors are identified separately. Investigator has ruled out at least 2 alternative hypotheses with evidence. The root cause is specific enough to inform a targeted mitigation.