Causal Diagrams
· 1 min read
management
sre
Incidents often result from contributing factors as opposed to a singular root cause. As a result, causal diagrams can be an effective tool for illustrating incidents.
Example
This is an example of an incident impacting availability of a service endpoint:
graph TD
A(Instance Terminated)
B(New Instance Health Check Passes & Recieves Traffic)
C(#6 External Service Connection Fails)
D(Instance Endpoint Returns Error)
F(#5 External Service Connections In Use)
G(Implemented Scaling Policy)
E(Purchase External Service Plan With #5 External Service Connection Limit)
A --> B --> C --> D
E --> F
G --> B
G --> F
F --> C
Tip: Causal diagrams should consist of a graph of linked events that contributed to the incident. These events should be things that happened as opposed to the absence of something.
Insight
From the above example we can derive the incident might have been avoided if we removed a contributing factor:
- Connection limit.
- Connections in use.
Or broke a link in a sequence:
- The service wasn’t terminated.
- The health check didn’t pass allowing the instance to recieve traffic.
Address systemic factors:
- Policy to minimize cost could be tempered with capacity planning.
Tip: 5 Whys can be useful for finding preceeding events.