Consulting SRE Engagements

This post is dedicated to how I would shape a “consulting” style Site Reliability Engineering (SRE) engagement.

SRE is seen as a high modernist project, intent on scientifically managing their systems, all techne and no metis; all SLOs and Kubernetes and no systems knowledge and craft.

Seeing Like an SRE: Site Reliability Engineering as High Modernism

1. Engagement Charter

Start by formalizing the scope and activities as a charter.

e.g. production readiness, operational responsibility, …

2. Critical User Journey Mapping

Discover and document user journeys prioritized by criticality to facilitate the remaining activities.

Critical User JourneyInteractionValid EventImpact
CheckoutGET /checkout/newHTTP 200100%
CheckoutPOST /checkoutHTTP 301100%
CheckoutGET /orders/[id]HTTP 200,4040%
Add to cartPUT /cart/[product_id]HTTP 20010-100%
View ProductGET /products/[id]HTTP 200,4045-100%

3. Risk Analysis

Capture concrete and systemic risks against Critical User Journeys (CUJ). An example of systemic risk might be “production access” or “lack of monitoring”. An example of a concrete risk might be “deployments cause downtime” or a “minor defect”.

RiskETTDETTRImpactETTFIncidents/YearBad Minutes/Year
minsmins%days365/ETTF(ETTD + ETTR) * Impact * Incidents/Year
deployment downtime0 mins3 mins100%7 days52156 mins
minor defect60 mins60 mins2%21 days1741 mins

4. Service Level Objective Development

Figure out which metrics to use as SLIs that will most accurately track the user experience. 80% of the time - this is availability.

98% of POST /checkout requests should return a HTTP 301 status code successfully over rolling 28 day time window.

See Art of SLOs

Availability %Downtime per year[note 1]Downtime per day (24 hours)Cost Example
90% (“one nine”)36.53 days2.40 hours0
99% (“two nines”)3.65 days14.40 minutes$1,000
99.9% (“three nines”)8.77 hours1.44 minutes$10,000
99.99% (“four nines”)52.60 minutes8.64 seconds$100,000
99.999% (“five nines”)5.26 minutes864.00 milliseconds$1,000,000
99.9999% (“six nines”)31.56 seconds86.40 milliseconds$10,000,000
99.99999% (“seven nines”)3.16 seconds8.64 milliseconds$100,000,000
99.999999% (“eight nines”)315.58 milliseconds864.00 microseconds$…
99.9999999% (“nine nines”)31.56 milliseconds86.40 microseconds$…

E.g. A Passenger plane engine might be designed for “five nines” of availability.

5. Production Readiness Review

Verify that the service meets accepted standards of production and operational readiness. Examples of production rediness could include a esablished developement process, regular and reliable deployments, operational monitoring and documented procedures.

See Evolving SRE Engagement Model

6. Review Periodically

Requirements will change and new information will become available. Here is some guidance from the SRE Workbook - Implementing SLOs on how to respond to your SLO measures.

SLOToilCustomer satisfactionAction
MetLowHighChoose to (a) relax release and deployment processes and increase velocity, or (b) step back from the engagement and focus engineering time on services that need more reliability.
MetLowLowTighten SLO.
MetHighHighIf alerting is generating false positives, reduce sensitivity. Otherwise, temporarily loosen the SLOs (or offload toil) and fix product and/or improve automated fault mitigation.
MetHighLowTighten SLO.
MissedLowHighLoosen SLO.
MissedLowLowIncrease alerting sensitivity.
MissedHighHighLoosen SLO.
MissedHighLowOffload toil and fix product and/or improve automated fault mitigation.

See SRE Workbook - Implementing SLOs

Useful activities

weekly production in review, runbooks, “Wheel of Misfortune”, pre-mortem, casual maps, human factors, team building activities, …