Making Sites Reliable (как сделать систему надежной) (Павел Уваров, Андрей Татаринов)

Making Sites ReliableHuman processes to increase reliability

Andrey Tatarinov, Pavel Uvarov

Site Reliability Engineering, Google

Engineering aspects and roles

● Software Development (SWE)○ Designing○ Coding

● Provisioning (Long term preventing of outages) (SWE+SRE)○ Capacity planning○ Automation○ Operations feedback

● Operations (Short term preventing of outages) (SRE)○ Manual response (oncall)○ Site (system) administration

Spreading knowledge

● Mad genius problem● Sociopath problem● Introducing new team member● Shuffling teams● Low "bus factor"

● Medium and large teams 30+ people● Startups have other problems

Sweet spot (or somewhere around)

● TLDR: Speak more to teammates ● Peer-to-peer review in every process● Data driven decision making● Group decision making● Priorities

○ Widespread knowledge○ Predictable quality○ Leveling extremes

● Key point

○ Human processes that scale○ Principles are omnipresent○ Not educational or disciplinary, but part of day-to-day life

Design review

● Each significant change triggers

● Document that captures○ Problem statement○ Requirements○ Proposed solution○ Costs and benefits

■ Development■ Resources

○ Risk factors and solutions

● Design review meeting○ Different experts: PM,

SWEs, SREs○ Different priorities: new

features, architecture sanity, stability

● Result: ○ Widespread knowledge○ Balanced compromise

with new functionality, sanity and reliability

Code review

● Before commit, not after● Style guide

○ "code is good when you can't tell who wrote it"○ Readability

● Peer-to-peer○ No individual ownership○ Each change should be reviewed by other engineer with expertise

in this area● Result

○ Widespread knowledge○ Reliable changes○ Easy to read and modify, consistent code

Knowledge externalization

● Oncall engineer wakes SWE up at 2am ● External knowledge database● Long-term memory

○ Playbook● Short-term memory

○ Alert history○ Hand-off

Manual response

● No L0, L1, L2 operations levels● Anyone with basic knowledge can react on 80~90% of alerts

○ Knowledge is included○ Human is an intellegent executor

■ Prevents feedback loops■ Can identify anomalies■ Feedback for provisioning and automation

● Weekly/Monthly/Quarterly oncall review○ More people are aware○ Top issues identified○ Automation/rearchitecturing planned

● To get into production every service has to comply with● Checklist derived from experience

○ Continuous builds/testing○ Load testing○ Capacity planning○ General health and user-facing monitoring/alerting○ Identified potential issues

■ Monitoring and failover scenarios○ Playbook entries

● Result: service which behaves predictably; any SRE can react on outage

Productionization

Post-mortem

● Each large outage triggers post-mortem and p-m meeting● Document with

○ Impact○ Timeline○ Root cause○ Action items to prevent root cause from happening

● Post-mortem review● Result

○ Widespread knowledge○ This class of root causes is likely not to happen again

Site Reliability Engineering

Reliability

● The ability of a person or system to perform and maintain its functions in routine circumstances, as well as hostile or unexpected circumstances.

Wikipedia ● Unexpected situations cannot be handled by a computer program.

Otherwise it's not unexpected.Common Logic

We will talk about human processes

Site

● Distributed complicated system○ Many machines, switchers, routers, racks, etc○ Lots of data○ Many services○ But not so many people (machines:admins > 4000:1)

● Everything breaks all the time○ Hardware...

■ Fan stopped, bad memory, disk died, ...○ Software...

■ Server not running, wrong version, slow response, ...○ Network...

Site Reliability

● Suppose we have the software to run on the site● It must just work● How to insure it?

○ Engineer the production environment for reliability■ Automate whatever possible

○ Engineer systems and tools that increase reliability■ Monitoring and alerting■ Databases that keep track of machines, tasks and even traffic

○ Handle unexpected situations manually■ Oncall

Automate...

● If a person doesn't make any desicions he/she can be replaced with a script which is much more scalable and reliable

● Automate○ failovers○ load balancing○ response to failures○ routine repairs (reinstalling, draining)

● Determine physical configuration automatically● Expect the unexpected

○ rare events are completely normal

Preventing outages

● In reality the software is always being upgraded● External world is changing too● Preventing problems.

○ Capacity planning & management○ Review design docs○ Prelaunch (onboarding) review○ Adapt production environment to (external) changes

Oncall

● Monitoring○ Every program/server is universally monitored

● Alerting○ Alert rules○ Alert escalation○ Pager

● Oncall "memory"○ Playbook (runbook)○ Oncall hand-off○ Tracking of issues

● Oncall review weekly/monthly -> feedback● Shifts

Monitoring and Alerting

alert rules

variables history

monitoring variables

services

alert (page)primary oncall

secondary oncall

the team

escalation

monitoring system

Preserving oncall "memory"

● Long term "memory"○ Playbook (useful hints to handle alerts)

● Short term "memory"○ Hand-off (description of the shift for the next oncall)○ Tracking recent issues

Carrying the pager. Shifts Are Geographically Distributed

SWE vs SRE

SWE: I want to believe!

Software Engineers are very devout believers. They believe that: ● Their software does not contain bugs● The datacenters and machines are always "up" and gratis● The network has infinite capacity● The speed of light does not apply to their system● Disks never fail and seek times are close to zero msec.● Configuration files do not contain (syntax) errors

SREs are cynics!

Site reliability engineers know that:● All software sucks● Everything always fails, preferably in the most inconvenient order● Complexity is the enemy of reliability● The laws of physics are real, even inside a virtual machine● All processes that contain a manual component are highly failure

prone● Traffic forecasts are bogus

Ouch

Questions?

Thanks!

Technology

Making Sites Reliable (как сделать систему надежной) (Павел Уваров, Андрей Татаринов)