26
Making Sites Reliable Human processes to increase reliability Andrey Tatarinov, Pavel Uvarov Site Reliability Engineering, Google

Making Sites Reliable (как сделать систему надежной) (Павел Уваров, Андрей Татаринов)

  • Upload
    ontico

  • View
    351

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Making Sites Reliable (как сделать систему надежной) (Павел Уваров, Андрей Татаринов)

Making Sites ReliableHuman processes to increase reliability

Andrey Tatarinov, Pavel Uvarov

Site Reliability Engineering, Google

Page 2: Making Sites Reliable (как сделать систему надежной) (Павел Уваров, Андрей Татаринов)

Engineering aspects and roles

● Software Development (SWE)○ Designing○ Coding

● Provisioning (Long term preventing of outages) (SWE+SRE)○ Capacity planning○ Automation○ Operations feedback

● Operations (Short term preventing of outages) (SRE)○ Manual response (oncall)○ Site (system) administration

Page 3: Making Sites Reliable (как сделать систему надежной) (Павел Уваров, Андрей Татаринов)

Spreading knowledge

● Mad genius problem● Sociopath problem● Introducing new team member● Shuffling teams● Low "bus factor"

● Medium and large teams 30+ people● Startups have other problems

Page 4: Making Sites Reliable (как сделать систему надежной) (Павел Уваров, Андрей Татаринов)

Sweet spot (or somewhere around)

● TLDR: Speak more to teammates ● Peer-to-peer review in every process● Data driven decision making● Group decision making● Priorities

○ Widespread knowledge○ Predictable quality○ Leveling extremes

● Key point

○ Human processes that scale○ Principles are omnipresent○ Not educational or disciplinary, but part of day-to-day life

Page 5: Making Sites Reliable (как сделать систему надежной) (Павел Уваров, Андрей Татаринов)

Design review

● Each significant change triggers

● Document that captures○ Problem statement○ Requirements○ Proposed solution○ Costs and benefits

■ Development■ Resources

○ Risk factors and solutions

● Design review meeting○ Different experts: PM,

SWEs, SREs○ Different priorities: new

features, architecture sanity, stability

● Result: ○ Widespread knowledge○ Balanced compromise

with new functionality, sanity and reliability

Page 6: Making Sites Reliable (как сделать систему надежной) (Павел Уваров, Андрей Татаринов)

Code review

● Before commit, not after● Style guide

○ "code is good when you can't tell who wrote it"○ Readability

● Peer-to-peer○ No individual ownership○ Each change should be reviewed by other engineer with expertise

in this area● Result

○ Widespread knowledge○ Reliable changes○ Easy to read and modify, consistent code

Page 7: Making Sites Reliable (как сделать систему надежной) (Павел Уваров, Андрей Татаринов)

Knowledge externalization

● Oncall engineer wakes SWE up at 2am ● External knowledge database● Long-term memory

○ Playbook● Short-term memory

○ Alert history○ Hand-off

Page 8: Making Sites Reliable (как сделать систему надежной) (Павел Уваров, Андрей Татаринов)

Manual response

● No L0, L1, L2 operations levels● Anyone with basic knowledge can react on 80~90% of alerts

○ Knowledge is included○ Human is an intellegent executor

■ Prevents feedback loops■ Can identify anomalies■ Feedback for provisioning and automation

● Weekly/Monthly/Quarterly oncall review○ More people are aware○ Top issues identified○ Automation/rearchitecturing planned

Page 9: Making Sites Reliable (как сделать систему надежной) (Павел Уваров, Андрей Татаринов)

● To get into production every service has to comply with● Checklist derived from experience

○ Continuous builds/testing○ Load testing○ Capacity planning○ General health and user-facing monitoring/alerting○ Identified potential issues

■ Monitoring and failover scenarios○ Playbook entries

● Result: service which behaves predictably; any SRE can react on outage

Productionization

Page 10: Making Sites Reliable (как сделать систему надежной) (Павел Уваров, Андрей Татаринов)

Post-mortem

● Each large outage triggers post-mortem and p-m meeting● Document with

○ Impact○ Timeline○ Root cause○ Action items to prevent root cause from happening

● Post-mortem review● Result

○ Widespread knowledge○ This class of root causes is likely not to happen again

Page 11: Making Sites Reliable (как сделать систему надежной) (Павел Уваров, Андрей Татаринов)

Site Reliability Engineering

Page 12: Making Sites Reliable (как сделать систему надежной) (Павел Уваров, Андрей Татаринов)

Reliability

● The ability of a person or system to perform and maintain its functions in routine circumstances, as well as hostile or unexpected circumstances.

Wikipedia ● Unexpected situations cannot be handled by a computer program.

Otherwise it's not unexpected.Common Logic

We will talk about human processes

Page 13: Making Sites Reliable (как сделать систему надежной) (Павел Уваров, Андрей Татаринов)

Site

● Distributed complicated system○ Many machines, switchers, routers, racks, etc○ Lots of data○ Many services○ But not so many people (machines:admins > 4000:1)

● Everything breaks all the time○ Hardware...

■ Fan stopped, bad memory, disk died, ...○ Software...

■ Server not running, wrong version, slow response, ...○ Network...

Page 14: Making Sites Reliable (как сделать систему надежной) (Павел Уваров, Андрей Татаринов)

Site Reliability

● Suppose we have the software to run on the site● It must just work● How to insure it?

○ Engineer the production environment for reliability■ Automate whatever possible

○ Engineer systems and tools that increase reliability■ Monitoring and alerting■ Databases that keep track of machines, tasks and even traffic

○ Handle unexpected situations manually■ Oncall

Page 15: Making Sites Reliable (как сделать систему надежной) (Павел Уваров, Андрей Татаринов)

Automate...

● If a person doesn't make any desicions he/she can be replaced with a script which is much more scalable and reliable

● Automate○ failovers○ load balancing○ response to failures○ routine repairs (reinstalling, draining)

● Determine physical configuration automatically● Expect the unexpected

○ rare events are completely normal

Page 16: Making Sites Reliable (как сделать систему надежной) (Павел Уваров, Андрей Татаринов)

Preventing outages

● In reality the software is always being upgraded● External world is changing too● Preventing problems.

○ Capacity planning & management○ Review design docs○ Prelaunch (onboarding) review○ Adapt production environment to (external) changes

Page 17: Making Sites Reliable (как сделать систему надежной) (Павел Уваров, Андрей Татаринов)

Oncall

● Monitoring○ Every program/server is universally monitored

● Alerting○ Alert rules○ Alert escalation○ Pager

● Oncall "memory"○ Playbook (runbook)○ Oncall hand-off○ Tracking of issues

● Oncall review weekly/monthly -> feedback● Shifts

Page 18: Making Sites Reliable (как сделать систему надежной) (Павел Уваров, Андрей Татаринов)

Monitoring and Alerting

alert rules

variables history

monitoring variables

services

alert (page)primary oncall

secondary oncall

the team

escalation

monitoring system

Page 19: Making Sites Reliable (как сделать систему надежной) (Павел Уваров, Андрей Татаринов)

Preserving oncall "memory"

● Long term "memory"○ Playbook (useful hints to handle alerts)

● Short term "memory"○ Hand-off (description of the shift for the next oncall)○ Tracking recent issues

Page 20: Making Sites Reliable (как сделать систему надежной) (Павел Уваров, Андрей Татаринов)

Carrying the pager. Shifts Are Geographically Distributed

Page 21: Making Sites Reliable (как сделать систему надежной) (Павел Уваров, Андрей Татаринов)

SWE vs SRE

Page 22: Making Sites Reliable (как сделать систему надежной) (Павел Уваров, Андрей Татаринов)

SWE: I want to believe!

Software Engineers are very devout believers. They believe that: ● Their software does not contain bugs● The datacenters and machines are always "up" and gratis● The network has infinite capacity● The speed of light does not apply to their system● Disks never fail and seek times are close to zero msec.● Configuration files do not contain (syntax) errors

Page 23: Making Sites Reliable (как сделать систему надежной) (Павел Уваров, Андрей Татаринов)

SREs are cynics!

Site reliability engineers know that:● All software sucks● Everything always fails, preferably in the most inconvenient order● Complexity is the enemy of reliability● The laws of physics are real, even inside a virtual machine● All processes that contain a manual component are highly failure

prone● Traffic forecasts are bogus

Page 24: Making Sites Reliable (как сделать систему надежной) (Павел Уваров, Андрей Татаринов)

Ouch

Page 25: Making Sites Reliable (как сделать систему надежной) (Павел Уваров, Андрей Татаринов)

Questions?

Page 26: Making Sites Reliable (как сделать систему надежной) (Павел Уваров, Андрей Татаринов)

Thanks!