23
Jun Liu [email protected] April 2016 Shopping Event Reliability SREcon16

Shopping Event Reliability - USENIX | The Advanced ... · PDF fileJun Liu [email protected] April 2016 Shopping Event Reliability SREcon16. ... Network Rack Product Subsystem Module

  • Upload
    trinhtu

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Shopping Event Reliability - USENIX | The Advanced ... · PDF fileJun Liu liujun01@baidu.com April 2016 Shopping Event Reliability SREcon16. ... Network Rack Product Subsystem Module

Jun [email protected]

April 2016

Shopping Event Reliability

SREcon16

Page 2: Shopping Event Reliability - USENIX | The Advanced ... · PDF fileJun Liu liujun01@baidu.com April 2016 Shopping Event Reliability SREcon16. ... Network Rack Product Subsystem Module

2 / 百度公司

@Jun Liu

• Architect, SRE team, Baidu Nuomi, China– Baidu Nuomi (Local Life O2O Service Platform)

• Areas of focus– Infrastructure :distributed computing/storage/scheduling

– Performance

– Reliability

– Capacity

[email protected]

Page 3: Shopping Event Reliability - USENIX | The Advanced ... · PDF fileJun Liu liujun01@baidu.com April 2016 Shopping Event Reliability SREcon16. ... Network Rack Product Subsystem Module

3 / 百度公司

Agenda

• Introduction

• The Challenges of SER

• Practice of SER

• What we learned

Page 4: Shopping Event Reliability - USENIX | The Advanced ... · PDF fileJun Liu liujun01@baidu.com April 2016 Shopping Event Reliability SREcon16. ... Network Rack Product Subsystem Module

4 / 百度公司

Introduction:Baidu Nuomi

• Nuomi (糯米) means sticky rice in Chinese

• Local Life O2O Service Platform– Food,Movie,Hotel,Takeaway,

Tourism ,Ticket, Home Services …

• Spiral-Up Business– Millions of orders per day

– Hundreds of millions of GMV per day

Page 5: Shopping Event Reliability - USENIX | The Advanced ... · PDF fileJun Liu liujun01@baidu.com April 2016 Shopping Event Reliability SREcon16. ... Network Rack Product Subsystem Module

5 / 百度公司

Introduction : Shopping Events of Nuomi @2015

Mar., Girls' Day

May., Chowhound Day

Jun., 5th Anniversary

Aug., Chinese Valentine's Day

Nov., Singles’Day

Oct., Mid-Autumn Festival

Shopping event is a most significant marketing approach in China. It is just like a war for SRE nearly every month.

GMV/DAU

Page 6: Shopping Event Reliability - USENIX | The Advanced ... · PDF fileJun Liu liujun01@baidu.com April 2016 Shopping Event Reliability SREcon16. ... Network Rack Product Subsystem Module

6 / 百度公司

The Challenges

• Much More User Traffic

• Much More Seckill

Seckill

Seckill

SeckillSeckill

Seckill

Seckill

SWE

SRE INF

MKT

• Much More Different Engineers

• Much More Marketing Events

SYS …

Shopping event The Day before

Page 7: Shopping Event Reliability - USENIX | The Advanced ... · PDF fileJun Liu liujun01@baidu.com April 2016 Shopping Event Reliability SREcon16. ... Network Rack Product Subsystem Module

7 / 百度公司

Overview: SRE FOR SER

5. Process Management• Standardization Exercise

1. Traffic Control• Overload Protection• Graceful Degradation

2. Capacity Management

3. Monitoring

4. Multi-homed Systems

Page 8: Shopping Event Reliability - USENIX | The Advanced ... · PDF fileJun Liu liujun01@baidu.com April 2016 Shopping Event Reliability SREcon16. ... Network Rack Product Subsystem Module

8 / 百度公司

1. Practice of Traffic Control

Page 9: Shopping Event Reliability - USENIX | The Advanced ... · PDF fileJun Liu liujun01@baidu.com April 2016 Shopping Event Reliability SREcon16. ... Network Rack Product Subsystem Module

9 / 百度公司

Practice of Traffic Control—Overload Protection

Baidu Front End (BFE)

Nuomi Service

Redis DB Other Platform

Native App

• Avoid Avalanche

• Handling Overload– Native app-side throttling

– Global Protection (BFE)

– Service Protection

– Database Protection

– Other Platform

• Based– Utilization of resource

– Active state of CPU

– Quota of QPS

– Receive buffer

Page 10: Shopping Event Reliability - USENIX | The Advanced ... · PDF fileJun Liu liujun01@baidu.com April 2016 Shopping Event Reliability SREcon16. ... Network Rack Product Subsystem Module

10 / 百度公司

Practice of Traffic Control—Graceful Degradation

Flexible Reliability– Grade and decouple the features

– Independent seckill service

Static environment – Old service and data mirror image

– Simple, Independent, Automatic, Keep high hit rate

Comfortable Notice– Be used in service failes widely

– Simple, Independent, Automatic

Page 11: Shopping Event Reliability - USENIX | The Advanced ... · PDF fileJun Liu liujun01@baidu.com April 2016 Shopping Event Reliability SREcon16. ... Network Rack Product Subsystem Module

11 / 百度公司

2. Practice of Capacity Management

Page 12: Shopping Event Reliability - USENIX | The Advanced ... · PDF fileJun Liu liujun01@baidu.com April 2016 Shopping Event Reliability SREcon16. ... Network Rack Product Subsystem Module

12 / 百度公司

Practice of Capacity Planning

Service Level Objectives• Latency• Reliability

Load Testing Tools• Simulation tool: read and write, independent• Playback tool:historical request data• Data collection tool: hardware (utilization) &

software(SLOs)

Business Objectives• Active users• QPS• GMV• The number of coupons• …

Capacity Indicators

Capacity Model• Correlation analysis• Finding Minimum water level

Load Testing

Scenario-Based Shopping Event Capacity Planning• Marketing Event Scene• Customer behavior Scene

Page 13: Shopping Event Reliability - USENIX | The Advanced ... · PDF fileJun Liu liujun01@baidu.com April 2016 Shopping Event Reliability SREcon16. ... Network Rack Product Subsystem Module

13 / 百度公司

Practice of Elastic Container Infrastructure

Global resource management (Matrix@Baidu)

Online Runtime Platform (ORP@Baidu)

Baidu Auto Scaling Strategy(BAS@Baidu)

Containers in Datacenters

Page 14: Shopping Event Reliability - USENIX | The Advanced ... · PDF fileJun Liu liujun01@baidu.com April 2016 Shopping Event Reliability SREcon16. ... Network Rack Product Subsystem Module

14 / 百度公司

Page 15: Shopping Event Reliability - USENIX | The Advanced ... · PDF fileJun Liu liujun01@baidu.com April 2016 Shopping Event Reliability SREcon16. ... Network Rack Product Subsystem Module

15 / 百度公司

3. Practice of Monitoring

Page 16: Shopping Event Reliability - USENIX | The Advanced ... · PDF fileJun Liu liujun01@baidu.com April 2016 Shopping Event Reliability SREcon16. ... Network Rack Product Subsystem Module

16 / 百度公司

Monitoring ALL Events During Shopping Event

Country

Region

Cluster

Network

Rack

Product

Subsystem

Module

Service

Instance

passport

Nuomi

Deploymarketingstrategy

Continuous Deployment

nmq

redis

mysql

InfrastructureFailure Events

ConfigurationEvents

Key Indicators

Abnormal EventsDeployEvents

Dependent Platform Events

Machine

Latency

Traffic

Errors

Saturation

Page 17: Shopping Event Reliability - USENIX | The Advanced ... · PDF fileJun Liu liujun01@baidu.com April 2016 Shopping Event Reliability SREcon16. ... Network Rack Product Subsystem Module

17 / 百度公司

Time-varying events

Event-Flow Graph

Key Indicators Abnormal Events

DeployEvent

ConfigurationEvent

Page 18: Shopping Event Reliability - USENIX | The Advanced ... · PDF fileJun Liu liujun01@baidu.com April 2016 Shopping Event Reliability SREcon16. ... Network Rack Product Subsystem Module

18 / 百度公司

4. Practice of Multi-homed Systems

Page 19: Shopping Event Reliability - USENIX | The Advanced ... · PDF fileJun Liu liujun01@baidu.com April 2016 Shopping Event Reliability SREcon16. ... Network Rack Product Subsystem Module

19 / 百度公司

Multi-homed Systems

Latency<10ms

Unit 1 Unit 3

DB master DB slave

Redismaster

Redisslave

Customer

Service Service

WR/WR

Unit 2

DB slave

Redisslave

Service

RW

Latency>30ms

Unit 2 Unit 1

DB master DB master

Redismaster

Redismaster

Customer

Service Service

R&W

R&W

Customer

Unit 3

DB master

Redismaster

Service

R&W

Customer

Page 20: Shopping Event Reliability - USENIX | The Advanced ... · PDF fileJun Liu liujun01@baidu.com April 2016 Shopping Event Reliability SREcon16. ... Network Rack Product Subsystem Module

20 / 百度公司

5. Practice of Process Management

Page 21: Shopping Event Reliability - USENIX | The Advanced ... · PDF fileJun Liu liujun01@baidu.com April 2016 Shopping Event Reliability SREcon16. ... Network Rack Product Subsystem Module

21 / 百度公司

Standardization

Before Event During Event After Event

Marketing Forecast

Capacity Planning

Failure Drill

Deploy Restrict

On-call in

Operation Center

Postmortem

Data Review

Auto-Scaling (Out)

Auto-Scaling (In)

Page 22: Shopping Event Reliability - USENIX | The Advanced ... · PDF fileJun Liu liujun01@baidu.com April 2016 Shopping Event Reliability SREcon16. ... Network Rack Product Subsystem Module

22 / 百度公司

What we learned

Traffic Control• Both over protection and graceful degradation are important because

there are a lot of unpredictability in shopping event

Capacity Management• Scenario-Based• Considering business objectives and SLOs• Collecting data of hardware (utilization) & software(SLOs)• Using historical request data

Monitoring• Monitoring all events in shopping event• Building an event-flow graph

Multi-homed Systems• Unitized

Process Management• Standardization• On-call in operation center

Page 23: Shopping Event Reliability - USENIX | The Advanced ... · PDF fileJun Liu liujun01@baidu.com April 2016 Shopping Event Reliability SREcon16. ... Network Rack Product Subsystem Module

23 / 百度公司

Thank You!

“This is not the end. It is not even the beginning of the end.But it is, perhaps, the end of the beginning.”

-Winston Churchill