39
Resilient design 101 Avishai Ish-Shalom github.com/avishai-ish-shalom @nukemberg [email protected]

Resilient Design 101 (JeeConf 2017)

Embed Size (px)

Citation preview

Page 1: Resilient Design 101 (JeeConf 2017)

Resilient design 101

Avishai Ish-Shalom

github.com/[email protected]@wix.com

Page 2: Resilient Design 101 (JeeConf 2017)

Wix in numbers

~ 500 Engineers~ 1500 employees

~ 100M users

~ 500 micro services

Lithuania

Ukraine

Vilnius

Kyiv

Dnipro

Wix Engineering Locations

Israel

Tel-Aviv

Be’er Sheva

Page 3: Resilient Design 101 (JeeConf 2017)

Queues

01

Page 4: Resilient Design 101 (JeeConf 2017)

Queues are everywhere!

▪ Futures/Executors

▪ Sockets

▪ Locks (DB Connection pools)

▪ Callbacks in node.js/Netty

Anything async?!

Page 5: Resilient Design 101 (JeeConf 2017)

Queues

▪ Incoming load (arrival rate)

▪ Service from the queue (service rate)

▪ Service discipline (FIFO/LIFO/Priority)

▪ Latency = Wait time + Service time

▪ Service time independent of queue

Page 6: Resilient Design 101 (JeeConf 2017)

It varies

▪ Arrival rate fluctuates

▪ Service times fluctuates

▪ Delays accumulate

▪ Idle time wasted

Queues are almost always full or near-empty!

Page 7: Resilient Design 101 (JeeConf 2017)

Capacity & Latency

▪ Latency (and queue size) rises to infinity

as utilization approaches 1

▪ For QoS ρ << 0.75

▪ Decent latency -> over capacity

ρ = arrival rate / service rate (utilization)

Page 8: Resilient Design 101 (JeeConf 2017)

Implications

Infinite queues:

▪ Memory pressure / OOM

▪ High latency

▪ Stale work

Always limit queue size!

Work item TTL*

Page 9: Resilient Design 101 (JeeConf 2017)

Latency & Service time

λ = wait timeσ = service timeρ = utilization

Page 10: Resilient Design 101 (JeeConf 2017)

Utilization fluctuates!

▪ 10% fluctuation at = 0.5 will hardly affects latency (~ 1.1x)

▪ 10% fluctuation at = 0.9 will kill you (~ 10x latency)

▪ Be careful when overloading resources

▪ During peak load we must be extra careful

▪ Highly varied load must be capped

Page 11: Resilient Design 101 (JeeConf 2017)

Practical advice

▪ Use chokepoints (throttling/load shedding)

▪ Plan for low utilization of slow resources

Example

Resource Latency Planned Utilization

RPC thread pool 1ms 0.75

DB connection pool 10ms 0.5

Page 12: Resilient Design 101 (JeeConf 2017)

Backpressure

▪ Internal queues fill up and cause latency

▪ Front layer will continue sending traffic

▪ We need to inform the client that we’re out of capacity

▪ E.g.: Blocking client, HTTP 503, finite queues for

threadpools

Page 13: Resilient Design 101 (JeeConf 2017)

Backpressure

▪ Blocking code has backpressure by default

▪ Executors, remote calls and async code need explicit

backpressure

▪ E.g. producer/consumer through Kafka

Page 14: Resilient Design 101 (JeeConf 2017)

Load shedding

▪ A tradeoff between latency and error rate

▪ Cap the queue size / throttle arrival rate

▪ Reject excess work or send to fallback service

Example: Facebook uses LIFO queue and rejects stale work

http://queue.acm.org/detail.cfm?id=2839461

Page 15: Resilient Design 101 (JeeConf 2017)

Thread Pools

02

Page 16: Resilient Design 101 (JeeConf 2017)

Jetty architecture

Thread pool (QTP)

Soc

ket

Acceptor thread

Page 17: Resilient Design 101 (JeeConf 2017)

Too many threads▪ O/S also has a queue

▪ Threads take memory, FDs, etc

▪ What about shared resources?

Bad QoS, GC storms, ungraceful

degradation

Not enough threads

wrong

▪ Work will queue up

▪ Not enough RUNNING threads

High latency, low resource utilization

Page 18: Resilient Design 101 (JeeConf 2017)

Capacity/Latency tradeoffsWhen optimizing for Latency:For low latency, resources must be available when needed

Keep the queue empty

▪ Block or apply backpressure

▪ Keep the queue small

▪ Overprovision

Page 19: Resilient Design 101 (JeeConf 2017)

Capacity/Latency tradeoffsWhen optimizing for CapacityFor max capacity, resources must always have work waiting

Keep the queue full

▪ We use a large queue to buffer work

▪ Queueing increases latency

▪ Queue size >> concurrency

Page 20: Resilient Design 101 (JeeConf 2017)

How may threads?

▪ Assuming CPU is the limiting resource

▪ Compute by maximal load (opt. latency)

▪ With a Grid: How many cores???

Java Concurrency in Practice (http://jcip.net/)

Page 21: Resilient Design 101 (JeeConf 2017)

How may threads?How to compute?

▪ Transaction time = W + C

▪ C ~ Total CPU time / throughput

▪ U ~ 0.5 – 0.7 (account for O/S, JVM, GC - and 0.75 utilization target)

▪ Memory and other resource limits

Page 22: Resilient Design 101 (JeeConf 2017)

What about async servers?

Page 23: Resilient Design 101 (JeeConf 2017)

Async servers architecture

Soc

ket

Event loop

epoll

Callbacks

O/S

Syscalls

Page 24: Resilient Design 101 (JeeConf 2017)

Async systems▪ Event loop callback/handler queue

▪ The callback queue is unlimited (!!!)

▪ Event loop can block (ouch)

▪ No inherent concurrency limit

▪ No backpressure (*)

Page 25: Resilient Design 101 (JeeConf 2017)

Async systems - overload▪ No preemption -> no QoS

▪ No backpressure -> overload

▪ Hard to tune

▪ Hard to limit concurrency/queue size

▪ Hard to debug

Page 26: Resilient Design 101 (JeeConf 2017)

So what’s the point?▪ High concurrency

▪ More control

▪ I/O heavy servers

Still evolving…. let’s revisit in a few years?

Page 27: Resilient Design 101 (JeeConf 2017)

Little’s Law

03

Page 28: Resilient Design 101 (JeeConf 2017)

Little’s law

▪ Holds for all distributions

▪ For “stable” systems

▪ Holds for systems and their subsystems

▪ “Throughput” is either Arrival rate or Service rate depending on the context.

Be careful!

L = λ⋅W

L = Avg clients in the systemλ = Avg ThroughputW = Avg Latency

Page 29: Resilient Design 101 (JeeConf 2017)

Using Little’s law

▪ How many requests queued inside the system?

▪ Verifying load tests / benchmarks

▪ Calculating latency when no direct measurement is possible

Go watch Gil Tene’s "How NOT to Measure Latency"

Read Benchmarking Blunders and Things That Go Bump in the Night

Page 30: Resilient Design 101 (JeeConf 2017)

Timeouts

04

Page 31: Resilient Design 101 (JeeConf 2017)

How not to timeout

People use arbitrary timeout values

▪ DB timeout > Overall transaction timeout

▪ Cache timeout > DB latency

▪ Huge unrealistic timeouts

▪ Refusing to return errors

P.S: connection timeout, read timeout & transaction timeout are not the same thing

Page 32: Resilient Design 101 (JeeConf 2017)

Deciding on timeouts

Use the distribution luke!

▪ Resources/Errors tradeoff

▪ Cumulative distribution chart

▪ Watch out for multiple modes

▪ Context, context, context

Page 33: Resilient Design 101 (JeeConf 2017)

Timeouts should be derived from real world constraints!

Page 34: Resilient Design 101 (JeeConf 2017)

UX numbers every developer needs to know

▪ Smooth motion perception threshold: ~ 20ms

▪ Immediate reaction threshold: ~ 100ms

▪ Delay perception threshold: ~ 300ms

▪ Focus threshold: ~ 1sec

▪ Frustration threshold: ~ 10sec

Google's RAIL modelUX powers of 10

Page 35: Resilient Design 101 (JeeConf 2017)

Hardware latency numbers every developer needs to know▪ SSD Disk seek: 0.15ms

▪ Magnetic disk seek: ~ 10ms

▪ Round trip within same datacenter: ~ 0.5ms

▪ Packet roundtrip US->EU->US: ~ 150ms

▪ Send 1M over typical user WAN: ~ 1sec

Latency numbers every developer needs to know (updated)

Page 36: Resilient Design 101 (JeeConf 2017)

Timeout Budgets▪ Decide on global timeouts

▪ Pass context object

▪ Each stage decrements budget

▪ Local timeouts according to budget

▪ If budget too low, terminate

preemptively

Think microservices

Example

Global: 500ms

Stage Used Budget Timeout

Authorization 6ms 494ms 100ms

Data fetch (DB) 123ms 371ms 200ms

Processing 47ms 324ms 371ms

Rendering 89ms 235ms 324ms

Audit 2ms - -

Filter 10ms 223ms 233ms

Page 37: Resilient Design 101 (JeeConf 2017)

The debt buyer▪ Transactions may return eventually after timeout

▪ Does the client really have to wait?

▪ Timeout and return error/default response to client (50ms)

▪ Keep waiting asynchronously (1 sec)

Can’t be used when client is expecting data back

Page 38: Resilient Design 101 (JeeConf 2017)

Questions?

github.com/[email protected]@wix.com

Page 39: Resilient Design 101 (JeeConf 2017)

Thank You

github.com/[email protected]@wix.com