Resilient Design 101 (JeeConf 2017)

Resilient design 101

Avishai Ish-Shalom

github.com/[email protected]@wix.com

Wix in numbers

~ 500 Engineers~ 1500 employees

~ 100M users

~ 500 micro services

Lithuania

Ukraine

Vilnius

Kyiv

Dnipro

Wix Engineering Locations

Israel

Tel-Aviv

Be’er Sheva

Queues

01

Queues are everywhere!

▪ Futures/Executors

▪ Sockets

▪ Locks (DB Connection pools)

▪ Callbacks in node.js/Netty

Anything async?!

Queues

▪ Incoming load (arrival rate)

▪ Service from the queue (service rate)

▪ Service discipline (FIFO/LIFO/Priority)

▪ Latency = Wait time + Service time

▪ Service time independent of queue

It varies

▪ Arrival rate fluctuates

▪ Service times fluctuates

▪ Delays accumulate

▪ Idle time wasted

Queues are almost always full or near-empty!

Capacity & Latency

▪ Latency (and queue size) rises to infinity

as utilization approaches 1

▪ For QoS ρ << 0.75

▪ Decent latency -> over capacity

ρ = arrival rate / service rate (utilization)

Implications

Infinite queues:

▪ Memory pressure / OOM

▪ High latency

▪ Stale work

Always limit queue size!

Work item TTL*

Latency & Service time

λ = wait timeσ = service timeρ = utilization

Utilization fluctuates!

▪ 10% fluctuation at = 0.5 will hardly affects latency (~ 1.1x)

▪ 10% fluctuation at = 0.9 will kill you (~ 10x latency)

▪ Be careful when overloading resources

▪ During peak load we must be extra careful

▪ Highly varied load must be capped

Practical advice

▪ Use chokepoints (throttling/load shedding)

▪ Plan for low utilization of slow resources

Example

Resource Latency Planned Utilization

RPC thread pool 1ms 0.75

DB connection pool 10ms 0.5

Backpressure

▪ Internal queues fill up and cause latency

▪ Front layer will continue sending traffic

▪ We need to inform the client that we’re out of capacity

▪ E.g.: Blocking client, HTTP 503, finite queues for

threadpools

Backpressure

▪ Blocking code has backpressure by default

▪ Executors, remote calls and async code need explicit

backpressure

▪ E.g. producer/consumer through Kafka

Load shedding

▪ A tradeoff between latency and error rate

▪ Cap the queue size / throttle arrival rate

▪ Reject excess work or send to fallback service

Example: Facebook uses LIFO queue and rejects stale work

http://queue.acm.org/detail.cfm?id=2839461



Thread Pools

02

Jetty architecture

Thread pool (QTP)

Soc

ket

Acceptor thread

Too many threads▪ O/S also has a queue

▪ Threads take memory, FDs, etc

▪ What about shared resources?

Bad QoS, GC storms, ungraceful

degradation

Not enough threads

wrong

▪ Work will queue up

▪ Not enough RUNNING threads

High latency, low resource utilization

Capacity/Latency tradeoffsWhen optimizing for Latency:For low latency, resources must be available when needed

Keep the queue empty

▪ Block or apply backpressure

▪ Keep the queue small

▪ Overprovision

Capacity/Latency tradeoffsWhen optimizing for CapacityFor max capacity, resources must always have work waiting

Keep the queue full

▪ We use a large queue to buffer work

▪ Queueing increases latency

▪ Queue size >> concurrency

How may threads?

▪ Assuming CPU is the limiting resource

▪ Compute by maximal load (opt. latency)

▪ With a Grid: How many cores???

Java Concurrency in Practice (http://jcip.net/)

http://jcip.net/

How may threads?How to compute?

▪ Transaction time = W + C

▪ C ~ Total CPU time / throughput

▪ U ~ 0.5 – 0.7 (account for O/S, JVM, GC - and 0.75 utilization target)

▪ Memory and other resource limits

What about async servers?

Async servers architecture

Soc

ket

Event loop

epoll

Callbacks

O/S

Syscalls

Async systems▪ Event loop callback/handler queue

▪ The callback queue is unlimited (!!!)

▪ Event loop can block (ouch)

▪ No inherent concurrency limit

▪ No backpressure (*)

Async systems - overload▪ No preemption -> no QoS

▪ No backpressure -> overload

▪ Hard to tune

▪ Hard to limit concurrency/queue size

▪ Hard to debug

So what’s the point?▪ High concurrency

▪ More control

▪ I/O heavy servers

Still evolving…. let’s revisit in a few years?

Little’s Law

03

Little’s law

▪ Holds for all distributions

▪ For “stable” systems

▪ Holds for systems and their subsystems

▪ “Throughput” is either Arrival rate or Service rate depending on the context.

Be careful!

L = λ⋅W

L = Avg clients in the systemλ = Avg ThroughputW = Avg Latency

Using Little’s law

▪ How many requests queued inside the system?

▪ Verifying load tests / benchmarks

▪ Calculating latency when no direct measurement is possible

Go watch Gil Tene’s "How NOT to Measure Latency"

Read Benchmarking Blunders and Things That Go Bump in the Night

https://www.youtube.com/watch?v=lJ8ydIuPFeU

https://arxiv.org/pdf/cs/0404043.pdf

Timeouts

04

How not to timeout

People use arbitrary timeout values

▪ DB timeout > Overall transaction timeout

▪ Cache timeout > DB latency

▪ Huge unrealistic timeouts

▪ Refusing to return errors

P.S: connection timeout, read timeout & transaction timeout are not the same thing

Deciding on timeouts

Use the distribution luke!

▪ Resources/Errors tradeoff

▪ Cumulative distribution chart

▪ Watch out for multiple modes

▪ Context, context, context

Timeouts should be derived from real world constraints!

UX numbers every developer needs to know

▪ Smooth motion perception threshold: ~ 20ms

▪ Immediate reaction threshold: ~ 100ms

▪ Delay perception threshold: ~ 300ms

▪ Focus threshold: ~ 1sec

▪ Frustration threshold: ~ 10sec

Google's RAIL modelUX powers of 10

https://developers.google.com/web/fundamentals/performance/rail

https://developers.google.com/web/fundamentals/performance/rail

https://www.nngroup.com/articles/powers-of-10-time-scales-in-ux/

https://www.nngroup.com/articles/powers-of-10-time-scales-in-ux/

Hardware latency numbers every developer needs to know▪ SSD Disk seek: 0.15ms

▪ Magnetic disk seek: ~ 10ms

▪ Round trip within same datacenter: ~ 0.5ms

▪ Packet roundtrip US->EU->US: ~ 150ms

▪ Send 1M over typical user WAN: ~ 1sec

Latency numbers every developer needs to know (updated)

https://gist.github.com/hellerbarde/2843375

https://gist.github.com/hellerbarde/2843375

Timeout Budgets▪ Decide on global timeouts

▪ Pass context object

▪ Each stage decrements budget

▪ Local timeouts according to budget

▪ If budget too low, terminate

preemptively

Think microservices

Example

Global: 500ms

Stage Used Budget Timeout

Authorization 6ms 494ms 100ms

Data fetch (DB) 123ms 371ms 200ms

Processing 47ms 324ms 371ms

Rendering 89ms 235ms 324ms

Audit 2ms - -

Filter 10ms 223ms 233ms

The debt buyer▪ Transactions may return eventually after timeout

▪ Does the client really have to wait?

▪ Timeout and return error/default response to client (50ms)

▪ Keep waiting asynchronously (1 sec)

Can’t be used when client is expecting data back

Questions?


Thank You


Software

Resilient Design 101 (JeeConf 2017)