65
On the way to low latency Artem Orobets Smartling Inc

On the way to low latency (2nd edition)

Embed Size (px)

Citation preview

Page 1: On the way to low latency (2nd edition)

On the way to low latency

Artem Orobets Smartling Inc

Page 2: On the way to low latency (2nd edition)

Long story short

We realized that latency is important for us

Our fabulous architecture supposed to work, but it didn’t

The issues that we have faced on the way

Page 3: On the way to low latency (2nd edition)

Those guys consider 10µs latencies slow

We have only 100ms threshold

We are not a trading company

Page 4: On the way to low latency (2nd edition)

What is low latency?

Page 5: On the way to low latency (2nd edition)

Latencyis a time interval betweenthe stimulationand response

Page 6: On the way to low latency (2nd edition)

What is latency?total response time = service time + time waiting for service

Page 7: On the way to low latency (2nd edition)

Why is it important?

• SLA • Negative correlation

to income

Page 8: On the way to low latency (2nd edition)

Latencies about 50ms is barely noticeable for human

Page 9: On the way to low latency (2nd edition)

You mostly care about throughput

Page 10: On the way to low latency (2nd edition)

How to measure it?

Page 11: On the way to low latency (2nd edition)

Duration of a single test run

Page 12: On the way to low latency (2nd edition)

Average of test run durations

Page 13: On the way to low latency (2nd edition)
Page 14: On the way to low latency (2nd edition)

Quantiles of test run durations

(usually 95th, 99th percentiles)

Page 15: On the way to low latency (2nd edition)

• to test

• to analyze

• to controle

Latency is more difficult to:

Page 16: On the way to low latency (2nd edition)

Design

Page 17: On the way to low latency (2nd edition)

Storage

Page 18: On the way to low latency (2nd edition)

* where latency is 99th percentile

Page 19: On the way to low latency (2nd edition)
Page 20: On the way to low latency (2nd edition)

Context switch problem

In production we have about 4k connections

opened simultaneously

Page 21: On the way to low latency (2nd edition)

Context switch problem

• Thread per request doesn’t work

• Too much overhead on context switching

• Too much overhead on memory Usually a Thread takes memory from 256kb to 1mb for the stack space!

Page 22: On the way to low latency (2nd edition)

Troubleshooting framework

1. Discovery.

2. Problem Reproduction.

3. Isolate the variables that relate directly to the problem.

4. Analyze your findings to determine the cause of the problem.

Page 23: On the way to low latency (2nd edition)

We have have fixed a lot of things that we believed were the most problematic parts.

But they weren’t.

Page 24: On the way to low latency (2nd edition)

Find an evidence that proves your suggestion

Page 25: On the way to low latency (2nd edition)

A good tool can give you a clue

• Proper logging and log analysis tool

• Performance tests

• Monitoring

Page 26: On the way to low latency (2nd edition)

Performance benchmark

98.47% <= 2 ms 99.95% <= 10 ms 99.98% <= 16 ms 99.99% <= 17 ms 100.00% <= 18 ms

750 rpsThroughput

Latency percentiles

Page 27: On the way to low latency (2nd edition)

A good tool can give you a clue

Page 28: On the way to low latency (2nd edition)

KPI is necessity

Page 29: On the way to low latency (2nd edition)

Problem that we faced

Page 30: On the way to low latency (2nd edition)

Some requests take almost a second

And it seems it always happens after deploy

Page 31: On the way to low latency (2nd edition)

is so lazy

Page 32: On the way to low latency (2nd edition)

Smoke tests

• A good practice when you have continuous delivery

• It makes all your code initialized by the time real load comes in

Page 33: On the way to low latency (2nd edition)

Logging

Synchronous logging is not appropriate for asynchronous application

Page 34: On the way to low latency (2nd edition)

log4j2: Asynchronous Loggers for Low-Latency Logging

http://logging.apache.org/log4j/2.x/manual/async.html

Page 35: On the way to low latency (2nd edition)

Sync Async

98.85% <= 1 ms 99.95% <= 7 ms 99.98% <= 13 ms 99.99% <= 15 ms 100.00% <= 18 ms

1658 rps

98.47% <= 2 ms 99.95% <= 10 ms 99.98% <= 16 ms 99.99% <= 17 ms 100.00% <= 18 ms

769.05 rps

Logging

Page 36: On the way to low latency (2nd edition)

Pauses 50-150ms

A network according to logs

Page 37: On the way to low latency (2nd edition)

Disappear when I scroll through logs via SSH

Page 38: On the way to low latency (2nd edition)
Page 39: On the way to low latency (2nd edition)

Any ideas?

Page 40: On the way to low latency (2nd edition)

TCP_NODELAY

Page 41: On the way to low latency (2nd edition)
Page 42: On the way to low latency (2nd edition)

Nagle's algorithm

• the "small packet problem”

• TCP packets have a 40 byte header (20 bytes for TCP, 20 bytes for IPv4)

• combining a number of small outgoing messages, and sending them all at once

Page 43: On the way to low latency (2nd edition)

• Pauses ~100 ms every couple of hours

• During connection creation

• Doesn’t reproduces on a local setup

Page 44: On the way to low latency (2nd edition)

How to diagnose that?

Page 45: On the way to low latency (2nd edition)

tcpdump -i eth0

Page 46: On the way to low latency (2nd edition)

TCPDUMP15:47:57.250119 IP (tos 0x0, ttl 64, id 44402, offset 0, flags [DF], proto TCP (6), length 569) 192.168.3.131.58749 > 93.184.216.34.80: Flags [P.], cksum 0x76b5 (correct), seq 3847355529:3847356046, ack 3021125542, win 4096, options [nop,nop,TS val 848825338 ecr 1053000005], length 517: HTTP, length: 517 GET / HTTP/1.1 Host: example.com Connection: keep-alive …

Page 47: On the way to low latency (2nd edition)

TCPDUMP

15:58:32.009884 IP (tos 0x0, ttl 255, id 39809, offset 0, flags [none], proto UDP (17), length 63) 192.168.3.131.56546 > 192.168.3.1.53: [udp sum ok] 52969+ A? www.google.com.ua. …

15:58:32.012844 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 127) 192.168.3.1.53 > 192.168.3.131.56546: [udp sum ok] 52969 q: A? www.google.com.ua. …

Page 48: On the way to low latency (2nd edition)

DNS lookups

• After hours of looking through tcp dumps

• We have found that DNS lookups sometimes take more than 100ms

Page 49: On the way to low latency (2nd edition)

How much time GC could take?

Page 50: On the way to low latency (2nd edition)
Page 51: On the way to low latency (2nd edition)
Page 52: On the way to low latency (2nd edition)

GC logging• -Xloggc:path_to_log_file

• -XX:+PrintGCDetails

• -XX:+PrintGCDateStamps

• -XX:+PrintHeapAtGC

• -XX:+PrintTenuringDistribution

Page 53: On the way to low latency (2nd edition)

-XX:+PrintGCDetails

[GC (Allocation Failure) 260526.491: [ParNew

[Times: user=0.02 sys=0.00, real=0.01 secs]

Page 54: On the way to low latency (2nd edition)

-XX:+PrintHeapAtGCHeap after GC invocations=43363 (full 3):

par new generation total 59008K, used 1335K

eden space 52480K, 0%

from space 6528K, 20% used

to space 6528K, 0% used

concurrent mark-sweep generation total 2031616K, used 1830227K

Page 55: On the way to low latency (2nd edition)

-XX:+PrintTenuringDistribution

Desired survivor size 3342336 bytes, new threshold 2 (max 2)

- age 1: 878568 bytes, 878568 total

- age 2: 1616 bytes, 880184 total

: 53829K->1380K(59008K), 0.0083140 secs] 1884058K->1831609K(2090624K), 0.0084006 secs]

Page 56: On the way to low latency (2nd edition)

A big amount of wrappers

Significant allocation pressure

Page 57: On the way to low latency (2nd edition)

~100ms GC pauses in logs

Page 58: On the way to low latency (2nd edition)

-XX:+UseConcMarkSweepGC

Page 59: On the way to low latency (2nd edition)

Note: CMS collector on young generation uses the same algorithm

as that of the parallel collector.

Java GC documentation at oracle.com

* http://www.oracle.com/webfolder/technetwork/tutorials/obe/java/gc01/index.html

Page 60: On the way to low latency (2nd edition)

Too many alive objects during young gen GC

• Minimize survivors

• Watch the tenuring threshold, might need to tune it to tenure long lived objects faster

• Reduce NewSize

• Reduce survivor spaces

Page 61: On the way to low latency (2nd edition)

Watch your GC

*time span is 2h

Page 62: On the way to low latency (2nd edition)

Watch your GC

Page 63: On the way to low latency (2nd edition)

You should have

• a deeper understanding of the JVM, OS, hardware …

• be brave

Page 65: On the way to low latency (2nd edition)

http://tech.smartling.com/

[email protected]