254
©2013 Azul Systems, Inc. How NOT to Measure Latency An attempt to confer wisdom... Gil Tene, CTO & co-Founder, Azul Systems

How NOT to Measure Latency, Gil Tene, London, Oct. 2013

Embed Size (px)

DESCRIPTION

Time is money. Understanding application responsiveness and latency is critical not only for delivering good application behavior but also for maintaining profitability and containing risk. But good characterization of bad data is useless. When measurements of response time present false or misleading latency information, even the best analysis can lead to wrong operational decisions and poor application experience. This presentation discusses common pitfalls encountered in measuring and characterizing latency. It demonstrates and discusses some false assumptions and measurement techniques that lead to dramatically incorrect reporting and covers ways to do a sanity check and correct such situations.

Citation preview

Page 1: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

How NOT to Measure Latency

An attempt to confer wisdom...

Gil Tene, CTO & co-Founder, Azul Systems

Page 2: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

High level agenda

Page 3: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

High level agendaSome latency behavior background

Page 4: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

High level agendaSome latency behavior background

The pitfalls of using “statistics”

Page 5: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

High level agendaSome latency behavior background

The pitfalls of using “statistics”

Latency “philosophy” questions

Page 6: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

High level agendaSome latency behavior background

The pitfalls of using “statistics”

Latency “philosophy” questions

The Coordinated Omission Problem

Page 7: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

High level agendaSome latency behavior background

The pitfalls of using “statistics”

Latency “philosophy” questions

The Coordinated Omission Problem

Some useful tools

Page 8: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

High level agendaSome latency behavior background

The pitfalls of using “statistics”

Latency “philosophy” questions

The Coordinated Omission Problem

Some useful tools

Use tools for bragging

Page 9: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

About me: Gil Tene

Page 10: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

About me: Gil Tene

co-founder, CTO @Azul Systems

Page 11: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

About me: Gil Tene

co-founder, CTO @Azul Systems

Have been working on “think different” GC approaches since 2002

Page 12: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

About me: Gil Tene

co-founder, CTO @Azul Systems

Have been working on “think different” GC approaches since 2002

* working on real-world trash compaction issues, circa 2004

Page 13: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

About me: Gil Tene

co-founder, CTO @Azul Systems

Have been working on “think different” GC approaches since 2002

Created Pauseless & C4 core GC algorithms (Tene, Wolf)

* working on real-world trash compaction issues, circa 2004

Page 14: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

About me: Gil Tene

co-founder, CTO @Azul Systems

Have been working on “think different” GC approaches since 2002

Created Pauseless & C4 core GC algorithms (Tene, Wolf)

A Long history building Virtual & Physical Machines, Operating Systems, Enterprise apps, etc...

* working on real-world trash compaction issues, circa 2004

Page 15: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

About me: Gil Tene

co-founder, CTO @Azul Systems

Have been working on “think different” GC approaches since 2002

Created Pauseless & C4 core GC algorithms (Tene, Wolf)

A Long history building Virtual & Physical Machines, Operating Systems, Enterprise apps, etc...

JCP EC Member... * working on real-world trash compaction issues, circa 2004

Page 16: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

About Azul

Page 17: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

About Azul

We make scalable Virtual Machines

Page 18: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

About Azul

We make scalable Virtual Machines

Have built “whatever it takes to get job done” since 2002

Page 19: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

About Azul

We make scalable Virtual Machines

Have built “whatever it takes to get job done” since 2002

3 generations of custom SMP Multi-core HW (Vega)

Page 20: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

About Azul

We make scalable Virtual Machines

Have built “whatever it takes to get job done” since 2002

3 generations of custom SMP Multi-core HW (Vega)

Vega

Page 21: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

About Azul

We make scalable Virtual Machines

Have built “whatever it takes to get job done” since 2002

3 generations of custom SMP Multi-core HW (Vega)

Vega

Page 22: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

About Azul

We make scalable Virtual Machines

Have built “whatever it takes to get job done” since 2002

3 generations of custom SMP Multi-core HW (Vega)

Zing: Pure software for commodity x86

Vega

Page 23: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

About Azul

We make scalable Virtual Machines

Have built “whatever it takes to get job done” since 2002

3 generations of custom SMP Multi-core HW (Vega)

Zing: Pure software for commodity x86

Vega

C4

Page 24: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

About Azul

We make scalable Virtual Machines

Have built “whatever it takes to get job done” since 2002

3 generations of custom SMP Multi-core HW (Vega)

Zing: Pure software for commodity x86

Known for Low Latency, Consistent execution, and Large data set excellence

Vega

C4

Page 25: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Common fallacies

Page 26: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Common fallacies

Computers run application code continuously

Page 27: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Common fallacies

Computers run application code continuously

Response time can be measured as work units/time

Page 28: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Common fallacies

Computers run application code continuously

Response time can be measured as work units/time

Response time exhibits a normal distribution

Page 29: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Common fallacies

Computers run application code continuously

Response time can be measured as work units/time

Response time exhibits a normal distribution

“Glitches” or “Semi-random omissions” in measurement don’t have a big effect.

Page 30: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Common fallacies

Computers run application code continuously

Response time can be measured as work units/time

Response time exhibits a normal distribution

“Glitches” or “Semi-random omissions” in measurement don’t have a big effect.

Wrong!

Wrong!

Wrong!

Wrong!

Page 31: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

A classic look at response time behavior

Response time as a function of load

source: IBM CICS server documentation, “understanding response times”

Page 32: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

A classic look at response time behavior

Response time as a function of load

source: IBM CICS server documentation, “understanding response times”

Page 33: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

A classic look at response time behavior

Response time as a function of load

source: IBM CICS server documentation, “understanding response times”

Average?Max?

Median?90%?99.9%

Page 34: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

A classic look at response time behavior

Response time as a function of load

source: IBM CICS server documentation, “understanding response times”

Average?Max?

Median?90%?99.9%

Page 35: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

Response time over time

When we measure behavior over time, we often see:

source: ZOHO QEngine White Paper: performance testing report analysis

9

Does the response time meet my target requirements? Response Time Graphs Response time is one of the most important characteristics of load testing. Response time reports and graph measures the web user experience as it indicates how long the user waits for the server to respond for his request. This is the time taken, in seconds, to receive full response from the server. It is equivalent to the time taken by the client to connect to the server and receive the response including image, script and stylesheet. Response Time Graph (Overall) Response Time versus Elapsed Time report indicates the average response time of the transactions over the elapsed time of the load test as shown.

Interpreting Results From this graph it is possible to identify peaks in the response time over a period of time. Helps to identify issues when server is continuously loaded for long duration of time. Response Time Graph (per Transaction) Response Time versus Elapsed Time report indicates the average response time of the transactions over the elapsed time of the load test as shown.

Page 36: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

Response time over time

When we measure behavior over time, we often see:

source: ZOHO QEngine White Paper: performance testing report analysis

9

Does the response time meet my target requirements? Response Time Graphs Response time is one of the most important characteristics of load testing. Response time reports and graph measures the web user experience as it indicates how long the user waits for the server to respond for his request. This is the time taken, in seconds, to receive full response from the server. It is equivalent to the time taken by the client to connect to the server and receive the response including image, script and stylesheet. Response Time Graph (Overall) Response Time versus Elapsed Time report indicates the average response time of the transactions over the elapsed time of the load test as shown.

Interpreting Results From this graph it is possible to identify peaks in the response time over a period of time. Helps to identify issues when server is continuously loaded for long duration of time. Response Time Graph (per Transaction) Response Time versus Elapsed Time report indicates the average response time of the transactions over the elapsed time of the load test as shown.

Page 37: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

Response time over time

When we measure behavior over time, we often see:

source: ZOHO QEngine White Paper: performance testing report analysis

9

Does the response time meet my target requirements? Response Time Graphs Response time is one of the most important characteristics of load testing. Response time reports and graph measures the web user experience as it indicates how long the user waits for the server to respond for his request. This is the time taken, in seconds, to receive full response from the server. It is equivalent to the time taken by the client to connect to the server and receive the response including image, script and stylesheet. Response Time Graph (Overall) Response Time versus Elapsed Time report indicates the average response time of the transactions over the elapsed time of the load test as shown.

Interpreting Results From this graph it is possible to identify peaks in the response time over a period of time. Helps to identify issues when server is continuously loaded for long duration of time. Response Time Graph (per Transaction) Response Time versus Elapsed Time report indicates the average response time of the transactions over the elapsed time of the load test as shown.

Page 38: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

Response time over time

When we measure behavior over time, we often see:

source: ZOHO QEngine White Paper: performance testing report analysis

9

Does the response time meet my target requirements? Response Time Graphs Response time is one of the most important characteristics of load testing. Response time reports and graph measures the web user experience as it indicates how long the user waits for the server to respond for his request. This is the time taken, in seconds, to receive full response from the server. It is equivalent to the time taken by the client to connect to the server and receive the response including image, script and stylesheet. Response Time Graph (Overall) Response Time versus Elapsed Time report indicates the average response time of the transactions over the elapsed time of the load test as shown.

Interpreting Results From this graph it is possible to identify peaks in the response time over a period of time. Helps to identify issues when server is continuously loaded for long duration of time. Response Time Graph (per Transaction) Response Time versus Elapsed Time report indicates the average response time of the transactions over the elapsed time of the load test as shown.

Page 39: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

Response time over time

When we measure behavior over time, we often see:

source: ZOHO QEngine White Paper: performance testing report analysis

9

Does the response time meet my target requirements? Response Time Graphs Response time is one of the most important characteristics of load testing. Response time reports and graph measures the web user experience as it indicates how long the user waits for the server to respond for his request. This is the time taken, in seconds, to receive full response from the server. It is equivalent to the time taken by the client to connect to the server and receive the response including image, script and stylesheet. Response Time Graph (Overall) Response Time versus Elapsed Time report indicates the average response time of the transactions over the elapsed time of the load test as shown.

Interpreting Results From this graph it is possible to identify peaks in the response time over a period of time. Helps to identify issues when server is continuously loaded for long duration of time. Response Time Graph (per Transaction) Response Time versus Elapsed Time report indicates the average response time of the transactions over the elapsed time of the load test as shown.

Page 40: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

Response time over time

When we measure behavior over time, we often see:

source: ZOHO QEngine White Paper: performance testing report analysis

9

Does the response time meet my target requirements? Response Time Graphs Response time is one of the most important characteristics of load testing. Response time reports and graph measures the web user experience as it indicates how long the user waits for the server to respond for his request. This is the time taken, in seconds, to receive full response from the server. It is equivalent to the time taken by the client to connect to the server and receive the response including image, script and stylesheet. Response Time Graph (Overall) Response Time versus Elapsed Time report indicates the average response time of the transactions over the elapsed time of the load test as shown.

Interpreting Results From this graph it is possible to identify peaks in the response time over a period of time. Helps to identify issues when server is continuously loaded for long duration of time. Response Time Graph (per Transaction) Response Time versus Elapsed Time report indicates the average response time of the transactions over the elapsed time of the load test as shown.

Page 41: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

Response time over time

When we measure behavior over time, we often see:

source: ZOHO QEngine White Paper: performance testing report analysis

9

Does the response time meet my target requirements? Response Time Graphs Response time is one of the most important characteristics of load testing. Response time reports and graph measures the web user experience as it indicates how long the user waits for the server to respond for his request. This is the time taken, in seconds, to receive full response from the server. It is equivalent to the time taken by the client to connect to the server and receive the response including image, script and stylesheet. Response Time Graph (Overall) Response Time versus Elapsed Time report indicates the average response time of the transactions over the elapsed time of the load test as shown.

Interpreting Results From this graph it is possible to identify peaks in the response time over a period of time. Helps to identify issues when server is continuously loaded for long duration of time. Response Time Graph (per Transaction) Response Time versus Elapsed Time report indicates the average response time of the transactions over the elapsed time of the load test as shown.

Page 42: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

Response time over time

When we measure behavior over time, we often see:

source: ZOHO QEngine White Paper: performance testing report analysis

9

Does the response time meet my target requirements? Response Time Graphs Response time is one of the most important characteristics of load testing. Response time reports and graph measures the web user experience as it indicates how long the user waits for the server to respond for his request. This is the time taken, in seconds, to receive full response from the server. It is equivalent to the time taken by the client to connect to the server and receive the response including image, script and stylesheet. Response Time Graph (Overall) Response Time versus Elapsed Time report indicates the average response time of the transactions over the elapsed time of the load test as shown.

Interpreting Results From this graph it is possible to identify peaks in the response time over a period of time. Helps to identify issues when server is continuously loaded for long duration of time. Response Time Graph (per Transaction) Response Time versus Elapsed Time report indicates the average response time of the transactions over the elapsed time of the load test as shown.

“Hiccups”

Page 43: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

0"

1000"

2000"

3000"

4000"

5000"

6000"

7000"

8000"

9000"

0" 20" 40" 60" 80" 100" 120" 140" 160" 180"

Hiccup

&Dura*

on&(m

sec)&

&Elapsed&Time&(sec)&

Hiccups&by&Time&Interval&

Max"per"Interval" 99%" 99.90%" 99.99%" Max"

What happened here?

Page 44: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

0"

1000"

2000"

3000"

4000"

5000"

6000"

7000"

8000"

9000"

0" 20" 40" 60" 80" 100" 120" 140" 160" 180"

Hiccup

&Dura*

on&(m

sec)&

&Elapsed&Time&(sec)&

Hiccups&by&Time&Interval&

Max"per"Interval" 99%" 99.90%" 99.99%" Max"

What happened here?

“Hiccups”

Page 45: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

0"

1000"

2000"

3000"

4000"

5000"

6000"

7000"

8000"

9000"

0" 20" 40" 60" 80" 100" 120" 140" 160" 180"

Hiccup

&Dura*

on&(m

sec)&

&Elapsed&Time&(sec)&

Hiccups&by&Time&Interval&

Max"per"Interval" 99%" 99.90%" 99.99%" Max"

What happened here?

Source: Gil running an idle program and suspending it five times in the middle

“Hiccups”

Page 46: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

0"

5"

10"

15"

20"

25"

0" 100" 200" 300" 400" 500" 600"

Hiccup

&Dura*

on&(m

sec)&

&Elapsed&Time&(sec)&

Hiccups&by&Time&Interval&

Max"per"Interval" 99%" 99.90%" 99.99%" Max"

0%" 90%" 99%" 99.9%" 99.99%" 99.999%"

Max=22.656&

0"

5"

10"

15"

20"

25"

Hiccup

&Dura*

on&(m

sec)&

&&

Percen*le&

Hiccups&by&Percen*le&Distribu*on&

The real world (a low latency example)

Page 47: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

0"

5"

10"

15"

20"

25"

0" 100" 200" 300" 400" 500" 600"

Hiccup

&Dura*

on&(m

sec)&

&Elapsed&Time&(sec)&

Hiccups&by&Time&Interval&

Max"per"Interval" 99%" 99.90%" 99.99%" Max"

0%" 90%" 99%" 99.9%" 99.99%" 99.999%"

Max=22.656&

0"

5"

10"

15"

20"

25"

Hiccup

&Dura*

on&(m

sec)&

&&

Percen*le&

Hiccups&by&Percen*le&Distribu*on&

The real world (a low latency example)

99%‘ile is ~60 usec

Page 48: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

0"

5"

10"

15"

20"

25"

0" 100" 200" 300" 400" 500" 600"

Hiccup

&Dura*

on&(m

sec)&

&Elapsed&Time&(sec)&

Hiccups&by&Time&Interval&

Max"per"Interval" 99%" 99.90%" 99.99%" Max"

0%" 90%" 99%" 99.9%" 99.99%" 99.999%"

Max=22.656&

0"

5"

10"

15"

20"

25"

Hiccup

&Dura*

on&(m

sec)&

&&

Percen*le&

Hiccups&by&Percen*le&Distribu*on&

The real world (a low latency example)

99%‘ile is ~60 usecMax is ~30,000%

higher than “typical”

Page 49: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

Hiccups are [typically] strongly multi-modal

Page 50: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

Hiccups are [typically] strongly multi-modal

They don’t look anything like a normal distribution

Page 51: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

Hiccups are [typically] strongly multi-modal

They don’t look anything like a normal distribution

They usually look like periodic freezes

Page 52: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

Hiccups are [typically] strongly multi-modal

They don’t look anything like a normal distribution

They usually look like periodic freezes

A complete shift from one mode/behavior to another

Page 53: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

Hiccups are [typically] strongly multi-modal

They don’t look anything like a normal distribution

They usually look like periodic freezes

A complete shift from one mode/behavior to another

Mode A: “good”.

Page 54: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

Hiccups are [typically] strongly multi-modal

They don’t look anything like a normal distribution

They usually look like periodic freezes

A complete shift from one mode/behavior to another

Mode A: “good”.

Mode B: “Somewhat bad”

Page 55: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

Hiccups are [typically] strongly multi-modal

They don’t look anything like a normal distribution

They usually look like periodic freezes

A complete shift from one mode/behavior to another

Mode A: “good”.

Mode B: “Somewhat bad”

Mode C: “terrible”, ...

Page 56: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

Hiccups are [typically] strongly multi-modal

They don’t look anything like a normal distribution

They usually look like periodic freezes

A complete shift from one mode/behavior to another

Mode A: “good”.

Mode B: “Somewhat bad”

Mode C: “terrible”, ...

....

Page 57: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

Common ways people deal with hiccups

Page 58: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

Common ways people deal with hiccups

Averages and Standard Deviation

Page 59: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

Common ways people deal with hiccups

Averages and Standard DeviationAlways Wrong!

Page 60: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

Better ways people can deal with hiccupsActually measuring percentiles

Telco App Example

0"

20"

40"

60"

80"

100"

120"

140"

0" 500" 1000" 1500" 2000" 2500"

Hiccup

&Dura*

on&(m

sec)&

&Elapsed&Time&(sec)&

Hiccups&by&Time&Interval&

Max"per"Interval" 99%" 99.90%" 99.99%" Max"

0%" 90%" 99%" 99.9%" 99.99%" 99.999%" 99.9999%"0"

20"

40"

60"

80"

100"

120"

140"

Hiccup

&Dura*

on&(m

sec)&

&&

Percen*le&

Hiccups&by&Percen*le&Distribu*on&

Hiccups"by"Percen?le" SLA"

Page 61: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

Better ways people can deal with hiccupsActually measuring percentiles

Telco App Example

0"

20"

40"

60"

80"

100"

120"

140"

0" 500" 1000" 1500" 2000" 2500"

Hiccup

&Dura*

on&(m

sec)&

&Elapsed&Time&(sec)&

Hiccups&by&Time&Interval&

Max"per"Interval" 99%" 99.90%" 99.99%" Max"

0%" 90%" 99%" 99.9%" 99.99%" 99.999%" 99.9999%"0"

20"

40"

60"

80"

100"

120"

140"

Hiccup

&Dura*

on&(m

sec)&

&&

Percen*le&

Hiccups&by&Percen*le&Distribu*on&

Hiccups"by"Percen?le" SLA"

Response Time Percentile plot line

Page 62: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

Better ways people can deal with hiccupsActually measuring percentiles

Telco App Example

0"

20"

40"

60"

80"

100"

120"

140"

0" 500" 1000" 1500" 2000" 2500"

Hiccup

&Dura*

on&(m

sec)&

&Elapsed&Time&(sec)&

Hiccups&by&Time&Interval&

Max"per"Interval" 99%" 99.90%" 99.99%" Max"

0%" 90%" 99%" 99.9%" 99.99%" 99.999%" 99.9999%"0"

20"

40"

60"

80"

100"

120"

140"

Hiccup

&Dura*

on&(m

sec)&

&&

Percen*le&

Hiccups&by&Percen*le&Distribu*on&

Hiccups"by"Percen?le" SLA"

Requirements

Response Time Percentile plot line

Page 63: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Requirements

Why we measure latency and response times to begin with...

Page 64: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

Latency tells us how long something took

Page 65: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

Latency tells us how long something took

But what do we WANT the latency to be?

Page 66: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

Latency tells us how long something took

But what do we WANT the latency to be?

What do we want the latency to BEHAVE like?

Page 67: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

Latency tells us how long something took

But what do we WANT the latency to be?

What do we want the latency to BEHAVE like?

Latency requirements are usually a PASS/FAIL test of some predefined criteria

Page 68: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

Latency tells us how long something took

But what do we WANT the latency to be?

What do we want the latency to BEHAVE like?

Latency requirements are usually a PASS/FAIL test of some predefined criteria

Different applications have different needs

Page 69: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

Latency tells us how long something took

But what do we WANT the latency to be?

What do we want the latency to BEHAVE like?

Latency requirements are usually a PASS/FAIL test of some predefined criteria

Different applications have different needs

Requirements should reflect application needs

Page 70: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

Latency tells us how long something took

But what do we WANT the latency to be?

What do we want the latency to BEHAVE like?

Latency requirements are usually a PASS/FAIL test of some predefined criteria

Different applications have different needs

Requirements should reflect application needs

Measurements should provide data to evaluate requirements

Page 71: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

The Olympicsaka “ring the bell first”

Page 72: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

The Olympicsaka “ring the bell first”

Goal: Get gold medals

Page 73: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

The Olympicsaka “ring the bell first”

Goal: Get gold medals

Need to be faster than everyone else at SOME races

Page 74: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

The Olympicsaka “ring the bell first”

Goal: Get gold medals

Need to be faster than everyone else at SOME races

Ok to be slower in some, as long as fastest at some (the average speed doesn’t matter)

Page 75: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

The Olympicsaka “ring the bell first”

Goal: Get gold medals

Need to be faster than everyone else at SOME races

Ok to be slower in some, as long as fastest at some (the average speed doesn’t matter)

Ok to not even finish or compete (the worst case and 99%‘ile don’t matter)

Page 76: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

The Olympicsaka “ring the bell first”

Goal: Get gold medals

Need to be faster than everyone else at SOME races

Ok to be slower in some, as long as fastest at some (the average speed doesn’t matter)

Ok to not even finish or compete (the worst case and 99%‘ile don’t matter)

Different strategies can apply. E.g. compete in only 3 races to not risk burning out, or compete in 8 races in hope of winning two

Page 77: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Pacemakersaka “hard” real time

Page 78: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Pacemakersaka “hard” real time

Goal: Keep heart beating

Page 79: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Pacemakersaka “hard” real time

Goal: Keep heart beating

Need to never be slower than X

Page 80: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Pacemakersaka “hard” real time

Goal: Keep heart beating

Need to never be slower than X

“Your heart will keep beating 99.9% of the time” is not reassuring

Page 81: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Pacemakersaka “hard” real time

Goal: Keep heart beating

Need to never be slower than X

“Your heart will keep beating 99.9% of the time” is not reassuring

Having a good average and a nice standard deviation don’t matter or help

Page 82: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Pacemakersaka “hard” real time

Goal: Keep heart beating

Need to never be slower than X

“Your heart will keep beating 99.9% of the time” is not reassuring

Having a good average and a nice standard deviation don’t matter or help

The worst case is all that matters

Page 83: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

“Low Latency” Tradingaka “soft” real time

Page 84: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

“Low Latency” Tradingaka “soft” real time

Goal A: Be fast enough to make some good plays

Page 85: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

“Low Latency” Tradingaka “soft” real time

Goal A: Be fast enough to make some good plays

Goal B: Contain risk and exposure while making plays

Page 86: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

“Low Latency” Tradingaka “soft” real time

Goal A: Be fast enough to make some good plays

Goal B: Contain risk and exposure while making plays

E.g. want to “typically” react within 200 usec.

Page 87: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

“Low Latency” Tradingaka “soft” real time

Goal A: Be fast enough to make some good plays

Goal B: Contain risk and exposure while making plays

E.g. want to “typically” react within 200 usec.

But can’t afford to hold open position for 20 msec, or react to 30 msec stale information

Page 88: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

“Low Latency” Tradingaka “soft” real time

Goal A: Be fast enough to make some good plays

Goal B: Contain risk and exposure while making plays

E.g. want to “typically” react within 200 usec.

But can’t afford to hold open position for 20 msec, or react to 30 msec stale information

So we want a very good “typical” (median, 50%‘ile)

Page 89: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

“Low Latency” Tradingaka “soft” real time

Goal A: Be fast enough to make some good plays

Goal B: Contain risk and exposure while making plays

E.g. want to “typically” react within 200 usec.

But can’t afford to hold open position for 20 msec, or react to 30 msec stale information

So we want a very good “typical” (median, 50%‘ile)

But we also need a reasonable Max, or 99.99%‘ile

Page 90: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Interactive applicationsaka “squishy” real time

Page 91: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Interactive applicationsaka “squishy” real time

Goal: Keep users happy enough to not complain/leave

Page 92: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Interactive applicationsaka “squishy” real time

Goal: Keep users happy enough to not complain/leave

Need to have “typically snappy” behavior

Page 93: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Interactive applicationsaka “squishy” real time

Goal: Keep users happy enough to not complain/leave

Need to have “typically snappy” behavior

Ok to have occasional longer times, but not too high, and not too often

Page 94: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Interactive applicationsaka “squishy” real time

Goal: Keep users happy enough to not complain/leave

Need to have “typically snappy” behavior

Ok to have occasional longer times, but not too high, and not too often

Example: 90% of responses should be below 0.2 sec, 99% should be below 0.5 sec, 99.9 should be better than 2 seconds. And a >10 second response should never happen.

Page 95: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Interactive applicationsaka “squishy” real time

Goal: Keep users happy enough to not complain/leave

Need to have “typically snappy” behavior

Ok to have occasional longer times, but not too high, and not too often

Example: 90% of responses should be below 0.2 sec, 99% should be below 0.5 sec, 99.9 should be better than 2 seconds. And a >10 second response should never happen.

Remember: A single user may have 100s of interactions per session...

Page 96: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Establishing Requirementsan interactive interview (or thought) process

Page 97: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Establishing Requirementsan interactive interview (or thought) process

Q: What are your latency requirements?

Page 98: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Establishing Requirementsan interactive interview (or thought) process

Q: What are your latency requirements?

A: We need an avg. response of 20 msec

Page 99: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Establishing Requirementsan interactive interview (or thought) process

Q: What are your latency requirements?

A: We need an avg. response of 20 msec

Q: Ok. Typical/average of 20 msec... So what is the worst case requirement?

Page 100: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Establishing Requirementsan interactive interview (or thought) process

Q: What are your latency requirements?

A: We need an avg. response of 20 msec

Q: Ok. Typical/average of 20 msec... So what is the worst case requirement?

A: We don’t have one

Page 101: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Establishing Requirementsan interactive interview (or thought) process

Q: What are your latency requirements?

A: We need an avg. response of 20 msec

Q: Ok. Typical/average of 20 msec... So what is the worst case requirement?

A: We don’t have one

Q: So it’s ok for some things to take more than 5 hours?

Page 102: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Establishing Requirementsan interactive interview (or thought) process

Q: What are your latency requirements?

A: We need an avg. response of 20 msec

Q: Ok. Typical/average of 20 msec... So what is the worst case requirement?

A: We don’t have one

Q: So it’s ok for some things to take more than 5 hours?

A: No way in H%%&!

Page 103: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Establishing Requirementsan interactive interview (or thought) process

Q: What are your latency requirements?

A: We need an avg. response of 20 msec

Q: Ok. Typical/average of 20 msec... So what is the worst case requirement?

A: We don’t have one

Q: So it’s ok for some things to take more than 5 hours?

A: No way in H%%&!

Q: So I’ll write down “5 hours worst case...”

Page 104: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Establishing Requirementsan interactive interview (or thought) process

Q: What are your latency requirements?

A: We need an avg. response of 20 msec

Q: Ok. Typical/average of 20 msec... So what is the worst case requirement?

A: We don’t have one

Q: So it’s ok for some things to take more than 5 hours?

A: No way in H%%&!

Q: So I’ll write down “5 hours worst case...”

A: No. That’s not what I said. Make that “nothing worse than 100 msec”

Page 105: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Establishing Requirementsan interactive interview (or thought) process

Q: What are your latency requirements?

A: We need an avg. response of 20 msec

Q: Ok. Typical/average of 20 msec... So what is the worst case requirement?

A: We don’t have one

Q: So it’s ok for some things to take more than 5 hours?

A: No way in H%%&!

Q: So I’ll write down “5 hours worst case...”

A: No. That’s not what I said. Make that “nothing worse than 100 msec”

Q: Are you sure? Even if it’s only two times a day?

Page 106: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Establishing Requirementsan interactive interview (or thought) process

Q: What are your latency requirements?

A: We need an avg. response of 20 msec

Q: Ok. Typical/average of 20 msec... So what is the worst case requirement?

A: We don’t have one

Q: So it’s ok for some things to take more than 5 hours?

A: No way in H%%&!

Q: So I’ll write down “5 hours worst case...”

A: No. That’s not what I said. Make that “nothing worse than 100 msec”

Q: Are you sure? Even if it’s only two times a day?

A: Ok... Make it “nothing worse than 2 seconds...”

Page 107: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Establishing Requirementsan interactive interview (or thought) process

Page 108: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Establishing Requirementsan interactive interview (or thought) process

Ok. So we need a typical of 20msec, and a worst case of 2 seconds. How often is it ok to have a 1 second response?

Page 109: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Establishing Requirementsan interactive interview (or thought) process

Ok. So we need a typical of 20msec, and a worst case of 2 seconds. How often is it ok to have a 1 second response?

A: (Annoyed) I thought you said only a few times a day

Page 110: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Establishing Requirementsan interactive interview (or thought) process

Ok. So we need a typical of 20msec, and a worst case of 2 seconds. How often is it ok to have a 1 second response?

A: (Annoyed) I thought you said only a few times a day

Q: That was for the worst case. But if half the results are better than 20 msec, is it ok for the other half to be just short of 2 seconds? What % of the time are you willing to take a 1 second, or a half second hiccup? Or some other level?

Page 111: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Establishing Requirementsan interactive interview (or thought) process

Ok. So we need a typical of 20msec, and a worst case of 2 seconds. How often is it ok to have a 1 second response?

A: (Annoyed) I thought you said only a few times a day

Q: That was for the worst case. But if half the results are better than 20 msec, is it ok for the other half to be just short of 2 seconds? What % of the time are you willing to take a 1 second, or a half second hiccup? Or some other level?

A: Oh. Let’s see. We have to better than 50 msec 90% of the time, or we’ll be losing money even when we are fast the rest of the time. We need to be better than 500 msec 99.9% of the time, or our customers will complain and go elsewhere

Page 112: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Establishing Requirementsan interactive interview (or thought) process

Ok. So we need a typical of 20msec, and a worst case of 2 seconds. How often is it ok to have a 1 second response?

A: (Annoyed) I thought you said only a few times a day

Q: That was for the worst case. But if half the results are better than 20 msec, is it ok for the other half to be just short of 2 seconds? What % of the time are you willing to take a 1 second, or a half second hiccup? Or some other level?

A: Oh. Let’s see. We have to better than 50 msec 90% of the time, or we’ll be losing money even when we are fast the rest of the time. We need to be better than 500 msec 99.9% of the time, or our customers will complain and go elsewhere

Now we have a service level expectation:

Page 113: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Establishing Requirementsan interactive interview (or thought) process

Ok. So we need a typical of 20msec, and a worst case of 2 seconds. How often is it ok to have a 1 second response?

A: (Annoyed) I thought you said only a few times a day

Q: That was for the worst case. But if half the results are better than 20 msec, is it ok for the other half to be just short of 2 seconds? What % of the time are you willing to take a 1 second, or a half second hiccup? Or some other level?

A: Oh. Let’s see. We have to better than 50 msec 90% of the time, or we’ll be losing money even when we are fast the rest of the time. We need to be better than 500 msec 99.9% of the time, or our customers will complain and go elsewhere

Now we have a service level expectation:

50% better than 20 msec

Page 114: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Establishing Requirementsan interactive interview (or thought) process

Ok. So we need a typical of 20msec, and a worst case of 2 seconds. How often is it ok to have a 1 second response?

A: (Annoyed) I thought you said only a few times a day

Q: That was for the worst case. But if half the results are better than 20 msec, is it ok for the other half to be just short of 2 seconds? What % of the time are you willing to take a 1 second, or a half second hiccup? Or some other level?

A: Oh. Let’s see. We have to better than 50 msec 90% of the time, or we’ll be losing money even when we are fast the rest of the time. We need to be better than 500 msec 99.9% of the time, or our customers will complain and go elsewhere

Now we have a service level expectation:

50% better than 20 msec

90% better than 50 msec

Page 115: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Establishing Requirementsan interactive interview (or thought) process

Ok. So we need a typical of 20msec, and a worst case of 2 seconds. How often is it ok to have a 1 second response?

A: (Annoyed) I thought you said only a few times a day

Q: That was for the worst case. But if half the results are better than 20 msec, is it ok for the other half to be just short of 2 seconds? What % of the time are you willing to take a 1 second, or a half second hiccup? Or some other level?

A: Oh. Let’s see. We have to better than 50 msec 90% of the time, or we’ll be losing money even when we are fast the rest of the time. We need to be better than 500 msec 99.9% of the time, or our customers will complain and go elsewhere

Now we have a service level expectation:

50% better than 20 msec

90% better than 50 msec

99.9% better than 500 msec

Page 116: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Establishing Requirementsan interactive interview (or thought) process

Ok. So we need a typical of 20msec, and a worst case of 2 seconds. How often is it ok to have a 1 second response?

A: (Annoyed) I thought you said only a few times a day

Q: That was for the worst case. But if half the results are better than 20 msec, is it ok for the other half to be just short of 2 seconds? What % of the time are you willing to take a 1 second, or a half second hiccup? Or some other level?

A: Oh. Let’s see. We have to better than 50 msec 90% of the time, or we’ll be losing money even when we are fast the rest of the time. We need to be better than 500 msec 99.9% of the time, or our customers will complain and go elsewhere

Now we have a service level expectation:

50% better than 20 msec

90% better than 50 msec

99.9% better than 500 msec

100% better than 2 seconds

Page 117: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Latency does not live in a vacuum

Page 118: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Remember this?How much load can this system handle?

Page 119: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Remember this?How much load can this system handle?

What the marketing

benchmarks will say

Page 120: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Remember this?How much load can this system handle?

What the marketing

benchmarks will say

Where users complain

Page 121: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Remember this?How much load can this system handle?

Where the sysadmin is willing

to go

What the marketing

benchmarks will say

Where users complain

Page 122: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Remember this?How much load can this system handle?

Where the sysadmin is willing

to go

What the marketing

benchmarks will say

Where users complain

SustainableThroughput

Level

Page 123: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Sustainable Throughput: The throughput achieved while safely maintaining service levels

Page 124: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Sustainable Throughput: The throughput achieved while safely maintaining service levels

Page 125: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Sustainable Throughput: The throughput achieved while safely maintaining service levels

UnsustainableThroughout

Page 126: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Comparing behavior under different throughputsand/or configurations

0%# 90%# 99%# 99.9%# 99.99%# 99.999%#0#

20#

40#

60#

80#

100#

120#

!Dura&

on!(m

sec)!

!!

Percen&le!

Dura&on!by!Percen&le!Distribu&on!

Setup#D#

Setup#E#

Setup#F#

Setup#A#

Setup#B#

Setup#C#

Page 127: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

0%# 90%# 99%# 99.9%# 99.99%# 99.999%#0#

10#

20#

30#

40#

50#

60#

70#

80#

90#

Latency((m

illisecon

ds)(

((

Percen3le(

Latency(by(Percen3le(Distribu3on(

Comparing latency behavior under different throughputs, configurations

HotSpot&10K&&&

HotSpot&15K&

HotSpot&5K&

HotSpot&1K&

Zing&15K&Zing&10K&Zing&5K&Zing&1K&&

Required&Service&Level&

Page 128: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Instance capacity test: “Fat Portal”HotSpot CMS: Peaks at ~3GB / 45 concurrent users

* LifeRay portal on JBoss @ 99.9% SLA of 5 second response times

Page 129: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Instance capacity test: “Fat Portal”HotSpot CMS: Peaks at ~3GB / 45 concurrent users

* LifeRay portal on JBoss @ 99.9% SLA of 5 second response times

Page 130: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Instance capacity test: “Fat Portal”C4: still smooth @ 800 concurrent users

* LifeRay portal on JBoss @ 99.9% SLA of 5 second response times

Page 131: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Instance capacity test: “Fat Portal”C4: still smooth @ 800 concurrent users

* LifeRay portal on JBoss @ 99.9% SLA of 5 second response times

Page 132: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Instance capacity test: “Fat Portal”C4: still smooth @ 800 concurrent users

* LifeRay portal on JBoss @ 99.9% SLA of 5 second response times

Page 133: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

The coordinated omission problem

An accidental conspiracy...

Page 134: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

The coordinated omission problem

Page 135: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

The coordinated omission problem

Common Example A (load testing):build/buy load tester to measure system behavior

each “client” issues requests one by one at a certain rate

measure and log response time for each request

results log used to produce histograms, percentiles, etc.

Page 136: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

The coordinated omission problem

Common Example A (load testing):build/buy load tester to measure system behavior

each “client” issues requests one by one at a certain rate

measure and log response time for each request

results log used to produce histograms, percentiles, etc.

So what’s wrong with that?works well only when all responses fit within rate interval

technique includes implicit “automatic backoff” and coordination

But requirements interested in random, uncoordinated requests

Page 137: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

The coordinated omission problem

Page 138: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

The coordinated omission problem

Common Example B (monitoring):System monitors and records each transaction latency

Latency measured between start and end of each operation

keeps observed latency stats of some sort (log, histogram, etc.)

Page 139: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

The coordinated omission problem

Common Example B (monitoring):System monitors and records each transaction latency

Latency measured between start and end of each operation

keeps observed latency stats of some sort (log, histogram, etc.)

So what’s wrong with that?works correctly well only when no queuing occurs

Long operations only get measured once

delays outside of timing window do not get measured at all

queued operations are measured wrong

Page 140: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Common Example B: Coordinated Omission in Monitoring Code

Long operations only get measured once

delays outside of timing window do not get measured at all

Page 141: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Common Example B: Coordinated Omission in Monitoring Code

Long operations only get measured once

delays outside of timing window do not get measured at all

Page 142: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

How bad can this get?

Elapsed Time

System easily handles 100 requests/sec

Responds to eachin 1msec

Page 143: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

How bad can this get?

System Stalledfor 100 Sec

Elapsed Time

System easily handles 100 requests/sec

Responds to eachin 1msec

Page 144: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

How bad can this get?

System Stalledfor 100 Sec

Elapsed Time

System easily handles 100 requests/sec

Responds to eachin 1msec

How would you characterize this system?

Page 145: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

How bad can this get?

Avg. is 1 msecover 1st 100 sec

System Stalledfor 100 Sec

Elapsed Time

System easily handles 100 requests/sec

Responds to eachin 1msec

How would you characterize this system?

Page 146: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

How bad can this get?

Avg. is 1 msecover 1st 100 sec

System Stalledfor 100 Sec

Elapsed Time

System easily handles 100 requests/sec

Responds to eachin 1msec

How would you characterize this system?

Avg. is 50 sec.over next 100 sec

Page 147: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

How bad can this get?

Avg. is 1 msecover 1st 100 sec

System Stalledfor 100 Sec

Elapsed Time

System easily handles 100 requests/sec

Responds to eachin 1msec

How would you characterize this system?

Avg. is 50 sec.over next 100 sec

Overall Average response time is ~25 sec.

Page 148: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

How bad can this get?

Avg. is 1 msecover 1st 100 sec

System Stalledfor 100 Sec

Elapsed Time

System easily handles 100 requests/sec

Responds to eachin 1msec

How would you characterize this system?

~50%‘ile is 1 msec ~75%‘ile is 50 sec 99.99%‘ile is ~100sec

Avg. is 50 sec.over next 100 sec

Overall Average response time is ~25 sec.

Page 149: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

Measurement in practice

System Stalledfor 100 Sec

Elapsed Time

System easily handles 100 requests/sec

Responds to eachin 1msec

Page 150: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

Measurement in practice

System Stalledfor 100 Sec

Elapsed Time

System easily handles 100 requests/sec

Responds to eachin 1msec

Page 151: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

Measurement in practice

System Stalledfor 100 Sec

Elapsed Time

System easily handles 100 requests/sec

Responds to eachin 1msec

Naïve Characterization

Page 152: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

Measurement in practice

System Stalledfor 100 Sec

Elapsed Time

System easily handles 100 requests/sec

Responds to eachin 1msec

Naïve Characterization

10,000 @ 1msec 1 @ 100 second

Page 153: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

Measurement in practice

System Stalledfor 100 Sec

Elapsed Time

System easily handles 100 requests/sec

Responds to eachin 1msec

Naïve Characterization

10,000 @ 1msec 1 @ 100 second

99.99%‘ile is 1 msec! Average. is 10.9msec! Std. Dev. is 0.99sec!(should be ~100sec) (should be ~25 sec)

Page 154: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

Proper measurement

System Stalledfor 100 Sec

Elapsed Time

System easily handles 100 requests/sec

Responds to eachin 1msec

10,000 resultsVarying linearlyfrom 100 secto 10 msec

10,000 results@ 1 msec each

Page 155: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

Proper measurement

System Stalledfor 100 Sec

Elapsed Time

System easily handles 100 requests/sec

Responds to eachin 1msec

10,000 resultsVarying linearlyfrom 100 secto 10 msec

10,000 results@ 1 msec each

Page 156: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

Proper measurement

System Stalledfor 100 Sec

Elapsed Time

System easily handles 100 requests/sec

Responds to eachin 1msec

10,000 resultsVarying linearlyfrom 100 secto 10 msec

10,000 results@ 1 msec each

~50%‘ile is 1 msec ~75%‘ile is 50 sec 99.99%‘ile is ~100sec

Page 157: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

Proper measurement

System Stalledfor 100 Sec

Elapsed Time

System easily handles 100 requests/sec

Responds to eachin 1msec

10,000 resultsVarying linearlyfrom 100 secto 10 msec

10,000 results@ 1 msec each

~50%‘ile is 1 msec ~75%‘ile is 50 sec 99.99%‘ile is ~100sec

Page 158: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

Proper measurement

System Stalledfor 100 Sec

Elapsed Time

System easily handles 100 requests/sec

Responds to eachin 1msec

10,000 resultsVarying linearlyfrom 100 secto 10 msec

10,000 results@ 1 msec each

~50%‘ile is 1 msec ~75%‘ile is 50 sec 99.99%‘ile is ~100sec

Coordinated Omission

Page 159: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

The coordinated omission problem

It is MUCH more common than you may think...

Page 160: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

JMeter makes this mistake... (so do others)

BeforeCorrection

AfterCorrecting

forOmission

Page 161: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

The “real” world

Page 162: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

The “real” world Results were collected by a single client

thread

Page 163: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

The “real” world

26.182 secondsrepresents 1.29%of the total time

Results were collected by a single client

thread

Page 164: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

The “real” world

99%‘ile MUST be at least 0.29% of total time (1.29% - 1%)which would be 5.9 seconds

26.182 secondsrepresents 1.29%of the total time

Results were collected by a single client

thread

Page 165: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

The “real” world

99%‘ile MUST be at least 0.29% of total time (1.29% - 1%)which would be 5.9 seconds

26.182 secondsrepresents 1.29%of the total time

wrong by a factor of 1,000x

Results were collected by a single client

thread

Page 166: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

The “real” world

A world record SPECjEnterprise2010 result

Page 167: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

The “real” world

A world record SPECjEnterprise2010 result

Page 168: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

The “real” world

305.197 seconds represents 8.4% of

the timing run

A world record SPECjEnterprise2010 result

Page 169: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

The “real” world

The max is 762 (!!!) standard deviations away from the mean

305.197 seconds represents 8.4% of

the timing run

A world record SPECjEnterprise2010 result

Page 170: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

0%# 90%# 99%# 99.9%# 99.99%# 99.999%#

Max=7995.39*

0#

1000#

2000#

3000#

4000#

5000#

6000#

7000#

8000#

9000#

Latency*(m

sec)*

**

Percen7le*

Uncorrected*Latency*by*Percen7le*Distribu7on*

0%# 90%# 99%# 99.9%# 99.99%# 99.999%#

Max=7995.39*

0#

1000#

2000#

3000#

4000#

5000#

6000#

7000#

8000#

9000#

Latency*(m

sec)*

**

Percen7le*

Corrected*Latency*by*Percen7le*Distribu7on*

Real World Coordinated Omission effects

BeforeCorrection

AfterCorrection

Page 171: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

0%# 90%# 99%# 99.9%# 99.99%# 99.999%#

Max=7995.39*

0#

1000#

2000#

3000#

4000#

5000#

6000#

7000#

8000#

9000#

Latency*(m

sec)*

**

Percen7le*

Uncorrected*Latency*by*Percen7le*Distribu7on*

0%# 90%# 99%# 99.9%# 99.99%# 99.999%#

Max=7995.39*

0#

1000#

2000#

3000#

4000#

5000#

6000#

7000#

8000#

9000#

Latency*(m

sec)*

**

Percen7le*

Corrected*Latency*by*Percen7le*Distribu7on*

Real World Coordinated Omission effects

BeforeCorrection

AfterCorrection

Page 172: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

0%# 90%# 99%# 99.9%# 99.99%# 99.999%#

Max=7995.39*

0#

1000#

2000#

3000#

4000#

5000#

6000#

7000#

8000#

9000#

Latency*(m

sec)*

**

Percen7le*

Uncorrected*Latency*by*Percen7le*Distribu7on*

0%# 90%# 99%# 99.9%# 99.99%# 99.999%#

Max=7995.39*

0#

1000#

2000#

3000#

4000#

5000#

6000#

7000#

8000#

9000#

Latency*(m

sec)*

**

Percen7le*

Corrected*Latency*by*Percen7le*Distribu7on*

Real World Coordinated Omission effects

BeforeCorrection

AfterCorrection

Wrongby 7x

Page 173: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Real World Coordinated Omission effects

0%# 90%# 99%# 99.9%# 99.99%# 99.999%# 99.9999%#0#

100000#

200000#

300000#

400000#

500000#

600000#

!Dura&

on!(u

sec)!

!!

Percen&le!

Dura&on!by!Percen&le!Distribu&on!

HotSpot#Base#

Page 174: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Real World Coordinated Omission effects

0%# 90%# 99%# 99.9%# 99.99%# 99.999%# 99.9999%#0#

100000#

200000#

300000#

400000#

500000#

600000#

!Dura&

on!(u

sec)!

!!

Percen&le!

Dura&on!by!Percen&le!Distribu&on!

HotSpot#Base#

Uncorrected Data

Page 175: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

0%# 90%# 99%# 99.9%# 99.99%# 99.999%# 99.9999%#0#

100000#

200000#

300000#

400000#

500000#

600000#

!Dura&

on!(u

sec)!

!!

Percen&le!

Dura&on!by!Percen&le!Distribu&on!

HotSpot#Base#

0%# 90%# 99%# 99.9%# 99.99%# 99.999%# 99.9999%#0#

100000#

200000#

300000#

400000#

500000#

600000#

!Dura&

on!(u

sec)!

!!

Percen&le!

Dura&on!by!Percen&le!Distribu&on!

HotSpot#Base#

HotSpot#Corrected1#

HotSpot#Corrected2#

Real World Coordinated Omission effects

Uncorrected Data

Page 176: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

0%# 90%# 99%# 99.9%# 99.99%# 99.999%# 99.9999%#0#

100000#

200000#

300000#

400000#

500000#

600000#

!Dura&

on!(u

sec)!

!!

Percen&le!

Dura&on!by!Percen&le!Distribu&on!

HotSpot#Base#

0%# 90%# 99%# 99.9%# 99.99%# 99.999%# 99.9999%#0#

100000#

200000#

300000#

400000#

500000#

600000#

!Dura&

on!(u

sec)!

!!

Percen&le!

Dura&on!by!Percen&le!Distribu&on!

HotSpot#Base#

HotSpot#Corrected1#

HotSpot#Corrected2#

Real World Coordinated Omission effects

Uncorrected Data

Corrected forCoordinated Omission

Page 177: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Real World Coordinated Omission effects(Why I care)

0%# 90%# 99%# 99.9%# 99.99%# 99.999%# 99.9999%#0#

100000#

200000#

300000#

400000#

500000#

600000#

!Dura&

on!(u

sec)!

!!

Percen&le!

Dura&on!by!Percen&le!Distribu&on!

HotSpot#Base#

HotSpot#Corrected1#

HotSpot#Corrected2#

Zing#Base#

Zing#Corrected1#

Zing#Corrected2#

Page 178: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Real World Coordinated Omission effects(Why I care)

0%# 90%# 99%# 99.9%# 99.99%# 99.999%# 99.9999%#0#

100000#

200000#

300000#

400000#

500000#

600000#

!Dura&

on!(u

sec)!

!!

Percen&le!

Dura&on!by!Percen&le!Distribu&on!

HotSpot#Base#

HotSpot#Corrected1#

HotSpot#Corrected2#

Zing#Base#

Zing#Corrected1#

Zing#Corrected2#

Zing

Page 179: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Real World Coordinated Omission effects(Why I care)

0%# 90%# 99%# 99.9%# 99.99%# 99.999%# 99.9999%#0#

100000#

200000#

300000#

400000#

500000#

600000#

!Dura&

on!(u

sec)!

!!

Percen&le!

Dura&on!by!Percen&le!Distribu&on!

HotSpot#Base#

HotSpot#Corrected1#

HotSpot#Corrected2#

Zing#Base#

Zing#Corrected1#

Zing#Corrected2#

Zing

“other” JVM

Page 180: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Real World Coordinated Omission effects(Why I care)

0%# 90%# 99%# 99.9%# 99.99%# 99.999%# 99.9999%#0#

100000#

200000#

300000#

400000#

500000#

600000#

!Dura&

on!(u

sec)!

!!

Percen&le!

Dura&on!by!Percen&le!Distribu&on!

HotSpot#Base#

HotSpot#Corrected1#

HotSpot#Corrected2#

Zing#Base#

Zing#Corrected1#

Zing#Corrected2#

A ~2500x difference in

reported percentile levels for the problem

that Zing eliminates Zing

“other” JVM

Page 181: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Suggestions

Page 182: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

SuggestionsWhatever your measurement technique is, TEST IT.

Page 183: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

SuggestionsWhatever your measurement technique is, TEST IT.

Run your measurement method against an artificial system that creates hypothetical pauses scenarios. See if your reported results agree with how you would describe that system behavior

Page 184: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

SuggestionsWhatever your measurement technique is, TEST IT.

Run your measurement method against an artificial system that creates hypothetical pauses scenarios. See if your reported results agree with how you would describe that system behavior

Don’t waste time analyzing until you establish sanity

Page 185: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

SuggestionsWhatever your measurement technique is, TEST IT.

Run your measurement method against an artificial system that creates hypothetical pauses scenarios. See if your reported results agree with how you would describe that system behavior

Don’t waste time analyzing until you establish sanity

Don’t EVER use or derive from std. deviation

Page 186: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

SuggestionsWhatever your measurement technique is, TEST IT.

Run your measurement method against an artificial system that creates hypothetical pauses scenarios. See if your reported results agree with how you would describe that system behavior

Don’t waste time analyzing until you establish sanity

Don’t EVER use or derive from std. deviation

ALWAYS measure Max time. Consider what it means... Be suspicious.

Page 187: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

SuggestionsWhatever your measurement technique is, TEST IT.

Run your measurement method against an artificial system that creates hypothetical pauses scenarios. See if your reported results agree with how you would describe that system behavior

Don’t waste time analyzing until you establish sanity

Don’t EVER use or derive from std. deviation

ALWAYS measure Max time. Consider what it means... Be suspicious.

Measure %‘iles. Lots of them.

Page 188: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Some Tools

Page 189: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

HdrHistogram

Page 190: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

0%# 90%# 99%# 99.9%# 99.99%# 99.999%# 99.9999%#

Max=137.472+

0#

20#

40#

60#

80#

100#

120#

140#

160#

Latency+(m

sec)+

++

Percen8le+

Latency+by+Percen8le+Distribu8on+

HdrHistogramIf you want to be able to produce graphs like this...

You need both good dynamic range and good resolution

Page 191: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

HdrHistogram background

Page 192: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

HdrHistogram background Goal: Collect data for good latency characterization...

Including acceptable precision at and between varying percentile levels

Page 193: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

HdrHistogram background Goal: Collect data for good latency characterization...

Including acceptable precision at and between varying percentile levels

Existing alternatives

Record all data, analyze later (e.g. sort and get 99.9%‘ile).

Record in traditional histograms

Page 194: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

HdrHistogram background Goal: Collect data for good latency characterization...

Including acceptable precision at and between varying percentile levels

Existing alternatives

Record all data, analyze later (e.g. sort and get 99.9%‘ile).

Record in traditional histograms

Traditional Histograms: Linear bins, Logarithmic bins, or Arbitrary bins

Linear requires lots of storage to cover range with good resolution

Logarithmic covers wide range but has terrible precisions

Arbitrary is.... arbitrary. Works only when you have a good feel for the interesting parts of the value range

Page 195: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

HdrHistogram

Page 196: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

HdrHistogramA High Dynamic Range Histogram

Covers a configurable dynamic value range

At configurable precision (expressed as number of significant digits)

Page 197: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

HdrHistogramA High Dynamic Range Histogram

Covers a configurable dynamic value range

At configurable precision (expressed as number of significant digits)

For Example:Track values between 1 microsecond and 1 hour

With 3 decimal points of resolution

Page 198: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

HdrHistogramA High Dynamic Range Histogram

Covers a configurable dynamic value range

At configurable precision (expressed as number of significant digits)

For Example:Track values between 1 microsecond and 1 hour

With 3 decimal points of resolution

Built-in [optional] compensation for Coordinated Omission

Page 199: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

HdrHistogramA High Dynamic Range Histogram

Covers a configurable dynamic value range

At configurable precision (expressed as number of significant digits)

For Example:Track values between 1 microsecond and 1 hour

With 3 decimal points of resolution

Built-in [optional] compensation for Coordinated Omission

Open SourceOn github, released to the public domain, creative commons CC0

Page 200: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

HdrHistogram

Page 201: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

HdrHistogramFixed cost in both space and time

Built with “latency sensitive” applications in mind

Recording values does not allocate or grow any data structures

Recording values uses a fixed computation to determine location (no searches, no variability in recording cost, FAST)

Even iterating through histogram can be done with no allocation

Page 202: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

HdrHistogramFixed cost in both space and time

Built with “latency sensitive” applications in mind

Recording values does not allocate or grow any data structures

Recording values uses a fixed computation to determine location (no searches, no variability in recording cost, FAST)

Even iterating through histogram can be done with no allocation

Internals work like a “floating point” data structure“Exponent” and “Mantissa”

Exponent determines “Mantissa bucket” to use

“Mantissa buckets” provide linear value range for a given exponent. Each have enough linear entries to support required precision

Page 203: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

HdrHistogram

Page 204: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

HdrHistogram

Provides tools for iterationLinear, Logarithmic, Percentile

Page 205: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

HdrHistogram

Provides tools for iterationLinear, Logarithmic, Percentile

Supports percentile iteratorsPractical due to high dynamic range

Page 206: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

HdrHistogram

Provides tools for iterationLinear, Logarithmic, Percentile

Supports percentile iteratorsPractical due to high dynamic range

Convenient percentile output10% intervals between 0 and 50% 5% intervals between 50% and 75% 2.5% intervals between 75% and 87.5%...

Very useful for feeding percentile distribution graphs...

Page 207: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

HdrHistogram

Provides tools for iterationLinear, Logarithmic, Percentile

Supports percentile iteratorsPractical due to high dynamic range

Convenient percentile output10% intervals between 0 and 50% 5% intervals between 50% and 75% 2.5% intervals between 75% and 87.5%...

Very useful for feeding percentile distribution graphs...

Page 208: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

0%# 90%# 99%# 99.9%# 99.99%# 99.999%# 99.9999%#

Max=137.472+

0#

20#

40#

60#

80#

100#

120#

140#

160#

Latency+(m

sec)+

++

Percen8le+

Latency+by+Percen8le+Distribu8on+

HdrHistogram

Page 209: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

jHiccup

Page 210: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

Incontinuities in Java platform execution

0"

200"

400"

600"

800"

1000"

1200"

1400"

1600"

1800"

0" 200" 400" 600" 800" 1000" 1200" 1400" 1600" 1800"

Hiccup

&Dura*

on&(m

sec)&

&Elapsed&Time&(sec)&

Hiccups"by"Time"Interval"

Max"per"Interval" 99%" 99.90%" 99.99%" Max"

0%" 90%" 99%" 99.9%" 99.99%" 99.999%"

Max=1665.024&

0"

200"

400"

600"

800"

1000"

1200"

1400"

1600"

1800"

Hiccup

&Dura*

on&(m

sec)&

&&

Percen*le&

Hiccups"by"Percen@le"Distribu@on"

Page 211: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

jHiccup

Page 212: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

jHiccupA tool for capturing and displaying platform hiccups

Records any observed non-continuity of the underlying platform

Plots results in simple, consistent format

Page 213: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

jHiccupA tool for capturing and displaying platform hiccups

Records any observed non-continuity of the underlying platform

Plots results in simple, consistent format

Simple, non-intrusiveAs simple as adding jHiccup.jar as a java agent:

% java -javaagent=jHiccup.jar myApp myflags

or attaching jHiccup to a running process:

% jHiccup -p <pid>

Adds a background thread that samples time @ 1000/sec into an HdrHistogram

Page 214: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

jHiccupA tool for capturing and displaying platform hiccups

Records any observed non-continuity of the underlying platform

Plots results in simple, consistent format

Simple, non-intrusiveAs simple as adding jHiccup.jar as a java agent:

% java -javaagent=jHiccup.jar myApp myflags

or attaching jHiccup to a running process:

% jHiccup -p <pid>

Adds a background thread that samples time @ 1000/sec into an HdrHistogram

Open Source. Released to the public domain

Page 215: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

Telco App Example

0"

20"

40"

60"

80"

100"

120"

140"

0" 500" 1000" 1500" 2000" 2500"

Hiccup

&Dura*

on&(m

sec)&

&Elapsed&Time&(sec)&

Hiccups&by&Time&Interval&

Max"per"Interval" 99%" 99.90%" 99.99%" Max"

0%" 90%" 99%" 99.9%" 99.99%" 99.999%" 99.9999%"0"

20"

40"

60"

80"

100"

120"

140"

Hiccup

&Dura*

on&(m

sec)&

&&

Percen*le&

Hiccups&by&Percen*le&Distribu*on&

Hiccups"by"Percen?le" SLA"

Page 216: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

Telco App Example

0"

20"

40"

60"

80"

100"

120"

140"

0" 500" 1000" 1500" 2000" 2500"

Hiccup

&Dura*

on&(m

sec)&

&Elapsed&Time&(sec)&

Hiccups&by&Time&Interval&

Max"per"Interval" 99%" 99.90%" 99.99%" Max"

0%" 90%" 99%" 99.9%" 99.99%" 99.999%" 99.9999%"0"

20"

40"

60"

80"

100"

120"

140"

Hiccup

&Dura*

on&(m

sec)&

&&

Percen*le&

Hiccups&by&Percen*le&Distribu*on&

Hiccups"by"Percen?le" SLA"

Page 217: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

Telco App Example

0"

20"

40"

60"

80"

100"

120"

140"

0" 500" 1000" 1500" 2000" 2500"

Hiccup

&Dura*

on&(m

sec)&

&Elapsed&Time&(sec)&

Hiccups&by&Time&Interval&

Max"per"Interval" 99%" 99.90%" 99.99%" Max"

0%" 90%" 99%" 99.9%" 99.99%" 99.999%" 99.9999%"0"

20"

40"

60"

80"

100"

120"

140"

Hiccup

&Dura*

on&(m

sec)&

&&

Percen*le&

Hiccups&by&Percen*le&Distribu*on&

Hiccups"by"Percen?le" SLA"

Max Time per interval

Page 218: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

Telco App Example

0"

20"

40"

60"

80"

100"

120"

140"

0" 500" 1000" 1500" 2000" 2500"

Hiccup

&Dura*

on&(m

sec)&

&Elapsed&Time&(sec)&

Hiccups&by&Time&Interval&

Max"per"Interval" 99%" 99.90%" 99.99%" Max"

0%" 90%" 99%" 99.9%" 99.99%" 99.999%" 99.9999%"0"

20"

40"

60"

80"

100"

120"

140"

Hiccup

&Dura*

on&(m

sec)&

&&

Percen*le&

Hiccups&by&Percen*le&Distribu*on&

Hiccups"by"Percen?le" SLA"

Max Time per interval

Page 219: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

Telco App Example

0"

20"

40"

60"

80"

100"

120"

140"

0" 500" 1000" 1500" 2000" 2500"

Hiccup

&Dura*

on&(m

sec)&

&Elapsed&Time&(sec)&

Hiccups&by&Time&Interval&

Max"per"Interval" 99%" 99.90%" 99.99%" Max"

0%" 90%" 99%" 99.9%" 99.99%" 99.999%" 99.9999%"0"

20"

40"

60"

80"

100"

120"

140"

Hiccup

&Dura*

on&(m

sec)&

&&

Percen*le&

Hiccups&by&Percen*le&Distribu*on&

Hiccups"by"Percen?le" SLA"

Max Time per interval

Hiccup duration at percentile

levels

Page 220: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

Telco App Example

0"

20"

40"

60"

80"

100"

120"

140"

0" 500" 1000" 1500" 2000" 2500"

Hiccup

&Dura*

on&(m

sec)&

&Elapsed&Time&(sec)&

Hiccups&by&Time&Interval&

Max"per"Interval" 99%" 99.90%" 99.99%" Max"

0%" 90%" 99%" 99.9%" 99.99%" 99.999%" 99.9999%"0"

20"

40"

60"

80"

100"

120"

140"

Hiccup

&Dura*

on&(m

sec)&

&&

Percen*le&

Hiccups&by&Percen*le&Distribu*on&

Hiccups"by"Percen?le" SLA"

Optional SLA plotting

Max Time per interval

Hiccup duration at percentile

levels

Page 221: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

Fun with jHiccup

Page 222: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

Oracle HotSpot CMS, 1GB in an 8GB heap

0"

2000"

4000"

6000"

8000"

10000"

12000"

14000"

0" 500" 1000" 1500" 2000" 2500" 3000" 3500"

Hiccup

&Dura*

on&(m

sec)&

&Elapsed&Time&(sec)&

Hiccups&by&Time&Interval&

Max"per"Interval" 99%" 99.90%" 99.99%" Max"

0%" 90%" 99%" 99.9%" 99.99%" 99.999%"

Max=13156.352&

0"

2000"

4000"

6000"

8000"

10000"

12000"

14000"

Hiccup

&Dura*

on&(m

sec)&

&&

Percen*le&

Hiccups&by&Percen*le&Distribu*on&

Zing 5, 1GB in an 8GB heap

0"

5"

10"

15"

20"

25"

0" 500" 1000" 1500" 2000" 2500" 3000" 3500"

Hiccup

&Dura*

on&(m

sec)&

&Elapsed&Time&(sec)&

Hiccups&by&Time&Interval&

Max"per"Interval" 99%" 99.90%" 99.99%" Max"

0%" 90%" 99%" 99.9%" 99.99%" 99.999%" 99.9999%"

Max=20.384&

0"

5"

10"

15"

20"

25"

Hiccup

&Dura*

on&(m

sec)&

&&

Percen*le&

Hiccups&by&Percen*le&Distribu*on&

Page 223: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

Oracle HotSpot CMS, 1GB in an 8GB heap

0"

2000"

4000"

6000"

8000"

10000"

12000"

14000"

0" 500" 1000" 1500" 2000" 2500" 3000" 3500"

Hiccup

&Dura*

on&(m

sec)&

&Elapsed&Time&(sec)&

Hiccups&by&Time&Interval&

Max"per"Interval" 99%" 99.90%" 99.99%" Max"

0%" 90%" 99%" 99.9%" 99.99%" 99.999%"

Max=13156.352&

0"

2000"

4000"

6000"

8000"

10000"

12000"

14000"

Hiccup

&Dura*

on&(m

sec)&

&&

Percen*le&

Hiccups&by&Percen*le&Distribu*on&

Zing 5, 1GB in an 8GB heap

0"

5"

10"

15"

20"

25"

0" 500" 1000" 1500" 2000" 2500" 3000" 3500"

Hiccup

&Dura*

on&(m

sec)&

&Elapsed&Time&(sec)&

Hiccups&by&Time&Interval&

Max"per"Interval" 99%" 99.90%" 99.99%" Max"

0%" 90%" 99%" 99.9%" 99.99%" 99.999%" 99.9999%"

Max=20.384&

0"

5"

10"

15"

20"

25"

Hiccup

&Dura*

on&(m

sec)&

&&

Percen*le&

Hiccups&by&Percen*le&Distribu*on&

Page 224: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

Oracle HotSpot CMS, 1GB in an 8GB heap

0"

2000"

4000"

6000"

8000"

10000"

12000"

14000"

0" 500" 1000" 1500" 2000" 2500" 3000" 3500"

Hiccup

&Dura*

on&(m

sec)&

&Elapsed&Time&(sec)&

Hiccups&by&Time&Interval&

Max"per"Interval" 99%" 99.90%" 99.99%" Max"

0%" 90%" 99%" 99.9%" 99.99%" 99.999%"

Max=13156.352&

0"

2000"

4000"

6000"

8000"

10000"

12000"

14000"

Hiccup

&Dura*

on&(m

sec)&

&&

Percen*le&

Hiccups&by&Percen*le&Distribu*on&

Zing 5, 1GB in an 8GB heap

0"

2000"

4000"

6000"

8000"

10000"

12000"

14000"

0" 500" 1000" 1500" 2000" 2500" 3000" 3500"

Hiccup

&Dura*

on&(m

sec)&

&Elapsed&Time&(sec)&

Hiccups&by&Time&Interval&

Max"per"Interval" 99%" 99.90%" 99.99%" Max"

0%" 90%" 99%" 99.9%" 99.99%" 99.999%" 99.9999%"Max=20.384&0"

2000"

4000"

6000"

8000"

10000"

12000"

14000"

Hiccup

&Dura*

on&(m

sec)&

&&

Percen*le&

Hiccups&by&Percen*le&Distribu*on&

Drawn to scale

Page 225: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

Good for both

“squishy” real time(human response times)

and

“soft” real time(low latency software systems)

Page 226: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

0"

5"

10"

15"

20"

25"

0" 100" 200" 300" 400" 500" 600"

Hiccup

&Dura*

on&(m

sec)&

&Elapsed&Time&(sec)&

Hiccups&by&Time&Interval&

Max"per"Interval" 99%" 99.90%" 99.99%" Max"

0%" 90%" 99%" 99.9%" 99.99%" 99.999%"

Max=22.656&

0"

5"

10"

15"

20"

25"

Hiccup

&Dura*

on&(m

sec)&

&&

Percen*le&

Hiccups&by&Percen*le&Distribu*on&

Oracle HotSpot (pure newgen)

Low latency trading application

Page 227: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

0"

0.2"

0.4"

0.6"

0.8"

1"

1.2"

1.4"

1.6"

1.8"

0" 100" 200" 300" 400" 500" 600"

Hiccup

&Dura*

on&(m

sec)&

&Elapsed&Time&(sec)&

Hiccups&by&Time&Interval&

Max"per"Interval" 99%" 99.90%" 99.99%" Max"

0%" 90%" 99%" 99.9%" 99.99%" 99.999%"

Max=1.568&

0"

0.2"

0.4"

0.6"

0.8"

1"

1.2"

1.4"

1.6"

1.8"

Hiccup

&Dura*

on&(m

sec)&

&&

Percen*le&

Hiccups&by&Percen*le&Distribu*on&

0"

5"

10"

15"

20"

25"

0" 100" 200" 300" 400" 500" 600"

Hiccup

&Dura*

on&(m

sec)&

&Elapsed&Time&(sec)&

Hiccups&by&Time&Interval&

Max"per"Interval" 99%" 99.90%" 99.99%" Max"

0%" 90%" 99%" 99.9%" 99.99%" 99.999%"

Max=22.656&

0"

5"

10"

15"

20"

25"

Hiccup

&Dura*

on&(m

sec)&

&&

Percen*le&

Hiccups&by&Percen*le&Distribu*on&

Oracle HotSpot (pure newgen) Zing

Low latency trading application

Page 228: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

0"

0.2"

0.4"

0.6"

0.8"

1"

1.2"

1.4"

1.6"

1.8"

0" 100" 200" 300" 400" 500" 600"

Hiccup

&Dura*

on&(m

sec)&

&Elapsed&Time&(sec)&

Hiccups&by&Time&Interval&

Max"per"Interval" 99%" 99.90%" 99.99%" Max"

0%" 90%" 99%" 99.9%" 99.99%" 99.999%"

Max=1.568&

0"

0.2"

0.4"

0.6"

0.8"

1"

1.2"

1.4"

1.6"

1.8"

Hiccup

&Dura*

on&(m

sec)&

&&

Percen*le&

Hiccups&by&Percen*le&Distribu*on&

0"

5"

10"

15"

20"

25"

0" 100" 200" 300" 400" 500" 600"

Hiccup

&Dura*

on&(m

sec)&

&Elapsed&Time&(sec)&

Hiccups&by&Time&Interval&

Max"per"Interval" 99%" 99.90%" 99.99%" Max"

0%" 90%" 99%" 99.9%" 99.99%" 99.999%"

Max=22.656&

0"

5"

10"

15"

20"

25"

Hiccup

&Dura*

on&(m

sec)&

&&

Percen*le&

Hiccups&by&Percen*le&Distribu*on&

Oracle HotSpot (pure newgen) Zing

Low latency trading application

Page 229: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2012 Azul Systems, Inc.

Low latency - Drawn to scale

Oracle HotSpot (pure newgen) Zing

0"

5"

10"

15"

20"

25"

0" 100" 200" 300" 400" 500" 600"

Hiccup

&Dura*

on&(m

sec)&

&Elapsed&Time&(sec)&

Hiccups&by&Time&Interval&

Max"per"Interval" 99%" 99.90%" 99.99%" Max"

0%" 90%" 99%" 99.9%" 99.99%" 99.999%"

Max=1.568&

0"

5"

10"

15"

20"

25"

Hiccup

&Dura*

on&(m

sec)&

&&

Percen*le&

Hiccups&by&Percen*le&Distribu*on&

0"

5"

10"

15"

20"

25"

0" 100" 200" 300" 400" 500" 600"

Hiccup

&Dura*

on&(m

sec)&

&Elapsed&Time&(sec)&

Hiccups&by&Time&Interval&

Max"per"Interval" 99%" 99.90%" 99.99%" Max"

0%" 90%" 99%" 99.9%" 99.99%" 99.999%"

Max=22.656&

0"

5"

10"

15"

20"

25"

Hiccup

&Dura*

on&(m

sec)&

&&

Percen*le&

Hiccups&by&Percen*le&Distribu*on&

Page 230: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Shameless bragging

Page 231: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Shameless bragging

Page 232: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Zing

Page 233: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

ZingA JVM for Linux/x86 servers

Page 234: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

ZingA JVM for Linux/x86 servers

ELIMINATES Garbage Collection as a concern for enterprise applications

Page 235: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

ZingA JVM for Linux/x86 servers

ELIMINATES Garbage Collection as a concern for enterprise applications

Very wide operating range: Used in both low latency and large scale enterprise application spaces

Page 236: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

ZingA JVM for Linux/x86 servers

ELIMINATES Garbage Collection as a concern for enterprise applications

Very wide operating range: Used in both low latency and large scale enterprise application spaces

Decouples scale metrics from response time concerns

Transaction rate, data set size, concurrent users, heap size, allocation rate, mutation rate, etc.

Page 237: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

What is Zing good for?

Page 238: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

What is Zing good for?

If you have a server-based Java application

Page 239: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

What is Zing good for?

If you have a server-based Java application

And you are running on Linux

Page 240: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

What is Zing good for?

If you have a server-based Java application

And you are running on Linux

And you use using more than ~300MB of memory

Page 241: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

What is Zing good for?

If you have a server-based Java application

And you are running on Linux

And you use using more than ~300MB of memory

Then Zing will likely deliver superior behavior metrics

Page 242: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Where Zing shines

Page 243: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Where Zing shines

Low latencyEliminate behavior blips down to the sub-millisecond-units level

Page 244: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Where Zing shines

Low latencyEliminate behavior blips down to the sub-millisecond-units level

Machine-to-machine “stuff”Support higher *sustainable* throughput (the one that meets SLAs)

Page 245: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Where Zing shines

Low latencyEliminate behavior blips down to the sub-millisecond-units level

Machine-to-machine “stuff”Support higher *sustainable* throughput (the one that meets SLAs)

Human response timesEliminate user-annoying response time blips. Multi-second and even fraction-of-a-second blips will be completely gone.

Support larger memory JVMs *if needed* (e.g. larger virtual user counts, or larger cache, in-memory state, or consolidating multiple instances)

Page 246: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Where Zing shines

Low latencyEliminate behavior blips down to the sub-millisecond-units level

Machine-to-machine “stuff”Support higher *sustainable* throughput (the one that meets SLAs)

Human response timesEliminate user-annoying response time blips. Multi-second and even fraction-of-a-second blips will be completely gone.

Support larger memory JVMs *if needed* (e.g. larger virtual user counts, or larger cache, in-memory state, or consolidating multiple instances)

“Large” data and in-memory analytics

Page 247: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Where Zing shines

Low latencyEliminate behavior blips down to the sub-millisecond-units level

Machine-to-machine “stuff”Support higher *sustainable* throughput (the one that meets SLAs)

Human response timesEliminate user-annoying response time blips. Multi-second and even fraction-of-a-second blips will be completely gone.

Support larger memory JVMs *if needed* (e.g. larger virtual user counts, or larger cache, in-memory state, or consolidating multiple instances)

“Large” data and in-memory analytics

Page 248: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Takeaways

Page 249: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Takeaways

Standard Deviation and application latency should never show up on the same page...

Page 250: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Takeaways

Standard Deviation and application latency should never show up on the same page...

If you haven’t stated percentiles and a Max, you haven’t specified your requirements

Page 251: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Takeaways

Standard Deviation and application latency should never show up on the same page...

If you haven’t stated percentiles and a Max, you haven’t specified your requirements

Measuring throughput without latency behavior is [usually] meaningless

Page 252: How NOT to Measure Latency, Gil Tene, London, Oct. 2013

©2013 Azul Systems, Inc.

Takeaways

Standard Deviation and application latency should never show up on the same page...

If you haven’t stated percentiles and a Max, you haven’t specified your requirements

Measuring throughput without latency behavior is [usually] meaningless

Mistakes in measurement/analysis can cause orders-of-magnitude errors and lead to bad business decisions