38
Presented by Date Event Benchmarking Best Practices 102 Maxim Kuvyrkov BKK16-300 March 9, 2016 Linaro Connect BKK16

BKK16-300 Benchmarking 102

  • Upload
    linaro

  • View
    452

  • Download
    0

Embed Size (px)

Citation preview

Page 1: BKK16-300 Benchmarking 102

Presented by

Date

Event

Benchmarking Best Practices 102

Maxim Kuvyrkov

BKK16-300 March 9, 2016

Linaro Connect BKK16

Page 2: BKK16-300 Benchmarking 102

Overview

● Revision (Benchmarking Best Practices 101)● Reproducibility● Reporting

Page 3: BKK16-300 Benchmarking 102

Revision

Page 4: BKK16-300 Benchmarking 102

Previously, in Benchmarking-101...

● Approach benchmarking as an experiment. Be scientific.

● Design the experiment in light of your goal.● Repeatability:

○ Understand and control noise.○ Use statistical methods to find truth in noise.

Page 5: BKK16-300 Benchmarking 102

And we briefly mentioned

● Reproducibility● Reporting

So let’s talk some more about those.

Page 6: BKK16-300 Benchmarking 102

Reproducibility

Page 7: BKK16-300 Benchmarking 102

Reproducibility

An experiment is reproducible if external teams can run the same experiment over large periods of time and get commensurate (comparable) results.Achieved if others can repeat what we did and get the same results as us, within the given confidence interval.

Page 8: BKK16-300 Benchmarking 102

From Repeatability to Reproducibility

We must log enough information that anyone else can use that information to repeat our experiments.We have achieved reproducibility if they can get the same results, within the given confidence interval.

Page 9: BKK16-300 Benchmarking 102

Logging: Target

● CPU/SoC/Board○ Revision, patch level, firmware version…

● Instance of the board○ Is board 1 really identical to board 2?

● Kernel version and configuration● Distribution

Page 10: BKK16-300 Benchmarking 102

Example: Target

Board: Juno r0CPU: 2 * Cortex-A57r0p0, 4 * Cortex-A53r0p0Firmware version: 0.11.3Hostname: juno-01Kernel: 3.16.0-4-generic #1 SMPDistribution: Debian Jessie

Page 11: BKK16-300 Benchmarking 102

Logging: Build

● Exact toolchain version● Exact libraries used● Exact benchmark source● Build system (scripts, makefiles etc)● Full build logOthers should be able to acquire and rebuild all of these components.

Page 12: BKK16-300 Benchmarking 102

Example: Build

Toolchain: Linaro GCC 2015.04CLI: -O2 -fno-tree-vectorize -DFOOLibraries: libBar.so.1.3.2, git.linaro.org/foo/bar #8d30a2c508468bb534bb937bd488b18b8636d3b1Benchmark: MyBenchmark, git.linaro.org/foo/mb #d00fb95a1b5dbe3a84fa158df872e1d2c4c49d06Build System: abe, git.linaro.org/toolchain/abe #d758ec431131655032bc7de12c0e6f266d9723c2

Page 13: BKK16-300 Benchmarking 102

Logging: Run-time Environment

● Environment variables● Command-line options passed to benchmark● Mitigation measures taken

Page 14: BKK16-300 Benchmarking 102

Logging: Other

All of the above may need modification depending on what is being measured.● Network-sensitive benchmarks may need

details of network configuration● IO-sensitive benchmarks may need details

of storage devices● And so on...

Page 15: BKK16-300 Benchmarking 102

Long Term Storage

All results should be stored with information required for reproducibilityResults should be kept for the long term● Someone may ask you for some information● You may want to do some new analysis in

the future

Page 16: BKK16-300 Benchmarking 102

Reporting

Page 17: BKK16-300 Benchmarking 102

Reporting

● Clear, concise reporting allows others to utilise benchmark results.

● Does not have to include all data required for reproducibility.

● But that data should be available.● Do not assume too much reader knowledge.

○ Err on the side of over-explanation

Page 18: BKK16-300 Benchmarking 102

Reporting: Goal

Explain the goal of the experiment● What decision will it help you to make?● What improvement will it allow you to

deliver?Explain the question that the experiment asksExplain how the answer to that question helps you to achieve the goal

Page 19: BKK16-300 Benchmarking 102

Reporting

● Method: Sufficient high-level detail○ Target, toolchain, build options, source, mitigation

● Limitations: Acknowledge and justify○ What are the consequences for this experiment?

● Results: Discuss in context of goal○ Co-locate data, graphs, discussion○ Include units - numbers without units are useless○ Include statistical data○ Use the benchmark’s metrics

Page 20: BKK16-300 Benchmarking 102

Presentation of Results

Graphs are always usefulTables of raw data also usefulStatistical context essential:● Number of runs● (Which) mean● Standard deviation

Page 21: BKK16-300 Benchmarking 102

Experimental Conditions

Precisely what to report depends on what is relevant to the resultsThe following are guidelinesOf course, all the environmental data should be logged and therefore available on request

Page 22: BKK16-300 Benchmarking 102

Include

Highlight key information, even if it could be derived. Including:● All toolchain options● Noise mitigation measures● Testing domain● For e.g. memory sensitive benchmark, report

bus speed, cache hierarchy

Page 23: BKK16-300 Benchmarking 102

Leave Out

Everything not essential to the main point● Environment variables● Build logs● Firmware● ...All of this information should be available to be provided on request.

Page 24: BKK16-300 Benchmarking 102

Graphs:Strong Suggestions

Page 25: BKK16-300 Benchmarking 102

Speedup Over Baseline (1/3)

Misleading scale● A is about 3.5%

faster than it was before, not 103.5%

Obfuscated regression● B is a regression

Page 26: BKK16-300 Benchmarking 102

Speedup Over Baseline (2/3)

Baseline becomes 0Title now correctRegression clear

But, no confidence interval.

Page 27: BKK16-300 Benchmarking 102

Speedup Over Baseline (3/3)

Error bars tell us more● Effect on D can be

disregarded● Effect on A is real,

but noisy.

Page 28: BKK16-300 Benchmarking 102

Labelling (1/2)

What is the unit?What are we comparing?

Page 29: BKK16-300 Benchmarking 102

Labelling (2/2)

Page 30: BKK16-300 Benchmarking 102

Graphs:Weak Suggestions

Page 31: BKK16-300 Benchmarking 102

Show the mean

Page 32: BKK16-300 Benchmarking 102

Direction of ‘Good’ (1/2)

“Speedup” changes to “time to execute”Direction of “good” flipsIf possible, maintain a constant direction of good.

Page 33: BKK16-300 Benchmarking 102

Direction of ‘Good’ (2/2)

If you have to change the direction of ‘good’, flag the direction (everywhere)

Can be helpful to flag it anyway

Page 34: BKK16-300 Benchmarking 102

Consistent Order

Presents improvements neatlyBut, hard to compare different graphs in the same report

Page 35: BKK16-300 Benchmarking 102

Scale (1/2)

A few high scores make other results hard to seeA couple of alternatives may be more clear...

Page 36: BKK16-300 Benchmarking 102

Scale (2/2)

Page 37: BKK16-300 Benchmarking 102

Summary

Page 38: BKK16-300 Benchmarking 102

Summary

● Log everything, in detail● Be clear about:

○ What the goal of your experiment is○ What your method is, and how it achieves your

purpose● Present results

○ Unambiguously○ With statistical context

● Relate results to your goal