36
Chapter 4 Assessing and Understanding Performance 授授授授 : 授授授 授授 (Chuan-Yu Chang Ph.D.) E-mail: [email protected] Tel: (05)5342601 ext. 4337

Chapter 4 Assessing and Understanding Performance

  • Upload
    brick

  • View
    50

  • Download
    0

Embed Size (px)

DESCRIPTION

Chapter 4 Assessing and Understanding Performance. 授課教師 : 張傳育 博士 (Chuan-Yu Chang Ph.D.) E-mail: [email protected] Tel: (05)5342601 ext. 4337. Which of these airplanes has the best performance?. Airplane Passengers Range (mi) Speed (mph) Passenger throughput - PowerPoint PPT Presentation

Citation preview

Page 1: Chapter 4 Assessing and Understanding Performance

Chapter 4Assessing and Understanding

Performance

Chapter 4Assessing and Understanding

Performance

授課教師 : 張傳育 博士 (Chuan-Yu Chang Ph.D.)E-mail: [email protected]: (05)5342601 ext. 4337

Page 2: Chapter 4 Assessing and Understanding Performance

2

Airplane Passengers Range (mi) Speed (mph) Passenger throughput

Boeing 777375 4630 610 228,750Boeing 747470 4150 610 286,700BAC/Sud Concorde132 4000 1350 178,200Douglas DC-8-50146 8720 544 79,424

How much faster is the Concorde compared to the 747?

How much bigger is the 747 than the Douglas DC-8?

For different types of applications, different performance metrics may be appropriate.

Understanding how best to measure performance and the limitations of performance measurements is important in selecting a machine.

Which of these airplanes has the best performance?Which of these airplanes has the best performance?

Page 3: Chapter 4 Assessing and Understanding Performance

3

Response Time (execution time)

The time between the start and completion of a task.

How long does it take for my job to run?

How long does it take to execute a job?

How long must I wait for the database query?

Throughput

The total amount of work done in a given time.

How many jobs can the machine run at once?

What is the average execution rate?

How much work is getting done?

If we upgrade a machine with a new processor what do we increase?

Response time and throughput are improved.

If we add additional processor to a system that uses multiple processors for separate tasks what do we

increase?

Throughput is increasing.

Computer Performance: TIME, TIME, TIME

Page 4: Chapter 4 Assessing and Understanding Performance

4

Computer Performance:Computer Performance:

To maximize performance, to minimize response time or execution time for some task.

Relative Performance

XX timeExecution

ePerformanc1

ntimeExecution

timeExecution

ePerformanc

ePerformanc

nePerformanc

ePerformanc

timeExecutiontimeExecution

timeExecutiontimeExecution

ePerformancePerformanc

X

Y

Y

X

Y

X

XY

YX

YX

11

If the performance of X is greater than the performance of Y

X is n times faster than Y

Page 5: Chapter 4 Assessing and Understanding Performance

5

Book's Definition of Performance

For some program running on machine X, PerformanceX = 1 / Execution timeX

"X is n times faster than Y"PerformanceX / PerformanceY = n

Example:

Machine A runs a program in 10 seconds and machine B runs the same program in 15 seconds, how much faster is A than B?

Answer5.1

10

15

A

B

timeExecution

timeExecution

A is 1.5 times faster than B.

Page 6: Chapter 4 Assessing and Understanding Performance

6

Time is the measure of computer performance

The computer that performs the same amount of work in the least time is the fastest.

Elapsed Time (response time)

counts everything (disk and memory accesses, I/O , etc.)

a useful number, but often not good for comparison purposes

CPU time (CPU execution time)

doesn't count I/O or time spent running other programs

can be broken up into system time, and user time

Our focus: user CPU time

time spent executing the lines of code that are "in" our program

CPU performance

System CPU time

Time spent in the OS performing tasks on behalf of the program

System performance

Execution TimeExecution Time

Page 7: Chapter 4 Assessing and Understanding Performance

7

Clock Cycles

Instead of reporting execution time in seconds, we often use cycles

Clock “ticks” indicate when to start activities:

cycle time = time between ticks = seconds per cycle

clock rate (frequency) = cycles per second (1 Hz. = 1 cycle/sec)

A 200 Mhz. clock has a cycle time

time

seconds

program

cycles

program

seconds

cycle

snanosecond 5910 610200

1

Page 8: Chapter 4 Assessing and Understanding Performance

8

How to Improve PerformanceHow to Improve Performance

CPU execution time for a program = CPU clock cycles for a program * clock cycle time

CPU execution time for a program = CPU clock cycles for a program / clock rate

seconds

program

cycles

program

seconds

cycle

•So, to improve performance (everything else being equal) you can either

•Reduce the # of required cycles for a program, or•Reduce the clock cycle time or,•Increase the clock rate.

Page 9: Chapter 4 Assessing and Understanding Performance

9

Example

Our favorite program runs in 10 seconds on computer A, which has a 4GHz clock. We are trying to help a computer designer build a new machine B, that will run this program in 6 seconds. The designer has determined that a substantial increase in the clock rate is possible, but this increase will affect the rest of the CPU design, causing computer B to require 1.2 times as many clock cycles as machine A for this program. What clock rate should we tell the designer to target?

GHzG

rateClock

rateClock

Gond

rateClock

cyclesclockCPUtimeCPU

B

B

B

BB

86

402.1

402.1sec6

2.1

Answer

GGcyclesclockCPUGHz

cyclesclockCPUond

rateClock

cyclesclockCPUtimeCPU

A

A

A

AA

404104

sec10

Page 10: Chapter 4 Assessing and Understanding Performance

10

Comparison of two different implementationsComparison of two different implementations

The execution time must depend on the number of instructions in a program

CPU clock cycles for a program = Instructions for a program * Average clock cycles per instructionCPI (clock cycles per instruction)

The average number of clock cycles each instruction takes to execute.

CPU time for a program = Instruction count * CPI * clock cycle timeCPU time for a program = (Instruction count * CPI ) / clock rateIf the CPU clock cycles are consist of different types of instruction and using individual clock cycle count.

n

iii CCPIcyclesclockCPU

1

)(

the number of instructions of class i executed

Page 11: Chapter 4 Assessing and Understanding Performance

11

CPI Example

Suppose we have two implementations of the same instruction set architecture (ISA). For some program,Computer A has a clock cycle time of 250 ps. and a CPI of 2.0 Computer B has a clock cycle time of 500 ps. and a CPI of 1.2 Which computer is faster for this program, and by how much?

Answer:

2.1500

600

6005002.1

5002500.2

2.1

0.2

psI

psI

timeExecution

timeExecution

eperformancCPU

eperformancCPU

psIpsItimeCPU

psIpsI

timecycleClockcyclesclockCPUtimeCPU

IcyclesclockCPU

IcyclesclockCPU

A

B

B

A

B

AAA

B

A

Machine A is 1.2 times faster than machine B

Page 12: Chapter 4 Assessing and Understanding Performance

12

How many cycles are required for a program?How many cycles are required for a program?

Could assume that # of cycles = # of instructions?

This assumption is incorrect,

different instructions take different amounts of time on different machines.

Why? hint: remember that these are machine instructions, not lines of C code

time

1st

inst

ruct

ion

2nd

inst

ruct

ion

3rd

inst

ruct

ion

4th

5th

6th ...

Page 13: Chapter 4 Assessing and Understanding Performance

13

Different numbers of cycles for different instructionsDifferent numbers of cycles for different instructions

Multiplication takes more time than addition

Floating point operations take longer than integer ones

Accessing memory takes more time than accessing registers

Important point: changing the cycle time often changes the number of cycles required for various instructions

time

Page 14: Chapter 4 Assessing and Understanding Performance

14

Now that we understand cyclesNow that we understand cycles

A given program will require

some number of instructions (machine instructions)

some number of cycles

some number of seconds

We have a vocabulary that relates these quantities:

cycle time (seconds per cycle)

clock rate (cycles per second)

CPI (cycles per instruction)

a floating point intensive application might have a higher CPI

MIPS (millions of instructions per second)

this would be higher for a program using simple instructions

Page 15: Chapter 4 Assessing and Understanding Performance

15

Basic Components of Performance

Performance is determined by execution time

# of cycles to execute program

# of instructions in program

# of cycles per second

average # of cycles per instruction

average # of instructions per second

Page 16: Chapter 4 Assessing and Understanding Performance

16

A compiler designer is trying to decide between two code sequences for a particular machine. Based on the hardware implementation, there are three different classes of instructions: Class A, Class B, and Class C, and they require one, two, and three cycles (respectively).

The first code sequence has 5 instructions:

2 of A, 1 of B, and 2 of C

The second sequence has 6 instructions:

4 of A, 1 of B, and 1 of C.

Which code sequence executes the most instructions? Which code sequence will be faster? What is the CPI for each sequence?

Example : # of Instructions

Sequence 1 execute fewer instruction

Page 17: Chapter 4 Assessing and Understanding Performance

17

# of Instructions Example (cont.)# of Instructions Example (cont.)

Answer:

cyclescyclesclockCPU

cyclescyclesclockCPU

CCPIcyclesclockCPUn

iii

9)31()21()14(

10)32()21()12(

)(

2

1

1

So code sequence 2 is faster

5.16

9

25

10

2

22

1

11

countnInstructio

cyclesclockCPUCPI

countnInstructio

cyclesclockCPUCPI

countnInstructio

cyclesclockCPUCPI 只使用一個 fact

or 來衡量效能是危險的

Page 18: Chapter 4 Assessing and Understanding Performance

18

Using MIPS as a performance metricUsing MIPS as a performance metric

MIPS (Million instructions per second)

MIPS = Instruction count / Execution time*106

Faster machines have a higher MIPS rating.

There are three problems with using MIPS as a measure for comparing machines:

MIPS specifies the instruction execution rate but does not take into account the capabilities of the instructions.

We cannot compare computers with different instruction sets using MIPS, since the instruction counts will certainly differ.

MIPS varies between programs on the same computer

A machine cannot have a single MIPS rating for all program.

MIPS can inversely with performance.

Page 19: Chapter 4 Assessing and Understanding Performance

19

Two different compilers are being tested for a 500 MHz machine with three different classes of instructions: Class A, Class B, and Class C, which require one, two, and three cycles (respectively). Both compilers are used to produce code for a large piece of software.

The first compiler's code uses 5 billion Class A instructions, 1 billion Class B instructions, and 1 billion Class C instructions.

The second compiler's code uses 10 billion Class A instructions, 1 billion Class B instructions, and 1 billion Class C instructions.

Which sequence will be faster according to MIPS?

Which sequence will be faster according to execution time?

MIPS example (Ed. 2, page 78)

Page 20: Chapter 4 Assessing and Understanding Performance

20

MIPS example (Ed. 2, page 78, cont.)MIPS example (Ed. 2, page 78, cont.)

second3010500

1015timeExecution

second2010500

1010timeExecution

101510)3121110(cyclesclockCPU

101010)312115(cyclesclockCPU

)(cyclesclockCPU

rateClock

cyclesclockCPUtimeExecution

6

9

2

6

9

1

992

991

1

n

iii CCPI

4001030

10)1110(

3501020

10)115(

10

6

9

2

6

9

1

6

MIPS

MIPS

timeExecution

countnInstructioMIPS

Compiler 1 faster than compiler 2 Compiler 2 has higher MIPS than Compiler 1

Page 21: Chapter 4 Assessing and Understanding Performance

21

Choosing programs to evaluate performanceChoosing programs to evaluate performance

Performance best determined by running a real application

Use programs typical of expected workload

The set of programs run would form a workload.

Compare the execution time of the workload on the two machines.

Using a set of benchmarks to evaluate machine.

benchmark: Programs specifically chosen to measure performance.

The best type of programs to use for benchmarks are real application

e.g., compilers/editors, scientific applications, graphics, etc.

Page 22: Chapter 4 Assessing and Understanding Performance

22

Choosing programs to evaluate performance( cont.)Choosing programs to evaluate performance( cont.)

Small benchmarks

nice for architects and designers

easy to standardize

can be abused

Different classes and applications of computers will require different types of benchmarks.

For desktop computer, the most common benchmarks are either measures of CPU performance or benchmarks focusing on a specific task.

SPEC (System Performance Evaluation Cooperative)

companies have agreed on a set of real program and inputs

can still be abused (Intel’s “other” bug)

valuable indicator of performance (and compiler technology)

For embedded computer

EEMBC

Page 23: Chapter 4 Assessing and Understanding Performance

23

SPEC ‘89

Compiler “enhancements” and performance

SPEC89 performance ratios for the IBM Powerstation 550 using two different compilers. (compare with VAX-11/780)

0

100

200

300

400

500

600

700

800

tomcatvfppppmatrix300eqntottlinasa7doducspiceespressogcc

BenchmarkCompiler

Enhanced compiler

SP

EC

pe

rfo

rman

ce r

atio

Page 24: Chapter 4 Assessing and Understanding Performance

24

The SPEC’95 Benchmarks

Benchmark Description

go Artificial intelligence; plays the game of Gom88ksim Motorola 88k chip simulator; runs test programgcc The Gnu C compiler generating SPARC codecompress Compresses and decompresses file in memoryli Lisp interpreterijpeg Graphic compression and decompressionperl Manipulates strings and prime numbers in the special-purpose programming language Perlvortex A database programtomcatv A mesh generation programswim Shallow water model with 513 x 513 gridsu2cor quantum physics; Monte Carlo simulationhydro2d Astrophysics; Hydrodynamic Naiver Stokes equationsmgrid Multigrid solver in 3-D potential fieldapplu Parabolic/elliptic partial differential equationstrub3d Simulates isotropic, homogeneous turbulence in a cubeapsi Solves problems regarding temperature, wind velocity, and distribution of pollutantfpppp Quantum chemistrywave5 Plasma physics; electromagnetic particle simulation

Page 25: Chapter 4 Assessing and Understanding Performance

25

SPEC ‘95

Does doubling the clock rate double the performance?

Clock rate (MHz)

SP

EC

int

2

0

4

6

8

3

1

5

7

9

10

200 25015010050

Pentium

Pentium Pro

PentiumClock rate (MHz)

SP

EC

fp

Pentium Pro

2

0

4

6

8

3

1

5

7

9

10

200 25015010050

造成 clock rate 増快 1 倍,但 performance 只增加 1.3~1.7 倍的原因在於記憶體的存取速度未提昇

Page 26: Chapter 4 Assessing and Understanding Performance

26

SPEC CPU2000 (the latest release)SPEC CPU2000 (the latest release)

12個整數程式

14個浮點數程式

Page 27: Chapter 4 Assessing and Understanding Performance

27

SPEC 2000

Does doubling the clock rate double the performance?

Can a machine with a slower clock rate have better performance?

Clock rate in MHz

500 1000 1500 30002000 2500 35000

200

400

600

800

1000

1200

1400

Pentium III CINT2000

Pentium 4 CINT2000

Pentium III CFP2000

Pentium 4 CFP2000

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

SPECINT2000 SPECFP2000 SPECINT2000 SPECFP2000 SPECINT2000 SPECFP2000

Always on/maximum clock Laptop mode/adaptiveclock

Minimum power/minimumclock

Benchmark and power mode

Pentium M @ 1.6/0.6 GHz

Pentium 4-M @ 2.4/1.2 GHz

Pentium III-M @ 1.2/0.8 GHz

Page 28: Chapter 4 Assessing and Understanding Performance

28

Performance ReportPerformance Report

The guiding principle in reporting performance measurements should be “reproducibility”

We should list everything another experimenter would need to duplicate the results.

The list must include

The version of the OS

Compiler

The input

Machine configuration

Ex.:(Fig.4.3)

Page 29: Chapter 4 Assessing and Understanding Performance

29

Example of Performance ReportExample of Performance Report

Page 30: Chapter 4 Assessing and Understanding Performance

30

Example-Comparing and Summarizing PerformanceExample-Comparing and Summarizing Performance

How should a summary be computed?Execution times of two programs on two different machines.

Computer A Computer B

Program 1 1 10

Program 2 1000 100

Total time 1001 110

A is 10 times faster than B for program 1.B is 10 times faster than A for program 2.Total Execution Time (the simplest approach to summarizing relative performance)

1.9110

1001

A

B

A

B

TimeExecution

TimeExecution

ePerformanc

ePerformanc

B is 9.1 times faster than A for programs 1 and 2 together.

Page 31: Chapter 4 Assessing and Understanding Performance

31

Arithmetic Mean

The average of the execution times

Weighted Arithmetic Mean

Assign a weighting factor wi to each program to indicate t

he frequency of the program in that workload.

n

iiTime

nAM

1

1

Example-Comparing and Summarizing Performance (cont.)Example-Comparing and Summarizing Performance (cont.)

Page 32: Chapter 4 Assessing and Understanding Performance

32

Amdahl's Law

Execution Time After Improvement =Execution Time Unaffected +( Execution Time Affected / Amount of Improvement )

Example:Suppose a program runs in 100 seconds on a machine, with multiply responsible for 80 seconds of this time. How much do we have to improve the speed of multiplication if we want the program to run 4 times and 5 times faster?“

Solution:

(1) run 4 times faster (2) run 5 times faster

Principle: Make the common case fast16

802025

nn

n

n80

0

802020

Page 33: Chapter 4 Assessing and Understanding Performance

33

Example

Suppose we enhance a machine making all floating-point instructions run five times faster. If the execution time of some benchmark before the floating-point enhancement is 10 seconds, what will the speedup be if half of the 10 seconds is spent executing floating-point instructions?

Solution:Execution Time After Improvement = Execution Time Unaffected +( Execution Time Affected / Amount of Improvement )Execution Time After Improvement =5+5/5=6

speedup=10/6=1.667

timprovemenaftertimeExecution

timprovemenbeforetimeExecution

timprovemenbeforeePerformanc

timprovemenafterePerformancSpeedup

Page 34: Chapter 4 Assessing and Understanding Performance

34

Performance is specific to a particular program/s

Total execution time is a consistent summary of performance

For a given architecture performance increases come from three sources:

increases in clock rate (without adverse CPI affects)

improvements in processor organization that lower CPI

compiler enhancements that lower CPI and/or instruction count

Pitfall: ( 陷阱 )

expecting improvement in one aspect of a machine’s performance to affect the total performance

Using MIPS as a performance metric

Using the arithmetic mean of normalized execution times to predict performance.

Remember

Page 35: Chapter 4 Assessing and Understanding Performance

35

Example:Example:Time on A Time on B Normalized to A Normalized to B

A B A B

Program 1 1 10 1 10 0.1 1

Program 2 1000 100 1 0.1 10 1

Arithmetic mean of time or normalized time

500.5 55 1 5.05 5.05 1

Geometric mean of time or normalized time

31.6 31.6 1 1 1 1

n

n

ii

n

n

ii

aaaaa

where

ratiotimetimeExecutionmeanGeometric

...3211

1

The geometric mean is independent of which dataseries use for normalization because of

i

i

i

i

Y

XmeanGeometric

YmeanGeometric

XmeanGeometric

)(

)(

Page 36: Chapter 4 Assessing and Understanding Performance

36

Fallacy( 謬誤 )Hardware-independent metrics predict performance.

Use code size as a measure of speed.

Synthetic benchmarks predict performance.The synthetic benchmarks are not real applications because these programs compute anything a user would not interesting. Compiler and hardware optimization can inflate performance of these benchmarks.

The geometric mean of execution time ratios is proportional to total execution time.

在圖 2.10 中,若執行程式一 100 次,程式二 1 次則Machine A 的總執行時間 =100+1000=1100Machine B 的總執行時間 =1000+100=1100所以 Machine A 和 Machine B 有相同的 performance?