View
283
Download
0
Category
Preview:
Citation preview
Chapter 1Fundamentals of Computer Design
吳俊興高雄大學資訊工程學系
September 2004
EEF011 Computer Architecture計算機結構
Chapter 1. Fundamentals of Computer Design
1.1 Introduction1.2 The Changing Face of Computing and the Task of
the Computer Designer1.3 Technology Trends1.4 Cost, Price, and Their Trends1.5 Measuring and Reporting Performance1.6 Quantitative Principles of Computer Design1.7 Putting It All Together1.8 Another View1.9 Fallacies and Pitfalls1.10 Concluding Remarks
Computer Architecture –1.1 Introduction
• Rapid rate of improvement in the roughly 55 years– the technology used to build computers– innovation in computer design
Beginning in about 1970,computer designers became largely dependent upon IC technology
• Emergence of microprocessor: late 1970s– C: virtual elimination of assembly language programming
(portability)– UNIX: standardized, vendor-independent OS
Lowered the cost and risk of bringing out a new architecture
• Development of RISC (Reduced Instruction Set Computer): early 1980sFocused the attention of designers on– The exploitation of instruction-level parallelism
• Initially through pipelining, and
• later through multiple instruction issue
– The use of caches• initially in simple forms and
• later using more sophisticated organizations and optimizations
1.1 Introduction (cont.)
Figure 1.1: 20 years of sustained growth in performance at an annual rate of over 50% (two different versions of SPEC, i.e., SPEC92 and SPEC95)
EffectsSignificantly enhanced the capability available to usersDominance of microprocessor-based computers across the entire
range of computer design• Minicomputers and mainframes replaced• Supercomputers built with collections of microprocessors
1.2 Changing Face of Computing
• 1960s: dominated by mainframes– business data processing, large-scale scientific computing
• 1970s: the birth of minicomputers– scientific laboratories, time-sharing multiple-user
environment• 1980s: the rise of the desktop computer (personal
computer and workstation)• 1990s: the emergence of the Internet and the World
Wide Web– handheld computing devices– emergence of high-performance digital consumer
electronics
Segmented Markets•Desktop computing: to optimize price-performance
–the largest market in dollar terms–Web-centric, interactive applications
•Servers–Web-based services, replacing traditional mainframe
Availability, Scalability, Throughput
•Embedded Computers–Features
•The presence of the computers is not immediately obvious•The application can usually be carefully tuned for the processor and system
–Widest range of processing power and cost•price is a key factor in the design of computers•Real-time performance requirement
–Key characteristics•the need to minimize memory and power•The use of processor cores together with application-specific circuitry
Task of Computer Designer
• Tasks– Determine what attributes are important for a new
machine, then– design a machine to maximize performance while staying
within cost and power constraints• Efforts
– instruction set design, functional organization, logic design, and implementation (IC design, packaging, power, and cooling)
Design a computer to meet functional requirements as well as price, power, and performance goals
(see Figure 1.4)
Task of a Computer Designer
Evaluate ExistingEvaluate ExistingSystems for Systems for BottlenecksBottlenecks
Simulate NewSimulate NewDesigns andDesigns and
OrganizationsOrganizations
Implement NextImplement NextGeneration SystemGeneration System
TechnologyTrends
Benchmarks
Workloads
ImplementationComplexity
Effect of trendsDesigner often needs to design for the next technology because during a single product cycle (typically 4-5 years), key technologies, such as memory, change sufficiently that the designer must plan for these changes.
1.3 Technology TrendsThe designer must be especially aware of rapidly occurring changes in implementation technology
1.Integrated circuit logic technology– transistor density increases by about 35% per year– die size are less predictable and slower
combined effect: growth rate in transistor count about 55% per year
2.Semiconductor DRAM– density increases by between 40% and 60% per year– cycle time decreased by about one-third in 10 years– bandwidth per chip increases about twice
3.Magnetic disk technology
4.Network technology
Moore’s Law
Gordon Moore (Founder of Intel) observed in 1965 that the number of transistors that could be crammed on a chip doubles every year.
http://www.intel.com/research/silicon/mooreslaw.htm
Technology Trends
Based on SPEED, the CPU has increased dramatically, but memory and disk have increased only a little. This has led to dramatic changed in architecture, Operating Systems, and Programming practices.
Capacity Speed (latency)
Logic 2x in 3 years 2x in 3 years
DRAM 4x in 3 years 2x in 10 years
Disk 4x in 3 years 2x in 10 years
1.4 Cost, Price, and Their Trends
• Price: what you sell a finished good– 1999: more than half the PCs sold were
priced at less than $1000
• Cost: the amount spent to produce it, including overhead
• Time– Learning curve: manufacturing cost reduces over time– Yield: the percentage of the manufactured devices that survi
ve the testing procedure– As the technology matures over time, the yield improves and
hence, things get cheaper• Volume
– Increasing volume decreases cost (time for learning curve),– Increases purchasing and manufacturing efficiency
• Commodities– Products are sold by multiple vendors in large volumes and
are essentially identical– The low-end business such as DRAMs, disks, monitors– Improved competition reduced cost
Cost and Its Trend
Between the start of a project and the shipping of a produce, say, two years, the cost of a new DRAM drops by a factor of between 5 and 10 in constant dollars.
Prices of six generations of DRAMs
Price of an Intel Pentium III at a Given Frequency
It decreases over time as yield enhancements decrease the cost of a good die and competition forces price reductions.
Cost and Its Trend (Cont.)
• Time– Learning curve: manufacturing cost reduces over time– Yield: the percentage of the manufactured devices that survive
the testing procedure– As the technology matures over time, the yield improves and h
ence, things get cheaper• Volume
– Increasing volume decreases cost (time for learning curve),– Increases purchasing and manufacturing efficiency
• Commodities– Products are sold by multiple vendors in large volumes and ar
e essentially identical– The low-end business such as DRAMs, disks, monitors– Improved competition reduced cost
Figure 1.8. This wafer contains 564 MIPS64 R20K processors in 0.18Figure 1.7. Intel Pentium 4 microprocessor Die
Wafer and DiesExponential cost decrease – technology basically the same:
A wafer is tested and chopped into dies that are packaged
Cost of IC =
Cost of an Integrated Circuit (IC)
yield Die per wafer Dies#
wafterofCost die ofCost
)area Die desity Detect
1( yield Wafer yield Die
Today’s technology: 4, detect density 0.4 and 0.8 per cm2
(A greater portion of the cost that varies between machines)
(sensitive to die size)
# of dies along the edge
Cost of an IC (cont.)
Cost Example in a $1000 PC in 2001
1.5 Measuring and Reporting Performance
• Designing high performance computers is one of the major goals of any computer architect.
• As a result, assessing the performance of computer hardware is the at the heart of computer design and greatly affects the demand and market value of the computer.
• However, measuring performance of a computer system is not a straightforward task:
Metrics – How do we describe in a numerical way the performance of a computer?
What tools do we use to find those metrics? How do we summarize the performance?
What is the computer user interested in?
• Reduce the time to run certain task–Execution time (response time) –The time between the start and the completion of
an event• Increase the tasks per week, day, hour, sec, ns …
–Throughput –The total amount of work done in a given time
Do the following changes to a computer system increase throughput, reduce response time, or both?
1) Replacing the processor in a computer with a faster version
2) Adding additional processors to a system that uses multiple processors for separate tasks –for example handling an airline reservation system
Answer1) Both response time and throughput are improved2) Only throughput increases
Example
Execution Time and Performance Comparison
• In this subject, we will be primarily interested in execution time as a measure of performance
• The relationship between performance and execution time on a computer X (reciprocal) is defined as follows:
• To compare design alternatives, we use the following equation:
• “X is n times faster than Y” or “the throughput of X is n times higher than Y” means that the execution time is n times less on X than Y
• To maximize performance of an application, we need to minimize its execution time
XX timeExecution
1 ePerformanc
nY
X
X
Y
ePerformanc
ePerformanc
TimeExecution
TimeExecution
Measure Performance – user CPU time
• Response time may include disk access, memory access, input/output activities, CPU event and operating system overhead - everything
• In order to get an accurate measure of performance, we use CPU time instead of using response time
• CPU time is the time the CPU spends computing a program and does not include time spent waiting for I/O or running other programs
• CPU time can also be divided into user CPU time (program) and system CPU time (OS)
• Key in UNIX command time, we have,90.7u 12.9s 2.39 65% (user CPU, system CPU, total response,%)
• In our performance measures, we use user CPU time – because of its independence on the OS and other factors
Programs for Measuring Performance
• Real applications: text processing software (Word), compliers (C), and other applications like Photoshop – have inputs, outputs, and options when the user wants to use them
One major downside: Real applications often encounter portability problems arising from dependences on OS or complier
• Modified (or scripted) applications: Modification is to enhance portability or focus on one particular aspect of system performance. Scripts are used to simulate the interaction behaviors
• Kernels: small, key pieces from real programs. Typically used to evaluate individual features of the machine
• Toy benchmarks: typically between 10 and 100 lines of code and produce a known result
• Synthetic benchmarks: artificially created code to match an average execution profile
• A key advantage of such suites is that the weakness of any one benchmark is lessened by the presence of the other benchmarks
• Good vs. Bad benchmarks
– Improving product for real programs vs. improving product for benchmarks to get more sales
– If benchmarks are inadequate, then sales wins!
Benchmark Suites
• They are a collection of programs (workload) that try to explore and capture all the strengths and weaknesses of a computer system (real programs, kernels)
SPEC Benchmarks
• Most successful attempts and widely adopted• First generation 1989
– 10 programs yielding a single number (“SPECmarks”)• Second generation 1992
– SPECInt92 (6 integer programs) and SPECfp92 (14 floating point programs)
– Unlimited compiler flags • Third generation 1995
– New set of programs: SPECint95 (8 integer programs) and SPECfp95 (10 floating point)
– Single flag setting for all programs: SPECint_base95, SPECfp_base95
– “benchmarks useful for 3 years”• SPEC CPU 2000
– CINT2000 (11 integer programs) andCFP2000 (14 floating point programs)
SPEC: Standard Performance Evaluation Cooperation
SPEC BenchmarksCINT2000 (Integer Component of SPEC CPU2000)
Program Language What Is It164.gzip C Compression175.vpr C FPGA Circuit Placement and Routing176.gcc C C Programming Language Compiler181.mcf C Combinatorial Optimization186.crafty C Game Playing: Chess197.parser C Word Processing252.eon C++ Computer Visualization253.perlbmkC PERL Programming Language254.gap C Group Theory, Interpreter255.vortex C Object-oriented Database256.bzip2 C Compression300.twolf C Place and Route Simulator
http://www.spec.org/osg/cpu2000/CINT2000/
SPEC BenchmarksCFP2000 (Floating Point Component of SPEC CPU2000)
Program Language What Is It168.wupwise Fortran 77 Physics / Quantum Chromodynamics171.swim Fortran 77 Shallow Water Modeling172.mgrid Fortran 77 Multi-grid Solver: 3D Potential Field173.applu Fortran 77 Parabolic / Elliptic Differential Equations177.mesa C 3-D Graphics Library178.galgel Fortran 90 Computational Fluid Dynamics179.art C Image Recognition / Neural Networks183.equake C Seismic Wave Propagation Simulation187.facerec Fortran 90 Image Processing: Face Recognition188.ammp C Computational Chemistry189.lucas Fortran 90 Number Theory / Primality Testing191.fma3d Fortran 90 Finite-element Crash Simulation 200.sixtrack Fortran 77 High Energy Physics Accelerator Design 301.apsi Fortran 77 Meteorology: Pollutant Distribution
http://www.spec.org/osg/cpu2000/CFP2000/
More Benchmarks
TPC: Transaction Processing Council
– Measure the ability of a system to handle transactions, which consist of database accesses and updates.
– Many variants depending on transaction complexity
– TPC-A: simple bank teller transaction style
– TPC-C: complex database query
– 34 kernels in 5 classes
– 16 automotive/industrial; 5 consumer; 3 networking; 4 office
automation; 6 telecommunications
EDN Embedded Microprocessor Benchmark Consortium (EEMBC, “embassy”)
Management would like to have one numberTechnical people want more:1.They want to have evidence of reproducibility – there should be
enough information so that you or someone else can repeat the experiment
2.There should be consistency when doing the measurements multiple times
How to Summarize Performance
How would you report these results?
Computer A Computer B Computer C
Program P1 (secs) 1 10 20
Program P2 (secs) 1000 100 20
Total Time (secs) 1001 110 40
Comparing and Summarizing Performance
Comparing the performance by looking at individual programs is not fair
Total execution time: a consistent summary measure
Arithmetic Mean – provides a simple average
– Timei: execution time for program i in the workload
– Doesn’t account for weight: all programs are treated equal
n
iiTime
n 1
1
Weighted Variants
What is the proper mixture of programs for the workload? Weight is a weighting factor to each program to indicate t
he relative frequency of the program in the workload: % of use
Weighted Arithmetic Mean
– Weighti: frequency of program i in the workload
– May be better but beware the dominant program time
n
iii TimeWeight
1
A B C W(1) W(2) W(3)
Program P1 (secs) 1 10 20 0.5 0.909 0.999
Program P2 (secs) 1000 100 20 0.5 0.091 0.001
Arithmetic Mean 500.5 55 20
Weighted Arithmetic Mean (1) 500.5 55 20
Weighted Arithmetic Mean (2) 91.82 18.18 20
Weighted Arithmetic Mean (3) 2 10.09 20
Example: Figure 1.16
n
iii TimeWeight
1
Weighted Arithmetic Mean =
Normalized Time Metrics
Normalized execution time metricsMeasure the performance by normalizing it to a reference machine: Execution time ratioi
Geometric Mean
Geometric mean is consistent no matter which machine is the reference
The arithmetic mean should not be used to average normalized execution times
However, geometric mean still doesn’t form accurate predication models (doesn’t predict execution time). It encourages designers to improve the easiest benchmarks rather the slowest benchmarks (due to no weighting)
ni
n
1iRatio TimeExecution
Example
A B C W(1)
Program P1 (secs) 1 10 20 100/101
Program P2 (secs) 1000 100 20 1/101
Program P1 (secs) 1 10 20 Normalized to A
Program P2 (secs) 1 0.1 0.02
Geometric Mean 1 1 0.63
Arithmetic Mean 500.5 55 20
Weighted Arithmetic Mean (1) 10.89 10.89 20
Machines A and B have the same performance according to the Geometric Mean measure, yet this would only be true for a workload that P1 runs 100 times more than P2 according to the Weighted Arithmetic Mean measure
1.6 Quantitative Principles of Computer Design
Already known how to define, measure and summarize performance, then we can explore some of the principles and guidelines in design and analysis of computers.
Make the common case fast
In making a design trade-off, favor the frequent case over the infrequent case Improving the frequent event, rather than the rare event, will obviously help
performance Frequent case is often simpler and can be done faster than the infrequent
case We have to decide what the frequent case is and how much performance
can be improved by making the case faster
Two equations to evaluate design alternatives
Amdahl’s Law states that the performance improvement to be gained from using some fast mode of execution is limited by the fraction of the time the faster mode can be used
Amdahl’s Law defines the speedup that can be gained by using a particular feature
The performance gain that can be obtained by improving some porting of a computer can be calculated using Amdahl’s Law
Amdahl’s Law
The CPU Performance Equation Essentially all computers are constructed using a clock running
at a constant rate CPU time then can be expressed by the amount of clock cycles
Amdahl's Law
Suppose that enhancement E accelerates a fraction F of the task by a factor S, and the remainder of the task is unaffected
tEnhancemenWithTimeExecution
tEnhancemenWithoutTimeExecutionESpeedup
___
___)(
Speedup due to enhancement E:
This fraction enhanced
• Fractionenhanced: the fraction of the execution time in the original machine that can be converted to take advantage of the enhancement
• Speedupenhanced: the improvement gained by the enhanced execution mode
ExTimenew = ExTimeold x (1 - Fractionenhanced) +
Speedupoverall =ExTimeold
ExTimenew
Speedupenhanced
=1
(1 - Fractionenhanced) +Speedupenhanced
This fraction enhanced
ExTimeold ExTimenew
Fractionenhanced
Fractionenhanced
Amdahl's Law
Example: Floating point (FP) instructions are improved to run faster by a factor of 2; but only 10% of actual instructions are FP. What’s the overall speedup gained by this improvement?
Speedupoverall = 10.95
= 1.053
ExTimenew = ExTimeold x (0.9 + 0.1/2) = 0.95 x ExTimeold
Amdahl's Law
Answer
Amdahl’s Law can serve as a guide to how much an enhancement will improve performance and how to distribute resource to improve cost-performance
It is particularly useful for comparing performances both of the overall system and the CPU of two design alternatives
More example: page 41 and page 42
All computers are constructed using a clock to operate its circuits
• Typically measured by two basic metrics
• Clock rate – today in MHz and GHz
• Clock cycle time: clock cycle time = 1/clock rate
• E.g., 1 GHz rate corresponds to 1 ns cycle time
Thus the CPU time for a program is given by:
Or,
timecycleClock program afor cyclesclock CPUTime CPU
rateClock
program afor cyclesclock CPU Time CPU
CPU Performance
More typically, we tend to count # instructions executed, known as Instruction Count (IC)
CPI: average number of clock cycles per instruction
Hence, alternative method to get the CPU timeIC
program afor cyclesclock CPU CPI
timecycleClock CPIIC timeCPU rateClock
CPIIC
CPU performance is equally dependent upon 3 characteristics: clock cycle time, CPI, IC. They are independent and making one better often makes another worse because the basic technologies involved in changing each characteristics are interdependent.
Clock cycle time: hardware technology and organizationCPI: organization and instruction set architectureIC: instruction set architecture and complier technology
Depends on IS of the computer and its compiler
CPU Performance
Example: Suppose we have 2 implementations of the same instruction set architecture. Computer A has a clock cycle time of 10 nsec and a CPI of 2.0 for some program, and computer B has a clock cycle time of 20 nsec and a CPI of 1.2 for the same program.
What machine is faster for this program?
Answer
Assume the program require IN instructions to be executed: CPU clock cycleA = IN x 2.0
CPU clock cycleB = IN x 1.2CPU timeA = IN x 2.0 x 10 = 20 x IN nsecCPU timeB = IN x 1.2 x 20 = 24 x IN nsec
So computer A is faster than computer B.
timecycleClock CPIIC timeCPU
CPU Performance
Often the overall performance is easy to deal with on a
per instruction set basis
The overall CPI can be expressed as:
n
ii
i
n
iii
1
1 CPIcountn Instructio
IC
countn Instructio
CPIIC CPI
timecycleClock CPIIC timeCPU1
n
iii
n
iii
1
CPIIC cyclesclock CPU
# times instruction i is executed
CPI for instruction i
CPU Performance
Cycles Per Instruction
Base Machine (Reg / Reg)
Instruction Freq CPI (% Time)
ALU 50% 1 (33%)
Load 20% 2 (27%)
Store 10% 2 (13%)
Branch 20% 2 (27%)
(Total CPI 1.5)
Example: Suppose we have a machine where we can count the frequency with which instructions are executed. We also know how many cycles it takes for each instruction type.
How do we get total CPI?How do we get %time?
Example: Suppose we have made the following measurements: Frequency of FP operations (other than FPSQR) = 25% Avg. CPI of FP operations = 4.0 Avg. CPI of other instructions = 1.33 Frequency of FPSQR = 2% CPI of FPSQR = 20Compare two designs:
1) decrease CPI of FPSQR to 2;2) decrease the avg. CPI of all FP operations to 2.5.
Answer
First, find the original CPI:
The CPI with design 1:
The CPI with design 2:
0.233.1%754%25CPIcountn Instructio
IC CPI
1o
n
ii
i
1.642)-(202%-2.0FPSQRfor CPI Improved%2CPI CPI1 o
1.6251.3375%2.525% CPI2 So design 2 is better
CPU Performance
Other important fundamental observations come from the properties of programs.
Principle of locality: Programs tend to reuse data and instructions they have used recently.
There are two different types of locality: (Ref. Chapter 5) Temporal Locality (locality in time): If an item is referenced, it
will tend to be referenced again soon (loops, reuse, etc.) Spatial Locality (locality in space/location): If an item is
referenced, items whose addresses are close one another tend to be referenced soon (straight line code, array access, etc.)
We can predict with reasonable accuracy what instructions and data a program will use in the near future based on its accesses in the past.
A program spends 90% of its execution time in only 10% of the code.
Principle of Locality
Some “Misleading” Performance Measures
There are certain computer performance measures which are famous with computer manufactures and sellers – but may be misleading.
MIPS (Million Instructions Per Second)
MIPS depends on the instruction set to make it difficult to compare MIPS of different instructions.
MIPS varies between programs on the same computer – different programs use different instruction mix.
MIPS can vary inversely to performance –most importantly.
Some “Misleading” Performance Measures
MFLOPS: Focus on one type of work MFLOPS (Million Floating-point Operations Per Second) depends on the
program. Must be FP intensive.
MFLOPS depends on the computer as well.
The floating-point operations vary in complexity (e.g., add & divide).
Peak Performance: Performance that the manufacture guarantees you won’t exceed
Difference between peak performance and average performance is huge.
Peak performance is not useful in predicting observed performance.
Summary1.1 Introduction1.2 The Changing Face of Computing and the Task of
the Computer Designer1.3 Technology Trends1.4 Cost, Price, and Their Trends1.5 Measuring and Reporting Performance
– Benchmarks– (Weighted) Arithmetic Mean, Normalized Geometric Mean
1.6 Quantitative Principles of Computer Design– Amdahl’s Law– CPU Performance: CPU Time, CPI– Locality
Recommended