View
247
Download
1
Embed Size (px)
Citation preview
The Improvement of the Personal Computer
By
私立 義守大學 資工系 副教授
金明浩
What is Computer Design?
ISA:Instruction Set Architecture
I/O systemInstr. Set Proc.
Compiler
Operating
System
Application
Digital Design
Circuit Design
Firmware
Datapath & Control
Layout
Forces on Computer Architecture
ComputerArchitecture
Technology ProgrammingLanguages
OperatingSystems
History
Applications
Technology
• In ~1985 the single-chip processor (32-bit) and the single-board computer emerged– => workstations, personal
computers, multiprocessors have been riding this wave since
• In the 2002+ timeframe, these may well look like mainframes compared single-chip computer (maybe 2 chips)
DRAM
Year Size
1980 64 Kb
1983 256 Kb
1986 1 Mb
1989 4 Mb
1992 16 Mb
1996 64 Mb
1999 256 Mb
2002 1 Gb
DRAM chip capacity
Forces on Computer Architecture
ComputerArchitecture
Technology ProgrammingLanguages
OperatingSystems
History
Applications
Levels of Representation
Machine Interpretation
temp = v[k];
v[k] = v[k+1];
v[k+1] = temp;
lw$15, 0($2)lw$16, 4($2)
sw$16, 0($2)sw$15, 4($2)
0000 1001 1100 0110 1010 1111 0101 10001010 1111 0101 1000 0000 1001 1100 0110 1100 0110 1010 1111 0101 1000 0000 1001 0101 1000 0000 1001 1100 0110 1010 1111
ALUOP[0:3] <= InstReg[9:11] & MASK
High Level Language Program
Assembly Language Program
Machine Language Program
Control Signal Specification
Compiler
Assembler
Forces on Computer Architecture
ComputerArchitecture
Technology ProgrammingLanguages
OperatingSystems History
Applications
Levels of Organization
SPARCstation 20
Processor
Computer
Control
Datapath
Memory Devices
Input
Output
Workstation Cost Design Target:25% on Processor25% on Memory(minimum memory size)Rest on I/O devices, power supplies, box
Processor and Caches SPARCstation 20
Slot 1MBus
Slot 0MBus
MBusMBus Module
External Cache
DatapathRegisters
InternalCache
Control
Processor
Input and Output (I/O) Devices SPARCstation 20
Slot 1SBus
Slot 0SBus
Slot 3SBus
Slot 2SBus
SEC MACIO
Disk
Tape
SCSIBus
SBus
Keyboard
& Mouse
Floppy
Disk
External Bus
• SCSI Bus: Standard I/O Devices• SBus: High Speed I/O Devices• PCI Bus: Compatible with PCs• External Bus: Low Speed I/O
Device
Forces on Computer Architecture
ComputerArchitecture
Technology ProgrammingLanguages
OperatingSystems
History
Applications
The design of a processor and the major components
The CPU execution cycles
The Single and multiple path architecture design
The Pipelined CPU
The Data, Control and Structure Hazard
The Advanced CPU Architecture
The Memory Hierarchy
The Need of Cache Memory
The Status and the Future
Architecture Design of a Processor
Processor Design is a Process• Bottom-up
– assemble components in target technology to establish critical timing
• Top-down– specify component behavior from high-level requirements
• Iterative refinement– establish partial solution, expand and improve
datapath control
processorInstruction SetArchitecture
=>
Reg. File Mux ALU Reg Mem Decoder Sequencer
Cells Gates
The design of a processor and the major components
The CPU execution cycles
The Single and multiple path architecture design
The Pipelined CPU
The Data, Control and Structure Hazard
The Advanced CPU Architecture
The Memory Hierarchy
The Need of Cache Memory
The Status and the Future
Architecture Design of a Processor
Execution Cycle
Instruction
Fetch
Instruction
Decode
Operand
Fetch
Execute
Result
Store
Next
Instruction
Obtain instruction from program storage
Determine required actions and instruction size
Locate and obtain operand data
Compute result value or status
Deposit results in storage for later use
Determine successor instruction
The design of a processor and the major components
The CPU execution cycles
The Single and multiple path architecture design
The Pipelined CPU
The Data, Control and Structure Hazard
The Advanced CPU Architecture
The Memory Hierarchy
The Need of Cache Memory
The Status and the Future
Architecture Design of a Processor
A Single Cycle Datapath• We have everything except control signals (underline)
– Today's lecture will show you how to generate the control signals
32
ALUctr
Clk
busW
RegWr
32
32
busA
32
busB
55 5
Rw Ra Rb
32 32-bitRegisters
Rs
Rt
Rt
RdRegDst
Exten
der
Mu
x
Mux
3216
imm16
ALUSrc
ExtOp
Mu
x
MemtoReg
Clk
Data InWrEn
32
Adr
DataMemory
32
MemWrA
LU
InstructionFetch Unit
Clk
Zero
Instruction<31:0>
0
1
0
1
01<
21:25>
<16:20>
<11:15>
<0:15>
Imm16RdRsRt
nPC_sel
The Truth Table for the Main Control
R-type ori lw sw beq jump
RegDst
ALUSrc
MemtoReg
RegWrite
MemWrite
Branch
Jump
ExtOp
ALUop (Symbolic)
1
0
0
1
0
0
0
x
R-type
0
1
0
1
0
0
0
0
Or
0
1
1
1
0
0
0
1
Add
x
1
x
0
1
0
0
1
Add
x
0
x
0
0
1
0
x
Subtract
x
x
x
0
0
0
1
x
xxx
op 00 0000 00 1101 10 0011 10 1011 00 0100 00 0010
ALUop <2> 1 0 0 0 0 x
ALUop <1> 0 1 0 0 0 x
ALUop <0> 0 0 0 0 1 x
MainControl
op
6
ALUControl(Local)
func
3
6
ALUop
ALUctr
3
RegDst
ALUSrc
:
Systematic Generation of Control
• In our single-cycle processor, each instruction is realized by exactly one control command or microinstruction
– in general, the controller is a finite state machine
– microinstruction can also control sequencing (see later)
Control Logic / Store(PLA, ROM)
OPcode
Datapath
Inst
ruct
ion
Decode
Con
ditio
nsControlPoints
microinstruction
The Big Picture: Where are We Now?
• The Five Classic Components of a Computer
• Today's Topic: Designing the Datapath for the Multiple Clock Cycle Datapath
Control
Datapath
Memory
Processor
Input
Output
The design of a processor and the major components
The CPU execution cycles
The Single and multiple path architecture design
The Pipelined CPU
The Data, Control and Structure Hazard
The Advanced CPU Architecture
The Memory Hierarchy
The Need of Cache Memory
The Status and the Future
Architecture Design of a Processor
Pipelining is Natural!
• Laundry Example• Ann, Brian, Cathy, Dave
each have one load of clothes
to wash, dry, and fold• Washer takes 30 minutes
• Dryer takes 30 minutes
• Folder takes 30 minutes
• Stasher takes 30 minutesto put clothes into drawers
A B C D
Sequential Laundry
• Sequential laundry takes 8 hours for 4 loads• If they learned pipelining, how long would laundry
take?
30Task
Order
B
C
D
ATime
30 30 3030 30 3030 30 30 3030 30 30 3030
6 PM 7 8 9 10 11 12 1 2 AM
Pipelined Laundry: Start work ASAP
• Pipelined laundry takes 3.5 hours for 4 loads!
Task
Order
12 2 AM6 PM 7 8 9 10 11 1
Time
B
C
D
A
3030 30 3030 30 30
Pipelining Lessons
• Pipelining doesn't help latency of single task, it helps throughput of entire workload
• Multiple tasks operating simultaneously using different resources
• Potential speedup = Number pipe stages
• Pipeline rate limited by slowest pipeline stage
• Unbalanced lengths of pipe stages reduces speedup
• Time to fill pipeline and time to drain it reduces speedup
• Stall for Dependences
6 PM 7 8 9
Time
B
C
D
A
3030 30 3030 30 30Task
Order
The Five Stages of Load
• Ifetch: Instruction Fetch– Fetch the instruction from the Instruction Memory
• Reg/Dec: Registers Fetch and Instruction Decode
• Exec: Calculate the memory address
• Mem: Read the data from the Data Memory
• Wr: Write the data back to the register file
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5
Ifetch Reg/Dec Exec Mem Wr
Graphically Representing Pipelines
• Can help with answering questions like:– how many cycles does it take to execute this
code?– what is the ALU doing during cycle 4?– use this representation to help understand
datapaths
Single Cycle, Multiple Cycle, vs. Pipeline
Clk
Cycle 1
Multiple Cycle Implementation:
Ifetch Reg Exec Mem Wr
Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9Cycle 10
Load Ifetch Reg Exec Mem Wr
Ifetch Reg Exec Mem
Load Store
Pipeline Implementation:
Ifetch Reg Exec Mem WrStore
Clk
Single Cycle Implementation:
Load Store Waste
Ifetch
R-type
Ifetch Reg Exec Mem WrR-type
Cycle 1 Cycle 2
Pipelined Execution
• Utilization?• Now we just have to make it work
IFetch Dcd Exec Mem WB
IFetch Dcd Exec Mem WB
IFetch Dcd Exec Mem WB
IFetch Dcd Exec Mem WB
IFetch Dcd Exec Mem WB
IFetch Dcd Exec Mem WBProgram Flow
Time
Why Pipeline?• Suppose we execute 100 instructions
• Single Cycle Machine– 45 ns/cycle x 1 CPI x 100 inst = 4500 ns
• Multicycle Machine– 10 ns/cycle x 4.6 CPI (due to inst mix) x 100
inst = 4600 ns
• Ideal pipelined machine– 10 ns/cycle x (1 CPI x 100 inst + 4 cycle drain)
= 1040 ns
Why Pipeline? Because the resources are there!
Instr.
Order
Time (clock cycles)
Inst 0
Inst 1
Inst 2
Inst 4
Inst 3
AL
UIm Reg Dm Reg
AL
UIm Reg Dm Reg
AL
UIm Reg Dm Reg
AL
UIm Reg Dm Reg
AL
UIm Reg Dm Reg
The design of a processor and the major components
The CPU execution cycles
The Single and multiple path architecture design
The Pipelined CPU
The Data, Control and Structure Hazard
The Advanced CPU Architecture
The Memory Hierarchy
The Need of Cache Memory
The Status and the Future
Architecture Design of a Processor
Can pipelining get us into trouble?• Yes: Pipeline Hazards
– structural hazards: attempt to use the same resource two different ways at the same time
• E.g., combined washer/dryer would be a structural hazard or folder busy doing something else (watching TV)
– data hazards: attempt to use item before it is ready• E.g., one sock of pair in dryer and one in washer; can't fold until
get sock from washer through dryer• instruction depends on result of prior instruction still in the
pipeline– control hazards: attempt to make a decision before condition is
evaulated• E.g., washing football uniforms and need to get proper
detergent level; need to see after dryer before next load in• branch instructions
• Can always resolve hazards by waiting– pipeline control must detect the hazard– take action (or delay action) to resolve hazards
Why Pipeline? Because the resources are there!
Instr.
Order
Time (clock cycles)
Inst 0
Inst 1
Inst 2
Inst 4
Inst 3
AL
UIm Reg Dm Reg
AL
UIm Reg Dm Reg
AL
UIm Reg Dm RegA
LUIm Reg Dm Reg
AL
UIm Reg Dm Reg
• Dependencies backwards in time are hazardsData Hazard on
r1:
Instr.
Order
Time (clock cycles)
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
IF
ID/RF
EX MEM WBAL
U
Im Reg Dm Reg
AL
UIm Reg Dm RegA
LUIm Reg Dm Reg
Im
AL
UReg Dm Reg
AL
UIm Reg Dm Reg
The design of a processor and the major components
The CPU execution cycles
The Single and multiple path architecture design
The Pipelined CPU
The Data, Control and Structure Hazard
The Advanced CPU Architecture
The Memory Hierarchy
The Need of Cache Memory
The Status and the Future
Architecture Design of a Processor
Issues in Pipelined design
Pipelining
Super-pipeline
- Issue one instruction per (fast) cycle
- ALU takes multiple cycles
Super-scalar
- Issue multiple scalar
instructions per cycle
Limitation
Issue rate, FU stalls, FU depth
Clock skew, FU stalls, FU depth
Hazard resolution
IF D Ex M W
IF D Ex M W
IF D Ex M W
IF D Ex M W
IF D Ex M W
IF D Ex M W
IF D Ex M W
IF D Ex M W
IF D Ex M W
IF D Ex M W
IF D Ex M W
IF D Ex M W
VLIW- Each instruction specifiesmultiple scalar operations- Compiler determines parallelism
Vector operations
- Each instruction specifies
series of identical operations
Packing
Applicability
W
W
W
IF D Ex M
Ex M
Ex M
Ex M W
IF D Ex M W
Ex M W
Ex M W
Ex M W
The VLIW and Vector Processors
Limits of Superscalar• While Integer/FP split is simple for the HW, get CPI of 0.5 only for
programs with:
– Exactly 50% FP operations
– No hazards
• If more instructions issue at same time, greater difficulty of decode and issue
– Even 2-scalar => examine 2 opcodes, 6 register specifiers, & decide if 1 or 2 instructions can issue
• VLIW: tradeoff instruction space for simple decoding
– The long instruction word has room for many operations
– By definition, all the operations the compiler puts in the long instruction word can execute in parallel
– E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch
• 16 to 24 bits per field => 7*16 or 112 bits to 7*24 or 168 bits
– Need compiling technique that schedules across several branches
Software Pipelining ExampleBefore: Unrolled 3 times
1 LD F0,0(R1) 2 ADDD F4,F0,F2 3 SD 0(R1),F4
4 LD F6,-8(R1) 5 ADDD F8,F6,F2 6 SD -8(R1),F8
7 LD F10,-16(R1)
8 ADDDF12,F10,F2
9 SD -16(R1),F12
10 SUBI R1,R1,#24 11 BNEZ R1,LOOP
After: Software Pipelined 1 SD 0(R1),F4 ; Stores M[i]
2 ADDD F4,F0,F2 ; Adds to
M[i-1]
3 LD F0,-16(R1);Loads M[i-
2]
4 SUBI R1,R1,#8
5 BNEZ R1,LOOP
• Symbolic Loop Unrolling– Less code space– Fill & drain pipe only once vs. each iteration in loop unrolling
Software Pipelining• Observation: if iterations from loops are independent, then can get
ILP by taking instructions from different iterations
• Software pipelining: reorganizes loops so that each iteration is made from instructions chosen from different iterations of the original loop (Tomasulo in SW)
Iteration 0 Iteration
1 Iteration 2 Iteration
3 Iteration 4
Software- pipelined iteration
Unrolled Loop that Minimizes Stalls for Scalar1 Loop: LD F0,0(R1)
2 LD F6,-8(R1)3 LD F10,-16(R1)4 LD F14,-24(R1)5 ADDD F4,F0,F26 ADDD F8,F6,F27 ADDD F12,F10,F28 ADDD F16,F14,F29 SD 0(R1),F410 SD -8(R1),F811 SD -16(R1),F1212 SUBI R1,R1,#3213 BNEZ R1,LOOP14 SD 8(R1),F16 ; 8-32 = -24
14 clock cycles, or 3.5 per iteration
LD to ADDD: 1 CycleADDD to SD: 2 Cycles
Loop Unrolling in VLIWMemory Memory FP FP Int. op/ Clock
reference 1 reference 2 operation 1 op. 2 branchLD F0,0(R1) LD F6,-8(R1) 1
LD F10,-16(R1) LD F14,-24(R1) 2
LD F18,-32(R1) LD F22,-40(R1) ADDD F4,F0,F2ADDD F8,F6,F23
LD F26,-48(R1) ADDD F12,F10,F2ADDD F16,F14,F24
ADDD F20,F18,F2 ADDD F24,F22,F2 5
SD 0(R1),F4 SD -8(R1),F8 ADDD F28,F26,F2 6
SD -16(R1),F12 SD -24(R1),F16 7
SD -32(R1),F20 SD -40(R1),F24 SUBI R1,R1,#48 8
SD -0(R1),F28 BNEZ R1,LOOP 9
Unrolled 7 times to avoid delays
7 results in 9 clocks, or 1.3 clocks per iteration
Need more registers in VLIW(EPIC => 128int + 128FP)
The design of a processor and the major components
The CPU execution cycles
The Single and multiple path architecture design
The Pipelined CPU
The Data, Control and Structure Hazard
The Advanced CPU Architecture
The Memory Hierarchy
The Need of Cache Memory
The Status and the Future
Architecture Design of a Processor
Levels of the Memory Hierarchy
CPU Registers100s Bytes<2s ns
CacheK Bytes SRAM2-100 ns$.01-.001/bit
Main MemoryM Bytes DRAM100ns-1us$.01-.001
DiskG Bytesms10 - 10 cents-3 -4
CapacityAccess TimeCost
Tapeinfinitesec-min10-6
Registers
Cache
Memory
Disk
Tape
Instr. Operands
Blocks
Pages
Files
StagingXfer Unit
prog./compiler1-8 bytes
cache cntl8-128 bytes
OS512-4K bytes
user/operatorMbytes
Upper Level
Lower Level
faster
Larger
Processor-DRAM Gap (latency)
proc60%/yr.
DRAM7%/yr.
1
10
100
1000
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
DRAM
CPU
1982
Processor-MemoryPerformance Gap:(grows 50% / year)
Per
form
ance
Time
Moore's Law
Memory Hierarchy° The Principle of Locality:
• Program access a relatively small portion of the address space at any instant of time.
- Temporal Locality: Locality in Time
- Spatial Locality: Locality in Space
° Three Major Categories of Cache Misses:
• Compulsory Misses: sad facts of life. Example: cold start misses.
• Conflict Misses: increase cache size and associativity.
• Capacity Misses: increase cache size
° Virtual Memory invented as another level of the hierarchy– Today VM allows many processes to share single memory without
having to swap all processes to disk, protection more important
– TLBs are important for fast translation/checking
Reducing Misses
• Classifying Misses: 3 Cs
– Compulsory The first access to a block is not in the cache, so the block must be brought into the cache. Also called cold start misses or first reference misses.(Misses in even an Infinite Cache)
– Capacity of the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur due to blocks being discarded and later retrieved.(Misses in Fully Associative Size X Cache)
– Conflict of block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory & capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. Also called collision misses or interference misses.(Misses in N-way Associative, Size X Cache)
How Can Reduce Misses?• 3 Cs: Compulsory, Capacity, Conflict• In all cases, assume total cache size not changed:• What happens if:
1) Change Block Size: Which of 3Cs is obviously affected?
2) Change Associativity: Which of 3Cs is obviously affected?
3) Change Compiler: Which of 3Cs is obviously affected?
The Principle of Locality
• The Principle of Locality:
– Program access a relatively small portion of the address space at any instant of time.
• Two Different Types of Locality:
– Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse)
– Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straight-line code, array access)
• Last 15 years, HW relied on locality for speed
Memory Hierarchy: Terminology• Hit: data appears in some block in the upper level (example: Block X)
– Hit Rate: the fraction of memory access found in the upper level
– Hit Time: Time to access the upper level which consists of
RAM access time + Time to determine hit/miss
• Miss: data needs to be retrieve from a block in lower level (Block Y)
– Miss Rate = 1 - (Hit Rate)
– Miss Penalty: Time to replace a block in the upper level +
Time to deliver the block the processor
• Hit Time << Miss Penalty (500 instructions on 21264!)
Lower LevelMemoryUpper Level
MemoryTo Processor
From ProcessorBlk X
Blk Y
Cache Measures
• Hit rate: fraction found in that level
– So high that usually talk about Miss rate
– Miss rate fallacy: as MIPS to CPU performance, miss rate to average memory access time in memory
• Average memory-access time = Hit time + Miss rate x Miss penalty
(ns or clocks)
• Miss penalty: time to replace a block from lower level, including time to replace in CPU
– access time: time to lower level
= f(latency to lower level)
– transfer time: time to transfer block
=f(BW between upper & lower levels)
Simplest Cache: Direct Mapped
Memory 4 Byte Direct Mapped CacheMemory Address
0
1
2
3
4
5
6
7
8
9
A
B
C
D
E
F
Cache Index
0
1
2
3
• Location 0 can be occupied by data from:
– Memory location 0, 4, 8, ... etc.
– In general: any memory locationwhose 2 LSBs of the address are 0s
– Address<1:0> => cache index
• Which one should we place in the cache?
• How can we tell which one is in the cache?
1 KB Direct Mapped Cache, 32B blocks
• For a 2 ** N byte cache:– The uppermost (32 - N) bits are always the Cache Tag– The lowest M bits are the Byte Select (Block = 2 ** M)
Cache Index
0
1
2
3
:
Cache Data
Byte 0
0431
:
Cache Tag Example: 0x50
Ex: 0x01
0x50
Stored as partof the cache state
Valid Bit
:
31
Byte 1Byte 31 :
Byte 32Byte 33Byte 63 :Byte 992Byte 1023 :
Cache Tag
Byte Select
Ex: 0x00
9
Two-way Set Associative Cache• N-way set associative: N entries for each Cache Index
– N direct mapped caches operates in parallel (N typically 2 to 4)
• Example: Two-way set associative cache
– Cache Index selects a set from the cache
– The two tags in the set are compared in parallel
– Data is selected based on the tag result
Cache Data
Cache Block 0
Cache TagValid
:: :
Cache Data
Cache Block 0
Cache Tag Valid
: ::
Cache Index
Mux 01Sel1 Sel0
Cache Block
CompareAdr Tag
Compare
OR
Hit
3Cs Relative Miss Rate
Cache Size (KB)
Mis
s R
ate
per
Typ
e
0%
20%
40%
60%
80%
100%1 2 4 8
16
32
64
12
8
1-way
2-way4-way
8-way
Capacity
Compulsory
Conflict
Flaws: for fixed block sizeGood: insight => invention
Why Do We Need Cache Memory?
Configuration L1 L2 CPU Clock Speed
P II w/o Cache 0 0 400MHZ 10 MIPS
386, w L1 Cache 8K 0 33MHZ 27 MIPS
Celeron, 333 32K 0 333MHZ 330/100 MIPS
P II, 350 32K 512K 350MHZ 330/240 MIPS
Main Memory Performance• Simple: CPU, Cache, Bus, Memory same width (32 bits)
• Wide: CPU/Mux 1 word; Mux/Cache, Bus, Memory N words (Alpha: 64 bits & 256 bits)
• Interleaved: CPU, Cache, Bus 1 word: Memory N Modules(4 Modules); example is word interleaved
Timing model: 1 to send address, 6 access time, 1 to send dataCache Block is 4 wordsSimple M.P. = 4 x (1+6+1) = 32Wide M.P. = 1 + 6 + 1 = 8Interleaved M.P. = 1 + 6 + 4x1 = 11
The design of a processor and the major components
The CPU execution cycles
The Single and multiple path architecture design
The Pipelined CPU
The Data, Control and Structure Hazard
The Advanced CPU Architecture
The Memory Hierarchy
The Need of Cache Memory
The Status and the Future
Architecture of a Processor
I/O System Design Issues
Processor
Cache
Memory - I/O Bus
MainMemory
I/OController
Disk Disk
I/OController
I/OController
Graphics Network
interrupts
• Systems have a hierarchy of busses as well (PC: memory,PCI,ESA)
Key Technologies
° Fast, cheap, highly integrated Computers-on-a-chip• IDT R4640, NEC VR4300, StrongARM, Superchips
° Affordable access to fast networks• ISDN, Cable Modems, ATM, . . .
° Platform independent programming languages• Java, JavaScript, Visual Basic Script
° Lightweight Operating Systems• GEOS, NCOS, RISCOS
° ???
Future of Computer Architecture and Engineering
• Performance
• High Level Computer Architecture
• Multiprocessors
• IRAM
Processor Performance
Year
Perf
orm
an
ce
0
50
100
150
200
250
300
19
82
19
83
19
84
19
85
19
86
19
87
19
88
19
89
19
90
19
91
19
92
19
93
19
94
19
95
RISC
Intel x86
35%/yr
RISCintroduction
SPECfp95base Performance (Oct. 1997)
0
10
20
30
40
50
60
tom
catv
swim
su2c
or
hyd
ro2d
mg
rid
app
lu
turb
3d
apsi
fpp
pp
wav
e5
SP
EC
fp
PA-8000
21164
PPro
The design of a processor and the major components
The CPU execution cycles
The Single and multiple path architecture design
The Pipelined CPU
The Data, Control and Structure Hazard
The Advanced CPU Architecture
The Memory Hierarchy
The Need of Cache Memory
The Status and the Future
Architecture of a Processor
1985 Computer Food Chain
PCWork-stationMini-
computer
Mainframe
Vector Supercomputer
Big Iron
1995 Computer Food Chain
PCWork-station
Mainframe
Vector Supercomputer Massively Parallel Processors
Minicomputer
(hitting wall soon)
(future is bleak)
2005 Computer Food Chain
Mainframe Vector Supercomputer
Minicomputer
PortableComputers
Networks of Workstations/PCs
Massively Parallel
Processors
Interconnection Networks
° Switched vs. Shared Media: pairs communicate at same time: point-to-point?connections
P
M
P
M
P
M
P
M
I/O
NI
Fast, Switched Network
P
MNININININI
Fast Communication
Cluster/Network of Workstations (NOW)
Slow, Scalable Network
P
M
NI
D
P
M
NI
D
P
M
NI
D
Distributed Comp.MPP
P P P
M
SMP
I/OBus
NI
General Purpose
Incremental Scalability,Timeliness
Fast, Switched Network
P
M
NI
D
P
M
NI
D
P
M
NI
D
Intelligent DRAM (IRAM)• IRAM motivation (?000 to 2005)
– 256 Mbit/1Gbit DRAMs in near future (128 MByte)
– Current CPUs starved for memory BW
– On chip memory BW = SQRT(Size)/RAS or 80 GB/sec
– 1% of Gbit DRAM = 10M transistors for processor
– Even in DRAM process, a 10M trans. CPU is attractive
– Package could be network interface vs. Addr./Data pins
– Embedded computers are increasingly important
• Why not re-examine computer design based on separation of memory and processor?
– Compact code & data?
– Vector instructions?
– Operating systems? Compilers? Data Structures?
IRAM Vision Statement
Microprocessor & DRAM on a single chip:– on-chip memory latency
5-10X, bandwidth 50-100X– improve energy efficiency
2X-4X (no off-chip bus)– serial I/O 5-10X v. buses– smaller board area/volume– adjustable memory size/width
DRAM
fab
Proc
Bus
D R A M
$ $Proc
L2$
Logic
fabBus
D R A M
I/OI/O
I/OI/O
Bus
and why not° multiprocessors on a chip?
° complete systems on a chip?• memory + processor + I/O
° computers in your credit card?
° networking in your kitchen? car?
° eye tracking input devices?