View
216
Download
0
Embed Size (px)
Citation preview
CS 152 Computer Architecture
and Engineering
Lecture 22: Final Lecture
Krste AsanovicElectrical Engineering and Computer Sciences
University of California, Berkeley
http://www.eecs.berkeley.edu/~krstehttp://inst.cs.berkeley.edu/~cs152
5/6/2008 2CS152-Spring’08
Today’s Lecture
• Review entire semester– What you learned
• Follow-on classes
• What’s next in computer architecture?
5/6/2008 3CS152-Spring’08
The New CS152 Executive Summary(what was promised in lecture 1)
The processor your predecessors built in CS152
What you’ll understand and experiment with in the new CS152
Plus, the technology behind chip-scale multiprocessors (CMPs)
5/6/2008 4CS152-Spring’08
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
From Babbage to IBM 650
5/6/2008 5CS152-Spring’08
IBM 360: Initial Implementations
Model 30 . . . Model 70
Storage 8K - 64 KB 256K - 512 KB
Datapath 8-bit 64-bit
Circuit Delay 30 nsec/level 5 nsec/level
Local Store Main Store Transistor Registers
Control Store Read only 1sec Conventional circuits
IBM 360 instruction set architecture (ISA) completely hid the underlying technological differences between various models.
Milestone: The first true ISA designed as portable hardware-software interface!
With minor modifications it still survives today!
5/6/2008 6CS152-Spring’08
Microcoded Microarchitecture
Memory(RAM)
Datapath
controller(ROM)
AddrData
zero?busy?
opcode
enMemMemWrt
holds fixedmicrocode instructions
holds user program written in macrocode
instructions (e.g., MIPS, x86, etc.)
5/6/2008 7CS152-Spring’08
Implementing Complex Instructions
ExtSel
A B
RegWrt
enReg
enMem
MA
addr addr
data data
rsrtrd
32(PC)31(Link)
RegSel
OpSel ldA ldB ldMA
Memory32 GPRs+ PC ...
32-bit RegALU
enALU
Bus
IR
busyzero?Opcode
ldIR
ImmExt
enImm
2ALU
control
2
3
MemWrt
32
rsrtrd
rd M[(rs)] op (rt) Reg-Memory-src ALU op M[(rd)] (rs) op (rt) Reg-Memory-dst ALU op M[(rd)] M[(rs)] op M[(rt)] Mem-Mem ALU op
5/6/2008 8CS152-Spring’08
From CISC to RISC• Use fast RAM to build fast instruction cache of user-
visible instructions, not fixed hardware microroutines– Can change contents of fast instruction memory to fit what
application needs right now
• Use simple ISA to enable hardwired pipelined implementation
– Most compiled code only used a few of the available CISC instructions
– Simpler encoding allowed pipelined implementations
• Further benefit with integration– In early ‘80s, can fit 32-bit datapath + small caches on a single chip– No chip crossings in common case allows faster operation
5/6/2008 9CS152-Spring’08
Nanocoding
• MC68000 had 17-bit code containing either 10-bit jump or 9-bit nanoinstruction pointer
– Nanoinstructions were 68 bits wide, decoded to give 196 control signals
code ROM
nanoaddress
code next-state
address
PC (state)
nanoinstruction ROMdata
Exploits recurring control signal patterns in code, e.g.,
ALU0 A Reg[rs] ...ALUi0 A Reg[rs]...
User PC
Inst. Cache
Hardwired Decode
5/6/2008 10CS152-Spring’08
“Iron Law” of Processor Performance
Time = Instructions Cycles Time Program Program * Instruction * Cycle
– Instructions per program depends on source code, compiler technology, and ISA
– Cycles per instructions (CPI) depends upon the ISA and the microarchitecture
– Time per cycle depends upon the microarchitecture and the base technology
Microarchitecture CPI cycle time
Microcoded >1 short
Single-cycle unpipelined 1 long
Pipelined 1 short
5/6/2008 11CS152-Spring’08
5-Stage Pipelined Execution
time t0 t1 t2 t3 t4 t5 t6 t7 . . . .instruction1 IF1 ID1 EX1 MA1 WB1
instruction2 IF2 ID2 EX2 MA2 WB2
instruction3 IF3 ID3 EX3 MA3 WB3
instruction4 IF4 ID4 EX4 MA4 WB4
instruction5 IF5 ID5 EX5 MA5 WB5
Write-Back (WB)
I-Fetch (IF)
Execute (EX)
Decode, Reg. Fetch (ID)
Memory (MA)
addr
wdata
rdataDataMemory
weALU
ImmExt
0x4
Add
addrrdata
Inst.Memory
rd1
GPRs
rs1rs2
wswdrd2
we
IRPC
5/6/2008 12CS152-Spring’08
Pipeline Hazards
• Pipelining instructions is complicated by HAZARDS:– Structural hazards (two instructions want same hardware resource)– Data hazards (earlier instruction produces value needed by later
instruction)– Control hazards (instruction changes control flow, e.g., branches or
exceptions)
• Techniques to handle hazards:– Interlock (hold newer instruction until older instructions drain out of
pipeline)– Bypass (transfer value from older instruction to newer instruction as
soon as available somwhere in machine)– Speculate (guess effect of earlier instruction)
• Speculation needs predictor, prediction check, and recovery mechanism
5/6/2008 13CS152-Spring’08
Exception Handling 5-Stage Pipeline
PCInst. Mem D Decode E M
Data Mem W+
Illegal Opcode
Overflow Data address Exceptions
PC address Exception
AsynchronousInterrupts
ExcD
PCD
ExcE
PCE
ExcM
PCM
Cause
EPC
Kill D Stage
Kill F Stage
Kill E Stage
Select Handler PC
Kill Writeback
Commit Point
5/6/2008 14CS152-Spring’08
Processor-DRAM Gap (latency)
Time
µProc 60%/year
DRAM7%/year
1
10
100
1000198
0198
1
198
3198
4198
5198
6198
7
198
8198
9199
0199
1199
2199
3199
4199
5199
6199
7199
8199
9200
0
DRAM
CPU198
2
Processor-MemoryPerformance Gap:(grows 50% / year)
Perf
orm
ance “Moore’s Law”
Four-issue 2GHz superscalar accessing 100ns DRAM could execute 800 instructions during time for one memory access!
5/6/2008CS152-Spring’08
Common Predictable Patterns
Two predictable properties of memory references:
– Temporal Locality: If a location is referenced it is likely to be referenced again in the near future.
– Spatial Locality: If a location is referenced it is likely that locations near it will be referenced in the near future.
Memory Reference Patterns
Donald J. Hatfield, Jeanette Gerald: Program Restructuring for Virtual Memory. IBM Systems Journal 10(3): 168-192 (1971)
Time
Mem
ory
Ad
dre
ss (
on
e d
ot
per
acc
ess)
SpatialLocality
Temporal Locality
5/6/2008 17CS152-Spring’08
Causes for Cache Misses
• Compulsory: first-reference to a block a.k.a. cold start misses
- misses that would occur even with infinite cache
• Capacity: cache is too small to hold all data needed by the program- misses that would occur even under perfect replacement policy
• Conflict: misses that occur because of collisions due to block-placement strategy
- misses that would not occur with full associativity
5/6/2008 18CS152-Spring’08
A Typical Memory Hierarchy c.2006
L1 Data Cache
L1 Instruction
CacheUnified L2
Cache
RF Memory
Memory
Memory
Memory
Multiported register file
(part of CPU)
Split instruction & data primary caches (on-chip SRAM)
Multiple interleaved memory banks
(DRAM)
Large unified secondary cache (on-chip SRAM)
CPU
5/6/2008 19CS152-Spring’08
Modern Virtual Memory Systems Illusion of a large, private, uniform store
Protection & Privacyseveral users, each with their private address space and one or more shared address spaces
page table name space
Demand PagingProvides the ability to run programs larger than the primary memory
Hides differences in machine configurations
The price is address translation on each memory reference
OS
useri
PrimaryMemory
SwappingStore
VA PAmapping
TLB
5/6/2008 20CS152-Spring’08
Hierarchical Page Table
Level 1 Page Table
Level 2Page Tables
Data Pages
page in primary memory page in secondary memory
Root of the CurrentPage Table
p1
offset
p2
Virtual Address
(ProcessorRegister)
PTE of a nonexistent page
p1 p2 offset01112212231
10-bitL1 index
10-bit L2 index
5/6/2008 21CS152-Spring’08
Address Translation & Protection
• Every instruction and data access needs address translation and protection checks
A good VM design needs to be fast (~ one cycle) and space efficient -> Translation Lookaside Buffer (TLB)
Physical Address
Virtual Address
AddressTranslation
Virtual Page No. (VPN) offset
Physical Page No. (PPN) offset
ProtectionCheck
Exception?
Kernel/User Mode
Read/Write
5/6/2008 22CS152-Spring’08
Address Translation in CPU Pipeline
• Software handlers need restartable exception on page fault or protection violation
• Handling a TLB miss needs a hardware or software mechanism to refill TLB
• Need mechanisms to cope with the additional latency of a TLB:
– slow down the clock
– pipeline the TLB and cache access
– virtual address caches
– parallel TLB/cache access
PCInst TLB
Inst. Cache D Decode E M
Data TLB
Data Cache W+
TLB miss? Page Fault?Protection violation?
TLB miss? Page Fault?Protection violation?
5/6/2008 23CS152-Spring’08
Concurrent Access to TLB & Cache
Index L is available without consulting the TLBcache and TLB accesses can begin simultaneously
Tag comparison is made after both accesses are completed
Cases: L + b = k L + b < k L + b > k
VPN L b
TLB Direct-map Cache 2L
blocks2b-byte block
PPN Page Offset
=hit?
DataPhysical Tag
Tag
VA
PA
VirtualIndex
k
5/6/2008 24CS152-Spring’08
CS152 Administrivia
• Lab 4 competition winners!
• Quiz 6 on Thursday, May 8– L19-21, PS 6, Lab 6
• Last 15 minutes, course survey– HKN survey
– Informal feedback survey for those who’ve not done it already
• Quiz 5 results
5/6/2008 25CS152-Spring’08
Complex Pipeline Structure
IF ID WB
ALU Mem
Fadd
Fmul
Fdiv
Issue
GPR’sFPR’s
5/6/2008 26CS152-Spring’08
Superscalar In-Order Pipeline
• Fetch two instructions per cycle; issue both simultaneously if one is integer/memory and other is floating-point
• Inexpensive way of increasing throughput, examples include Alpha 21064 (1992) & MIPS R5000 series (1996)
• Same idea can be extended to wider issue by duplicating functional units (e.g. 4-issue UltraSPARC) but register file ports and bypassing costs grow quickly
Commit Point
2PC
Inst. Mem D
DualDecode X1 X2
Data Mem W+GPRs
X2 WFadd X3
X3
FPRs X1
X2 Fmul X3
X2FDiv X3
Unpipelined divider
5/6/2008 27CS152-Spring’08
Types of Data Hazards
Consider executing a sequence of rk (ri) op (rj)
type of instructions
Data-dependencer3 (r1) op (r2) Read-after-Write r5 (r3) op (r4) (RAW) hazard
Anti-dependencer3 (r1) op (r2) Write-after-Read r1 (r4) op (r5) (WAR) hazard
Output-dependencer3 (r1) op (r2) Write-after-Write r3 (r6) op (r7) (WAW) hazard
5/6/2008 28CS152-Spring’08
Fetch: Instruction bits retrieved from cache.
Phases of Instruction Execution
I-cache
Fetch Buffer
IssueBuffer
Func.Units
Arch.State
Execute: Instructions and operands sent to execution units . When execution completes, all results and exception flags are available.
Decode: Instructions placed in appropriate issue (aka “dispatch”) stage buffer
ResultBuffer Commit: Instruction irrevocably updates
architectural state (aka “graduation” or “completion”).
PC
5/6/2008 29CS152-Spring’08
Pipeline Design with Physical Regfile
FetchDecode & Rename
Reorder BufferPC
BranchPrediction
Update predictors
Commit
BranchResolution
BranchUnit
ALU MEMStore Buffer
D$
Execute
In-Order
In-OrderOut-of-Order
Physical Reg. File
kill
kill
kill
kill
5/6/2008 30CS152-Spring’08
Reorder Buffer HoldsActive Instruction Window
…ld r1, (r3)add r3, r1, r2sub r6, r7, r9add r3, r3, r6ld r6, (r1)add r6, r6, r3st r6, (r1)ld r6, (r1)…
(Older instructions)
(Newer instructions)
Cycle t
…ld r1, (r3)add r3, r1, r2sub r6, r7, r9add r3, r3, r6ld r6, (r1)add r6, r6, r3st r6, (r1)ld r6, (r1)…
Commit
Fetch
Cycle t + 1
Execute
5/6/2008 31CS152-Spring’08
Branch History Table
4K-entry BHT, 2 bits/entry, ~80-90% correct predictions
0 0Fetch PC
Branch? Target PC
+
I-Cache
Opcode offset
Instruction
k
BHT Index
2k-entryBHT,2 bits/entry
Taken/¬Taken?
5/6/2008 32CS152-Spring’08
Two-Level Branch PredictorPentium Pro uses the result from the last two branchesto select one of the four sets of BHT bits (~95% correct)
0 0
kFetch PC
Shift in Taken/¬Taken results of each branch
2-bit global branch history shift register
Taken/¬Taken?
5/6/2008 33CS152-Spring’08
Branch Target Buffer (BTB)
• Keep both the branch PC and target PC in the BTB • PC+4 is fetched if match fails• Only taken branches and jumps held in BTB• Next PC determined before branch fetched and decoded
2k-entry direct-mapped BTB(can also be associative)
I-Cache PC
k
Valid
valid
Entry PC
=
match
predicted
target
target PC
5/6/2008 34CS152-Spring’08
Combining BTB and BHT• BTB entries are considerably more expensive than BHT, but can
redirect fetches at earlier stage in pipeline and can accelerate indirect branches (JR)
• BHT can hold many more entries and is more accurate
A PC Generation/MuxP Instruction Fetch Stage 1F Instruction Fetch Stage 2B Branch Address Calc/Begin DecodeI Complete DecodeJ Steer Instructions to Functional unitsR Register File ReadE Integer Execute
BTB
BHTBHT in later pipeline stage corrects when BTB misses a predicted taken branch
BTB/BHT only updated after branch resolves in E stage
5/6/2008 35CS152-Spring’08
Check instruction dependencies
Superscalar processor
Sequential ISA Bottleneck
a = foo(b);
for (i=0, i<
Sequential source code
Superscalar compiler
Find independent operations
Schedule operations
Sequential machine code
Schedule execution
5/6/2008 36CS152-Spring’08
VLIW: Very Long Instruction Word
• Multiple operations packed into one instruction
• Each operation slot is for a fixed function
• Constant operation latencies are specified
• Architecture requires guarantee of:– Parallelism within an instruction => no cross-operation RAW check
– No data use before data ready => no data interlocks
Two Integer Units,Single Cycle Latency
Two Load/Store Units,Three Cycle Latency Two Floating-Point Units,
Four Cycle Latency
Int Op 2 Mem Op 1 Mem Op 2 FP Op 1 FP Op 2Int Op 1
5/6/2008 37CS152-Spring’08
Scheduling Loop Unrolled Code
loop: ld f1, 0(r1)
ld f2, 8(r1)
ld f3, 16(r1) ld f4, 24(r1)
add r1, 32
fadd f5, f0, f1
fadd f6, f0, f2
fadd f7, f0, f3
fadd f8, f0, f4
sd f5, 0(r2)
sd f6, 8(r2)
sd f7, 16(r2)
sd f8, 24(r2)
add r2, 32
bne r1, r3, loop
Schedule
Int1 Int 2 M1 M2 FP+ FPx
loop:
Unroll 4 ways
ld f1ld f2ld f3ld f4add r1 fadd f5
fadd f6fadd f7fadd f8
sd f5sd f6sd f7sd f8add r2 bne
5/6/2008 38CS152-Spring’08
Software Pipelining
loop: ld f1, 0(r1)
ld f2, 8(r1)
ld f3, 16(r1) ld f4, 24(r1)
add r1, 32
fadd f5, f0, f1
fadd f6, f0, f2
fadd f7, f0, f3
fadd f8, f0, f4
sd f5, 0(r2)
sd f6, 8(r2)
sd f7, 16(r2)
add r2, 32
sd f8, -8(r2)
bne r1, r3, loop
Int1 Int 2 M1 M2 FP+ FPxUnroll 4 ways firstld f1ld f2ld f3ld f4
fadd f5fadd f6fadd f7fadd f8
sd f5sd f6sd f7sd f8
add r1
add r2bne
ld f1ld f2ld f3ld f4
fadd f5fadd f6fadd f7fadd f8
sd f5sd f6sd f7sd f8
add r1
add r2bne
ld f1ld f2ld f3ld f4
fadd f5fadd f6fadd f7fadd f8
sd f5
add r1
loop:iterate
prolog
epilog
5/6/2008 39CS152-Spring’08
Vector Programming Model
+ + + + + +
[0] [1] [VLR-1]
Vector Arithmetic Instructions
ADDV v3, v1, v2 v3
v2v1
Scalar Registers
r0
r15Vector Registers
v0
v15
[0] [1] [2] [VLRMAX-1]
VLRVector Length Register
v1Vector Load and
Store InstructionsLV v1, r1, r2
Base, r1 Stride, r2Memory
Vector Register
5/6/2008 40CS152-Spring’08
Vector Unit Structure
Lane
Functional Unit
VectorRegisters
Memory Subsystem
Elements 0, 4, 8, …
Elements 1, 5, 9, …
Elements 2, 6, 10, …
Elements 3, 7, 11, …
5/6/2008 41CS152-Spring’08
load
Vector Instruction Parallelism
Can overlap execution of multiple vector instructions– example machine has 32 elements per vector register and 8 lanes
loadmul
mul
add
add
Load Unit Multiply Unit Add Unit
time
Instruction issue
Complete 24 operations/cycle while issuing 1 short instruction/cycle
5/6/2008 42CS152-Spring’08
Multithreading
How can we guarantee no dependencies between instructions in a pipeline?
-- One way is to interleave execution of instructions from different program threads on same pipeline
F D X M W
t0 t1 t2 t3 t4 t5 t6 t7 t8
T1: LW r1, 0(r2)T2: ADD r7, r1, r4T3: XORI r5, r4, #12T4: SW 0(r7), r5T1: LW r5, 12(r1)
t9
F D X M WF D X M W
F D X M WF D X M W
Interleave 4 threads, T1-T4, on non-bypassed 5-stage pipe
Prior instruction in a thread always completes write-back before next instruction in same thread reads register file
5/6/2008 43CS152-Spring’08
Multithreaded CategoriesTi
me
(pro
cess
or
cycle
)Superscalar Fine-Grained Coarse-Grained Multiprocessing
SimultaneousMultithreading
Thread 1
Thread 2Thread 3Thread 4
Thread 5Idle slot
5/6/2008 44CS152-Spring’08
Power 4Power 4
SMT in Power 5SMT in Power 5
2 fetch (PC),2 initial decodes
2 commits (architected register sets)
5/6/2008 45CS152-Spring’08
A Producer-Consumer Example
The program is written assuming instructions are executed in order.
Producer posting Item x:Load Rtail, (tail)Store (Rtail), xRtail=Rtail+1Store (tail), Rtail
Consumer:Load Rhead, (head)
spin: Load Rtail, (tail)if Rhead==Rtail goto spinLoad R, (Rhead)Rhead=Rhead+1Store (head), Rhead
process(R)
Producer Consumertail head
RtailRtail Rhead R
5/6/2008 46CS152-Spring’08
Sequential ConsistencyA Memory Model
“ A system is sequentially consistent if the result ofany execution is the same as if the operations of allthe processors were executed in some sequential order, and the operations of each individual processorappear in the order specified by the program”
Leslie Lamport
Sequential Consistency = arbitrary order-preserving interleavingof memory references of sequential programs
M
P P P P P P
5/6/2008 47CS152-Spring’08
Sequential Consistency
Sequential consistency imposes more memory ordering constraints than those imposed by uniprocessor program dependencies ( )
What are these in our example ?
T1: T2:Store (X), 1 (X = 1) Load R1, (Y) Store (Y), 11 (Y = 11) Store (Y’), R1 (Y’= Y)
Load R2, (X) Store (X’), R2 (X’=
X)
additional SC requirements
5/6/2008 48CS152-Spring’08
Mutual Exclusion and Locks
Want to guarantee only one process is active in a critical section
• Blocking atomic read-modify-write instructionse.g., Test&Set, Fetch&Add, Swap
vs• Non-blocking atomic read-modify-write instructions
e.g., Compare&Swap, Load-reserve/Store-conditional
vs• Protocols based on ordinary Loads and Stores
5/6/2008 49CS152-Spring’08
Snoopy Cache Protocols
Use snoopy mechanism to keep all processors’ view of memory coherent
M1
M2
M3
Snoopy Cache
DMA
Physical Memory
Memory Bus
Snoopy Cache
Snoopy Cache
DISKS
5/6/2008 50CS152-Spring’08
MESI: An Enhanced MSI protocol increased performance for private data
M E
S I
M: Modified ExclusiveE: Exclusive, unmodifiedS: Shared I: Invalid
Each cache line has a tag
Address tagstate bits
Write missOther processorintent to write
Read miss,shared
Other processorintent to write
P1 write
Read by any processor
Other processor readsP1 writes back
P1 readP1 writeor read
Cache state in processor P1
P 1 in
tent t
o writ
e
Read miss, not shared
5/6/2008 51CS152-Spring’08
Basic Operation of Directory
• k processors.
• With each cache-block in memory: k presence-bits, 1 dirty-bit
• With each cache-block in cache: 1 valid bit, and 1 dirty (owner) bit• ••
P P
Cache Cache
Memory Directory
presence bits dirty bit
Interconnection Network
• Read from main memory by processor i:
• If dirty-bit OFF then { read from main memory; turn p[i] ON; }
• if dirty-bit ON then { recall line from dirty proc (cache state to shared); update memory; turn dirty-bit OFF; turn p[i] ON; supply recalled data to i;}
• Write to main memory by processor i:
• If dirty-bit OFF then {send invalidations to all caches that have the block; turn dirty-bit ON; supply data to i; turn p[i] ON; ... }
5/6/2008 52CS152-Spring’08
Directory Cache Protocol(Handout 6)
• Assumptions: Reliable network, FIFO message delivery between any given source-destination pair
CPU
Cache
Interconnection Network
Directory Controller
DRAM Bank
Directory Controller
DRAM Bank
CPU
Cache
CPU
Cache
CPU
Cache
CPU
Cache
CPU
Cache
Directory Controller
DRAM Bank
Directory Controller
DRAM Bank
5/6/2008 53CS152-Spring’08
Performance of Symmetric Shared-Memory Multiprocessors
Cache performance is combination of:
1. Uniprocessor cache miss traffic
2. Traffic caused by communication – Results in invalidations and subsequent cache misses
• Adds 4th C: coherence miss– Joins Compulsory, Capacity, Conflict
– (Sometimes called a Communication miss)
5/6/2008 54CS152-Spring’08
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Intel “Nehalem” (2008)
• 2-8 cores
• SMT (2 threads/core)
• Private L2$/core
• Shared L3$
• Initially in 45nm
5/6/2008 55CS152-Spring’08
Related Courses
CS61CCS61C CS 152CS 152
CS 258CS 258
CS 150CS 150
Basic computer organization, first look at pipelines + caches
Computer Architecture, First look at parallel
architectures
Parallel Architectures,Languages, Systems
Digital Logic Design
Strong
Prerequisite
CS 194-6CS 194-6
New FPGA-based Architecture Lab Class
CS 252CS 252
Graduate Computer Architecture,
Advanced Topics
5/6/2008 56CS152-Spring’08
Advice: Get involved in research
E.g.,
• RADLab - data center
• ParLab - parallel clients
• Undergrad research experience is the most important part of application to top grad schools.
5/6/2008 57CS152-Spring’08
End of CS152
• Thanks for being such patient guinea pigs!– Hopefully your pain will help future generations of CS152 students
5/6/2008 58CS152-Spring’08
Acknowledgements
• These slides contain material developed and copyright by:
– Arvind (MIT)
– Krste Asanovic (MIT/UCB)
– Joel Emer (Intel/MIT)
– James Hoe (CMU)
– John Kubiatowicz (UCB)
– David Patterson (UCB)
• MIT material derived from course 6.823
• UCB material derived from course CS252