CS 152 Computer Architecture and Engineering Lecture 22: Final Lecture Krste Asanovic Electrical Engineering and Computer Sciences University of California,

CS 152 Computer Architecture

and Engineering

Lecture 22: Final Lecture

Krste AsanovicElectrical Engineering and Computer Sciences

University of California, Berkeley

http://www.eecs.berkeley.edu/~krstehttp://inst.cs.berkeley.edu/~cs152

5/6/2008 2CS152-Spring’08

Today’s Lecture

• Review entire semester– What you learned

• Follow-on classes

• What’s next in computer architecture?

5/6/2008 3CS152-Spring’08

The New CS152 Executive Summary(what was promised in lecture 1)

The processor your predecessors built in CS152

What you’ll understand and experiment with in the new CS152

Plus, the technology behind chip-scale multiprocessors (CMPs)

5/6/2008 4CS152-Spring’08

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

From Babbage to IBM 650

5/6/2008 5CS152-Spring’08

IBM 360: Initial Implementations

Model 30 . . . Model 70

Storage 8K - 64 KB 256K - 512 KB

Datapath 8-bit 64-bit

Circuit Delay 30 nsec/level 5 nsec/level

Local Store Main Store Transistor Registers

Control Store Read only 1sec Conventional circuits

IBM 360 instruction set architecture (ISA) completely hid the underlying technological differences between various models.

Milestone: The first true ISA designed as portable hardware-software interface!

With minor modifications it still survives today!

5/6/2008 6CS152-Spring’08

Microcoded Microarchitecture

Memory(RAM)

Datapath

controller(ROM)

AddrData

zero?busy?

opcode

enMemMemWrt

holds fixedmicrocode instructions

holds user program written in macrocode

instructions (e.g., MIPS, x86, etc.)

5/6/2008 7CS152-Spring’08

Implementing Complex Instructions

ExtSel

A B

RegWrt

enReg

enMem

MA

addr addr

data data

rsrtrd

32(PC)31(Link)

RegSel

OpSel ldA ldB ldMA

Memory32 GPRs+ PC ...

32-bit RegALU

enALU

Bus

IR

busyzero?Opcode

ldIR

ImmExt

enImm

2ALU

control

2

3

MemWrt

32

rsrtrd

rd M[(rs)] op (rt) Reg-Memory-src ALU op M[(rd)] (rs) op (rt) Reg-Memory-dst ALU op M[(rd)] M[(rs)] op M[(rt)] Mem-Mem ALU op

5/6/2008 8CS152-Spring’08

From CISC to RISC• Use fast RAM to build fast instruction cache of user-

visible instructions, not fixed hardware microroutines– Can change contents of fast instruction memory to fit what

application needs right now

• Use simple ISA to enable hardwired pipelined implementation

– Most compiled code only used a few of the available CISC instructions

– Simpler encoding allowed pipelined implementations

• Further benefit with integration– In early ‘80s, can fit 32-bit datapath + small caches on a single chip– No chip crossings in common case allows faster operation

5/6/2008 9CS152-Spring’08

Nanocoding

• MC68000 had 17-bit code containing either 10-bit jump or 9-bit nanoinstruction pointer

– Nanoinstructions were 68 bits wide, decoded to give 196 control signals

code ROM

nanoaddress

code next-state

address

PC (state)

nanoinstruction ROMdata

Exploits recurring control signal patterns in code, e.g.,

ALU0 A Reg[rs] ...ALUi0 A Reg[rs]...

User PC

Inst. Cache

Hardwired Decode

5/6/2008 10CS152-Spring’08

“Iron Law” of Processor Performance

Time = Instructions Cycles Time Program Program * Instruction * Cycle

– Instructions per program depends on source code, compiler technology, and ISA

– Cycles per instructions (CPI) depends upon the ISA and the microarchitecture

– Time per cycle depends upon the microarchitecture and the base technology

Microarchitecture CPI cycle time

Microcoded >1 short

Single-cycle unpipelined 1 long

Pipelined 1 short

5/6/2008 11CS152-Spring’08

5-Stage Pipelined Execution

time t0 t1 t2 t3 t4 t5 t6 t7 . . . .instruction1 IF1 ID1 EX1 MA1 WB1

instruction2 IF2 ID2 EX2 MA2 WB2




Write-Back (WB)

I-Fetch (IF)

Execute (EX)

Decode, Reg. Fetch (ID)

Memory (MA)

addr

wdata

rdataDataMemory

weALU

ImmExt

0x4

Add

addrrdata

Inst.Memory

rd1

GPRs

rs1rs2

wswdrd2

we

IRPC

5/6/2008 12CS152-Spring’08

Pipeline Hazards

• Pipelining instructions is complicated by HAZARDS:– Structural hazards (two instructions want same hardware resource)– Data hazards (earlier instruction produces value needed by later

instruction)– Control hazards (instruction changes control flow, e.g., branches or

exceptions)

• Techniques to handle hazards:– Interlock (hold newer instruction until older instructions drain out of

pipeline)– Bypass (transfer value from older instruction to newer instruction as

soon as available somwhere in machine)– Speculate (guess effect of earlier instruction)

• Speculation needs predictor, prediction check, and recovery mechanism

5/6/2008 13CS152-Spring’08

Exception Handling 5-Stage Pipeline

PCInst. Mem D Decode E M

Data Mem W+

Illegal Opcode

Overflow Data address Exceptions

PC address Exception

AsynchronousInterrupts

ExcD

PCD

ExcE

PCE

ExcM

PCM

Cause

EPC

Kill D Stage

Kill F Stage

Kill E Stage

Select Handler PC

Kill Writeback

Commit Point

5/6/2008 14CS152-Spring’08

Processor-DRAM Gap (latency)

Time

µProc 60%/year

DRAM7%/year

1

10

100

1000198

0198

1

198

3198

4198

5198

6198

7

198

8198

9199

0199

1199

2199

3199

4199

5199

6199

7199

8199

9200

0

DRAM

CPU198

2

Processor-MemoryPerformance Gap:(grows 50% / year)

Perf

orm

ance “Moore’s Law”

Four-issue 2GHz superscalar accessing 100ns DRAM could execute 800 instructions during time for one memory access!

5/6/2008CS152-Spring’08

Common Predictable Patterns

Two predictable properties of memory references:

– Temporal Locality: If a location is referenced it is likely to be referenced again in the near future.

– Spatial Locality: If a location is referenced it is likely that locations near it will be referenced in the near future.

Memory Reference Patterns

Donald J. Hatfield, Jeanette Gerald: Program Restructuring for Virtual Memory. IBM Systems Journal 10(3): 168-192 (1971)

Time

Mem

ory

Ad

dre

ss (

on

e d

ot

per

acc

ess)

SpatialLocality

Temporal Locality

5/6/2008 17CS152-Spring’08

Causes for Cache Misses

• Compulsory: first-reference to a block a.k.a. cold start misses

- misses that would occur even with infinite cache

• Capacity: cache is too small to hold all data needed by the program- misses that would occur even under perfect replacement policy

• Conflict: misses that occur because of collisions due to block-placement strategy

- misses that would not occur with full associativity

5/6/2008 18CS152-Spring’08

A Typical Memory Hierarchy c.2006

L1 Data Cache

L1 Instruction

CacheUnified L2

Cache

RF Memory

Memory

Memory

Memory

Multiported register file

(part of CPU)

Split instruction & data primary caches (on-chip SRAM)

Multiple interleaved memory banks

(DRAM)

Large unified secondary cache (on-chip SRAM)

CPU

5/6/2008 19CS152-Spring’08

Modern Virtual Memory Systems Illusion of a large, private, uniform store

Protection & Privacyseveral users, each with their private address space and one or more shared address spaces

page table name space

Demand PagingProvides the ability to run programs larger than the primary memory

Hides differences in machine configurations

The price is address translation on each memory reference

OS

useri

PrimaryMemory

SwappingStore

VA PAmapping

TLB

5/6/2008 20CS152-Spring’08

Hierarchical Page Table

Level 1 Page Table

Level 2Page Tables

Data Pages

page in primary memory page in secondary memory

Root of the CurrentPage Table

p1

offset

p2

Virtual Address

(ProcessorRegister)

PTE of a nonexistent page

p1 p2 offset01112212231

10-bitL1 index

10-bit L2 index

5/6/2008 21CS152-Spring’08

Address Translation & Protection

• Every instruction and data access needs address translation and protection checks

A good VM design needs to be fast (~ one cycle) and space efficient -> Translation Lookaside Buffer (TLB)

Physical Address

Virtual Address

AddressTranslation

Virtual Page No. (VPN) offset

Physical Page No. (PPN) offset

ProtectionCheck

Exception?

Kernel/User Mode

Read/Write

5/6/2008 22CS152-Spring’08

Address Translation in CPU Pipeline

• Software handlers need restartable exception on page fault or protection violation

• Handling a TLB miss needs a hardware or software mechanism to refill TLB

• Need mechanisms to cope with the additional latency of a TLB:

– slow down the clock

– pipeline the TLB and cache access

– virtual address caches

– parallel TLB/cache access

PCInst TLB

Inst. Cache D Decode E M

Data TLB

Data Cache W+

TLB miss? Page Fault?Protection violation?

TLB miss? Page Fault?Protection violation?

5/6/2008 23CS152-Spring’08

Concurrent Access to TLB & Cache

Index L is available without consulting the TLBcache and TLB accesses can begin simultaneously

Tag comparison is made after both accesses are completed

Cases: L + b = k L + b < k L + b > k

VPN L b

TLB Direct-map Cache 2L

blocks2b-byte block

PPN Page Offset

=hit?

DataPhysical Tag

Tag

VA

PA

VirtualIndex

k

5/6/2008 24CS152-Spring’08

CS152 Administrivia

• Lab 4 competition winners!

• Quiz 6 on Thursday, May 8– L19-21, PS 6, Lab 6

• Last 15 minutes, course survey– HKN survey

– Informal feedback survey for those who’ve not done it already

• Quiz 5 results

5/6/2008 25CS152-Spring’08

Complex Pipeline Structure

IF ID WB

ALU Mem

Fadd

Fmul

Fdiv

Issue

GPR’sFPR’s

5/6/2008 26CS152-Spring’08

Superscalar In-Order Pipeline

• Fetch two instructions per cycle; issue both simultaneously if one is integer/memory and other is floating-point

• Inexpensive way of increasing throughput, examples include Alpha 21064 (1992) & MIPS R5000 series (1996)

• Same idea can be extended to wider issue by duplicating functional units (e.g. 4-issue UltraSPARC) but register file ports and bypassing costs grow quickly

Commit Point

2PC

Inst. Mem D

DualDecode X1 X2

Data Mem W+GPRs

X2 WFadd X3

X3

FPRs X1

X2 Fmul X3

X2FDiv X3

Unpipelined divider

5/6/2008 27CS152-Spring’08

Types of Data Hazards

Consider executing a sequence of rk (ri) op (rj)

type of instructions

Data-dependencer3 (r1) op (r2) Read-after-Write r5 (r3) op (r4) (RAW) hazard

Anti-dependencer3 (r1) op (r2) Write-after-Read r1 (r4) op (r5) (WAR) hazard

Output-dependencer3 (r1) op (r2) Write-after-Write r3 (r6) op (r7) (WAW) hazard

5/6/2008 28CS152-Spring’08

Fetch: Instruction bits retrieved from cache.

Phases of Instruction Execution

I-cache

Fetch Buffer

IssueBuffer

Func.Units

Arch.State

Execute: Instructions and operands sent to execution units . When execution completes, all results and exception flags are available.

Decode: Instructions placed in appropriate issue (aka “dispatch”) stage buffer

ResultBuffer Commit: Instruction irrevocably updates

architectural state (aka “graduation” or “completion”).

PC

5/6/2008 29CS152-Spring’08

Pipeline Design with Physical Regfile

FetchDecode & Rename

Reorder BufferPC

BranchPrediction

Update predictors

Commit

BranchResolution

BranchUnit

ALU MEMStore Buffer

D$

Execute

In-Order

In-OrderOut-of-Order

Physical Reg. File

kill

kill

kill

kill

5/6/2008 30CS152-Spring’08

Reorder Buffer HoldsActive Instruction Window

…ld r1, (r3)add r3, r1, r2sub r6, r7, r9add r3, r3, r6ld r6, (r1)add r6, r6, r3st r6, (r1)ld r6, (r1)…

(Older instructions)

(Newer instructions)

Cycle t

…ld r1, (r3)add r3, r1, r2sub r6, r7, r9add r3, r3, r6ld r6, (r1)add r6, r6, r3st r6, (r1)ld r6, (r1)…

Commit

Fetch

Cycle t + 1

Execute

5/6/2008 31CS152-Spring’08

Branch History Table

4K-entry BHT, 2 bits/entry, ~80-90% correct predictions

0 0Fetch PC

Branch? Target PC

+

I-Cache

Opcode offset

Instruction

k

BHT Index

2k-entryBHT,2 bits/entry

Taken/¬Taken?

5/6/2008 32CS152-Spring’08

Two-Level Branch PredictorPentium Pro uses the result from the last two branchesto select one of the four sets of BHT bits (~95% correct)

0 0

kFetch PC

Shift in Taken/¬Taken results of each branch

2-bit global branch history shift register

Taken/¬Taken?

5/6/2008 33CS152-Spring’08

Branch Target Buffer (BTB)

• Keep both the branch PC and target PC in the BTB • PC+4 is fetched if match fails• Only taken branches and jumps held in BTB• Next PC determined before branch fetched and decoded

2k-entry direct-mapped BTB(can also be associative)

I-Cache PC

k

Valid

valid

Entry PC

=

match

predicted

target

target PC

5/6/2008 34CS152-Spring’08

Combining BTB and BHT• BTB entries are considerably more expensive than BHT, but can

redirect fetches at earlier stage in pipeline and can accelerate indirect branches (JR)

• BHT can hold many more entries and is more accurate

A PC Generation/MuxP Instruction Fetch Stage 1F Instruction Fetch Stage 2B Branch Address Calc/Begin DecodeI Complete DecodeJ Steer Instructions to Functional unitsR Register File ReadE Integer Execute

BTB

BHTBHT in later pipeline stage corrects when BTB misses a predicted taken branch

BTB/BHT only updated after branch resolves in E stage

5/6/2008 35CS152-Spring’08

Check instruction dependencies

Superscalar processor

Sequential ISA Bottleneck

a = foo(b);

for (i=0, i<

Sequential source code

Superscalar compiler

Find independent operations

Schedule operations

Sequential machine code

Schedule execution

5/6/2008 36CS152-Spring’08

VLIW: Very Long Instruction Word

• Multiple operations packed into one instruction

• Each operation slot is for a fixed function

• Constant operation latencies are specified

• Architecture requires guarantee of:– Parallelism within an instruction => no cross-operation RAW check

– No data use before data ready => no data interlocks

Two Integer Units,Single Cycle Latency

Two Load/Store Units,Three Cycle Latency Two Floating-Point Units,

Four Cycle Latency

Int Op 2 Mem Op 1 Mem Op 2 FP Op 1 FP Op 2Int Op 1

5/6/2008 37CS152-Spring’08

Scheduling Loop Unrolled Code

loop: ld f1, 0(r1)

ld f2, 8(r1)

ld f3, 16(r1) ld f4, 24(r1)

add r1, 32

fadd f5, f0, f1

fadd f6, f0, f2

fadd f7, f0, f3

fadd f8, f0, f4

sd f5, 0(r2)

sd f6, 8(r2)

sd f7, 16(r2)

sd f8, 24(r2)

add r2, 32

bne r1, r3, loop

Schedule

Int1 Int 2 M1 M2 FP+ FPx

loop:

Unroll 4 ways

ld f1ld f2ld f3ld f4add r1 fadd f5

fadd f6fadd f7fadd f8

sd f5sd f6sd f7sd f8add r2 bne

5/6/2008 38CS152-Spring’08

Software Pipelining

loop: ld f1, 0(r1)

ld f2, 8(r1)

ld f3, 16(r1) ld f4, 24(r1)

add r1, 32

fadd f5, f0, f1

fadd f6, f0, f2

fadd f7, f0, f3

fadd f8, f0, f4

sd f5, 0(r2)

sd f6, 8(r2)

sd f7, 16(r2)

add r2, 32

sd f8, -8(r2)

bne r1, r3, loop

Int1 Int 2 M1 M2 FP+ FPxUnroll 4 ways firstld f1ld f2ld f3ld f4

fadd f5fadd f6fadd f7fadd f8

sd f5sd f6sd f7sd f8

add r1

add r2bne

ld f1ld f2ld f3ld f4


sd f5sd f6sd f7sd f8

add r1

add r2bne

ld f1ld f2ld f3ld f4


sd f5

add r1

loop:iterate

prolog

epilog

5/6/2008 39CS152-Spring’08

Vector Programming Model

+ + + + + +

[0] [1] [VLR-1]

Vector Arithmetic Instructions

ADDV v3, v1, v2 v3

v2v1

Scalar Registers

r0

r15Vector Registers

v0

v15

[0] [1] [2] [VLRMAX-1]

VLRVector Length Register

v1Vector Load and

Store InstructionsLV v1, r1, r2

Base, r1 Stride, r2Memory

Vector Register

5/6/2008 40CS152-Spring’08

Vector Unit Structure

Lane

Functional Unit

VectorRegisters

Memory Subsystem

Elements 0, 4, 8, …




5/6/2008 41CS152-Spring’08

load

Vector Instruction Parallelism

Can overlap execution of multiple vector instructions– example machine has 32 elements per vector register and 8 lanes

loadmul

mul

add

add

Load Unit Multiply Unit Add Unit

time

Instruction issue

Complete 24 operations/cycle while issuing 1 short instruction/cycle

5/6/2008 42CS152-Spring’08

Multithreading

How can we guarantee no dependencies between instructions in a pipeline?

-- One way is to interleave execution of instructions from different program threads on same pipeline

F D X M W

t0 t1 t2 t3 t4 t5 t6 t7 t8

T1: LW r1, 0(r2)T2: ADD r7, r1, r4T3: XORI r5, r4, #12T4: SW 0(r7), r5T1: LW r5, 12(r1)

t9

F D X M WF D X M W

F D X M WF D X M W

Interleave 4 threads, T1-T4, on non-bypassed 5-stage pipe

Prior instruction in a thread always completes write-back before next instruction in same thread reads register file

5/6/2008 43CS152-Spring’08

Multithreaded CategoriesTi

me

(pro

cess

or

cycle

)Superscalar Fine-Grained Coarse-Grained Multiprocessing

SimultaneousMultithreading

Thread 1

Thread 2Thread 3Thread 4

Thread 5Idle slot

5/6/2008 44CS152-Spring’08

Power 4Power 4

SMT in Power 5SMT in Power 5

2 fetch (PC),2 initial decodes

2 commits (architected register sets)

5/6/2008 45CS152-Spring’08

A Producer-Consumer Example

The program is written assuming instructions are executed in order.

Producer posting Item x:Load Rtail, (tail)Store (Rtail), xRtail=Rtail+1Store (tail), Rtail

Consumer:Load Rhead, (head)

spin: Load Rtail, (tail)if Rhead==Rtail goto spinLoad R, (Rhead)Rhead=Rhead+1Store (head), Rhead

process(R)

Producer Consumertail head

RtailRtail Rhead R

5/6/2008 46CS152-Spring’08

Sequential ConsistencyA Memory Model

“ A system is sequentially consistent if the result ofany execution is the same as if the operations of allthe processors were executed in some sequential order, and the operations of each individual processorappear in the order specified by the program”

Leslie Lamport

Sequential Consistency = arbitrary order-preserving interleavingof memory references of sequential programs

M

P P P P P P

5/6/2008 47CS152-Spring’08

Sequential Consistency

Sequential consistency imposes more memory ordering constraints than those imposed by uniprocessor program dependencies ( )

What are these in our example ?

T1: T2:Store (X), 1 (X = 1) Load R1, (Y) Store (Y), 11 (Y = 11) Store (Y’), R1 (Y’= Y)

Load R2, (X) Store (X’), R2 (X’=

X)

additional SC requirements

5/6/2008 48CS152-Spring’08

Mutual Exclusion and Locks

Want to guarantee only one process is active in a critical section

• Blocking atomic read-modify-write instructionse.g., Test&Set, Fetch&Add, Swap

vs• Non-blocking atomic read-modify-write instructions

e.g., Compare&Swap, Load-reserve/Store-conditional

vs• Protocols based on ordinary Loads and Stores

5/6/2008 49CS152-Spring’08

Snoopy Cache Protocols

Use snoopy mechanism to keep all processors’ view of memory coherent

M1

M2

M3

Snoopy Cache

DMA

Physical Memory

Memory Bus

Snoopy Cache

Snoopy Cache

DISKS

5/6/2008 50CS152-Spring’08

MESI: An Enhanced MSI protocol increased performance for private data

M E

S I

M: Modified ExclusiveE: Exclusive, unmodifiedS: Shared I: Invalid

Each cache line has a tag

Address tagstate bits

Write missOther processorintent to write

Read miss,shared

Other processorintent to write

P1 write

Read by any processor

Other processor readsP1 writes back

P1 readP1 writeor read

Cache state in processor P1

P 1 in

tent t

o writ

e

Read miss, not shared

5/6/2008 51CS152-Spring’08

Basic Operation of Directory

• k processors.

• With each cache-block in memory: k presence-bits, 1 dirty-bit

• With each cache-block in cache: 1 valid bit, and 1 dirty (owner) bit• ••

P P

Cache Cache

Memory Directory

presence bits dirty bit

Interconnection Network

• Read from main memory by processor i:

• If dirty-bit OFF then { read from main memory; turn p[i] ON; }

• if dirty-bit ON then { recall line from dirty proc (cache state to shared); update memory; turn dirty-bit OFF; turn p[i] ON; supply recalled data to i;}

• Write to main memory by processor i:

• If dirty-bit OFF then {send invalidations to all caches that have the block; turn dirty-bit ON; supply data to i; turn p[i] ON; ... }

5/6/2008 52CS152-Spring’08

Directory Cache Protocol(Handout 6)

• Assumptions: Reliable network, FIFO message delivery between any given source-destination pair

CPU

Cache

Interconnection Network

Directory Controller

DRAM Bank


DRAM Bank

CPU

Cache

CPU

Cache

CPU

Cache

CPU

Cache

CPU

Cache


DRAM Bank


DRAM Bank

5/6/2008 53CS152-Spring’08

Performance of Symmetric Shared-Memory Multiprocessors

Cache performance is combination of:

1. Uniprocessor cache miss traffic

2. Traffic caused by communication – Results in invalidations and subsequent cache misses

• Adds 4th C: coherence miss– Joins Compulsory, Capacity, Conflict

– (Sometimes called a Communication miss)

5/6/2008 54CS152-Spring’08

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Intel “Nehalem” (2008)

• 2-8 cores

• SMT (2 threads/core)

• Private L2$/core

• Shared L3$

• Initially in 45nm

5/6/2008 55CS152-Spring’08

Related Courses

CS61CCS61C CS 152CS 152

CS 258CS 258

CS 150CS 150

Basic computer organization, first look at pipelines + caches

Computer Architecture, First look at parallel

architectures

Parallel Architectures,Languages, Systems

Digital Logic Design

Strong

Prerequisite

CS 194-6CS 194-6

New FPGA-based Architecture Lab Class

CS 252CS 252

Graduate Computer Architecture,

Advanced Topics

5/6/2008 56CS152-Spring’08

Advice: Get involved in research

E.g.,

• RADLab - data center

• ParLab - parallel clients

• Undergrad research experience is the most important part of application to top grad schools.

5/6/2008 57CS152-Spring’08

End of CS152

• Thanks for being such patient guinea pigs!– Hopefully your pain will help future generations of CS152 students

5/6/2008 58CS152-Spring’08

Acknowledgements

• These slides contain material developed and copyright by:

– Arvind (MIT)

– Krste Asanovic (MIT/UCB)

– Joel Emer (Intel/MIT)

– James Hoe (CMU)

– John Kubiatowicz (UCB)

– David Patterson (UCB)

• MIT material derived from course 6.823

• UCB material derived from course CS252

Documents

CS 152 Computer Architecture and Engineering Lecture 22: Final Lecture Krste Asanovic Electrical Engineering and Computer Sciences University of California,