Review of Processor Architecturesvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/%b8%f0%b5%e...– Two read, one write port (2R1W) register file – Write muxfor register file from ALU

Module 2

Review of Processor Architectures

Module 2

Review of Processor Architectures

이론 3시간

채수익

서울대학교

학습목표

• SoC에 대한 기본 개념 이해

• SoC 구조와 구성 요소에 대한 이해

• SoC 설계 과정 소개

• SoC 플랫폼에 대한 기본 개념 이해

모듈구성

선택필수

SoC 구조

Embedded Processor I, II

Embedded Memory

Typical Logic Blocks

External Interface

On-Chip Bus Architecture

Bus Interface Design

실습

Processor 설계실습I, II

간단한 SoC 설계실습 I, II

Reconfigurable Processor

Architecture

Low Power SoC Design I, II

Network-on-Chip

Introduction

Introduction to SoC Architecture

Review of Processor

Architecture

SoC 설계실례

강의자료활용 Guideline

• 3시간 강의

• 모든 내용을 다 다룬다.

Contents

1. Overview of Processor Architecture

2. Pipelining

3. Memory Systems

4. Advanced Topics

5. Summary

6. Exercises

7. References

Overview of Processor Architecture

• Computer is comprised of– Processor (datapath, control)

– Input (keyboard, mouse, joystick)

– Output (LCD, CRT, printer)

– Memory (CD, HDD, DRAM, SRAM)

– Network

• Processor : Heart of Computer– VLSI (Very Large Scale Integration) System

– State-of-the-art processor integrate a few billion transistors

• Brief History– Intel 4004 (in 1971, 2,250 transistors)

– Intel 8086 (in 1978, 29,000 transistors)

– Intel 80486 (in 1989, 1,200,000 transistors)

– Pentium (in 1993, 3,100,000 transistors)

– Core 2 (in 2006, 291,000,000 transistors)

– Dual-core Itanium 2 (in 2006, 1,720,000,000 transistors)

History of Processor Architecture

4004

8086

40486

Pentium

Instruction Set Architectures

• What is Instruction Set Architecture?

– like Building : common in making things

– Important factor of performance and cost

• Architecture satisfies user requirements

• Also, cost is minimized in engineering perspective

• Architecture evolution

– History

• Single cycle processor

• Multi cycle processor

• Pipelined processor

• Multi issue processor

• Configurable processor

– Toward high integration

• Multi core

• More caches

– Low Power : keep both clock frequency and voltage low

Hardware and Software interface

• Instruction Set Architecture is

– Interfaces between processor and software

• Compiler, assembler in software, control, datapath in processor must satisfy the instruction set architecture

– Instruction set architecture design with performance evaluation is researched for several decades

Classification of Instruction Set Architectures

• Classification (ref [2])

– Stack

• No registers, Short instruction length

• Restricted instruction order

• Used in 1960~1970 for real chip

• Java Virtual Machine

– Accumulator based

• One accumulator register

• Rather short instructions

• e.g. UNIVAC I, EDSAC, Intel 8080

– Register-memory

• A set of registers

• Another operand is from memory

• Most CISC

– Register-register

• All operand from register file

• Load-store architecture

• RISC

Basic Instruction Format

• Basic Instruction Format– opcode– operand1, operand2– ...

• Instruction Type– Arithmetic, logical, compare– Branch, jump– Load, store

• Addressing Mode– How processor obtain operand from instruction word?– Register

• Add R1, R2 // Reg[R1] = Reg[R1]+Reg[R2]– Immediate

• Add R1, #2 // Reg[R1] = Reg[R1]+2– Displacement (register indirect)

• Add R1, 4(R2) // Reg[R1] = Reg[R1]+Mem[Reg[R2]+4]– Auto-increment/decrement (pre, post)

• Add R1, (R2)+ // Post increment• // Reg[R1] = Reg[R1]+Mem[Reg[R2]]• // Reg[R2] = Reg[R2]+s // s is operand size• Add R1, -(R2) // Pre decrement• // Reg[R2] = Reg[R2]-s // s is operand size• // Reg[R1] = Reg[R1]+Mem[Reg[R2]]

MIPS Instruction Set Architecture

• Example : MIPS architecture

– Register-register architecture (load-store architecture)

– Three addressing modes

• Displacement, immediate, register indirect

– 32 64-bit GPR, 32 32-/64-bit floating pointer registers

– 8, 16, 32, 64-bit integer data types

– 32-bit single precision, 64-bit double precision floating data

– Three basic instruction formats

• R-type : three register operands (e.g. ADD R0,R1,R2)

• I-type : two register with immediate operands (e.g. ADD R0,R1,#4)

• J-type : long immediate for jump instruction

opcode rs rt rd shamt funct

opcode rs rt immediate

opcode address

R

I

J

31 26 25 21 20 16 15 11 10 6 5 0

31 26 25 21 20 16 15 0

31 26 25 0

Datapath Design

• Datapath supports instruction functionality

• Datapath consists of– Instruction related parts

• Program counter (PC)

• Instruction memory (IMEM)

• Instruction register (IR)

– Data processing related parts• Register file (RF)

• Arithmetic and logical unit (ALU)

• Data memory (DMEM)

• Instruction processing flow1. Processor read instruction indicated by current PC

2. After accessing IMEM, an instruction fetched to IR

3. Decoding the instruction, operands are read from RF

4. ALU perform the operation with read operands

5. The result will be stored to DMEM or RF

Functional Datapath Diagram

• Functional datapath diagram

– Two separated memory elements

• Instruction (read only)

• Data (read/write)

– Program counter and instruction register

– Register file

– ALU

Program

Counter (P

C)

Instruction Memory(IMEM)

Instruction Register (IR

)

Register File(RF)

Data Memory (DMEM)

ALU

Architectural Datapath Diagram

• Architectural datapath diagram

– Memory has address and input, output data bus

– Support for Branch, jump instruction

– Two read, one write port (2R1W) register file

– Write mux for register file from ALU and DMEM

– ALU gets operand from instruction word – immediate operand

• Sign extension unit

Program

Counter (P

C)

Instruction Register (IR

)

ALU Operations

• ALU performs

– Arithmetic operation

• Addition and subtraction

• Multiplication and division

• Integer operation or floating-point operation

– Separated FP unit due to long execution time

– Shift operation

• Shift left/right logical

• Shift right arithmetic (sign extension)

• Rotate left/right (truncated bits concatenated to other side)

– Logical operation

• NOT, AND, OR, NAND, NOR, XOR, ...

– Comparison

• ==, !=, <, <=, >, >=

• Integer or floating-point

Adders

• Adder, Subtractor (ref[3])

– Ripple carry adder

– Carry lookahead adder

– Carry skip adder

– Carry select adder

– Tree adder

Carry Lookahead adder

Cin+

S4:1

P4:1

A4:1 B4:1

+

S8:5

P8:5

A8:5 B8:5

+

S12:9

P12:9

A12:9 B12:9

+

S16:13

P16:13

A16:13 B16:13

CoutC4

1

0

C81

0

C121

0

1

0

Carry skip adder

Cin+

A4:1 B4:1

S4:1

C4

+

+

01

A8:5 B8:5

S8:5

C8

+

+

01

A12:9 B12:9

S12:9

C12

+

+

01

A16:13 B16:13

S16:13

Cout

0

1

0

1

0

1

Carry select adder

1:03:25:47:69:811:1013:1215:14

3:07:411:815:12

7:015:8

11:0

5:09:013:0

0123456789101112131415

15:014:013:0 12:011:010:0 9:0 8:0 7:0 6:0 5:0 4:0 3:0 2:0 1:0 0:0

Brent-Kung tree adder

Multipliers

• Multiplier

– Multi-cycle : shift, and accumulate

– Booth multiplier

• Generate partial products

• Reduce partial products : (3,2) counter, (4,2) compressor

• Fast carry propagate addition

– Array multiplier

• Regular structure

y0y1y2y3

x0

x1

x2

x3

p0p1p2p3p4p5p6p7

B

ASin Cin

SoutCout

BA

CinCout

Sout

Sin

=

CSAArray

CPA

critical path BA

Sout

Cout CinCout

Sout

=Cin

BA

Instruction decoding

• Instruction decoding

– For given instruction word, select the operations to perform (opcode)

– Read register address (operand 1, operand 2)

– Write register address (operand 3)

• Program counter

– increment (sequential process)

– jump (by jump instruction)

– branch (by branch instruction)

• Register file

– Read operands indicated by instruction

– Write enable and address indicated by instruction

• ALU

– Operation type selection (ADD, SUB, SLL, AND, ...)

– Signed/Unsigned operation selection

• Data memory

– For load instruction, generate read address and read enable, and save the result to register file

– For store instruction, generate write address and write data and write enable

– address typically generated by ALU (displacement addressing mode)

Multiplexer Control for Program Counter

• Multiplexer control for Program counter

– Incrementer (sequential execution)

– Adder or ALU (branch operation)

– Immediate (jump operation)

IMEM DMEMALURFraddr2

immediate

reg2

reg1

4

Next

Current PC

immediate

Sign Ext

wdatawaddr

raddr1 rdata1

rdata2

Curr

spr

ALU result

addr

data

opA

opB

opW

imm

DMEM data

raddr/waddr

wdata

rdata

InstructionDecoder

opcode

wen

ren

wenop

br

jp

rfimm

dmemwb

opb

opa

dmemdata

rfw dmemw

dmemr

ALUop

brjp

rfwdmemrdmemwopaopb

ALUop

dmemdatadmemwbrfimm

Multiplexer for register file

• Multiplexer control for register file

– ALU

– Data memory

– Immediate


immediate

reg2

reg1

4

Next

Current PC

immediate

Sign Ext

wdatawaddr

raddr1 rdata1

rdata2

Curr

spr

ALU result

addr

data

opA

opB

opW

imm

DMEM data

raddr/waddr

wdata

rdata

InstructionDecoder

opcode

wen

ren

wenop

br

jp

rfimm

dmemwb

opb

opa

dmemdata

rfw dmemw

dmemr

ALUop

brjp

rfwdmemrdmemwopaopb

ALUop

dmemdatadmemwbrfimm

Multiplexer Control for ALU Operands

• Multiplexer control for ALU operands

– Register file

– Immediate (from instruction word)

– Data memory (CISC case)

– Program Counter

– Special registers


immediate

reg2

reg1

4

Next

Current PC

immediate

Sign Ext

wdatawaddr

raddr1 rdata1

rdata2

Curr

spr

ALU result

addr

data

opA

opB

opW

imm

DMEM data

raddr/waddr

wdata

rdata

InstructionDecoder

opcode

wen

ren

wenop

br

jp

rfimm

dmemwb

opb

opa

dmemdata

rfw dmemw

dmemr

ALUop

brjp

rfwdmemrdmemwopaopb

ALUop

dmemdatadmemwbrfimm

Multiplexer Control For Data Memory

• Multiplexer control for Data Memory

– Register file

– ALU

– Immediate


immediate

reg2

reg1

4

Next

Current PC

immediate

Sign Ext

wdatawaddr

raddr1 rdata1

rdata2

Curr

spr

ALU result

addr

data

opA

opB

opW

imm

DMEM data

raddr/waddr

wdata

rdata

InstructionDecoder

opcode

wen

ren

wenop

br

jp

rfimm

dmemwb

opb

opa

dmemdata

rfw dmemw

dmemr

ALUop

brjp

rfwdmemrdmemwopaopb

ALUop

dmemdatadmemwbrfimm

Finite State Machine

• Implementing Controller : Finite State Machine

– Input : opcode

– Output : mux selection, write/read enable for RF, DMEM

– State : pipeline enable/disable (or processor stall)

FSM Implementation

opcode

current state

next state

control signals

Address D

ecoderData

bitline

wordline

address control control

signals

Data

1

opcode

Microcode Memory

ROM

Microprogramming

• Implementing FSM

– Logic gates

– PLA (Programmable Logic Array)

– ROM (Read Only Memory)

– Microprogramming

opcode

current state

next state

control signals

AND plane

OR plane

PLA

Contents

1. Overview of Processor Architecture

2. Pipelining

3. Memory Systems

4. Advanced Topics

5. Summary

6. Exercises

7. References

Pipelining

• Pipelining

– Slicing unified datapath to multi clock path

– Processor performs a several instruction at the same time

• Better instruction throughput

• Worse instruction latency

• Benefits

– High clock frequency (short critical path)

– Resource sharing

• Drawbacks

– Pipelining register

– Hazard detection

Multi-Cycle Implementation

• Single cycle multiple cycles (ref [1])

– 5 cycle implementation

• IF : Instruction Fetch

• ID : Instruction Decode

• EX : Execution

• MEM : Memory Access

• WB : Write Back

PC IMEM IR DMEMALURF

raddr2

immediate

reg2

reg1

4

Next

Current PC

IF

Sign Ext

wdatawaddr

raddr1 rdata1

rdata2

Curr

spr

ALU result

addr

data

opA

opB

opW

imm

DMEM data

raddr/waddr

wdata

rdata

InstructionDecoder

opcode

wen

ren

wenop

br

jp

rfimm

dmemwb

opb

opa

dmemdata

rfw dmemw

dmemr

ALUop

brjp

rfwdmemrdmemwopaopb

ALUop

dmemdatadmemwbrfimm

ID EX MEM WB

Pipelined Architecture

• For multi-cycle architecture

– One instruction needs 5 cycles

– Two instruction needs 10 cycles

• Because of exclusive use of data path

• Pipelined architecture

– One instruction needs 5 cycles

– Two instruction needs 6 cycles

• Each slice can be used to different instructions

• Implementation

– Slicing data path by pipelining register

– Store controls for each instruction in decoding stage

3-Stage Pipelined Architecture

• 3 stage pipelining

– IF : instruction fetch

– ID : instruction decode

– EX : execution (including memory access, write back)

Register FileIMEM

DMEM

ALU

IF/ID ID/EX

IF stage ID stage EX stage

Memory access

5-Stage Pipelined Architecture

• 5 stage pipelining

– like 5 cycle implementation

• IF, ID, EX, MEM, WB

Program

Couner

Register File

(Read)A

LU operand

IMEM

Instruction Register

DMEMALU

IF/ID ID/EX

IF stage ID stage EX stage

Data out

EX/MEM

Register w

rite

MEM/WB

MEM stage WB stage

Register File

(Write)

Forwarding

• How about instruction sequence having dependencies?

• Correct results

INST1 : ADD R1, R2, R3

INST2 : SLL R4, R2, R1

INST3 : CMPEQ R3, R4, R1

R1 R2 R3 R4

Before 1

Before 2

Before 3 4 1 3 16

After 3

5 1 3 5

4 1 3 5

4 1 0 16

Without forwarding

R1 R2 R3 R4

Before 1

Before 2

Before 3 4 1 3 32(WRONG)

After 3

5 1 3 5

4 (CORRECT) 1 3 5

4 1 1(WRONG) 32

INST1 : ADD R1, R2, R3

INST2 : SLL R4, R2, R1

INST3 : CMPEQ R3, R4, R1

• The results will be wrong

– 5 stage case

Solution for data dependancy

• What’s the problem?

– Due to pipelining

– Intermediate result must be forwarded for execution stage

– Forwarding from

• Execution stage result (EX/MEM pipeline register)

• Memory access stage result (MEM/WB pipelining register)

Modification for Forwarding

• Modification

– Adding datapath• EX/MEM stage (ALU output) ALU operand

• MEM/WB stage (DMEM data) ALU operand

– Adding controls

• ALU operand A selection

• ALU operand B selection

– Adding pipelining control register

• Write register address

– MEM stage

– WB stage

• Decoded control signals

– Adding hazard detection logic

• Compare execution stage operands with MEM stage or WB stage registers

Pipeline Hazards

• Pipelining causes pipeline hazards, many unexpected problem in simple implementation or multi cycle implementation

• There are three kinds of hazards from pipelining

– Structural hazard

– Control hazard

– Data hazard

Pipeline Hazards

• Structural hazard

– Collision of functional unit or resource conflicts

– Hardware cannot support the instructions at the same time

– e.g.1. ALU (sharing IF stage and EX stage)

– e.g.2. Memory (sharing instruction and data memory)

• Control hazard

– Unexpected change of program counter

– Due to branch instruction and other instructions

• Data hazard

– Subsequence instruction needs results of previous instructions

– Solution : forwarding logics

Structural Hazards

• Structural hazard– Arise when there are some functional unit not fully pipelined– Can be solved from duplication of functional unit

• e.g.1. Conflict of write in register file– Some instruction ends in 4 cycles (add, shift, ...)– Load instruction ends in 5 cycles – Load instruction followed by add instruction causes a structural

hazard– This is due to one register write port

• e.g.2. Conflict of memory– For some architecture, only one memory is provided– Reading instruction and data need to access memory– Memory access instruction (load/store) causes memory conflict

• e.g.3. Conflict of ALU– For another architecture, program counter control needs ALU

for branch instruction– Next program counter calculation is performed before EX stage– In sharing ALU with PC calculation and operation results,

branch instruction causes pipelining stall

Control Hazards

• Control hazard

– Control hazard are generated from branch or jump instruction

– When processing non sequential instructions

• Instructions after these instruction must be invalidated

• e.g.1. Conflict from branch instruction

– In case of branch address calculation in execution stage

• Two instruction (ID and IF stage) must be invalidated

– In case of branch address calculation in decoding stage

• One instruction (IF stage) must be invalidated

• e.g.2. Conflict from jump instruction

– Like branch instruction, one or two instruction must be invalidated

– However, some technique can be used for reducing pipelining stall

• Instruction includes instruction word rather than jump address

Types of Data Dependencies

• RAR (read after read)

• WAR (write after read)

• WAW (write after write)

• RAW (read after write)

Data Hazards

• RAR (Read after read) is not actually hazard

– ADD R3, R1, R2

– SLL R4, R1, R2 // pipelining doesn’t affect the results at all

• WAR (Write after read) is called antidependence

– ADD R3, R1, R2 // reading R1 must be done before writing R1

– SLL R1, R4, R5

– Current pipelined structure read operands in ID stage and write back the result in WB stage which is serialized, thus it guarantee the correct behavior

• WAW (Write after write) is called output dependence

– ADD R3, R1, R2

– SLL R3, R4, R5 // Writing R3 must commit after add instruction

– Register write back must be committed from single path or ordered way

• Register renaming prevent WAR and WAW hazard

– These hazard is not logical hazard

– The hazard have caused in register allocation process

RAW Hazards and Forwarding

• Read after write hazard is actually dealt with in forwarding

– ADD R3, R1, R2

– SLL R4, R3, R5

– RAW hazard can be solved with

– forwarding

– Hazard detection unit notifies

– ALU operand unit to select

– forwarded path

– However, some instruction

– sequences must be stalled

– LD R1, 4(R2)

– ADD R3, R1, R2

• Data from load instruction must be available in WB stage

• Subsequent add instruction must be stalled in EX stage

R1 R2 R5

+

<<

R4

Contents

• Overview of Processor Architecture

• Pipelining

• Memory Systems

• Advanced Topics

• Summary

• Exercises

• References

Memory Systems

• Memory is processors’ working space which read instructions or read/write data

– Fast memory is required to obtain higher clock frequencies

– Fast memory is usually expensive than slow one

– So, the trade-off of speed and cost is main topic in designing memory systems

• How to obtain speed against cost?

– Memory hierarchy

Memory Hierarchy

• Memory hierarchy

Registers

L1 Cache

L2 Cache

Main Memory (DRAM)

Secondary Storage (DISK)

Faster, but expensive

Slower and Larger

Cache

• Characteristic in memory access : Locality

• Temporal locality (locality in time)

– If some instruction/data element are referenced, then it might be referenced again soon

– Keep most recently referenced elements closed to processor

• Spatial locality (locality in space)

– If some instruction/data element are referenced, then it’s neighbor elements might be referenced soon

– Due to latency in larger memory (DRAM or disk), it will be helpful to get the neighbor elements at same time of getting a element

address space

probability of referencing

probability of referencing same element

time duration

Average Memory Access Time

• Memory system design goal is reducing the average memory accessing time

• Performance factor– Probability of existence in faster memory– Latency of getting data from slower memory

• Cache design– Reduce accessing frequency of main memory– Average access time– = cache hit time + cache miss rate * DRAM access time

• Cache memory includes– Tag : address in cache line– Data : data elements from DRAM– Valid bit : indicates the tag and data is valid– Dirty bit : indicates the data is different from DRAM

Average Access Time = Hit time + Miss rate * Miss penalty

Source of Cache Misses

• Compulsory miss

– Called cold miss or first-reference miss

– Unavoidable

• Invalidation miss

– From context switching

• Capacity miss

– Due to smaller cache size than working set

– Making cache larger

– e.g. Large matrix multiplication

• Conflict miss

– Due to cache access pattern

– Some cache line loaded and discarded repeatedly

– Can be reduced with higher set-associative cache

Cache Organization

• Three kind of cache organization

– Direct mapped cache

– Set associative cache

– Fully associative cache

Direct Mapped Cache

• Direct mapped cache organization

– 2N bytes direct mapped cache

– 2M bytes line size

– Number of line : 2N-M

• Memory Address

– M bit offset

– N-M bit index

– 32-N bit tag

• Most simple architecture

• Highest conflict miss ratio

• e.g. 8 KB direct mapped cache with 16 bytes line

– 512 lines

– 4 bit offset

– 9 bit index

– 19 bit tag

– Tag memory size : 19 bit * 512 = 9.5Kbit

– With valid bit, dirty bit total cache overhead is

• 8KB * 8 bit/byte + 9.5 Kbit + 512 bit + 512 bit = 74.5Kbits = 9.3 KB

Direct Mapped Cache

offsettag index034121331

address

valid dirty tag data

0

1

2

3

4

5

6

7

508

509

510

511

......

.

.

.... ... ... ...

.

.

.

word 0word 1word 2word 3word 4word 5word 6word 7

word 2047 word 2046 word 2045 word 2044

Set Associative Cache

• Set associative cache organization– 2N bytes set associative cache


– 2O-way set associativity


• Memory Address– M bit offset

– N-M-O bit index

– 32-N+O bit tag

• Replacement algorithm

• e.g. 8 KB 4-way set associative cache with 16 bytes line– 512 lines

– 4 bit offset

– 11 bit index

– 21 bit tag

– Tag memory size : 21 bit * 512 = 10.5Kbit


• 8KB * 8 bit/byte + 10.5 Kbit + 512 bit + 512 bit = 75.5 Kbits = 9.4 KB

– Set associative cache needs more multiplexer

Set Associative Cache

offsettag index034101131

address

valid dirty

tag

data

0

1

127

......

.

.

.... ... ... ...

.

.

.



1

127

......

.

.

.... ... ... ...

.

.

.

word 512

0

1

127

......

.

.

.... ... ... ...

.

.

.

0

1

127

......

.

.

.... ... ... ...

.

.

.word 2047 word 2046 word 2045 word 2044

Compare

data

word 513word 514word 515word 519 word 518 word 517 word 516

word 1020word 1021word 1022word 1023

word 1027 word 1026 word 1025 word 1024word 1028word 1029word 1030word 1031


word 1536word 1537word 1538word 1539word 1543 word 1542 word 1541 word 1540

Fully Associative Cache

• Fully associative cache organization

– 2N bytes fully associative cache



• Memory Address

– M bit offset

– 32-M bit tag

• No conflict miss (highest hit ratio)

• But needs more comparator

• e.g. 8 KB fully associative cache with 16 bytes line

– 512 lines

– 4 bit offset

– 28 bit tag

– Tag memory size : 28 bit * 512 = 14Kbit


• 8KB * 8 bit/byte + 14 Kbit + 512 bit + 512 bit = 79 Kbits = 9.9 KB

– Fully associative usually implemented using CAM (Content Addressable Memory) which consists of RAM cell with built-in comparator

Fully Associative Cache

offsettag03431

address

Content Addressable

Memory (CAM)

... ... ... ...



MUX

hit data

Virtual Memory

• Motivation of virtual memory (VM)

– Multi process

• Each process has own address space

• Overlapped address space

• WIthout VM, programmer have to change the address inside code

– Memory protection

• Each process must access own memory region

• Without memory protection, a process interfere memory region of another process One problematic process crash whole systems

Address Translation

• Virtual memory

– Memory management technique of dealing with main memory as a cache for secondary storage

– Separate address region between programmer’s view and physical view

• Address space

– V = {0,1,...,N-1} // Virtual address space

– P = {0,1,...,M-1} // Physical address space

– N>M

• Address translation

– MAP : V P U {Φ}

– MAP(A) = A’

• data at virtual address A is in physical address A’

– MAP(A) = Φ

• data at virtual address A is not in physical memory

• Invalid address or stored on disk

Address Translation Process

• Translation process

– Processor request the data at address A

– Address translator translate virtual address A using page table,TLB, MMU, etc

– In case of translation hit (data in main memory), read data frommain memory

– However, when translation miss, page fault exception occurs, OS handler bring a page to main memory from secondary storage

Processor Address Translator

Main Memory

Secondary Storage

Fault Handler

AA’

Φ

page fault

MMU, TLB, Page Table... OS handler

hit

miss

Page Table

• Page table

– Stores physical page number indexed by virtual page number

– Valid bit indicates the requested page is in main memory

– Protection checks the requested access is allowable

• Read only / writable

• Kernel mode only / user mode

Translation Lookaside Buffer

• Translation Lookaside Buffer (TLB)

– Address translation process needs more than one memory access itself

• Page table is also stored in main memory

• Multi level translation needs more memory access

– Reducing translation overhead like cache

• Part of page table can be translated from TLB

• When TLB miss, TLB carry physical page number from page table

Page OffsetVirtual Page NumberVirtual Address

PPN 0

...

valid

PPN 1PPN 2PPN 3

PPN p-1tag data

Physical Page NumberPhysical Address Page Offset

Cache and Address Tanslation

• Cache with address translation

– Cache with physical address

• Address translation first, then access cache

• Long access time to cache

– Cache with virtual address

• Access cache without address translation

• Parallel checking of TLB and cache

Contents


• Pipelining

• Memory Systems

• Advanced Topics

• Summary

• Exercises

• References

Advanced Topics

• Deeper pipelining stage makes fast frequencies but

– Pipelining hazard causes performance degradation

• Modern processor exploits more parallelism than ever before

– Instruction level parallelism

– Data level parallelism

– Thread level parallelism

– Multiprocessor

• Complex instruction set architecture and support of additional logic block is required for better performance

– Branch prediction

– Exploiting parallelism

Branch Prediction

• Branch instruction affect program pipelining state because instructions following branch is actually determined after branch instruction

– If the branch is taken, instructions after branch is to be bubble

– This is called control hazard

• However, typical program has many branches related to loop

– e.g. for (i=0; i<10; i++) { c[i] = a[i] + b[i]; }

– // 9 branches are taken, but 1 branches are not taken

• Branch prediction exploits locality like cache

– Branch prediction is method resolving a branch hazard with assuming the result while waiting the actual result

– Some branch instruction has tendency of taken or not taken

• if statement which checks corner case

– At some moment, branch instructions are taken more

• loop statement which has more taken behavior

Branch Prediction Schemes

• Branch prediction scheme (ref [5])

– Program-based predictors vs profile-based predictors

– Or static schemes vs dynamic schemes

• Static schemes

– Determines the direction in compile time

– Less hardware complexity

– Scheme

• Always taken

• Taken when backward or Not taken when forward

• Profile based decision : inaccurate for different programs

• Dynamic schemes

– Direction changes according to program procedure

– Hardware overhead : branch history table, branch target buffer, etc

– 1-bit prediction scheme

– 2-bit prediction scheme

– Bimodal branch prediction scheme

– Local branch prediction scheme

– Global branch prediction scheme

Dynamic Branch Prediction

• 1-bit branch prediction

– 1-bit stores the previous prediction is taken or not

– for loop causes two mispredictions

• 2-bit branch prediction

– 2-bit counter changes with previous branches

– Less misprediction than 1-bit branch predictor

Predict Taken11

Predict Taken10

Predict Not Taken00

Predict Not Taken01

Not Taken

Taken

Not Taken

Taken

Not TakenTaken

Taken

Not Taken

Example of Branch Prediction

• Example) Double loop statement

– for (i=0; i<10; i++)

– for (j=0; j<10; j++)

– Loop Body;

– First two miss, followed by one miss per inner loop

j 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 ...

Prediction NT NT T T T T T T T T T T T T T T ...

State 00 01 11 11 11 11 11 11 11 11 11 10 11 11 11 11 ...

Action T T T T T T T T T T NT T T T T T ...

Advanced Branch Prediction

• Expansion of 2-bit branch prediction

– Bimodal branch prediction

– Local history prediction

– Global history prediction

• Bimodal branch prediction

– Table indexed by some bits of branch address

• Local branch prediction

– Store temporary local history

– Consists of two tables

• Table indexed by some bits of program counter

• Table indexed by local taken history

• Global branch prediction

– Store temporary global history

– Table indexed by global taken history

Advanced Branch Prediction

PC

countstaken prediction

taken

indexing

PC

historytaken

indexing

countsprediction

shift register

GR

countstaken prediction

taken

indexing

shift register

Bimodal branch prediction Local branch prediction

Global branch prediction

Exploiting Parallelism

• Parallelism is a key factor of performance improvements

– More instructions, threads, processes simultaneously

• There are several architectural variations of exploiting parallelism

– VLIW

– Superscalar

– Simultaneous multithreading

– Vector processor

– Multi processor

Very Long Instruction Word Architecture

• Very Long Instruction Word (VLIW) architecture– One instruction has more than two issues

– A number of instructions forms a long instruction word

• Architecture supports– Register file

• More than two write ports (depends on number of issue slot)

• More than four read ports

– Multiple ALU• Duplicated ALU is used for arithmetic operations

• e.g. IA-64 architecture– Explicitly Parallel Instruction Computing (EPIC) architecture

– 128-bit instruction bundle• Three instructions (each instruction has 41-bits)

• 5-bit Template

Superscalar Architecture

• Superscalar architecture

– With hardware supports, more than one instructions are performed at once

– Originally ordered instructions are dynamically formed a larger instruction words

– Register renaming

– Out-of-execution

• e.g. Two ALU superscalar machine

ADD R5,R1,R2

SUB R6,R3,R4

SLL R7,R1,R5

AND R8,R3,R6

cycle 0 : ADD R5,R1,R2 || SUB R6,R3,R4

cycle 1 : SLL R7,R1,R5 || AND R8,R3,R6

IF ID EX MEM WBAND/SUB

IF ID EX MEM WBSLL/AND

VLIW vs Superscalar

• Comparison of VLIW and superscalar architecture

– Compilation

• VLIW machine needs new compilation for legacy codes

• Superscalar machine uses binary of single issue binary

– Scheduling window

• VLIW machine has infinite scheduling window

– Due to compilation time, less than a hundred instructions

• Superscalar machine has limited window

– Hardware implementation dependent

– Hardware complexity

• Superscalar machine dynamically detects hazards

• VLIW machine statically detects hazards in compilation time

– Software complexity

• VLIW machine needs a new compiler

Simultaneous Multithreading

• Simultaneous Multithreading (SMT) (ref [6])

– Combines wide-issue superscalar and multithreaded processors

– Instruction issue filled from multiple threads

• Instruction from single threads isn’t enough to fill all instruction slots

– Register renaming for different threads

– Per thread hardware contexts (program counter, registers), instruction retirements, branch target buffer, translation look-aside buffer, etc

Time (cycle)

Vector Processor

• Vector Processor

– Single vector instruction implies lots of operations

• Fewer instruction fetches

– Each result is independent from other results

• Multiple operation can be executed in parallel

• Vectorizing compiler checks dependencies

– Reduces branches and their side effect

– Vector instructions access memory with known pattern

• Effective prefetch

• Memory interleaving for higher bandwidth

for (i=0; i<64; i++)

Y[i] = X[i] * C;

VMUL V2, V1, R1

// V2, V1 is vector operands which indicates 64 operands

Contents


• Pipelining

• Memory Systems

• Advanced Topics

• Summary

• Exercises

• References

Summary

• Pipeline 구조와 pipeline hazard을 다루었다.

• 여러 가지 Cache 종류(direct mapped cache, set associative cache, fully associative cache) 구조와 특성, 여러 가지 cache miss 원인 등을 이해하는 것이 중요하다.

• Virtual memory의 필요성, page fault, TLB 동작을 다루었다.

• 고성능 프로세서에서 더 중요한 Branch penalty를 줄이기 위한여러 가지 branch prediction 방법을 다루었다.

• 한 clock cycle에 한 명령어 이상을 수행하는 VLIW, superscalar, SMT의 동작 원리를 다루었다.

Contents


• Pipelining

• Memory Systems

• Advanced Topics

• Summary

• Exercises

• References

Exercises

1. Pipeline hazard의 종류를 기술하고, 각 hazard를 해결하는방법을 간단히 기술하시오.

2. Cache 종류를 나열하고 각 cache의 구조에 대해서 설명하시오.

3. 여러 가지 Cache miss의 원인을 설명하시오.

4. Page fault에 대해서 아는 대로 기술하시오.

5. Branch prediction을 하는 방법과 이를 사용하는 이유를기술하시오.

Contents


• Pipelining

• Memory Systems

• Advanced Topics

• Summary

• Exercises

• References

References

• [1] David A. Patterson and John L. Hennessy, “Computer Organization and Design,” 3rd ed., Morgan Kaufmann, 2005

• [2] John, L. Hennessy and David A. Patterson, “Computer Architecture: A Quantitative Approach,” 3rd ed., Morgan Kaufmann, 2003

• [3] Neil H. E. Weste and David Harris, “CMOS VLSI Design,”Addison Wesley, 2005

• [4] Steve Furber, “ARM System-on-Chip Architecture,” 2nd ed., Addison Wesley, 2000

• [5] Chih-Cheng Cheng, “The Schemes and Performances of Dynamic Branch Predictors”

• [6] Susan J. Eggers et al., “Simultaneous Multithreading: A Platform for Next-Generation Processors”, IEEE Micro, 1997

Documents

Review of Processor Architecturesvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/%b8%f0%b5%e...– Two read, one write port (2R1W) register file – Write muxfor register file from ALU