Upload
doliem
View
221
Download
4
Embed Size (px)
Citation preview
Module 2
Review of Processor Architectures
Module 2
Review of Processor Architectures
이론 3시간
채수익
서울대학교
학습목표
• SoC에 대한 기본 개념 이해
• SoC 구조와 구성 요소에 대한 이해
• SoC 설계 과정 소개
• SoC 플랫폼에 대한 기본 개념 이해
모듈구성
선택필수
SoC 구조
Embedded Processor I, II
Embedded Memory
Typical Logic Blocks
External Interface
On-Chip Bus Architecture
Bus Interface Design
실습
Processor 설계실습I, II
간단한 SoC 설계실습 I, II
Reconfigurable Processor
Architecture
Low Power SoC Design I, II
Network-on-Chip
Introduction
Introduction to SoC Architecture
Review of Processor
Architecture
SoC 설계실례
강의자료활용 Guideline
• 3시간 강의
• 모든 내용을 다 다룬다.
Contents
1. Overview of Processor Architecture
2. Pipelining
3. Memory Systems
4. Advanced Topics
5. Summary
6. Exercises
7. References
Overview of Processor Architecture
• Computer is comprised of– Processor (datapath, control)
– Input (keyboard, mouse, joystick)
– Output (LCD, CRT, printer)
– Memory (CD, HDD, DRAM, SRAM)
– Network
• Processor : Heart of Computer– VLSI (Very Large Scale Integration) System
– State-of-the-art processor integrate a few billion transistors
• Brief History– Intel 4004 (in 1971, 2,250 transistors)
– Intel 8086 (in 1978, 29,000 transistors)
– Intel 80486 (in 1989, 1,200,000 transistors)
– Pentium (in 1993, 3,100,000 transistors)
– Core 2 (in 2006, 291,000,000 transistors)
– Dual-core Itanium 2 (in 2006, 1,720,000,000 transistors)
History of Processor Architecture
4004
8086
40486
Pentium
Instruction Set Architectures
• What is Instruction Set Architecture?
– like Building : common in making things
– Important factor of performance and cost
• Architecture satisfies user requirements
• Also, cost is minimized in engineering perspective
• Architecture evolution
– History
• Single cycle processor
• Multi cycle processor
• Pipelined processor
• Multi issue processor
• Configurable processor
– Toward high integration
• Multi core
• More caches
– Low Power : keep both clock frequency and voltage low
Hardware and Software interface
• Instruction Set Architecture is
– Interfaces between processor and software
• Compiler, assembler in software, control, datapath in processor must satisfy the instruction set architecture
– Instruction set architecture design with performance evaluation is researched for several decades
Classification of Instruction Set Architectures
• Classification (ref [2])
– Stack
• No registers, Short instruction length
• Restricted instruction order
• Used in 1960~1970 for real chip
• Java Virtual Machine
– Accumulator based
• One accumulator register
• Rather short instructions
• e.g. UNIVAC I, EDSAC, Intel 8080
– Register-memory
• A set of registers
• Another operand is from memory
• Most CISC
– Register-register
• All operand from register file
• Load-store architecture
• RISC
Basic Instruction Format
• Basic Instruction Format– opcode– operand1, operand2– ...
• Instruction Type– Arithmetic, logical, compare– Branch, jump– Load, store
• Addressing Mode– How processor obtain operand from instruction word?– Register
• Add R1, R2 // Reg[R1] = Reg[R1]+Reg[R2]– Immediate
• Add R1, #2 // Reg[R1] = Reg[R1]+2– Displacement (register indirect)
• Add R1, 4(R2) // Reg[R1] = Reg[R1]+Mem[Reg[R2]+4]– Auto-increment/decrement (pre, post)
• Add R1, (R2)+ // Post increment• // Reg[R1] = Reg[R1]+Mem[Reg[R2]]• // Reg[R2] = Reg[R2]+s // s is operand size• Add R1, -(R2) // Pre decrement• // Reg[R2] = Reg[R2]-s // s is operand size• // Reg[R1] = Reg[R1]+Mem[Reg[R2]]
MIPS Instruction Set Architecture
• Example : MIPS architecture
– Register-register architecture (load-store architecture)
– Three addressing modes
• Displacement, immediate, register indirect
– 32 64-bit GPR, 32 32-/64-bit floating pointer registers
– 8, 16, 32, 64-bit integer data types
– 32-bit single precision, 64-bit double precision floating data
– Three basic instruction formats
• R-type : three register operands (e.g. ADD R0,R1,R2)
• I-type : two register with immediate operands (e.g. ADD R0,R1,#4)
• J-type : long immediate for jump instruction
opcode rs rt rd shamt funct
opcode rs rt immediate
opcode address
R
I
J
31 26 25 21 20 16 15 11 10 6 5 0
31 26 25 21 20 16 15 0
31 26 25 0
Datapath Design
• Datapath supports instruction functionality
• Datapath consists of– Instruction related parts
• Program counter (PC)
• Instruction memory (IMEM)
• Instruction register (IR)
– Data processing related parts• Register file (RF)
• Arithmetic and logical unit (ALU)
• Data memory (DMEM)
• Instruction processing flow1. Processor read instruction indicated by current PC
2. After accessing IMEM, an instruction fetched to IR
3. Decoding the instruction, operands are read from RF
4. ALU perform the operation with read operands
5. The result will be stored to DMEM or RF
Functional Datapath Diagram
• Functional datapath diagram
– Two separated memory elements
• Instruction (read only)
• Data (read/write)
– Program counter and instruction register
– Register file
– ALU
Program
Counter (P
C)
Instruction Memory(IMEM)
Instruction Register (IR
)
Register File(RF)
Data Memory (DMEM)
ALU
Architectural Datapath Diagram
• Architectural datapath diagram
– Memory has address and input, output data bus
– Support for Branch, jump instruction
– Two read, one write port (2R1W) register file
– Write mux for register file from ALU and DMEM
– ALU gets operand from instruction word – immediate operand
• Sign extension unit
Program
Counter (P
C)
Instruction Register (IR
)
ALU Operations
• ALU performs
– Arithmetic operation
• Addition and subtraction
• Multiplication and division
• Integer operation or floating-point operation
– Separated FP unit due to long execution time
– Shift operation
• Shift left/right logical
• Shift right arithmetic (sign extension)
• Rotate left/right (truncated bits concatenated to other side)
– Logical operation
• NOT, AND, OR, NAND, NOR, XOR, ...
– Comparison
• ==, !=, <, <=, >, >=
• Integer or floating-point
Adders
• Adder, Subtractor (ref[3])
– Ripple carry adder
– Carry lookahead adder
– Carry skip adder
– Carry select adder
– Tree adder
Carry Lookahead adder
Cin+
S4:1
P4:1
A4:1 B4:1
+
S8:5
P8:5
A8:5 B8:5
+
S12:9
P12:9
A12:9 B12:9
+
S16:13
P16:13
A16:13 B16:13
CoutC4
1
0
C81
0
C121
0
1
0
Carry skip adder
Cin+
A4:1 B4:1
S4:1
C4
+
+
01
A8:5 B8:5
S8:5
C8
+
+
01
A12:9 B12:9
S12:9
C12
+
+
01
A16:13 B16:13
S16:13
Cout
0
1
0
1
0
1
Carry select adder
1:03:25:47:69:811:1013:1215:14
3:07:411:815:12
7:015:8
11:0
5:09:013:0
0123456789101112131415
15:014:013:0 12:011:010:0 9:0 8:0 7:0 6:0 5:0 4:0 3:0 2:0 1:0 0:0
Brent-Kung tree adder
Multipliers
• Multiplier
– Multi-cycle : shift, and accumulate
– Booth multiplier
• Generate partial products
• Reduce partial products : (3,2) counter, (4,2) compressor
• Fast carry propagate addition
– Array multiplier
• Regular structure
y0y1y2y3
x0
x1
x2
x3
p0p1p2p3p4p5p6p7
B
ASin Cin
SoutCout
BA
CinCout
Sout
Sin
=
CSAArray
CPA
critical path BA
Sout
Cout CinCout
Sout
=Cin
BA
Instruction decoding
• Instruction decoding
– For given instruction word, select the operations to perform (opcode)
– Read register address (operand 1, operand 2)
– Write register address (operand 3)
• Program counter
– increment (sequential process)
– jump (by jump instruction)
– branch (by branch instruction)
• Register file
– Read operands indicated by instruction
– Write enable and address indicated by instruction
• ALU
– Operation type selection (ADD, SUB, SLL, AND, ...)
– Signed/Unsigned operation selection
• Data memory
– For load instruction, generate read address and read enable, and save the result to register file
– For store instruction, generate write address and write data and write enable
– address typically generated by ALU (displacement addressing mode)
Multiplexer Control for Program Counter
• Multiplexer control for Program counter
– Incrementer (sequential execution)
– Adder or ALU (branch operation)
– Immediate (jump operation)
IMEM DMEMALURFraddr2
immediate
reg2
reg1
4
Next
Current PC
immediate
Sign Ext
wdatawaddr
raddr1 rdata1
rdata2
Curr
spr
ALU result
addr
data
opA
opB
opW
imm
DMEM data
raddr/waddr
wdata
rdata
InstructionDecoder
opcode
wen
ren
wenop
br
jp
rfimm
dmemwb
opb
opa
dmemdata
rfw dmemw
dmemr
ALUop
brjp
rfwdmemrdmemwopaopb
ALUop
dmemdatadmemwbrfimm
Multiplexer for register file
• Multiplexer control for register file
– ALU
– Data memory
– Immediate
IMEM DMEMALURFraddr2
immediate
reg2
reg1
4
Next
Current PC
immediate
Sign Ext
wdatawaddr
raddr1 rdata1
rdata2
Curr
spr
ALU result
addr
data
opA
opB
opW
imm
DMEM data
raddr/waddr
wdata
rdata
InstructionDecoder
opcode
wen
ren
wenop
br
jp
rfimm
dmemwb
opb
opa
dmemdata
rfw dmemw
dmemr
ALUop
brjp
rfwdmemrdmemwopaopb
ALUop
dmemdatadmemwbrfimm
Multiplexer Control for ALU Operands
• Multiplexer control for ALU operands
– Register file
– Immediate (from instruction word)
– Data memory (CISC case)
– Program Counter
– Special registers
IMEM DMEMALURFraddr2
immediate
reg2
reg1
4
Next
Current PC
immediate
Sign Ext
wdatawaddr
raddr1 rdata1
rdata2
Curr
spr
ALU result
addr
data
opA
opB
opW
imm
DMEM data
raddr/waddr
wdata
rdata
InstructionDecoder
opcode
wen
ren
wenop
br
jp
rfimm
dmemwb
opb
opa
dmemdata
rfw dmemw
dmemr
ALUop
brjp
rfwdmemrdmemwopaopb
ALUop
dmemdatadmemwbrfimm
Multiplexer Control For Data Memory
• Multiplexer control for Data Memory
– Register file
– ALU
– Immediate
IMEM DMEMALURFraddr2
immediate
reg2
reg1
4
Next
Current PC
immediate
Sign Ext
wdatawaddr
raddr1 rdata1
rdata2
Curr
spr
ALU result
addr
data
opA
opB
opW
imm
DMEM data
raddr/waddr
wdata
rdata
InstructionDecoder
opcode
wen
ren
wenop
br
jp
rfimm
dmemwb
opb
opa
dmemdata
rfw dmemw
dmemr
ALUop
brjp
rfwdmemrdmemwopaopb
ALUop
dmemdatadmemwbrfimm
Finite State Machine
• Implementing Controller : Finite State Machine
– Input : opcode
– Output : mux selection, write/read enable for RF, DMEM
– State : pipeline enable/disable (or processor stall)
FSM Implementation
opcode
current state
next state
control signals
Address D
ecoderData
bitline
wordline
address control control
signals
Data
1
opcode
Microcode Memory
ROM
Microprogramming
• Implementing FSM
– Logic gates
– PLA (Programmable Logic Array)
– ROM (Read Only Memory)
– Microprogramming
opcode
current state
next state
control signals
AND plane
OR plane
PLA
Contents
1. Overview of Processor Architecture
2. Pipelining
3. Memory Systems
4. Advanced Topics
5. Summary
6. Exercises
7. References
Pipelining
• Pipelining
– Slicing unified datapath to multi clock path
– Processor performs a several instruction at the same time
• Better instruction throughput
• Worse instruction latency
• Benefits
– High clock frequency (short critical path)
– Resource sharing
• Drawbacks
– Pipelining register
– Hazard detection
Multi-Cycle Implementation
• Single cycle multiple cycles (ref [1])
– 5 cycle implementation
• IF : Instruction Fetch
• ID : Instruction Decode
• EX : Execution
• MEM : Memory Access
• WB : Write Back
PC IMEM IR DMEMALURF
raddr2
immediate
reg2
reg1
4
Next
Current PC
IF
Sign Ext
wdatawaddr
raddr1 rdata1
rdata2
Curr
spr
ALU result
addr
data
opA
opB
opW
imm
DMEM data
raddr/waddr
wdata
rdata
InstructionDecoder
opcode
wen
ren
wenop
br
jp
rfimm
dmemwb
opb
opa
dmemdata
rfw dmemw
dmemr
ALUop
brjp
rfwdmemrdmemwopaopb
ALUop
dmemdatadmemwbrfimm
ID EX MEM WB
Pipelined Architecture
• For multi-cycle architecture
– One instruction needs 5 cycles
– Two instruction needs 10 cycles
• Because of exclusive use of data path
• Pipelined architecture
– One instruction needs 5 cycles
– Two instruction needs 6 cycles
• Each slice can be used to different instructions
• Implementation
– Slicing data path by pipelining register
– Store controls for each instruction in decoding stage
3-Stage Pipelined Architecture
• 3 stage pipelining
– IF : instruction fetch
– ID : instruction decode
– EX : execution (including memory access, write back)
Register FileIMEM
DMEM
ALU
IF/ID ID/EX
IF stage ID stage EX stage
Memory access
5-Stage Pipelined Architecture
• 5 stage pipelining
– like 5 cycle implementation
• IF, ID, EX, MEM, WB
Program
Couner
Register File
(Read)A
LU operand
IMEM
Instruction Register
DMEMALU
IF/ID ID/EX
IF stage ID stage EX stage
Data out
EX/MEM
Register w
rite
MEM/WB
MEM stage WB stage
Register File
(Write)
Forwarding
• How about instruction sequence having dependencies?
• Correct results
INST1 : ADD R1, R2, R3
INST2 : SLL R4, R2, R1
INST3 : CMPEQ R3, R4, R1
R1 R2 R3 R4
Before 1
Before 2
Before 3 4 1 3 16
After 3
5 1 3 5
4 1 3 5
4 1 0 16
Without forwarding
R1 R2 R3 R4
Before 1
Before 2
Before 3 4 1 3 32(WRONG)
After 3
5 1 3 5
4 (CORRECT) 1 3 5
4 1 1(WRONG) 32
INST1 : ADD R1, R2, R3
INST2 : SLL R4, R2, R1
INST3 : CMPEQ R3, R4, R1
• The results will be wrong
– 5 stage case
Solution for data dependancy
• What’s the problem?
– Due to pipelining
– Intermediate result must be forwarded for execution stage
– Forwarding from
• Execution stage result (EX/MEM pipeline register)
• Memory access stage result (MEM/WB pipelining register)
Modification for Forwarding
• Modification
– Adding datapath• EX/MEM stage (ALU output) ALU operand
• MEM/WB stage (DMEM data) ALU operand
– Adding controls
• ALU operand A selection
• ALU operand B selection
– Adding pipelining control register
• Write register address
– MEM stage
– WB stage
• Decoded control signals
– Adding hazard detection logic
• Compare execution stage operands with MEM stage or WB stage registers
Pipeline Hazards
• Pipelining causes pipeline hazards, many unexpected problem in simple implementation or multi cycle implementation
• There are three kinds of hazards from pipelining
– Structural hazard
– Control hazard
– Data hazard
Pipeline Hazards
• Structural hazard
– Collision of functional unit or resource conflicts
– Hardware cannot support the instructions at the same time
– e.g.1. ALU (sharing IF stage and EX stage)
– e.g.2. Memory (sharing instruction and data memory)
• Control hazard
– Unexpected change of program counter
– Due to branch instruction and other instructions
• Data hazard
– Subsequence instruction needs results of previous instructions
– Solution : forwarding logics
Structural Hazards
• Structural hazard– Arise when there are some functional unit not fully pipelined– Can be solved from duplication of functional unit
• e.g.1. Conflict of write in register file– Some instruction ends in 4 cycles (add, shift, ...)– Load instruction ends in 5 cycles – Load instruction followed by add instruction causes a structural
hazard– This is due to one register write port
• e.g.2. Conflict of memory– For some architecture, only one memory is provided– Reading instruction and data need to access memory– Memory access instruction (load/store) causes memory conflict
• e.g.3. Conflict of ALU– For another architecture, program counter control needs ALU
for branch instruction– Next program counter calculation is performed before EX stage– In sharing ALU with PC calculation and operation results,
branch instruction causes pipelining stall
Control Hazards
• Control hazard
– Control hazard are generated from branch or jump instruction
– When processing non sequential instructions
• Instructions after these instruction must be invalidated
• e.g.1. Conflict from branch instruction
– In case of branch address calculation in execution stage
• Two instruction (ID and IF stage) must be invalidated
– In case of branch address calculation in decoding stage
• One instruction (IF stage) must be invalidated
• e.g.2. Conflict from jump instruction
– Like branch instruction, one or two instruction must be invalidated
– However, some technique can be used for reducing pipelining stall
• Instruction includes instruction word rather than jump address
Types of Data Dependencies
• RAR (read after read)
• WAR (write after read)
• WAW (write after write)
• RAW (read after write)
Data Hazards
• RAR (Read after read) is not actually hazard
– ADD R3, R1, R2
– SLL R4, R1, R2 // pipelining doesn’t affect the results at all
• WAR (Write after read) is called antidependence
– ADD R3, R1, R2 // reading R1 must be done before writing R1
– SLL R1, R4, R5
– Current pipelined structure read operands in ID stage and write back the result in WB stage which is serialized, thus it guarantee the correct behavior
• WAW (Write after write) is called output dependence
– ADD R3, R1, R2
– SLL R3, R4, R5 // Writing R3 must commit after add instruction
– Register write back must be committed from single path or ordered way
• Register renaming prevent WAR and WAW hazard
– These hazard is not logical hazard
– The hazard have caused in register allocation process
RAW Hazards and Forwarding
• Read after write hazard is actually dealt with in forwarding
– ADD R3, R1, R2
– SLL R4, R3, R5
– RAW hazard can be solved with
– forwarding
– Hazard detection unit notifies
– ALU operand unit to select
– forwarded path
– However, some instruction
– sequences must be stalled
– LD R1, 4(R2)
– ADD R3, R1, R2
• Data from load instruction must be available in WB stage
• Subsequent add instruction must be stalled in EX stage
R1 R2 R5
+
<<
R4
Contents
• Overview of Processor Architecture
• Pipelining
• Memory Systems
• Advanced Topics
• Summary
• Exercises
• References
Memory Systems
• Memory is processors’ working space which read instructions or read/write data
– Fast memory is required to obtain higher clock frequencies
– Fast memory is usually expensive than slow one
– So, the trade-off of speed and cost is main topic in designing memory systems
• How to obtain speed against cost?
– Memory hierarchy
Memory Hierarchy
• Memory hierarchy
Registers
L1 Cache
L2 Cache
Main Memory (DRAM)
Secondary Storage (DISK)
Faster, but expensive
Slower and Larger
Cache
• Characteristic in memory access : Locality
• Temporal locality (locality in time)
– If some instruction/data element are referenced, then it might be referenced again soon
– Keep most recently referenced elements closed to processor
• Spatial locality (locality in space)
– If some instruction/data element are referenced, then it’s neighbor elements might be referenced soon
– Due to latency in larger memory (DRAM or disk), it will be helpful to get the neighbor elements at same time of getting a element
address space
probability of referencing
probability of referencing same element
time duration
Average Memory Access Time
• Memory system design goal is reducing the average memory accessing time
• Performance factor– Probability of existence in faster memory– Latency of getting data from slower memory
• Cache design– Reduce accessing frequency of main memory– Average access time– = cache hit time + cache miss rate * DRAM access time
• Cache memory includes– Tag : address in cache line– Data : data elements from DRAM– Valid bit : indicates the tag and data is valid– Dirty bit : indicates the data is different from DRAM
Average Access Time = Hit time + Miss rate * Miss penalty
Source of Cache Misses
• Compulsory miss
– Called cold miss or first-reference miss
– Unavoidable
• Invalidation miss
– From context switching
• Capacity miss
– Due to smaller cache size than working set
– Making cache larger
– e.g. Large matrix multiplication
• Conflict miss
– Due to cache access pattern
– Some cache line loaded and discarded repeatedly
– Can be reduced with higher set-associative cache
Cache Organization
• Three kind of cache organization
– Direct mapped cache
– Set associative cache
– Fully associative cache
Direct Mapped Cache
• Direct mapped cache organization
– 2N bytes direct mapped cache
– 2M bytes line size
– Number of line : 2N-M
• Memory Address
– M bit offset
– N-M bit index
– 32-N bit tag
• Most simple architecture
• Highest conflict miss ratio
• e.g. 8 KB direct mapped cache with 16 bytes line
– 512 lines
– 4 bit offset
– 9 bit index
– 19 bit tag
– Tag memory size : 19 bit * 512 = 9.5Kbit
– With valid bit, dirty bit total cache overhead is
• 8KB * 8 bit/byte + 9.5 Kbit + 512 bit + 512 bit = 74.5Kbits = 9.3 KB
Direct Mapped Cache
offsettag index034121331
address
valid dirty tag data
0
1
2
3
4
5
6
7
508
509
510
511
......
.
.
.... ... ... ...
.
.
.
word 0word 1word 2word 3word 4word 5word 6word 7
word 2047 word 2046 word 2045 word 2044
Set Associative Cache
• Set associative cache organization– 2N bytes set associative cache
– 2M bytes line size
– 2O-way set associativity
– Number of line : 2N-M
• Memory Address– M bit offset
– N-M-O bit index
– 32-N+O bit tag
• Replacement algorithm
• e.g. 8 KB 4-way set associative cache with 16 bytes line– 512 lines
– 4 bit offset
– 11 bit index
– 21 bit tag
– Tag memory size : 21 bit * 512 = 10.5Kbit
– With valid bit, dirty bit total cache overhead is
• 8KB * 8 bit/byte + 10.5 Kbit + 512 bit + 512 bit = 75.5 Kbits = 9.4 KB
– Set associative cache needs more multiplexer
Set Associative Cache
offsettag index034101131
address
valid dirty
tag
data
0
1
127
......
.
.
.... ... ... ...
.
.
.
word 0word 1word 2word 3word 4word 5word 6word 7
word 511 word 510 word 509 word 5080
1
127
......
.
.
.... ... ... ...
.
.
.
word 512
0
1
127
......
.
.
.... ... ... ...
.
.
.
0
1
127
......
.
.
.... ... ... ...
.
.
.word 2047 word 2046 word 2045 word 2044
Compare
data
word 513word 514word 515word 519 word 518 word 517 word 516
word 1020word 1021word 1022word 1023
word 1027 word 1026 word 1025 word 1024word 1028word 1029word 1030word 1031
word 1535 word 1534 word 1533 word 1532
word 1536word 1537word 1538word 1539word 1543 word 1542 word 1541 word 1540
Fully Associative Cache
• Fully associative cache organization
– 2N bytes fully associative cache
– 2M bytes line size
– Number of line : 2N-M
• Memory Address
– M bit offset
– 32-M bit tag
• No conflict miss (highest hit ratio)
• But needs more comparator
• e.g. 8 KB fully associative cache with 16 bytes line
– 512 lines
– 4 bit offset
– 28 bit tag
– Tag memory size : 28 bit * 512 = 14Kbit
– With valid bit, dirty bit total cache overhead is
• 8KB * 8 bit/byte + 14 Kbit + 512 bit + 512 bit = 79 Kbits = 9.9 KB
– Fully associative usually implemented using CAM (Content Addressable Memory) which consists of RAM cell with built-in comparator
Fully Associative Cache
offsettag03431
address
Content Addressable
Memory (CAM)
... ... ... ...
word 0word 1word 2word 3word 4word 5word 6word 7
word 2047 word 2046 word 2045 word 2044
MUX
hit data
Virtual Memory
• Motivation of virtual memory (VM)
– Multi process
• Each process has own address space
• Overlapped address space
• WIthout VM, programmer have to change the address inside code
– Memory protection
• Each process must access own memory region
• Without memory protection, a process interfere memory region of another process One problematic process crash whole systems
Address Translation
• Virtual memory
– Memory management technique of dealing with main memory as a cache for secondary storage
– Separate address region between programmer’s view and physical view
• Address space
– V = {0,1,...,N-1} // Virtual address space
– P = {0,1,...,M-1} // Physical address space
– N>M
• Address translation
– MAP : V P U {Φ}
– MAP(A) = A’
• data at virtual address A is in physical address A’
– MAP(A) = Φ
• data at virtual address A is not in physical memory
• Invalid address or stored on disk
Address Translation Process
• Translation process
– Processor request the data at address A
– Address translator translate virtual address A using page table,TLB, MMU, etc
– In case of translation hit (data in main memory), read data frommain memory
– However, when translation miss, page fault exception occurs, OS handler bring a page to main memory from secondary storage
Processor Address Translator
Main Memory
Secondary Storage
Fault Handler
AA’
Φ
page fault
MMU, TLB, Page Table... OS handler
hit
miss
Page Table
• Page table
– Stores physical page number indexed by virtual page number
– Valid bit indicates the requested page is in main memory
– Protection checks the requested access is allowable
• Read only / writable
• Kernel mode only / user mode
Translation Lookaside Buffer
• Translation Lookaside Buffer (TLB)
– Address translation process needs more than one memory access itself
• Page table is also stored in main memory
• Multi level translation needs more memory access
– Reducing translation overhead like cache
• Part of page table can be translated from TLB
• When TLB miss, TLB carry physical page number from page table
Page OffsetVirtual Page NumberVirtual Address
PPN 0
...
valid
PPN 1PPN 2PPN 3
PPN p-1tag data
Physical Page NumberPhysical Address Page Offset
Cache and Address Tanslation
• Cache with address translation
– Cache with physical address
• Address translation first, then access cache
• Long access time to cache
– Cache with virtual address
• Access cache without address translation
• Parallel checking of TLB and cache
Contents
• Overview of Processor Architecture
• Pipelining
• Memory Systems
• Advanced Topics
• Summary
• Exercises
• References
Advanced Topics
• Deeper pipelining stage makes fast frequencies but
– Pipelining hazard causes performance degradation
• Modern processor exploits more parallelism than ever before
– Instruction level parallelism
– Data level parallelism
– Thread level parallelism
– Multiprocessor
• Complex instruction set architecture and support of additional logic block is required for better performance
– Branch prediction
– Exploiting parallelism
Branch Prediction
• Branch instruction affect program pipelining state because instructions following branch is actually determined after branch instruction
– If the branch is taken, instructions after branch is to be bubble
– This is called control hazard
• However, typical program has many branches related to loop
– e.g. for (i=0; i<10; i++) { c[i] = a[i] + b[i]; }
– // 9 branches are taken, but 1 branches are not taken
• Branch prediction exploits locality like cache
– Branch prediction is method resolving a branch hazard with assuming the result while waiting the actual result
– Some branch instruction has tendency of taken or not taken
• if statement which checks corner case
– At some moment, branch instructions are taken more
• loop statement which has more taken behavior
Branch Prediction Schemes
• Branch prediction scheme (ref [5])
– Program-based predictors vs profile-based predictors
– Or static schemes vs dynamic schemes
• Static schemes
– Determines the direction in compile time
– Less hardware complexity
– Scheme
• Always taken
• Taken when backward or Not taken when forward
• Profile based decision : inaccurate for different programs
• Dynamic schemes
– Direction changes according to program procedure
– Hardware overhead : branch history table, branch target buffer, etc
– 1-bit prediction scheme
– 2-bit prediction scheme
– Bimodal branch prediction scheme
– Local branch prediction scheme
– Global branch prediction scheme
Dynamic Branch Prediction
• 1-bit branch prediction
– 1-bit stores the previous prediction is taken or not
– for loop causes two mispredictions
• 2-bit branch prediction
– 2-bit counter changes with previous branches
– Less misprediction than 1-bit branch predictor
Predict Taken11
Predict Taken10
Predict Not Taken00
Predict Not Taken01
Not Taken
Taken
Not Taken
Taken
Not TakenTaken
Taken
Not Taken
Example of Branch Prediction
• Example) Double loop statement
– for (i=0; i<10; i++)
– for (j=0; j<10; j++)
– Loop Body;
– First two miss, followed by one miss per inner loop
j 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 ...
Prediction NT NT T T T T T T T T T T T T T T ...
State 00 01 11 11 11 11 11 11 11 11 11 10 11 11 11 11 ...
Action T T T T T T T T T T NT T T T T T ...
Advanced Branch Prediction
• Expansion of 2-bit branch prediction
– Bimodal branch prediction
– Local history prediction
– Global history prediction
• Bimodal branch prediction
– Table indexed by some bits of branch address
• Local branch prediction
– Store temporary local history
– Consists of two tables
• Table indexed by some bits of program counter
• Table indexed by local taken history
• Global branch prediction
– Store temporary global history
– Table indexed by global taken history
Advanced Branch Prediction
PC
countstaken prediction
taken
indexing
PC
historytaken
indexing
countsprediction
shift register
GR
countstaken prediction
taken
indexing
shift register
Bimodal branch prediction Local branch prediction
Global branch prediction
Exploiting Parallelism
• Parallelism is a key factor of performance improvements
– More instructions, threads, processes simultaneously
• There are several architectural variations of exploiting parallelism
– VLIW
– Superscalar
– Simultaneous multithreading
– Vector processor
– Multi processor
Very Long Instruction Word Architecture
• Very Long Instruction Word (VLIW) architecture– One instruction has more than two issues
– A number of instructions forms a long instruction word
• Architecture supports– Register file
• More than two write ports (depends on number of issue slot)
• More than four read ports
– Multiple ALU• Duplicated ALU is used for arithmetic operations
• e.g. IA-64 architecture– Explicitly Parallel Instruction Computing (EPIC) architecture
– 128-bit instruction bundle• Three instructions (each instruction has 41-bits)
• 5-bit Template
Superscalar Architecture
• Superscalar architecture
– With hardware supports, more than one instructions are performed at once
– Originally ordered instructions are dynamically formed a larger instruction words
– Register renaming
– Out-of-execution
• e.g. Two ALU superscalar machine
ADD R5,R1,R2
SUB R6,R3,R4
SLL R7,R1,R5
AND R8,R3,R6
cycle 0 : ADD R5,R1,R2 || SUB R6,R3,R4
cycle 1 : SLL R7,R1,R5 || AND R8,R3,R6
IF ID EX MEM WBAND/SUB
IF ID EX MEM WBSLL/AND
VLIW vs Superscalar
• Comparison of VLIW and superscalar architecture
– Compilation
• VLIW machine needs new compilation for legacy codes
• Superscalar machine uses binary of single issue binary
– Scheduling window
• VLIW machine has infinite scheduling window
– Due to compilation time, less than a hundred instructions
• Superscalar machine has limited window
– Hardware implementation dependent
– Hardware complexity
• Superscalar machine dynamically detects hazards
• VLIW machine statically detects hazards in compilation time
– Software complexity
• VLIW machine needs a new compiler
Simultaneous Multithreading
• Simultaneous Multithreading (SMT) (ref [6])
– Combines wide-issue superscalar and multithreaded processors
– Instruction issue filled from multiple threads
• Instruction from single threads isn’t enough to fill all instruction slots
– Register renaming for different threads
– Per thread hardware contexts (program counter, registers), instruction retirements, branch target buffer, translation look-aside buffer, etc
Time (cycle)
Vector Processor
• Vector Processor
– Single vector instruction implies lots of operations
• Fewer instruction fetches
– Each result is independent from other results
• Multiple operation can be executed in parallel
• Vectorizing compiler checks dependencies
– Reduces branches and their side effect
– Vector instructions access memory with known pattern
• Effective prefetch
• Memory interleaving for higher bandwidth
for (i=0; i<64; i++)
Y[i] = X[i] * C;
VMUL V2, V1, R1
// V2, V1 is vector operands which indicates 64 operands
Contents
• Overview of Processor Architecture
• Pipelining
• Memory Systems
• Advanced Topics
• Summary
• Exercises
• References
Summary
• Pipeline 구조와 pipeline hazard을 다루었다.
• 여러 가지 Cache 종류(direct mapped cache, set associative cache, fully associative cache) 구조와 특성, 여러 가지 cache miss 원인 등을 이해하는 것이 중요하다.
• Virtual memory의 필요성, page fault, TLB 동작을 다루었다.
• 고성능 프로세서에서 더 중요한 Branch penalty를 줄이기 위한여러 가지 branch prediction 방법을 다루었다.
• 한 clock cycle에 한 명령어 이상을 수행하는 VLIW, superscalar, SMT의 동작 원리를 다루었다.
Contents
• Overview of Processor Architecture
• Pipelining
• Memory Systems
• Advanced Topics
• Summary
• Exercises
• References
Exercises
1. Pipeline hazard의 종류를 기술하고, 각 hazard를 해결하는방법을 간단히 기술하시오.
2. Cache 종류를 나열하고 각 cache의 구조에 대해서 설명하시오.
3. 여러 가지 Cache miss의 원인을 설명하시오.
4. Page fault에 대해서 아는 대로 기술하시오.
5. Branch prediction을 하는 방법과 이를 사용하는 이유를기술하시오.
Contents
• Overview of Processor Architecture
• Pipelining
• Memory Systems
• Advanced Topics
• Summary
• Exercises
• References
References
• [1] David A. Patterson and John L. Hennessy, “Computer Organization and Design,” 3rd ed., Morgan Kaufmann, 2005
• [2] John, L. Hennessy and David A. Patterson, “Computer Architecture: A Quantitative Approach,” 3rd ed., Morgan Kaufmann, 2003
• [3] Neil H. E. Weste and David Harris, “CMOS VLSI Design,”Addison Wesley, 2005
• [4] Steve Furber, “ARM System-on-Chip Architecture,” 2nd ed., Addison Wesley, 2000
• [5] Chih-Cheng Cheng, “The Schemes and Performances of Dynamic Branch Predictors”
• [6] Susan J. Eggers et al., “Simultaneous Multithreading: A Platform for Next-Generation Processors”, IEEE Micro, 1997