Upload
ilene-higgins
View
221
Download
0
Embed Size (px)
Citation preview
Out-of-Order Speculative Execution
Designing a Configurable Simulator for an OOO Microprocessor
By Mustafa Imran AliID# 230203
COE 501 Presentation by Mustafa Imran Ali 2
Presentation Outline
Introduction Examples - Representative Micro-
architectures Some Issues - Limitations and Other
Approaches Simulator Details
COE 501 Presentation by Mustafa Imran Ali 3
Out-of-order Speculative Execution – Maximizing ILP
In-order Execution Pipelining – exploiting temporal parallelism
through overlap Superscalar – more parallelism by allowing
multiple instructions to issue Problem – Pipeline Stalls
Data dependencies allow limited ILP Large latency functions cause structural hazards Data loads - Cache miss stalls
COE 501 Presentation by Mustafa Imran Ali 4
Out-of-order Speculative Execution
instructions execute as soon as possible and in parallel with other nondependent work results in faster execution because critical-path
computations start and complete quickly speculatively fetch and execute instructions
even though it may not know immediately whether the instructions will be on the final execution path Multilevel Branch prediction to avoid waiting for
outcome of multiple branches
COE 501 Presentation by Mustafa Imran Ali 5
OOO Speculative Execution - Benefits
Reduced reliance on compilers Compilers are cannot examine runtime
dependencies No need for recompilation
Source code access not always possible Binary compatibility with existing code
COE 501 Presentation by Mustafa Imran Ali 6
OOO Speculative Execution -Problems and Issues
Overcoming WAW and WAR hazards – Register Renaming
More branches/cycle – accurate branch prediction Register Renaming – Dependency checking
mechanism (Large comparisions) Data forwarding from producers to consumers –
use of tagging and broadcast mechanism Exceptions – Committing instructions in program
order
COE 501 Presentation by Mustafa Imran Ali 7
Compaq Alpha 21264 (1998)
OOO superscalar with speculative execution Fetches 4 instructions/cycle Dynamically issues up to 6 instructions/cycle: 4 integer
and 2 floating point Can speculate through up to 20 branches 64 architectural register 41 integer + 41 floating point rename register Up to 80 instructions in-flight + 32 in-flight loads + 32 in-
flight stores 20-entry integer queue Issues 4 instructions 15-entry floating point queue Issues 2 instructions Can retire at most 11 instructions/cycle, can sustain a rate
of 8/cycle (over short periods)
COE 501 Presentation by Mustafa Imran Ali 8
Stages in Instruction PipelineProvides 4 instructions/cycle
Maps virtual registerto physical registers
Dynamically selectsfrom up to 6 instructions –Issue reordering takes place
All pipeline stages subsequent to the register map stage operate on internal registers rather than user-visible registers
COE 501 Presentation by Mustafa Imran Ali 9
Register Renaming Process
assigns a unique storage location with each write-reference to a register
speculatively allocates a register to each instruction with a register result
register only becomes part of the user-visible (architectural) register state when the instruction retires/commits
allows instruction to speculatively issue and deposit its result into the register file before the instruction retires
COE 501 Presentation by Mustafa Imran Ali 10
Register Renaming Process (continued)
processor maintains storage with each internal register indicating the user-visible register that is currently associated with the given internal register (if any)
register renaming is a content-addressable memory (CAM) operation for register sources together with a register allocation for the destination register
register mapper stores the register map state for each in-flight instruction so that the machine architectural state can be restored in case a misspeculation occurs
COE 501 Presentation by Mustafa Imran Ali 11
Map (register rename) and QueueStages
The map stage renames programmer-visible register numbers to internal register numbers
The queue stage stores instructions until they are ready to issue
structures are duplicated for integer and floating point execution
COE 501 Presentation by Mustafa Imran Ali 12
Out-of-order Issue Queues
issue queue logic maintains 2 lists of pending instructions in separate integer and floating-point queues
scoreboards maintain status of the internal registers by tracking the progress of single-cycle, multiple-cycle, and variable-cycle (memory load) instructions
the scoreboard unit notifies all instructions in the queue that require the register value when functional unit or load-data results become available
COE 501 Presentation by Mustafa Imran Ali 13
Out-of-order Execution
Each queue/arbiter selects the oldest operand-ready and functional-unit-ready instructions for execution each cycle
queues are collapsable—an entry becomes immediately available once the instruction issues or is squashed due to misspeculation
COE 501 Presentation by Mustafa Imran Ali 14
Retire Mechanism
assigns each mapped instruction a slot in a circular in-flight window (in fetch order)
tracks the internal register usage for all in-flight instructions
each entry in the mechanism contains storage indicating the internal register that held the old contents of the destination register for the corresponding instruction
this (stale) register can be freed for other use after the instruction retires
COE 501 Presentation by Mustafa Imran Ali 15
Exception Handling
exception causes all younger instructions in the in-flight window to be squashed and are removed from all queues in the system
register map is backed up to the state before the last squashed instruction using the saved map state
registers allocated by the squashed instructions become immediately available
COE 501 Presentation by Mustafa Imran Ali 16
HP PA-RISC 8000
COE 501 Presentation by Mustafa Imran Ali 17
ROB Size Performance Effect
COE 501 Presentation by Mustafa Imran Ali 18
AMD K-5 ROB Entry
COE 501 Presentation by Mustafa Imran Ali 19
AMD K-5 Reservation Station Entry
COE 501 Presentation by Mustafa Imran Ali 20
Approaches for Billion Transistor Architectures
Advanced superscalar processors scale up from current designs to issue 16
or 32 instructions per cycle Superspeculative processors
enhance wide-issue superscalar performance by speculating aggressively at every point in the processor pipeline
COE 501 Presentation by Mustafa Imran Ali 21
SPARC64 V9
COE 501 Presentation by Mustafa Imran Ali 22
Pentium III and 4 Register Renaming and ROB
COE 501 Presentation by Mustafa Imran Ali 23
One BillionTransistors, One Uniprocessor, One Chip?
COE 501 Presentation by Mustafa Imran Ali 24
Superspeculative Architecture
COE 501 Presentation by Mustafa Imran Ali 25
Area Issues
A large circuitry required to feed the processors with a continuous instructions stream
Dynamic execution requires a large amount of comparisons for dependency checking
The size of reorder buffer, reservation stations/rename registers increase accordingly
COE 501 Presentation by Mustafa Imran Ali 26
Limitations
Larger issue machines have high peak to sustained rate ratios – Intel Pentium Pro architecture Approach
Beyond issue widths of 8, inherent limited ILP in single-thread, give diminishing returns – More architectures switching to Simultaneous Multithreading
COE 501 Presentation by Mustafa Imran Ali 27
Alternate Approaches
Approach Issue
Structure
Hazard detection
Scheduling Comment Examples
Speculative Superscalar
Dynamic Hardware Dynamic with
Speculation
OOO
with
Speculation
Pentium II/III/IV,
Alpha 21264
VLIW Static Software Static No hazard between issue packets
MAJC
EPIC Mostly static Mostly software
Mostly static Explicit dependences marked by compiler
Itanium
COE 501 Presentation by Mustafa Imran Ali 28
OOO Speculative Execution Processor - Simulator Design
Tracking all the activities of the pipelined machine in each clock cycle
Issue Unit design that solves structural and data hazards
Dependency checking Mechanisms Strategy for sending data from
producers to consumers
COE 501 Presentation by Mustafa Imran Ali 29
Data Structures
Instruction Queue Execution Tracking Hardware
Structure Register File Producer Table Reservation Stations The Reorder Buffer
Functional Units State Structure
COE 501 Presentation by Mustafa Imran Ali 30
Service Functions
Issue Dispatch Completion CDB Snooping Retirement and Writeback
COE 501 Presentation by Mustafa Imran Ali 31
Overall Structure
COE 501 Presentation by Mustafa Imran Ali 32
Producer Table
Each register is extended by a tag and valid flag Valid=true iff register contains
appropriate data Other tag points to instruction producing
the data
COE 501 Presentation by Mustafa Imran Ali 33
Reservation Stations
Full bit is set if entry occupied Tag points to ROB tag of the
instruction op1 and op2 hold the source
references
COE 501 Presentation by Mustafa Imran Ali 34
The Reorder Buffer
Realized as a FIFO with ROBhead and ROBtail
New instructions put at ROBtail and instruction is tagged in RS with this.
Each cycle the ROBhead valid entry is checked for instruction completion
COE 501 Presentation by Mustafa Imran Ali 35
Issue Protocolif (there is a free RS and a free ROB entry) {RS.full:=1; RS.tag:=ROBtail; for all operands x of Ii with address r if Rr.valid=1 RS.opx:=Rr; else if CDB.tag=Rr.tag and CDB.valid RS.opx:=CDB; else RS.opx:=ROB[Rr.tag]; if ( Ii has a destination register r) Rr.tag:=ROBtail; Rr.valid=0; ROB[ROBtail].dest:=r; else ROB[ROBtail].dest:=none; ROBtail:=ROBtail+1; }
COE 501 Presentation by Mustafa Imran Ali 36
Dispatch Protocol
if there is a RS with RS.opx.valid=1 for all operands x and the function unit is not stalled { Pass instruction, operands, and tag to FU
RS.full:=0; }
COE 501 Presentation by Mustafa Imran Ali 37
Completion Protocol
if FU has result and got CDB acknowledge { CDB.valid:=1; CDB.data:=result from FU; CDB.tag:=tag from FU; ROB[CDB.tag].valid:=1; ROB[CDB.tag].data:=CDB.data; }
COE 501 Presentation by Mustafa Imran Ali 38
CDB Snooping
For all operands x: if RS.full=1 and RS.opx.valid=0 and RS.opx.tag=CDB.tag { RS.opx:=CDB; }
COE 501 Presentation by Mustafa Imran Ali 39
Retirement/Writeback Protocol
if ROB not empty and ROB[ROBhead].valid=1 { if instruction in the ROB[ROBhead] requires writeback { x:=ROB[ROBhead].dest; Rx.data:=ROB[ROBhead].data; if ROBhead=Rx.tag Rx.valid=1; } ROBhead:=ROBhead+1; }
COE 501 Presentation by Mustafa Imran Ali 40
Configurable Parameters
Probability of memory misses Probability of correct branch prediction Branch mis-prediction penalty Cache miss penalty Window Size for instruction issue Number of Issues/cycle Number of Functional Units (FUs) Pipeline Depth/Latency of each FU Number of CDBs Size of reservation stations/rename registers (RS) Operand matching mechanism in each RS Size of re-order buffer Branch Prediction Mechanisms (optional)
COE 501 Presentation by Mustafa Imran Ali 41
Performance Metrics
Number of Clock cycles on an instruction trace
Number of Stalls (Various Types) Effect on Hardware costs Peak vs. Sustained Rates (actual
issues vs. maximum possible) Percentage Resource Utilization
COE 501 Presentation by Mustafa Imran Ali 42
OOO Speculative Micro-architecture Simulators
Simple Scalar University of Wisconsin in Madison www.simplescalar.com
KScalar Universidad Autónoma de Barcelona www.caos.uab.es/kscalar
COE 501 Presentation by Mustafa Imran Ali 43
Simple Scalar v3.0
tool set includes sample simulators ranging from a fast functional simulator to a detailed, dynamically scheduled processor model that supports non-blocking caches, speculative execution, and state-of-the-art branch prediction
includes performance visualization tools, statistical analysis resources, and debug and verification infrastructure
includes a machine definition infrastructure that permits most architectural details to be separated from simulator implementations
COE 501 Presentation by Mustafa Imran Ali 44
KScalar
allows analyzing the performance behavior of a wide range of processor microarchitectures: from a very simple in-order, scalar pipeline, to a detailed out-of-order, superscalar pipeline with non-blocking caches, speculative execution, and complex branch prediction
The simulator interprets executables for the Alpha AXP instruction set: from very short program fragments to large applications
The object's program execution may be simulated in varying levels of detail: either cycle-by-cycle, observing all the pipeline events that determine processor performance,
or million cycles at once, taking statistics of the main performance issues
COE 501 Presentation by Mustafa Imran Ali 45
Study Direction
Modeling and comparison of representative Micro-architectures Parameters modeling commercial micro-
architecture’s OOO speculative execution core
SPEC benchmarks instruction traces analysis of relative importance of
supporting assumptions
COE 501 Presentation by Mustafa Imran Ali 46
Study Direction (continued)
Modeling Resource Utilization of Simultaneous Multithreaded Workload Comparison of resource utilization and
performance metrics of single-thread vs. SMT execution
Use of instruction traces that model multi-thread workload (e.g. modeling Hyperthreading in Pentium 4)