46
Out-of-Order Speculative Execution Designing a Configurable Simulator for an OOO Microprocessor By Mustafa Imran Ali ID# 230203

Out-of-Order Speculative Execution Designing a Configurable Simulator for an OOO Microprocessor By Mustafa Imran Ali ID# 230203

Embed Size (px)

Citation preview

Page 1: Out-of-Order Speculative Execution Designing a Configurable Simulator for an OOO Microprocessor By Mustafa Imran Ali ID# 230203

Out-of-Order Speculative Execution

Designing a Configurable Simulator for an OOO Microprocessor

By Mustafa Imran AliID# 230203

Page 2: Out-of-Order Speculative Execution Designing a Configurable Simulator for an OOO Microprocessor By Mustafa Imran Ali ID# 230203

COE 501 Presentation by Mustafa Imran Ali 2

Presentation Outline

Introduction Examples - Representative Micro-

architectures Some Issues - Limitations and Other

Approaches Simulator Details

Page 3: Out-of-Order Speculative Execution Designing a Configurable Simulator for an OOO Microprocessor By Mustafa Imran Ali ID# 230203

COE 501 Presentation by Mustafa Imran Ali 3

Out-of-order Speculative Execution – Maximizing ILP

In-order Execution Pipelining – exploiting temporal parallelism

through overlap Superscalar – more parallelism by allowing

multiple instructions to issue Problem – Pipeline Stalls

Data dependencies allow limited ILP Large latency functions cause structural hazards Data loads - Cache miss stalls

Page 4: Out-of-Order Speculative Execution Designing a Configurable Simulator for an OOO Microprocessor By Mustafa Imran Ali ID# 230203

COE 501 Presentation by Mustafa Imran Ali 4

Out-of-order Speculative Execution

instructions execute as soon as possible and in parallel with other nondependent work results in faster execution because critical-path

computations start and complete quickly speculatively fetch and execute instructions

even though it may not know immediately whether the instructions will be on the final execution path Multilevel Branch prediction to avoid waiting for

outcome of multiple branches

Page 5: Out-of-Order Speculative Execution Designing a Configurable Simulator for an OOO Microprocessor By Mustafa Imran Ali ID# 230203

COE 501 Presentation by Mustafa Imran Ali 5

OOO Speculative Execution - Benefits

Reduced reliance on compilers Compilers are cannot examine runtime

dependencies No need for recompilation

Source code access not always possible Binary compatibility with existing code

Page 6: Out-of-Order Speculative Execution Designing a Configurable Simulator for an OOO Microprocessor By Mustafa Imran Ali ID# 230203

COE 501 Presentation by Mustafa Imran Ali 6

OOO Speculative Execution -Problems and Issues

Overcoming WAW and WAR hazards – Register Renaming

More branches/cycle – accurate branch prediction Register Renaming – Dependency checking

mechanism (Large comparisions) Data forwarding from producers to consumers –

use of tagging and broadcast mechanism Exceptions – Committing instructions in program

order

Page 7: Out-of-Order Speculative Execution Designing a Configurable Simulator for an OOO Microprocessor By Mustafa Imran Ali ID# 230203

COE 501 Presentation by Mustafa Imran Ali 7

Compaq Alpha 21264 (1998)

OOO superscalar with speculative execution Fetches 4 instructions/cycle Dynamically issues up to 6 instructions/cycle: 4 integer

and 2 floating point Can speculate through up to 20 branches 64 architectural register 41 integer + 41 floating point rename register Up to 80 instructions in-flight + 32 in-flight loads + 32 in-

flight stores 20-entry integer queue Issues 4 instructions 15-entry floating point queue Issues 2 instructions Can retire at most 11 instructions/cycle, can sustain a rate

of 8/cycle (over short periods)

Page 8: Out-of-Order Speculative Execution Designing a Configurable Simulator for an OOO Microprocessor By Mustafa Imran Ali ID# 230203

COE 501 Presentation by Mustafa Imran Ali 8

Stages in Instruction PipelineProvides 4 instructions/cycle

Maps virtual registerto physical registers

Dynamically selectsfrom up to 6 instructions –Issue reordering takes place

All pipeline stages subsequent to the register map stage operate on internal registers rather than user-visible registers

Page 9: Out-of-Order Speculative Execution Designing a Configurable Simulator for an OOO Microprocessor By Mustafa Imran Ali ID# 230203

COE 501 Presentation by Mustafa Imran Ali 9

Register Renaming Process

assigns a unique storage location with each write-reference to a register

speculatively allocates a register to each instruction with a register result

register only becomes part of the user-visible (architectural) register state when the instruction retires/commits

allows instruction to speculatively issue and deposit its result into the register file before the instruction retires

Page 10: Out-of-Order Speculative Execution Designing a Configurable Simulator for an OOO Microprocessor By Mustafa Imran Ali ID# 230203

COE 501 Presentation by Mustafa Imran Ali 10

Register Renaming Process (continued)

processor maintains storage with each internal register indicating the user-visible register that is currently associated with the given internal register (if any)

register renaming is a content-addressable memory (CAM) operation for register sources together with a register allocation for the destination register

register mapper stores the register map state for each in-flight instruction so that the machine architectural state can be restored in case a misspeculation occurs

Page 11: Out-of-Order Speculative Execution Designing a Configurable Simulator for an OOO Microprocessor By Mustafa Imran Ali ID# 230203

COE 501 Presentation by Mustafa Imran Ali 11

Map (register rename) and QueueStages

The map stage renames programmer-visible register numbers to internal register numbers

The queue stage stores instructions until they are ready to issue

structures are duplicated for integer and floating point execution

Page 12: Out-of-Order Speculative Execution Designing a Configurable Simulator for an OOO Microprocessor By Mustafa Imran Ali ID# 230203

COE 501 Presentation by Mustafa Imran Ali 12

Out-of-order Issue Queues

issue queue logic maintains 2 lists of pending instructions in separate integer and floating-point queues

scoreboards maintain status of the internal registers by tracking the progress of single-cycle, multiple-cycle, and variable-cycle (memory load) instructions

the scoreboard unit notifies all instructions in the queue that require the register value when functional unit or load-data results become available

Page 13: Out-of-Order Speculative Execution Designing a Configurable Simulator for an OOO Microprocessor By Mustafa Imran Ali ID# 230203

COE 501 Presentation by Mustafa Imran Ali 13

Out-of-order Execution

Each queue/arbiter selects the oldest operand-ready and functional-unit-ready instructions for execution each cycle

queues are collapsable—an entry becomes immediately available once the instruction issues or is squashed due to misspeculation

Page 14: Out-of-Order Speculative Execution Designing a Configurable Simulator for an OOO Microprocessor By Mustafa Imran Ali ID# 230203

COE 501 Presentation by Mustafa Imran Ali 14

Retire Mechanism

assigns each mapped instruction a slot in a circular in-flight window (in fetch order)

tracks the internal register usage for all in-flight instructions

each entry in the mechanism contains storage indicating the internal register that held the old contents of the destination register for the corresponding instruction

this (stale) register can be freed for other use after the instruction retires

Page 15: Out-of-Order Speculative Execution Designing a Configurable Simulator for an OOO Microprocessor By Mustafa Imran Ali ID# 230203

COE 501 Presentation by Mustafa Imran Ali 15

Exception Handling

exception causes all younger instructions in the in-flight window to be squashed and are removed from all queues in the system

register map is backed up to the state before the last squashed instruction using the saved map state

registers allocated by the squashed instructions become immediately available

Page 16: Out-of-Order Speculative Execution Designing a Configurable Simulator for an OOO Microprocessor By Mustafa Imran Ali ID# 230203

COE 501 Presentation by Mustafa Imran Ali 16

HP PA-RISC 8000

Page 17: Out-of-Order Speculative Execution Designing a Configurable Simulator for an OOO Microprocessor By Mustafa Imran Ali ID# 230203

COE 501 Presentation by Mustafa Imran Ali 17

ROB Size Performance Effect

Page 18: Out-of-Order Speculative Execution Designing a Configurable Simulator for an OOO Microprocessor By Mustafa Imran Ali ID# 230203

COE 501 Presentation by Mustafa Imran Ali 18

AMD K-5 ROB Entry

Page 19: Out-of-Order Speculative Execution Designing a Configurable Simulator for an OOO Microprocessor By Mustafa Imran Ali ID# 230203

COE 501 Presentation by Mustafa Imran Ali 19

AMD K-5 Reservation Station Entry

Page 20: Out-of-Order Speculative Execution Designing a Configurable Simulator for an OOO Microprocessor By Mustafa Imran Ali ID# 230203

COE 501 Presentation by Mustafa Imran Ali 20

Approaches for Billion Transistor Architectures

Advanced superscalar processors scale up from current designs to issue 16

or 32 instructions per cycle Superspeculative processors

enhance wide-issue superscalar performance by speculating aggressively at every point in the processor pipeline

Page 21: Out-of-Order Speculative Execution Designing a Configurable Simulator for an OOO Microprocessor By Mustafa Imran Ali ID# 230203

COE 501 Presentation by Mustafa Imran Ali 21

SPARC64 V9

Page 22: Out-of-Order Speculative Execution Designing a Configurable Simulator for an OOO Microprocessor By Mustafa Imran Ali ID# 230203

COE 501 Presentation by Mustafa Imran Ali 22

Pentium III and 4 Register Renaming and ROB

Page 23: Out-of-Order Speculative Execution Designing a Configurable Simulator for an OOO Microprocessor By Mustafa Imran Ali ID# 230203

COE 501 Presentation by Mustafa Imran Ali 23

One BillionTransistors, One Uniprocessor, One Chip?

Page 24: Out-of-Order Speculative Execution Designing a Configurable Simulator for an OOO Microprocessor By Mustafa Imran Ali ID# 230203

COE 501 Presentation by Mustafa Imran Ali 24

Superspeculative Architecture

Page 25: Out-of-Order Speculative Execution Designing a Configurable Simulator for an OOO Microprocessor By Mustafa Imran Ali ID# 230203

COE 501 Presentation by Mustafa Imran Ali 25

Area Issues

A large circuitry required to feed the processors with a continuous instructions stream

Dynamic execution requires a large amount of comparisons for dependency checking

The size of reorder buffer, reservation stations/rename registers increase accordingly

Page 26: Out-of-Order Speculative Execution Designing a Configurable Simulator for an OOO Microprocessor By Mustafa Imran Ali ID# 230203

COE 501 Presentation by Mustafa Imran Ali 26

Limitations

Larger issue machines have high peak to sustained rate ratios – Intel Pentium Pro architecture Approach

Beyond issue widths of 8, inherent limited ILP in single-thread, give diminishing returns – More architectures switching to Simultaneous Multithreading

Page 27: Out-of-Order Speculative Execution Designing a Configurable Simulator for an OOO Microprocessor By Mustafa Imran Ali ID# 230203

COE 501 Presentation by Mustafa Imran Ali 27

Alternate Approaches

Approach Issue

Structure

Hazard detection

Scheduling Comment Examples

Speculative Superscalar

Dynamic Hardware Dynamic with

Speculation

OOO

with

Speculation

Pentium II/III/IV,

Alpha 21264

VLIW Static Software Static No hazard between issue packets

MAJC

EPIC Mostly static Mostly software

Mostly static Explicit dependences marked by compiler

Itanium

Page 28: Out-of-Order Speculative Execution Designing a Configurable Simulator for an OOO Microprocessor By Mustafa Imran Ali ID# 230203

COE 501 Presentation by Mustafa Imran Ali 28

OOO Speculative Execution Processor - Simulator Design

Tracking all the activities of the pipelined machine in each clock cycle

Issue Unit design that solves structural and data hazards

Dependency checking Mechanisms Strategy for sending data from

producers to consumers

Page 29: Out-of-Order Speculative Execution Designing a Configurable Simulator for an OOO Microprocessor By Mustafa Imran Ali ID# 230203

COE 501 Presentation by Mustafa Imran Ali 29

Data Structures

Instruction Queue Execution Tracking Hardware

Structure Register File Producer Table Reservation Stations The Reorder Buffer

Functional Units State Structure

Page 30: Out-of-Order Speculative Execution Designing a Configurable Simulator for an OOO Microprocessor By Mustafa Imran Ali ID# 230203

COE 501 Presentation by Mustafa Imran Ali 30

Service Functions

Issue Dispatch Completion CDB Snooping Retirement and Writeback

Page 31: Out-of-Order Speculative Execution Designing a Configurable Simulator for an OOO Microprocessor By Mustafa Imran Ali ID# 230203

COE 501 Presentation by Mustafa Imran Ali 31

Overall Structure

Page 32: Out-of-Order Speculative Execution Designing a Configurable Simulator for an OOO Microprocessor By Mustafa Imran Ali ID# 230203

COE 501 Presentation by Mustafa Imran Ali 32

Producer Table

Each register is extended by a tag and valid flag Valid=true iff register contains

appropriate data Other tag points to instruction producing

the data

Page 33: Out-of-Order Speculative Execution Designing a Configurable Simulator for an OOO Microprocessor By Mustafa Imran Ali ID# 230203

COE 501 Presentation by Mustafa Imran Ali 33

Reservation Stations

Full bit is set if entry occupied Tag points to ROB tag of the

instruction op1 and op2 hold the source

references

Page 34: Out-of-Order Speculative Execution Designing a Configurable Simulator for an OOO Microprocessor By Mustafa Imran Ali ID# 230203

COE 501 Presentation by Mustafa Imran Ali 34

The Reorder Buffer

Realized as a FIFO with ROBhead and ROBtail

New instructions put at ROBtail and instruction is tagged in RS with this.

Each cycle the ROBhead valid entry is checked for instruction completion

Page 35: Out-of-Order Speculative Execution Designing a Configurable Simulator for an OOO Microprocessor By Mustafa Imran Ali ID# 230203

COE 501 Presentation by Mustafa Imran Ali 35

Issue Protocolif (there is a free RS and a free ROB entry) {RS.full:=1; RS.tag:=ROBtail; for all operands x of Ii with address r if Rr.valid=1 RS.opx:=Rr; else if CDB.tag=Rr.tag and CDB.valid RS.opx:=CDB; else RS.opx:=ROB[Rr.tag]; if ( Ii has a destination register r) Rr.tag:=ROBtail; Rr.valid=0; ROB[ROBtail].dest:=r; else ROB[ROBtail].dest:=none; ROBtail:=ROBtail+1; }

Page 36: Out-of-Order Speculative Execution Designing a Configurable Simulator for an OOO Microprocessor By Mustafa Imran Ali ID# 230203

COE 501 Presentation by Mustafa Imran Ali 36

Dispatch Protocol

if there is a RS with RS.opx.valid=1 for all operands x and the function unit is not stalled { Pass instruction, operands, and tag to FU

RS.full:=0; }

Page 37: Out-of-Order Speculative Execution Designing a Configurable Simulator for an OOO Microprocessor By Mustafa Imran Ali ID# 230203

COE 501 Presentation by Mustafa Imran Ali 37

Completion Protocol

if FU has result and got CDB acknowledge { CDB.valid:=1; CDB.data:=result from FU; CDB.tag:=tag from FU; ROB[CDB.tag].valid:=1; ROB[CDB.tag].data:=CDB.data; }

Page 38: Out-of-Order Speculative Execution Designing a Configurable Simulator for an OOO Microprocessor By Mustafa Imran Ali ID# 230203

COE 501 Presentation by Mustafa Imran Ali 38

CDB Snooping

For all operands x: if RS.full=1 and RS.opx.valid=0 and RS.opx.tag=CDB.tag { RS.opx:=CDB; }

Page 39: Out-of-Order Speculative Execution Designing a Configurable Simulator for an OOO Microprocessor By Mustafa Imran Ali ID# 230203

COE 501 Presentation by Mustafa Imran Ali 39

Retirement/Writeback Protocol

if ROB not empty and ROB[ROBhead].valid=1 { if instruction in the ROB[ROBhead] requires writeback { x:=ROB[ROBhead].dest; Rx.data:=ROB[ROBhead].data; if ROBhead=Rx.tag Rx.valid=1; } ROBhead:=ROBhead+1; }

Page 40: Out-of-Order Speculative Execution Designing a Configurable Simulator for an OOO Microprocessor By Mustafa Imran Ali ID# 230203

COE 501 Presentation by Mustafa Imran Ali 40

Configurable Parameters

Probability of memory misses Probability of correct branch prediction Branch mis-prediction penalty Cache miss penalty Window Size for instruction issue Number of Issues/cycle Number of Functional Units (FUs) Pipeline Depth/Latency of each FU Number of CDBs Size of reservation stations/rename registers (RS) Operand matching mechanism in each RS Size of re-order buffer Branch Prediction Mechanisms (optional)

Page 41: Out-of-Order Speculative Execution Designing a Configurable Simulator for an OOO Microprocessor By Mustafa Imran Ali ID# 230203

COE 501 Presentation by Mustafa Imran Ali 41

Performance Metrics

Number of Clock cycles on an instruction trace

Number of Stalls (Various Types) Effect on Hardware costs Peak vs. Sustained Rates (actual

issues vs. maximum possible) Percentage Resource Utilization

Page 42: Out-of-Order Speculative Execution Designing a Configurable Simulator for an OOO Microprocessor By Mustafa Imran Ali ID# 230203

COE 501 Presentation by Mustafa Imran Ali 42

OOO Speculative Micro-architecture Simulators

Simple Scalar University of Wisconsin in Madison www.simplescalar.com

KScalar Universidad Autónoma de Barcelona www.caos.uab.es/kscalar

Page 43: Out-of-Order Speculative Execution Designing a Configurable Simulator for an OOO Microprocessor By Mustafa Imran Ali ID# 230203

COE 501 Presentation by Mustafa Imran Ali 43

Simple Scalar v3.0

tool set includes sample simulators ranging from a fast functional simulator to a detailed, dynamically scheduled processor model that supports non-blocking caches, speculative execution, and state-of-the-art branch prediction

includes performance visualization tools, statistical analysis resources, and debug and verification infrastructure

includes a machine definition infrastructure that permits most architectural details to be separated from simulator implementations

Page 44: Out-of-Order Speculative Execution Designing a Configurable Simulator for an OOO Microprocessor By Mustafa Imran Ali ID# 230203

COE 501 Presentation by Mustafa Imran Ali 44

KScalar

allows analyzing the performance behavior of a wide range of processor microarchitectures: from a very simple in-order, scalar pipeline, to a detailed out-of-order, superscalar pipeline with non-blocking caches, speculative execution, and complex branch prediction

The simulator interprets executables for the Alpha AXP instruction set: from very short program fragments to large applications

The object's program execution may be simulated in varying levels of detail: either cycle-by-cycle, observing all the pipeline events that determine processor performance,

or million cycles at once, taking statistics of the main performance issues

Page 45: Out-of-Order Speculative Execution Designing a Configurable Simulator for an OOO Microprocessor By Mustafa Imran Ali ID# 230203

COE 501 Presentation by Mustafa Imran Ali 45

Study Direction

Modeling and comparison of representative Micro-architectures Parameters modeling commercial micro-

architecture’s OOO speculative execution core

SPEC benchmarks instruction traces analysis of relative importance of

supporting assumptions

Page 46: Out-of-Order Speculative Execution Designing a Configurable Simulator for an OOO Microprocessor By Mustafa Imran Ali ID# 230203

COE 501 Presentation by Mustafa Imran Ali 46

Study Direction (continued)

Modeling Resource Utilization of Simultaneous Multithreaded Workload Comparison of resource utilization and

performance metrics of single-thread vs. SMT execution

Use of instruction traces that model multi-thread workload (e.g. modeling Hyperthreading in Pentium 4)