Out-of-Order Speculative Execution Designing a Configurable Simulator for an OOO Microprocessor By Mustafa Imran Ali ID# 230203

Out-of-Order Speculative Execution

Designing a Configurable Simulator for an OOO Microprocessor

By Mustafa Imran AliID# 230203

COE 501 Presentation by Mustafa Imran Ali 2

Presentation Outline

Introduction Examples - Representative Micro-

architectures Some Issues - Limitations and Other

Approaches Simulator Details


Out-of-order Speculative Execution – Maximizing ILP

In-order Execution Pipelining – exploiting temporal parallelism

through overlap Superscalar – more parallelism by allowing

multiple instructions to issue Problem – Pipeline Stalls

Data dependencies allow limited ILP Large latency functions cause structural hazards Data loads - Cache miss stalls


Out-of-order Speculative Execution

instructions execute as soon as possible and in parallel with other nondependent work results in faster execution because critical-path

computations start and complete quickly speculatively fetch and execute instructions

even though it may not know immediately whether the instructions will be on the final execution path Multilevel Branch prediction to avoid waiting for

outcome of multiple branches


OOO Speculative Execution - Benefits

Reduced reliance on compilers Compilers are cannot examine runtime

dependencies No need for recompilation

Source code access not always possible Binary compatibility with existing code


OOO Speculative Execution -Problems and Issues

Overcoming WAW and WAR hazards – Register Renaming

More branches/cycle – accurate branch prediction Register Renaming – Dependency checking

mechanism (Large comparisions) Data forwarding from producers to consumers –

use of tagging and broadcast mechanism Exceptions – Committing instructions in program

order


Compaq Alpha 21264 (1998)

OOO superscalar with speculative execution Fetches 4 instructions/cycle Dynamically issues up to 6 instructions/cycle: 4 integer

and 2 floating point Can speculate through up to 20 branches 64 architectural register 41 integer + 41 floating point rename register Up to 80 instructions in-flight + 32 in-flight loads + 32 in-

flight stores 20-entry integer queue Issues 4 instructions 15-entry floating point queue Issues 2 instructions Can retire at most 11 instructions/cycle, can sustain a rate

of 8/cycle (over short periods)


Stages in Instruction PipelineProvides 4 instructions/cycle

Maps virtual registerto physical registers

Dynamically selectsfrom up to 6 instructions –Issue reordering takes place

All pipeline stages subsequent to the register map stage operate on internal registers rather than user-visible registers


Register Renaming Process

assigns a unique storage location with each write-reference to a register

speculatively allocates a register to each instruction with a register result

register only becomes part of the user-visible (architectural) register state when the instruction retires/commits

allows instruction to speculatively issue and deposit its result into the register file before the instruction retires


Register Renaming Process (continued)

processor maintains storage with each internal register indicating the user-visible register that is currently associated with the given internal register (if any)

register renaming is a content-addressable memory (CAM) operation for register sources together with a register allocation for the destination register

register mapper stores the register map state for each in-flight instruction so that the machine architectural state can be restored in case a misspeculation occurs


Map (register rename) and QueueStages

The map stage renames programmer-visible register numbers to internal register numbers

The queue stage stores instructions until they are ready to issue

structures are duplicated for integer and floating point execution


Out-of-order Issue Queues

issue queue logic maintains 2 lists of pending instructions in separate integer and floating-point queues

scoreboards maintain status of the internal registers by tracking the progress of single-cycle, multiple-cycle, and variable-cycle (memory load) instructions

the scoreboard unit notifies all instructions in the queue that require the register value when functional unit or load-data results become available


Out-of-order Execution

Each queue/arbiter selects the oldest operand-ready and functional-unit-ready instructions for execution each cycle

queues are collapsable—an entry becomes immediately available once the instruction issues or is squashed due to misspeculation


Retire Mechanism

assigns each mapped instruction a slot in a circular in-flight window (in fetch order)

tracks the internal register usage for all in-flight instructions

each entry in the mechanism contains storage indicating the internal register that held the old contents of the destination register for the corresponding instruction

this (stale) register can be freed for other use after the instruction retires


Exception Handling

exception causes all younger instructions in the in-flight window to be squashed and are removed from all queues in the system

register map is backed up to the state before the last squashed instruction using the saved map state

registers allocated by the squashed instructions become immediately available


HP PA-RISC 8000


ROB Size Performance Effect


AMD K-5 ROB Entry


AMD K-5 Reservation Station Entry


Approaches for Billion Transistor Architectures

Advanced superscalar processors scale up from current designs to issue 16

or 32 instructions per cycle Superspeculative processors

enhance wide-issue superscalar performance by speculating aggressively at every point in the processor pipeline


SPARC64 V9


Pentium III and 4 Register Renaming and ROB


One BillionTransistors, One Uniprocessor, One Chip?


Superspeculative Architecture


Area Issues

A large circuitry required to feed the processors with a continuous instructions stream

Dynamic execution requires a large amount of comparisons for dependency checking

The size of reorder buffer, reservation stations/rename registers increase accordingly


Limitations

Larger issue machines have high peak to sustained rate ratios – Intel Pentium Pro architecture Approach

Beyond issue widths of 8, inherent limited ILP in single-thread, give diminishing returns – More architectures switching to Simultaneous Multithreading


Alternate Approaches

Approach Issue

Structure

Hazard detection

Scheduling Comment Examples

Speculative Superscalar

Dynamic Hardware Dynamic with

Speculation

OOO

with

Speculation

Pentium II/III/IV,

Alpha 21264

VLIW Static Software Static No hazard between issue packets

MAJC

EPIC Mostly static Mostly software

Mostly static Explicit dependences marked by compiler

Itanium


OOO Speculative Execution Processor - Simulator Design

Tracking all the activities of the pipelined machine in each clock cycle

Issue Unit design that solves structural and data hazards

Dependency checking Mechanisms Strategy for sending data from

producers to consumers


Data Structures

Instruction Queue Execution Tracking Hardware

Structure Register File Producer Table Reservation Stations The Reorder Buffer

Functional Units State Structure


Service Functions

Issue Dispatch Completion CDB Snooping Retirement and Writeback


Overall Structure


Producer Table

Each register is extended by a tag and valid flag Valid=true iff register contains

appropriate data Other tag points to instruction producing

the data


Reservation Stations

Full bit is set if entry occupied Tag points to ROB tag of the

instruction op1 and op2 hold the source

references


The Reorder Buffer

Realized as a FIFO with ROBhead and ROBtail

New instructions put at ROBtail and instruction is tagged in RS with this.

Each cycle the ROBhead valid entry is checked for instruction completion


Issue Protocolif (there is a free RS and a free ROB entry) {RS.full:=1; RS.tag:=ROBtail; for all operands x of Ii with address r if Rr.valid=1 RS.opx:=Rr; else if CDB.tag=Rr.tag and CDB.valid RS.opx:=CDB; else RS.opx:=ROB[Rr.tag]; if ( Ii has a destination register r) Rr.tag:=ROBtail; Rr.valid=0; ROB[ROBtail].dest:=r; else ROB[ROBtail].dest:=none; ROBtail:=ROBtail+1; }


Dispatch Protocol

if there is a RS with RS.opx.valid=1 for all operands x and the function unit is not stalled { Pass instruction, operands, and tag to FU

RS.full:=0; }


Completion Protocol

if FU has result and got CDB acknowledge { CDB.valid:=1; CDB.data:=result from FU; CDB.tag:=tag from FU; ROB[CDB.tag].valid:=1; ROB[CDB.tag].data:=CDB.data; }


CDB Snooping

For all operands x: if RS.full=1 and RS.opx.valid=0 and RS.opx.tag=CDB.tag { RS.opx:=CDB; }


Retirement/Writeback Protocol

if ROB not empty and ROB[ROBhead].valid=1 { if instruction in the ROB[ROBhead] requires writeback { x:=ROB[ROBhead].dest; Rx.data:=ROB[ROBhead].data; if ROBhead=Rx.tag Rx.valid=1; } ROBhead:=ROBhead+1; }


Configurable Parameters

Probability of memory misses Probability of correct branch prediction Branch mis-prediction penalty Cache miss penalty Window Size for instruction issue Number of Issues/cycle Number of Functional Units (FUs) Pipeline Depth/Latency of each FU Number of CDBs Size of reservation stations/rename registers (RS) Operand matching mechanism in each RS Size of re-order buffer Branch Prediction Mechanisms (optional)


Performance Metrics

Number of Clock cycles on an instruction trace

Number of Stalls (Various Types) Effect on Hardware costs Peak vs. Sustained Rates (actual

issues vs. maximum possible) Percentage Resource Utilization


OOO Speculative Micro-architecture Simulators

Simple Scalar University of Wisconsin in Madison www.simplescalar.com

KScalar Universidad Autónoma de Barcelona www.caos.uab.es/kscalar


Simple Scalar v3.0

tool set includes sample simulators ranging from a fast functional simulator to a detailed, dynamically scheduled processor model that supports non-blocking caches, speculative execution, and state-of-the-art branch prediction

includes performance visualization tools, statistical analysis resources, and debug and verification infrastructure

includes a machine definition infrastructure that permits most architectural details to be separated from simulator implementations


KScalar

allows analyzing the performance behavior of a wide range of processor microarchitectures: from a very simple in-order, scalar pipeline, to a detailed out-of-order, superscalar pipeline with non-blocking caches, speculative execution, and complex branch prediction

The simulator interprets executables for the Alpha AXP instruction set: from very short program fragments to large applications

The object's program execution may be simulated in varying levels of detail: either cycle-by-cycle, observing all the pipeline events that determine processor performance,

or million cycles at once, taking statistics of the main performance issues


Study Direction

Modeling and comparison of representative Micro-architectures Parameters modeling commercial micro-

architecture’s OOO speculative execution core

SPEC benchmarks instruction traces analysis of relative importance of

supporting assumptions


Study Direction (continued)

Modeling Resource Utilization of Simultaneous Multithreaded Workload Comparison of resource utilization and

performance metrics of single-thread vs. SMT execution

Use of instruction traces that model multi-thread workload (e.g. modeling Hyperthreading in Pentium 4)

Documents

Out-of-Order Speculative Execution Designing a Configurable Simulator for an OOO Microprocessor By Mustafa Imran Ali ID# 230203