Hier wird Wissen Wirklichkeit Computer Architecture – Part 9 – page 1 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt Part 9 Instruction

Hier wird Wissen Wirklichkeit Computer Architecture – Part 9 – page 1 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt

Part 9Instruction Level Parallelism (ILP) - Concurrency

Computer Architecture

Slide Sets

WS 2011/2012

Prof. Dr. Uwe BrinkschulteProf. Dr. Klaus Waldschmidt


Concurrency

Classical pipelining allows the termination of up to one instruction per clock

cycle (scalar execution)

A concurrent execution of several instructions in one clock cycle requires

the availability of several independent functional units.

These functional units are more or less heterogeneous (that means, they

are designed and optimized for different functions).

Two major concepts of concurrency on ILP level are existing:

- Superscalar concurrency

- VLIW concurrency

These concepts can be found as well in combination


Concurrency - superscalar

The superscalar technique operates on a conventional sequential

instruction stream

The concurrent instruction issue is performed completely during runtime

by hardware.

This technique requires a lot of hardware resources.

It allows a very efficient dynamic issue of instructions at runtime.

On the downside, no long running dependency analysis (as e.g.

possible by a compiler) is possible


Concurrency - superscalar

The superscaler technique is a pure microarchitecture technique, since

it is not visible on the architectural level (conventional sequential

instruction stream)

Thus, hardware structure (e.g. the number of parallel execution units)

can be changed without changing the architectural specifications

(e.g. ISA)

Superscaler execution is usually combined with pipelining (superscalar

pipeline)


Concurrency - VLIW

The VLIW technique (Very Large Instruction Word) operates on a parallel

instruction stream.

The concurrent instruction issue is organized statically with the support of

the compiler.

The consequence is a lower amount of hardware resources.

Extensive compiler optimizations are possible to exploit parallelism.

On the downside, no dynamic effects can be considered (e.g. branch

prediction is difficult in VLIW).


Concurrency - VLIW

VLIW is a architectural technique, since the parallel instruction stream is

visible on the architectural level.

Therefore, a change in e.g. the level of parallelism leads to a change in the

architectural specifications

VLIW is usually combined with pipelining

VLIW can also be combined with superscaler concepts, as e.g done in

EPIC (Explicit Parallel Instruction Computing, Intel Itanium)


The main question in designing a concurrent computer architecture is:

How many instruction level parallelism (ILP) exists in the code of an

application?

This question has been analyzed very extensively for the compilation

of a sequential imperative programming language in a RISC instruction

set.

The result of all these analyses is:

Programs include a fine grain parallelism degree of 5-7.

Degree of parallelism in ILP


Higher degrees in parallelism can be obtained only by code with

long basic blocks (long instruction sequences without branches).

Numerical applications in combination with loop unrolling is an

application class with a higher ILP.

Further application classes are embedded system control.

A computer architecture for general purpose applications with a

higher ILP of 5-7 can suffer from decreasing efficiency because of a

lot of idle functional units.

Degree of parallelism in ILP


Superscalar technique

Components ofa superscaler processor

L o ad /S to re

U n it(s )

F lo a tin g -P o in t

R eg is te rs

G en e ra lP u rp o se

R eg is te rs

M u lti-m ed ia

R eg is te rs

F lo a tin g -P o in t

U n it(s )

In teg e rU n it(s )

M u lti-m ed iaU n it(s )

R e tireU n it

R en am eR eg is te rs

D -cach eM M U

R eo rd e r B u ffe r

In s tru c tio n B u ffe rIn s tru c tio nIssu e U n it

In s tru c tio n D eco d e an dR eg is te r R en am e U n it

In s tru c tio n F e tchU n it

B ran chU n it

B H T B TA C

R A S

M M UI-cach e

B u sIn te rface

U n it



A superscalar pipeline:

• operates on a sequential instruction stream

• Instructions are collected in a instruction window

• Instruction issue to heterogeneous execution units is done by hardware

• microprocessor has several, mostly heterogeneous, functional units in the execution stage of the instruction pipeline.

• Instruction processing can be done out of sequential instruction stream order.

• Sequential instruction stream order is finally restored.

In s tru c tio n F e tch

. . .

In s tru c tio n D eco d e

an d R en am e

. . .

Inst

ruct

ion

Win

dow

I s su e

Res

erva

tion

St

atio

ns

E x ecu tio n

Res

erva

tion

St

atio

nsE x ecu tio n

. . .

R etire an d

W rite B ack



In s tru c tio n F e tch

. . .

In s tru c tio n D eco d e

an d R en am e

. . .

Inst

ruct

ion

Win

dow

I s su e

Res

erva

tion

St

atio

ns

E x ecu tio n

Res

erva

tion

St

atio

ns

E x ecu tio n

. . .

R etire an d

W rite B ack

In-order and out-of-order sections in a superscaler pipeline

In-order In-order

Out-of-order


Instruction fetch

• Loads several instructions (instruction block) from the nearest instruction memory (e.g. instruction cache) to an instruction buffer

• Ususally, as many instructions are fetched per clock cycle as can be issued to the execution units (fetch bandwidth)

• Control flow conflicts are solved by branch prediction and branch target address cache

• The instruction buffer decouples instruction fetch from decode


Instruction fetch

• Cache level Harvard architecture

• Self-modifying code cannot be implemented efficiently on todays superscaler processors

• Instruction cache (single port) mostly simpler organized than data cache (multi port)

• In case of branches, instructions have to be fetched from different cache blocks

• Solutions to parallelize this: multi-channel caches, interleaved caches, multiple instructions fetch units, trace cache


Decode

• Decodes multiple instructions per clock cycle

• Decode bandwidth usually equal to fetch bandwidth

• Fixed length instruction format simplifies decoding of several instructions per clock cycle

• Variable instruction length => multi stage decoding

• first stage: determinde instruction boundaries

• second stage: decode instructions and create one or more microinstructions

• complex CISC instructions are splitted to simpler RISC instructions


Register rename

• Goal of register renaming: remove false dependencies (output dependency, anti dependency)

• Renaming can be done:

• statically by the compiler

• dynamically by hardware

• Dynamic register renaming:

• architectural registers are mapped to physical registers

• each destination register specified in the instruction is mapped to a free physical register

• the following instructions having the same architectural register as source register will get last assigned physical register as input operand by register renaming

=> false dependencies between register operands are removed


Register rename

Two possible implementations:

• two different register sets are present

• architectural registers store the „valid“ values

• rename buffer registers store temporary results

• on renaming, architectural registers are assigned to buffer registers

• only one register set of so-called physical registers is present

• these store temporary and valid values

• architectural registers are mapped to physical registers

• architectural registers themselves are physically non-existent

• a mapping table defines which physical register currently operates as which architectural register for a given instruction


Register rename

Mapping tablelocical

destination registers

physical destination registers

locical source

registers

Dependency check

Multi-plexer

physical source

registers

Mapping has to be done for multipe instructions simultaneously

Possible implementation:


Instruction window

• Decoded instructions are written to the instruction window

• The instruction window decouples fetch/decode from execution

• The instructions in the instruction window are

• free of control flow dependencies due to branch prediction

• free of false dependencies due to register renaming

• True dependencies and resource dependencies remain

• Instruction issue checks in each clock cycle, which instructions from instruction window can be issued to the execution units

• These are issued up to the maximum issue bandwidth (number of execution units)

• The original instruction sequence is stored in the reorder buffer


Instruction window and issueterminology

• issue means the assignment of instructions to execution units or preceeding reservation stations, if present (see e.g. Tomasulo alg.)

• if reservation stations are present, the assignment of instructions from reservation stations to the execution units is called dispatch

• the instruction issue policy describes the protcoll used to select instructions for issuing

• depending on the processor instructions can be issued in-order or out- of-order

• the lookahead capability determines, how may instructions in the instruction window can be inspected to find the next issuable instructions

• the issuing logic determining executable instructions often is called scheduler


In-order versus out-of-order issue

Example:

I1 w = a - b I2 x = c + w I3 y = d - e I4 z = e + y

In-order issue: Out-of-order issue:

clock n: I1 clock n: I1, I3 clock n+1: I2, I3 clock n+1: I2, I4 clock n+2: I4

• Using in-order issue, the scheduler has to wait after I1 (RAW), then I2 and I3 can be issued in parallel (no dependency), finally I4 can be issued (RAW)

• Using out-of-order issue, the scheduler can issue I1 and I3 in parallel (no dependeny), followed by I2 and I4 => one clock cycle is saved

RAW

RAW


False dependencies andout-of-order issue

Example:

I1 w = a - b I2 x = c + w I3 c = d - e I4 z = e + c

Out-of-order issue:

I1 w = a - b, I3 c = d - e I2 x = c + w, I4 z = e + c

Out-of-order issue with register rename:

I1 w = a - b, I3 c2 = d - e I2 x = c1 + w, I4 z = e + c2

• Out-of-order issue makes a false dependencies (WAR, WAW) critical• Register renaming solves these issues

RAW

RAW

WARI2 uses old c

I4 used new c

Different!

I2 and I4 use new c

I2 uses old c, I4 uses new c

Identical!


Scheduling techniques

There are several possible techniques to determine and

issue the next executable instructions, e.g.:

• Associative memory

(central solution)

• Tomasulo algorithm

(decentral solution)

• Scoreboard

(central solution)


Wake up with associative memory

• The instructions waiting in the instruction window are marked by so

called tags.

• The tags of the produced results are compared with the tags of the

operands of the waiting instructions.

• For comparison, each window cell is equipped with comparators.

All comparators are working in parallel.

• This kind of a memory is called associative memory.

• A hit of comparison is marked by a ready bit.

• If the ready bits of an instruction are complete, the instruction is

issued.

• This solves the true dependencies


Wake up with associative memory

rdyL opd tagL opd tagR rdyR

OR = = = = OR

. . .

tagIW tag1

inst0

. . . rdyL opd tagL opd tagR rdyR instN-1

. . .


Priority based issuing of instructions woken up

• If there are more instruction determined for issuing then available execution units (issue bandwidth), a priority selection logic is necessary

• This selection logic determines for each execution unit the instruction to issue from the woken up instructions

• Therefore, each execution unit needs such a selection unit

• This solves the resource dependencies

• The hardware complexity of the issue unit rises with the size of the instruction window and the number of execution units


Selection logic for a single execution unit

req0

gran

t0re

q1gr

ant1

req2

gran

t2re

q3gr

ant3

a n y req en ab le an y req en ab le an y req en ab le an y req en ab le

. . .

. . .Issu e W in d o w

req0

gran

t0re

q1gr

ant1

req2

gran

t2re

q3gr

ant3

a n y req en ab le

req0

req1

req2

req3 gr

ant0

gran

t1gr

ant2

gran

t3

A rb ite r C e llO R P rio rityE n co d e r

an y req en a b le

req0

gran

t0e n ab le

ro o t ce ll

fro m /to o th e r su b tree s


Tomasulo algorithm

• The most well-known principle for instruction parallelism of superscalar

processors is the Tomasulo algorithm.

• This algorithm was implemented first in the IBM 360 Computer by R. Tomasulo.

• The main assumption of the Tomasulo algorithm is, that the semantic of a program is unchanged, if the data dependencies are still existing when modifying the sequence of the instructions.

• The Tomasulo algorithm is based on the dataflow principle.

• All waiting instructions in the instruction window can be ordered in a dataflow graph.

• As consequence, all instructions in one level of the dataflow graph can be issued and executed in parallel and all dependencies in the dataflow graph can be represented by pointers to the functional units.


Tomasulo algorithm

• Therefore the functional units are equipped with additional registers, so called reservation stations, which store these pointers or the operands itself.

• Assigning operands and pointers to the reservation stations (issue) solves the resource dependencies

• As soon as all operands and pointers are available, the function is executed (dispatch)

• This solves the true data dependencies

• If all operands are available immediately, issue and dispatch can be done in the same clock cycle, so dispatch usually is not a pipeline stage

• Different from associative memory, resource dependencies are solved before true data dependencies

• For a better distinction of the reservation stations from the registers of the original register file, the registers of the register file are regarded as functional units with the identity operation.


Dataflow graph of instructions in the instruction window

Level 0

Level 1

Level 2

...

reserva-tion station

reserva-tion station

register

identityfunctional units

Implementation of the nodes


Simple microarchitecture for demonstrating Tomasulo algorithm

functional unit

execution unit

reservationstations

mul

= = = = = = = = =registerunit

a b c d e f x y z

divaddsub


Simple microarchitecture for demonstrating Tomasulo algorithm

mul

= = = = = = = = =a b c d e f x y z

zfedca b

divaddsub

a bz c de f

I1 x = a / bI2 y = x + zI3 z = c dI4 x = e - f

RAW

WAR WAW

1. step

div add mul

2. step

x

x

sub

3. step

y


Execution of the program sequence on the microarchitecture

First step: instructions I1 - I4 and the available operands are issued to the corresponding reservation stations

reservation stations of the results are reserverd for I1, I2 and I3

result reservation station for I4 cannot be reserved because already occupied by result of I1

Second step: instructions I1 and I3 are dispatched because all operands and result space are available

result of I1 is transferred to the reservation station where I2 is waiting

therefore, result reservation station occupied by I1 so far becomes free and is now reserved for I4

Third step: instructions I2 and I4 are dispatched now and the results are stored


Scoreboard (Thornton algorithm)

• The true data dependencies in a superscalar processor can also be

solved solely over the register file.

• This is the basic idea of the scoreboarding and therefore the principle

is very simple.

• It is a central method within a microarchitecture for controlling the

instruction sequence according to the data dependencies.

• Register, which are in use, are marked by a scoreboard bit. A register

is marked as in use if it is destination of an instruction.

• Only free registers are available for read or write operations. This is

a very simple solution for solving data dependencies.


Scoreboard (Thornton algorithm)

R0 R1 R2 ..... Ri ..... Rn

0 0 1 ..... 01 .....

The length of the scoreboard bit vector is the same as the length of the register file.

Registerfile

Scoreboardbitvector

• The scoreboard bit is set at the instruction issue point of the pipeline.

• It is set at the request for a destination register and is reset after the write back phase.

• Each instruction is checked against a conflict with their source operands and a “in use” destination register.

• In case of a conflict, the instruction will be delayed until the scoreboard bit is reset. With this simple method, a RAW-conflict is solved.


State graph of the scoreboard method

00

11

Register Ri free (unused)

Register Ri is address of the destination operand

Register Ri occupied (in use)write back to Ri is finished&

Ri address of another destination operand

write back to Ri

(destination operand is in Ri)


Scoreboard logic

instruction word

scoreboardlogic

set resetSC bit n SC bit n

RF READstage

(S1)

(S2)

+

OPCEX

stage

RF WRITEstage

Adresse

Operand

R

(R)

31

n

0

31

n

0

OPC R S1 S2


Instruction window organization

Decentralized windows, single stageCentralized window, single stage

Centralized or dezentralized windows, two stages


Execution

• Out-of-order execution of the instructions in mostly parallel execution units

• Results are store in the rename buffers or physical registers

• Execution units can be

• single cycle units (execution takes a single clock cycle), latency = throughput = 1

• multiple cycle units (execution takes multiple clock cycles), latency > 1

• with pipelining (e.g. arithmetic pipeline), throughput = 1

• without pipelining (e.g. load-/store-unit - possible cache misses), throughput = 1 / latency


Execution

Load-Store-Units

• Load- and store-instructions often can take different paths inside the load-store-unit (wait-buffer for stores)

• Store instructions need the address (address calculation) and the value to store, while load instructions only need the address

• Therefore, load instruction are often brought before store instructions as long as not the same address is concerned

write bufferload

store

address register content

Load-Store-Unit


Execution

Load-Store-Units

• A load instruction is completed, as soon as the value to load is written to a buffer register

• A store instruction is completed, as soon as the value is written to the cache

• This cannot be undone!

• So store instructions on a speculative path (branch prediction) cannot be completed before the speculation is confirmed to be true

• Speculative load instructions are not a problem


Execution

Multimedia Units

• perform SIMD operations (subword parallelism)

• the same operation is performed on a part of the register set

• graphic-oriented multimedia operations

• arithmetic or logic operations on packed datatypes like e.g. eight 8-bit, four 16-bit or two 32-bit partial words

• pack and unpack operations, mask, conversion and compare operations

• video-oriented multimedia operations

• two to four simultaneous 32-bit floatingpoint operations


Retire and write back

Retire and write back is responsible for:

• commiting or discarding the completed results from execution

• rolling back wrong speculation paths from branch prediction

• restoring the original sequential instruction order

• allowing precise interrupts or exceptions


Some wordings

Completion of an instruction:

• The execution unit has finished the execution of the instruction

• The results are written to temporary buffer registers and are available as operands for data-dependend instructions

• Completion is done out of order

• During completion, the position of the instruction in the original instruction sequence and the current completion state is stored in a reorder buffer

• The completion state might indicate a preceding interrupt/exception or a pending speculation for this instruction


Some wordings

Commitment of an instruction:

• Commitment is done in the original instruction order (in-order)

• A result of an instruction can be commited, if

• execution is completed

• the results of all instructions preceding this instruction in the original instruction order are committed or will be committed within the same clock cycle

• no interrupt/exception occured before or during execution

• the execution does no longer depend on any speculation

• During commitment the results are written permanently to the architectural

registers

• Committed instructions are removed from the reorder buffer


Some wordings

Removement of an instruction:

• The instruction is removed from the reorder buffer without committing it

• All results of the instructions are discarded

• This is done e.g. in case of misspeculation or a preceding interrupt/exception

Retirement of an instruction

• The instruction is removed from the reorder buffer with or without committing it (commitment or removement)


Interrupts and exceptions

On an interrupt or exception, the regular program flow is interrupted and an interrupt service routine (exception handler) is called

Classes of interrupts/exceptions:

Aborts:are very fatal and lead to processor shutdownReasons: hardware failures like defective memory cells

Traps: are fatal and normally lead to program terminationReasons: arithmetic errors (overflow, underflow, division by 0),

privilege violation, invalid opcode, …

Faults: cause the repetition of the last executed instruction after handlingReasons: virtual memory management errors like page faults

External interrupts: lead to interrupt handling Reasons: interrupts from external devices to indicate the presence of data or timer events

Software interrupts: lead to interrupt handling Reasons: interrupt instruction in program


Usually, exceptions like aborts, traps or faults have higher priorities then other

interrupts

Interruptrequest

save statusand

set interrupt mask

return from interrupt

main program interrupt routine

Program flow for interrupt/exception handling

restore status

Interrupts and exceptions


An interrupt or exception is called precise, if the processor state saved at the start of the interrupt routine is identical to a sequential in order execution on a von-Neumann-architecture

For out-of-order execution on a superscaler processor this means:

• all instructions preceding the interrupt causing instruction are

committed and therefore have modified the processor state

• all instructions succeeding the interrupt causing instruction are

removed and therefore have not influenced the processor state

• depending on the interrupt causing instruction, it is either committed

or removed

Precise interrupts and exceptions


The reorder buffer stores the sequential order of the issued instructions and therefore allows result serialization during retirement

The reorder bandwidth is usually identical to the issue bandwidth

Possible reorder buffer organization:

• contains instruction states only

• contains instruction states and results (combination of reorder buffer and rename buffer register)

Alternate reorder techniques:

• ceckpoint repair

• history buffer

Reorder buffer


The reorder buffer can be implemented as a ring buffer

Consecutive completed and non speculativeinstructions at the headof the ring buffer canbe committed

Reorder buffer

I1

I2

I3

I4

I5

I6instruction issued & result completed

instruction issued & result completed, based on speculation

instruction issued

empty slot

head

tail

can be committed


• Serialization is done to maintain the sequential von-Neumann principle on the architectural level

• Out-of-order commitment is not allowed on today's superscaler processors

• Single exception: bringing load instructions before store instructions is allowed on some processors

• From the outside, a superscalar processor looks like a simple von- Neumann computer

• This is

– good for program verification

– bad for parallel processing

Why serialization during commitment?


Very Long Instruction Word (VLIW)architecture

In contrast to the superscaler technique, which is a microarchitectural technique, VLIW is a architectural technique

While in superscaler technique, the parallelism is exploited by hardware, in VLIW this is done by software

The compiler bundles a fixed set of simple independent instructions, which are stored in a very long instruction word

The processor executes all instructions of this very long instruction word in parallel


Basic principle of VLIW

Compiler

FU FU FU FU

CPU

Very long instructionword (VLIW)

Functional unit

Instruction


Some important features of pureVLIW

• Sequential stream of long instruction words• Length of an instruction word usually between 128 and 1024 bits• Static scheduling of instructions by the compiler

(parallelization at compile time)• The number of instructions in one VLIW word is fix• Instructions in one VLIW word must be independent and contain their own opcodes and operands. All dependencies have to be solved by the compiler. This leads to a restriction of the density of VLIW code.• If the full width of the very long instruction word cannot be exploited, it must be filled with NOOPs• Only in order issue is supported, but more than one instructions can be executed in one clock cycle, according to the width of the very long instruction word.• The hardware complexity of the instruction window is very low. Scheduling at runtime is not necessary.


VLIW instruction vs. CISC and SIMDinstruction

• Difference to a CISC instruction:

A CISC instruction can code several potentially sequential operations in one instruction, while VLIW contains independent parallel operations

• Difference to a SIMD instruction

SIMD instructions perform a single operation on multiple data elements, while VLIW instructions perform different operations on different data elements


FP-ALU I-ALU LOAD/ Addressinstruction instruction STORE

FP-ALU ALU Data Memory

Multiport-Registerfile

VLIW-machine instruction

Example of a VLIW machine instruction + execution hardware


Basic structure of a pure VLIW architecture

I1 I2 I3 In

CU

Instruction stream

Operands& Results

Register unit

Interconnectionunit

Functionunits

Controlunit

Very longinstruction word

Memory unit

FU FU FU FU


Basic structure of a pure VLIW architecture

A VLIW processor contains a number of functional units, which can execute a machine instruction in parallel and synchronous to the clock cycle.

A VLIW instruction packet contains as much instructions as functional units are present

Ideally, the processor starts a VLIW instruction packet each clock cycle

The instructions of this packet are then fetched, decoded, issued and executed in parallel

All instructions of the packet must have the same execution time

Usually, pipelining is used for each instruction of the packet

=> n parallel pipelines in a n times VLIW processor


Problems with pure VLIW

• VLIW is a real architecture approach.It is not scalable without new compilation. A new architecture means a new VLIW means a new compilation

• VLIW suffers from branch instructions. Speculative branches cannot be handled by the hardware

• VLIW suffers from memory latencies. A cache miss leads to a stall of all subsequent pipeline stages

• VLIW cannot react to dynamic events. Again, a stall of all subsequent pipeline stages is the consequence

Pure VLIW has a strong 1 : 1 relation to the microarchitecture


Code morphing in VLIW

Code morphing has been introduced to VLIW with the Transmeta Crusoe

processors

This is a hardware-software hybrid solution

A software interpreter transforms sequential machine code to VLIW

instructions at runtime

E.g., ordinary x86 code is "morphed" at runtime to VLIW instructions

By changing the morphing software, any other machine code can be

adapted to the Transmeta Crusoe processors

Decoupling from hard- and software is improved

Execution of legacy code is simplified


Compiler

Code Morphing Software

FU FU FU FUCPU

ISA

VLIW-ISA

Block diagram of VLIW with code morphing level

Code morphing is done by software.

Sequential instruction stream

Parallel VLIW instruction stream


Principle of Code Morphing

Translation of a "virtual" instruction stream to a "real" instruction stream

Code can be optimized during the translation process in several steps:

• The first translation is performed without optimization in the so called

lowest execution mode

• Furthermore, the virtual instructions are instrumented to prepare a profile

of the timing behavior

• The prepared profile can initiate an optimization of the program path.

The binary translation is started again and a revised real VLIW instruction

stream is generated

• This procedure can be repeated several times, until an optimized VLIW

code is available at a high level of execution mode.


VLIW architecture with code morphing by Transmeta

Original goal of Transmeta:

- Fast CPUs with low power consumption on the basis of VLIW

and CMOS

- Reduction of hardware complexity by additional software shell.

Crusoe architectures consists of a VLIW hardware core and a

software shell.

- Code morphing software translates X86 instructions into VLIW

code at runtime.


Basic block diagram of a Crusoe microarchitecture

A very long instruction word (VLIW) is called a „molecule“.

•There are up to 4 atoms (instructions) in one molecule.

•The execution of molecules is in order.

•The issue of the X86 instructions is out-of-order. There exist a binary code translation

•Within the Crusoe processor family, different instruction sets are used.

FADD ADD LD BRCC

Floating-PointUnit

IntegerALU#0

Load/StoreUnit

BranchUnit

128-Bit-molecule


Crusoe features

The processors are optimized for low power consumption.

Crusoe are VLIW architectures with an additional code morphing software. The

translation of code is “on demand” and is stored in the cache.

On an instruction cache miss, new code is translated to VLIW code.

This code is the executed until the next cache miss

By separating the hardware from the application, a "virtual programming

environment" is created, which supports:

• regular VLIW code execution

• speculative load and store instructions

• prediction

• code instrumentation for optimization.


EPIC (Explicitly Parallel InstructionComputing)

• Improvement of VLIW by Intel ad HP to the IA-64 architecture for 64

bit server processors

• Extended 3-instruction format, similar to 3 times VLIW

• Gloal of EPIC: combine simplicity and high clock frequency of a VLIW processor with the advantages of dynamic scheduling

• The EPIC format allows the compiler to inform the processor directly about instruction level parallelism

• Therefore, an EPIC processor ideally has not to check for data and control flow dependencies

• This simplifies the microarchitecture compared to a superscaler processor while improving flexibility compared to VLIW processor


Compiler


FU FU FU FU

CPU

EPIC instructionbundle

Functional unit

Instruction

Dispersal window

Stop marker to show boundaries of parallel execution



• An EPIC instruction bundle is 128 bit in width and consists of a compiler generated bundle of 3 IA-64 instructions and so-called template bits

• A IA-64 instruction is 41 bit in width and mainly consists of an opcode, a predicate field, two source and one destination register addresses

• 5 template bits indicate information on instruction grouping

• There are no NOOP instructions like in VLIW. Instruction parallelism is given by the template bits. They define if an instruction can be executed in parallel with the other instructions

• This refers to instruction within the same EPIC bundle and the following EPIC bundles

• Therefore, instructions with data or control flow dependencies can be bundled improving flexibility compared to VLIW



bundle i bundle i+1 bundle i+2

can be executed in parallel




stop marker given by template bitsIA-64 instruction

add r1 = r2 + r3sub r4 = r11 – r2 ;; stop marker

sub r5 = r1 – r10

Dependency

e.g. in a bundle:


Template5 Bit

instruction41 Bit

instruction41 Bit

instruction41 Bit

Template classifies instruction types and stop marker

Example of a sequence of IA-64 bundles:

Template 1. instruction 2. instruction 3. instruction

00000 Memory Integer Integer

00001 Memory Integer Integer ;;

00010 Memory Integer ;; Integer

11101 Memory FP Branch

11110 Memory FP Branch ;;

… … … …

128 bit

“bundle”

Format of a bundle in IA-64


Itanium Processor

• Six times EPIC processor with a ten stage pipeline

• Nine execution units: four ALU/MMX units, two floating point units and three branch units

• Itanium concatenates up to two bundles of indepenent instructions and executes these instructions in parallel in the pipeline

• Future EPIC processors are able to concatenate more then two bundles

=> in contrast to VLIW scaling is possible

• Itanium 2 nearly identical to Itanium, removes some weaknesses (long cache latencies, faster bus, better X86 emulation)

• Variants of Itanium 2: McKinley (first Itanium 2), Madison (higher clock frequency then Madison), Deerfield (low power version)

• Montecito is a multicore processor containing two Itanium 2 processor cores


Block diagram of the Itanium processor:

2 bundles are fetched (32 Bytes) from L1 cacheInstruction Queue contains 24 IA-64 instructions

9 available functional units6 IA-64 instructions can be issued perClock cycle over Issue ports

Itanium Processor


Functionality of the dispersal window in EPIC

M F I M I B M I I M I B

BundleStream

fromI-Cache

Dispersal Window

First Bundle Second Bundle

DispersedInstructions

M0 M1 I1 B2B1B0F1F0I0 Functional units


Bundle stream from I-Cache

According to the resources 1 or 2 bundles can be fetched from the I-Cache.

Example: If one bundle cannot be mapped completely, only one bundle is fetched from the I-Cache.

M I I M I B M I BBundleStream

fromI-Cache


M0 M1 I1 B2B1B0F1F0I0

M BI M I BBundleStream

fromI-Cache


M0 M1 I1 B2B1B0F1F0I0

M I I


IA-64 instruction set architecture

IA-64 instruction set architecture contains:

• A fully predicative instruction set

• Many registers:

•128 Integer register

•128 floating point register

• 64 predication register

• 8 branch register

• Speculative load instructions


Predication model

All instructions of the ISA can refer to one of the 64 predication register

Example:

p1, p2 <- cmp (x == y)

p1: instr

p2: instr

p2 is complementary to p1


a) Traditional architectureThe statement is partitioned in 4 basicblocks by the compiler. These blockshave to be executed serially.

instinst...p1, p2 cmp (x==y)jump of p2

inst1inst2...jump

inst3inst4...

instinst...

if

then

else

if

then

else

inst inst . . . . p1, p2 cmp (a==b)

(p1) inst1(p1) inst2 . . .

(p2) inst3(p2) inst4 . . .

inst inst . . .

Example for an “if-then-else” statement

b) EPIC architecture“Then” path will be executed if p1 is true.“Else” path will be executed if p2 is true.

Consequence: The conditional branch is parallelized in a simple way.


Speculative load instructions

• Speculative load (hoisting) means a load instruction is

speculatively executed in advance of a branch instruction.

(before the affiliated basic block)

• This allows a reduction of memory latency and therefore an

increasing of the ILP-degree.

• A new speculative load (ld.s) instruction is introduced, which

initiates a speculative fetch to the memory

• A check.s instruction is used to verify speculation


Traditional Architecture EPIC Architecture (IA-64)

.

.

.instinst...jump...loadinst...

.

.ld.sinstinst...jump...check.sinst...

Hoi

stin

g

Barrier

Example

In a traditional architecture, the load can be shifted only to the barrier (border of the basic block).


Another example

• without Control Speculation

• with Control Speculation

- 1- 2- 3

(p1)br.cond target 1 ld4 r1=[r5] ;; add r2=r1, r3

- 1- 2...

- n- n+1- n+2

ld4.s r1=[r5];;

maybe other instructions

(p1)br.cond target 1 chk.s r1, add r2=r1, r3


Comparing superscaler, VLIW and EPIC

• All three techniques aim to improve performance by concurrent execution units

• Ideally, as many instructions as execution units are present should be executed in one clock cycle

• Architecture- versus microarchitecture approach:

• VLIW and EPIC are architecture approaches

• Superscaler is a microarchitecture approach

• Instruction scheduling and conflict avoidance:

• VLIW/EPIC: the compiler schedules the assignment of instructions to execution units and takes care to avoid conflicts

• In a superscalar processor, this is done by hardware

=> VLIW/EPIC puts higher demands on the compiler than superscaler



• Compiler optimization:• all three techniques require an optimizing compiler• the VLIW and EPIC compiler additionally has to take in account memory access time• superscaler memory access is managed by the load-/store-unit• often the same optimization strategies can be used in all three cases

• Instruction ordering:• a superscaler processor feeds its execution units from a single simple

execution stream• a VLIW processor uses a instruction stream of instruction packages (tuples of simple instructions)• EPIC can bundle dependent instructions. Template bits have to be checked by the processor. Several bundles of independent instructions can be executed concurrently. Therefore EPIC is a hybrid

of superscaler and VLIW



• Reaction to runtime events: VLIW not as flexible as superscaler

• Memory organization: superscaler can support memory hierarchies much better then VLIW

• Branch prediction and speculation:

• dynamic branch prediction is a standard technique in current superscaler processors

• impossible in VLIW, hard to realize in EPIC

• Code density

• VLIW has a fixed instruction format => code density is lower then in superscaler processors, if the available instruction level parallelism is insufficient to fill the VLIW instruction package

• EPIC doesn't have this drawback, but the template bits produce some

overhead



• Reachable performance and fields of application

• comparable performance of all three techniques under ideal conditions

• The simplicity of VLIW processors allow a higher clock frequency compared to superscaler

• VLIW is preferable for code with a high degree of parallelism, e.g. for signal processing

• General purpose applications like e.g. text processing, compiler or games have a lower degree of parallelism and a higher degree of dynamics thus favoring superscaler

• EPIC combines VLIW and superscaler thus avoiding the inelasticity of

VLIW and the issue complexity of superscaler

Documents

Hier wird Wissen Wirklichkeit Computer Architecture – Part 9 – page 1 of 84 – Prof. Dr. Uwe Brinkschulte, Prof. Dr. Klaus Waldschmidt Part 9 Instruction