13
.1 Multiprocessor on a Chip & Simultaneous Multi- threads [Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005]

1 Multiprocessor on a Chip & Simultaneous Multi-threads [Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005]

Embed Size (px)

Citation preview

.1

Multiprocessor on a Chip &

Simultaneous Multi-threads

[Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005]

.2

Review: Multiprocessor Basics

# of Proc

Communication model

Message passing 8 to 2048

Shared address

NUMA 8 to 256

UMA 2 to 64

Physical connection

Network 8 to 256

Bus 2 to 36

• Q1 – How do they share data?• Q2 – How do they coordinate?• Q3 – How scalable is the architecture? How many

processors?

.3

Multithreading: Interleave instructions from separate threads on the same hardware. Seen by OS as several CPUs.

Multi-core: Integrating several processors that (partially) share a memory system on the same chip

.4

Recall: Bypass network prevents stalls

Instead of bypass: Interleave threads on the pipeline to prevent stalls ...

.5

CMP: Multiprocessors On One Chip• By placing multiple processors, their memories and the

Interface all on one chip, the latencies of chip-to-chip communication are drastically reduced– ARM multi-chip core

Snoop Control Unit

CPUL1$s

CPUL1$s

CPUL1$s

CPUL1$s

Interrupt Distributor

CPUInterface

CPUInterface

CPUInterface

CPUInterface

Per-CPU aliased peripherals

Configurable between 1 & 4 symmetric CPUs

Private peripheral

bus

Configurable # of hardware intr

Primary AXI R/W 64-b bus Optional AXI R/W 64-b bus

Private IRQ

.6

Multithreading on A Chip• Find a way to “hide” true data dependency stalls, cache

miss stalls, and branch stalls by finding instructions (from other process threads) that are independent of those stalling instructions

• Multithreading – increase the utilization of resources on a chip by allowing multiple processes (threads) to share the functional units of a single processor– Processor must duplicate the state hardware for each thread – a

separate register file, PC, instruction buffer, and store buffer for each thread

– The caches, TLBs, BHT, BTB can be shared (although the miss rates may increase if they are not sized accordingly)

– The memory can be shared through virtual memory mechanisms– Hardware must support efficient thread context switching

.7

Types of Multithreading• Fine-grain – switch threads on every instruction issue

– Round-robin thread interleaving (skipping stalled threads)– Processor must be able to switch threads on every clock cycle– Advantage – can hide throughput losses that come from both

short and long stalls– Disadvantage – slows down the execution of an individual

thread since a thread that is ready to execute without stalls is delayed by instructions from other threads

• Coarse-grain – switches threads only on costly stalls (e.g., L2 cache misses)– Advantages – thread switching doesn’t have to be essentially

free and much less likely to slow down the execution of an individual thread

– Disadvantage – limited, due to pipeline start-up costs, in its ability to overcome throughput loss

• Pipeline must be flushed and refilled on thread switches

.8

Multithreaded Example: Sun’s Niagara (UltraSparc T1)

• Eight fine grain multithreaded single-issue, in-order cores (no speculation, no dynamic branch prediction)

Ultra III Niagara

Data width 64-b 64-b

Clock rate 1.2 GHz 1.0 GHz

Cache (I/D/L2)

32K/64K/ (8M external)

16K/8K/3M

Issue rate 4 issue 1 issue

Pipe stages 14 stages 6 stages

BHT entries 16K x 2-b None

TLB entries 128I/512D 64I/64D

Memory BW 2.4 GB/s ~20GB/s

Transistors 29 million 200 million

Power (max) 53 W <60 W4-

way

MT

SP

AR

C p

ipe

4-w

ay M

T S

PA

RC

pip

e

4-w

ay M

T S

PA

RC

pip

e

4-w

ay M

T S

PA

RC

pip

e

4-w

ay M

T S

PA

RC

pip

e

4-w

ay M

T S

PA

RC

pip

e

4-w

ay M

T S

PA

RC

pip

e

4-w

ay M

T S

PA

RC

pip

e

Crossbar

4-way banked L2$

Memory controllers

I/Osharedfunct’s

.9

Niagara Integer Pipeline

• Cores are simple (single-issue, 6 stage, no branch prediction), small, and power-efficient

Fetch Thrd Sel Decode Execute Memory WB

I$

ITLB

Inst bufx4

PC logicx4

Decode

RegFilex4

Thread Select Logic

ALU Mul Shft Div

D$

DTLB Stbufx4

Thrd Sel Mux

Thrd Sel Mux

Crossbar Interface

Instr typeCache missesTraps & interruptsResource conflicts

From MPR, Vol. 18, #9, Sept. 2004

.10

Simultaneous Multithreading (SMT)

• A variation on multithreading that uses the resources of a multiple-issue, dynamically scheduled processor (Super Scalar) to exploit both program ILP and thread-level parallelism (TLP)– Most Super Scalar processors have more machine level

parallelism than most programs can effectively use (i.e., than have ILP)

– With register renaming and dynamic scheduling, multiple instructions from independent threads can be issued without regard to dependencies among them

• Need separate rename tables (ROBs) for each thread• Need the capability to commit from multiple threads (i.e.,

from multiple ROBs) in one cycle

• Intel’s Pentium 4 SMT called hyperthreading– Supports just two threads (doubles the architecture state)

.11

Threading on a 4-way SS Processor Example

Thread A Thread B

Thread C Thread D

Tim

e →

Issue slots →SMTFine MTCoarse MT

.12

Multicore Xbox360 – “Xenon” processor

• To provide game developers with a balanced and powerful platform– Three SMT processors, 32KB L1 D$ & I$, 1MB UL2 cache– 165M transistors total– 3.2 Ghz Near-POWER ISA– 2-issue, 21 stage pipeline, with 128 128-bit registers– Weak branch prediction – supported by software hinting– In order instructions– Narrow cores – 2 INT units, 2 128-bit VMX (colloquial term for

SIMD) units, 1 of anything else

• A 500MZ GPU w/ 512MB of DDR3DRAM– 337M transistors, 10MB framebuffer– 48 pixel shader cores, each with 4 ALUs

.13

Xenon Diagram

Core 0

L1D L1I

Core 1

L1D L1I

Core 2

L1D L1I

1MB UL2

512MB DRAM

GPUBIU/IO Intf

3D Core

10MBEDRAM

VideoOut

MC

0M

C1

AnalogChip

XM

A D

ecS

MC

DVDHDD PortFront USBs (2)WirelessMU ports (2 USBs)Rear USB (1)EthernetIRAudio OutFlashSystems Control

Video Out