41
Computer Architecture 2014 - Multithreading 1/40 Multithreaded Architectures Uri Weiser and Yoav Etsion

Multithreaded Architectures

  • Upload
    shiro

  • View
    47

  • Download
    0

Embed Size (px)

DESCRIPTION

Multithreaded Architectures . Uri Weiser and Yoav Etsion. Overview. Multithreaded Software Multithreaded Architecture Multithreaded Micro-Architecture Conclusions. Multithreading Basics. Process Each process has its unique address space Can consist of several threads - PowerPoint PPT Presentation

Citation preview

Page 1: Multithreaded Architectures

Computer Architecture 2014 - Multithreading 1/40

Multithreaded Architectures

Uri Weiser and Yoav Etsion

Page 2: Multithreaded Architectures

Computer Architecture 2014 - Multithreading 2/40

Overview

Multithreaded Software Multithreaded Architecture Multithreaded Micro-Architecture Conclusions

Page 3: Multithreaded Architectures

Computer Architecture 2014 - Multithreading 3/40

MultithreadingBasics

Process Each process has its unique address space Can consist of several threads

Thread – each thread has its unique execution context Thread has its own PC (Sequencer) + registers + stack All threads (within a process) share same address space Private heap is optional

Multithreaded app’s: process broken into threads #1 example: transactions (databases, web servers)

Increased concurrency Partial blocking (your transaction blocks, mine doesn’t have to) Centralized (smarter) resource management (by process)

Page 4: Multithreaded Architectures

Computer Architecture 2014 - Multithreading 5/40

Superscalar ExecutionExample – 5 stages, 4 wide machine

F

F

F

F

D1 WB

D1 D2

D1 EX/RET

D1

WB

EX/RET

EX/RET

EX/RET

D2 WB

D2

D2 WB

Page 5: Multithreaded Architectures

Computer Architecture 2014 - Multithreading 6/40

Superscalar Execution

Generally increases throughput, but decreases utilization

Time

Page 6: Multithreaded Architectures

Computer Architecture 2014 - Multithreading 7/40

CMP – Chip Multi-ProcessorMulti-threads/Multi-tasks runs on Multi-Processor/core

Time

Low utilization / higher throughput

Page 7: Multithreaded Architectures

Computer Architecture 2014 - Multithreading 8/40

Multiple Hardware Threads

A thread can be viewed as a stream of instructions State is represented by PC, Stack pointer, GP Registers

Equip multithreaded processors with multiple hardware contexts (threads) OS views CPU as multiple logical processors

Execution of instructions from different threads are interleaved in the pipeline Interleaving policy is critical…

Page 8: Multithreaded Architectures

Computer Architecture 2014 - Multithreading 9/40

Blocked MultithreadingMulti-threading/multi-tasking runs on single core

May increase utilization and throughput, but must switch when currentthread goes to low utilization/throughput section (e.g. L2 cache miss)

Time

How is this different from OS scheduling?

Page 9: Multithreaded Architectures

Computer Architecture 2014 - Multithreading 10/40

Fine Grained MultithreadingMulti-threading/multi-tasking runs on single core

Increases utilization/throughput by reducing impact of dependences

Time

Page 10: Multithreaded Architectures

Computer Architecture 2014 - Multithreading 11/40

Simultaneous Multithreading (SMT)Multi-threading/multi-tasking runs on single core

Time

Increases utilization/throughput

Page 11: Multithreaded Architectures

Computer Architecture 2014 - Multithreading 12/40

Blocked Multithreading (SoE-MT- Switch on Event MT, aka’ – “Poor Man MT”)

Critical decision: when to switch threads When current thread’s utilization/throughput is about to drop

(e.g. L2 cache miss) Requirements for throughput:

(Thread switch) + (pipe fill time) << blocking latency Would like to get some work done before other thread comes back

Fast thread-switch: multiple register banks Fast pipe-fill: short pipe

Advantage: small changes to existing hardware Drawback: high single thread performance requires long thread

switch Examples

Macro-dataflow machine MIT Alewife IBM Northstar

Page 12: Multithreaded Architectures

Computer Architecture 2014 - Multithreading 13/40

Interleaved (Fine grained) Multithreading

Critical decision: none? Requirements for throughput:

Enough threads to eliminate data access and dependencies Increasing number of threads reduces single-thread performance

F D E M W

F D E M W

F D E M W

F D E M W F D E M W

I1 of T1

F D E M W

Superscalar pipeline Multithreaded pipeline

I1

I2

I1 of T2

I2 of T1

Page 13: Multithreaded Architectures

Computer Architecture 2014 - Multithreading 14/40

Interleaved (Fine grained) Multithreading (cont.)

Advantages: (w/ flexible interleave:) Reasonable single thread performance High processor utilization (esp. in case of many thread)

Drawback: Complicated hardware Multiple contexts (states) (w/ inflexible interleave:) limits single thread performance

Examples: HEP Denelcor: 8 threads (latencies were shorter then) TERA: 128 threads MicroUnity - 5 x 1GHz threads = 200 MHz like latency

Became attractive for GPUs and network processors

Page 14: Multithreaded Architectures

Computer Architecture 2014 - Multithreading 15/40

Simultaneous Multi-threading (SMT)

Critical decision: fetch-interleaving policy Requirements for throughput:

Enough threads to utilize resources Fewer than needed to stretch dependences

Examples: Compaq Alpha EV8 (cancelled) Intel Pentium® 4 Hyper-Threading Technology

Page 15: Multithreaded Architectures

Computer Architecture 2014 - Multithreading 16/40

SMT Case Study: EV8

8-issue OOO processor (wide machine) SMT Support

Multiple sequencers (PC): 4 Alternate fetch from multiple threads

Separate register renaming for each thread More (4x) physical registers Thread tags on all shared resources

ROB, LSQ, BTB, etc. Allow per thread flush/trap/retirement

Process tags on all address space resources: caches, TLB’s, etc.

Notice: none of these things are in the “core” Instruction queues, scheduler, wakeup are SMT-blind

Page 16: Multithreaded Architectures

Computer Architecture 2014 - Multithreading 17/40

Basic EV8

Fetch Decode/Map

Queue Reg Read

Execute Dcache/Store Buffer

Reg Write

Retire

PC

Icache

RegisterMap

DcacheRegs Regs

Thread-blind

Page 17: Multithreaded Architectures

Computer Architecture 2014 - Multithreading 18/40

SMT EV8

Fetch Decode/Map

Queue Reg Read

Execute Dcache/Store Buffer

Reg Write

Retire

IcacheDcache

PC

RegisterMap

Regs Regs

Thread Blind?Not really

Thread-blind

Page 18: Multithreaded Architectures

Computer Architecture 2014 - Multithreading 19/40

Performance Scalability

0%

50%

100%

150%

200%

250%

SpecInt SpecFP Mixed Int/FP

1T2T3T4T

0%

50%

100%

150%

200%

250%

Turb3d Swm256 Tomcatv

1T2T3T4T

0%

50%

100%

150%

200%

250%

300%

Barnes Chess Sort TP

1T2T4TDecomposed SPEC95 Applications

Multiprogrammed Workload

Multithreaded Applications

Page 19: Multithreaded Architectures

Computer Architecture 2014 - Multithreading 20/40

MultithreadI’m afraid of this

0 5 10 15 20 25Mcycles

0.4

0.6

0.8

1

1.2

1.4

1.6SM

T Sp

eedu

p

EV8 simulation R. EspasaIs CMP a solution?

Page 20: Multithreaded Architectures

Computer Architecture 2014 - Multithreading 21/40

Performance Gain – when?

When threads do not “step” on each other “predicted state” $, Branch Prediction, data prefetch….

If ample resources are available Slow resource, can hamper performance

Page 21: Multithreaded Architectures

Computer Architecture 2014 - Multithreading 22/40

Scheduling in SMT

What if one thread gets “stuck”?

Round-robin eventually it will fill up the machine (resource contention)

ICOUNT: thread with fewest instructions in pipe has priority

Thread doesn’t get to fetch until it gets “unstuck”

Other policies have been devised to get best balance, fairness, and overall performance

Page 22: Multithreaded Architectures

Computer Architecture 2014 - Multithreading 23/40

Scheduling in SMT

Variation: what if one thread is spinning?Not really stuck, gets to keep fetchingHave to stop/slow it artificially (QUIESCE)

Sleep until a memory location changes On Pentium 4 – use the Pause instruction

Page 23: Multithreaded Architectures

Computer Architecture 2014 - Multithreading 24/40

MT Application Data Sharing

Shared memory apps Decoupled caches = Private caches

Easier to implement – use existing MP like protocol Shared Caches

Faster to communicate among threads No coherence overhead Flexibility in allocating resources

DataTag DataTag DataTag DataTag

Way 0 Way 1 Way 2 Way 7

Page 24: Multithreaded Architectures

Computer Architecture 2014 - Multithreading 25/40

SMT vs. CMP How to use 2X transistors?

Better core+SMT or CMP? SMT

Better single thread performance Better resource utilization

CMP Much easier to implement Avoid wire delays problems Overall better throughput per area/power More deterministic performance BUT! program must be multithreaded

Rough numbers SMT:

ST Performance/MT performance for 2X transistors = 1.3X/1.5-1.7X CMP:

ST Performance /MT performance for 2X transistors = 1X/2X

SMTCore

Core A

Core B

2MBL2

2MBL2

Core A 1MBL2

Reference

SMT

CMP

Page 25: Multithreaded Architectures

Computer Architecture 2014 - Multithreading 26/40

Intermediate summary

Multithreaded Software Multithreaded Architecture

Advantageous cost/throughput …

Blocked MT - long latency tolerance Good single thread performance, good throughput Needs fast thread switch and short pipe

Interleaved MT – ILP increase Bad single thread performance, good throughput Needs many threads; several loaded contexts…

Simultaneous MT – ILP increase OK MT performance Good single thread performance Good utilization … Need fewer threads

Page 26: Multithreaded Architectures

Computer Architecture 2014 - Multithreading 28/40

Example: Pentium® 4 Hyper-Threading

Executes two threads simultaneously Two different applicationsor

Two threads of same application CPU maintains architecture state for two processors

Two logical processors per physical processor Implementation on Intel® Xeon™ Processor

Two logical processors for < 5% additional die area The processor pretends as if it has 2 cores in an MP

shared memory system

Page 27: Multithreaded Architectures

Computer Architecture 2014 - Multithreading 29/40

uArchitecture impact

Replicate some resources All; per-CPU architectural state Instruction pointers, renaming logic Some smaller resources - e.g. return stack predictor, ITLB, etc.

Partition some resources (share by splitting in half per thread) Re-Order Buffer Load/Store Buffers Queues; etc.

Share most resources Out-of-Order execution engine Caches

Page 28: Multithreaded Architectures

Computer Architecture 2014 - Multithreading 30/40

Hyper-Threading Performance

*All trademarks and brands are the property of their respective owners

1.22

IXP-MPHyper-

ThreadingDisabled

1.001.18

1.231.30

Microsoft*Active Directory

Microsoft*Exchange

Microsoft*SQL Server

Microsoft*IIS

Intel Xeon Processor MP platforms are prototype systems in 2-way configurationsApplications not tuned or optimized for Hyper-Threading Technology

IXP-MPHyper-Threading Enabled

Parallel GCC Compile

1.18

Performance varies as expected with: Number of parallel threads Resource utilization

Less aggressive MT than EV8 Less impressive scalability Machine optimized for single-thread.

Page 29: Multithreaded Architectures

Computer Architecture 2014 - Multithreading 31/40

execution

Memory access = P

execution

Memory access

Multi-Thread Architecture (SoE)

Period

Page 30: Multithreaded Architectures

Computer Architecture 2014 - Multithreading 32/40

Memory latency is shielded by multiple threads execution

C

To Memory

Threads architectural states

Multi-Thread Architecture (SoE)

Page 31: Multithreaded Architectures

Computer Architecture 2014 - Multithreading 33/40

CPII = CPI ideal of core C

MI = Memory access/InstructionP = Memory access penalty (cycles)n = number of threads

C

To Memory

Threads architectural states

Multi-Thread Architecture (SoE)

Instructions/Memory access=1/MI

Page 32: Multithreaded Architectures

Computer Architecture 2014 - Multithreading 34/40

CPI1= {P + [1/MI]CPIi}/ [1/MI] =CPIi + MI * P Cycles / instructionsCPII = CPI ideal of core C

MI = Memory access/InstructionP = Memory access penalty (cycles)n = number of threads

C

To Memory

Threads architectural states

Multi-Thread Architecture (SoE)

Period = P +[1/MI] CPIi

P

Page 33: Multithreaded Architectures

Computer Architecture 2014 - Multithreading 35/40

IPCT= n / CPI1 = n / [ CPIi + MI * P ]CPII = CPI ideal of core C

MI = Memory access/InstructionP = Memory access penalty (cycles)n = number of threads

C

To Memory

Threads architectural states

Multi-Thread Architecture (SoE)

Period

Page 34: Multithreaded Architectures

Computer Architecture 2014 - Multithreading 36/40

Memory latency is shielded by multiple threads execution

Threads (n)

Perf

orm

ance

= IP

C

Max performance ?

Multi-Thread Architecture (SoE)

C

To Memory

Threads architectural states

Page 35: Multithreaded Architectures

Computer Architecture 2014 - Multithreading 37/40

Memory latency is shielded by multiple threads execution

Threads (n)

Perf

orm

ance

= IP

C Max performance = 1/CPIi

Multi-Thread Architecture (SoE)

C

To Memory

Threads architectural states

N1?

Page 36: Multithreaded Architectures

Computer Architecture 2014 - Multithreading 38/40

The unified machine – the valley

Three regions: Cache Region , the valley, MT Region

Threads

Perf

orm

ance

MC

regi

on

MT regionThe valley

trend

Perf = n/ [ CPIi + MI * MR(n) * P ]

Page 37: Multithreaded Architectures

Computer Architecture 2014 - Multithreading 39/40

Multi-Thread Architecture (SoE)conclusions

Applicable when many threads are availableRequire threads state savingCan reach “high” performance (~1/CPIi)Bandwidth to memory is high

high power

Page 38: Multithreaded Architectures

Computer Architecture 2014 - Multithreading 40/40

BACKUP SLIDES

Page 39: Multithreaded Architectures

unlimited BW to memory

The unified machineParameter: Cache Size

Increase Cache size cache ability to hide memory latency favors MC

Increase in

cache size

0100200300400500600700800900

10001100

0

2500

5000

7500

1000

0

1250

0

1500

0

1750

0

2000

0

2250

0

2500

0

2750

0

3000

0

3250

0

3500

0

3750

0

4000

0

GO

PS

Number Of Threads

Performance for Different Cache Sizes

no cache16M32M64M128Mperfect $

Increase in

Cache size

* At this point: unlimited BW to memory

Page 40: Multithreaded Architectures

Computer Architecture 2014 - Multithreading 42/40

0

100

200

300

400

500

600

700

800

900

1000

1100

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

1000

0

1100

0

1200

0

1300

0

1400

0

1500

0

1600

0

1700

0

1800

0

1900

0

2000

0

GO

PS

Number Of Threads

Performance for Different Cache Sizes (Limited BW)

no $

16M

32M

64M

128M

perfect $

Increase in cache size cache suffices for more in-flight threads Extends the $ region

Increase in cache size

Cache Size Impact

..AND also Valuable in the MT region

Caches reduce off-chip bandwidth delay the BW saturation point

Page 41: Multithreaded Architectures

Computer Architecture 2014 - Multithreading 43/40

Increase in memory latency Hinders the MT region Emphasise the importance of caches

Unlimited BW to memory0

100200300400500600700800900

10001100

0

2000

4000

6000

8000

1000

0

1200

0

1400

0

1600

0

1800

0

2000

0

2200

0

2400

0

2600

0

2800

0

3000

0

3200

0

3400

0

3600

0

3800

0

4000

0

GO

PS

Number Of Threads

Performance for Different memory Latencies

zero latency

50 cycles

100 cycles

300 cycles

1000 cycles

Increase in Memory latency

Unlimited BW to memory

Memory Latency Impact