13
by Michael Butler, Leslie Barnes, Debjit Das Sarma, Bob Gelinas This paper appears in: Micro, IEEE March/April 2011 (vol. 31 no. 2) pp. 6-15 Bulldozer: An Approach to multithreaded Compute Performance 마마마마 마마마마 마마 speaker: 마마마

By Michael Butler, Leslie Barnes, Debjit Das Sarma, Bob Gelinas This paper appears in: Micro, IEEE March/April 2011 (vol. 31 no. 2) pp. 6-15 마이크로 프로세서

Embed Size (px)

Citation preview

Page 1: By Michael Butler, Leslie Barnes, Debjit Das Sarma, Bob Gelinas This paper appears in: Micro, IEEE March/April 2011 (vol. 31 no. 2) pp. 6-15 마이크로 프로세서

by

Michael Butler, Leslie Barnes,

Debjit Das Sarma, Bob Gelinas

This paper appears in: Micro, IEEE

March/April 2011 (vol. 31 no. 2)

pp. 6-15

Bulldozer:An Approach to multithreaded Compute Performance

마이크로 프로세서 구조 speaker: 박세준

Page 2: By Michael Butler, Leslie Barnes, Debjit Das Sarma, Bob Gelinas This paper appears in: Micro, IEEE March/April 2011 (vol. 31 no. 2) pp. 6-15 마이크로 프로세서

1. Motivation

2. Introduction

3. Block diagram

4. Key features

5. Function block highlights

6. Bulldozer-based SoC

Contents

Page 3: By Michael Butler, Leslie Barnes, Debjit Das Sarma, Bob Gelinas This paper appears in: Micro, IEEE March/April 2011 (vol. 31 no. 2) pp. 6-15 마이크로 프로세서

AMD has been focusing on the core count and highly parallel sever workloads

Two basic observations1. Future SoCs support multiple execution threads

• The smallest possible building module

2. Core would operate in constrained power environment.

• Power reduction techniques:

Filtering , speculation reduction, data movement minimization

Performance per watt!!

Motivation

Page 4: By Michael Butler, Leslie Barnes, Debjit Das Sarma, Bob Gelinas This paper appears in: Micro, IEEE March/April 2011 (vol. 31 no. 2) pp. 6-15 마이크로 프로세서

Bulldozer is New direction in microar-chitecture

• Bulldozer is the first x86 design to share substantial hardware between multiple core

• Bulldozer is a hierarchical design with sharing at nearly every level

• Bulldozer is a high frequency opti-mized CPU

• Instead of peak performance, average performance increased.

Introduction

Page 5: By Michael Butler, Leslie Barnes, Debjit Das Sarma, Bob Gelinas This paper appears in: Micro, IEEE March/April 2011 (vol. 31 no. 2) pp. 6-15 마이크로 프로세서

Major contribution

• Scaling the core structures

• Aggressive frequency goal• low gates per clock

Introduction

Page 6: By Michael Butler, Leslie Barnes, Debjit Das Sarma, Bob Gelinas This paper appears in: Micro, IEEE March/April 2011 (vol. 31 no. 2) pp. 6-15 마이크로 프로세서

It combines two independent core as a module• implementation of a shared level 2 cache

• Improved area and power efficiency

Block diagram

The module can fetch and decode up to four x86 in-struction per clock.

Each core can services two loads per cycle.

Shared Frontend• Decoupled predict and

fetch pipelines

Page 7: By Michael Butler, Leslie Barnes, Debjit Das Sarma, Bob Gelinas This paper appears in: Micro, IEEE March/April 2011 (vol. 31 no. 2) pp. 6-15 마이크로 프로세서

• ALU performance 33% decrease FPU performance 33% in-crease

• ALU performance 33% increase FPU performance 33% in-crease

Block diagram

Page 8: By Michael Butler, Leslie Barnes, Debjit Das Sarma, Bob Gelinas This paper appears in: Micro, IEEE March/April 2011 (vol. 31 no. 2) pp. 6-15 마이크로 프로세서

1. Multithreading microarchitecture• Appropriate use of replication and shared hardware

• Main advantage to sharing instruction cache and branch

• Enforcing frontend (increasing ROB, BTB)

2. Decoupled branch-prediction from instruction fetch pipelines• Enablement of instruction prefetch using the prediction queue

• instruction control unit increased 128 (reorder buffer)

3. Register renaming and operand delivery• scheduler and operand-handling is the biggest power consumer in the integer execu-

tion unit

• PRF-based renaming microarchitecture for power efficiency

• Eliminates data replication

4. FMAC and media extension• FMAC(floating-point multiply-accumulate) deliver significant peak execution bandwidth

• It made one per each module like coprocessor

Key features

Page 9: By Michael Butler, Leslie Barnes, Debjit Das Sarma, Bob Gelinas This paper appears in: Micro, IEEE March/April 2011 (vol. 31 no. 2) pp. 6-15 마이크로 프로세서

Branch prediction

multilevel BTB

Instruction cache

64 Kbyte, two-way set-associative,

cache shared between both threads

Function block highlights

Page 10: By Michael Butler, Leslie Barnes, Debjit Das Sarma, Bob Gelinas This paper appears in: Micro, IEEE March/April 2011 (vol. 31 no. 2) pp. 6-15 마이크로 프로세서

Decode

branch fusion (intel: macro fusion ), four x86 instruction per cycle

Bulldozer execution pipeline

Function block highlights

Page 11: By Michael Butler, Leslie Barnes, Debjit Das Sarma, Bob Gelinas This paper appears in: Micro, IEEE March/April 2011 (vol. 31 no. 2) pp. 6-15 마이크로 프로세서

Integer scheduler and execution

renaming by PRF(Physical Register Files)

Floating point

FPU is a coprocessor between two integer core

L2 cache

the two cores share the unified L2 cache

Function block highlights

Page 12: By Michael Butler, Leslie Barnes, Debjit Das Sarma, Bob Gelinas This paper appears in: Micro, IEEE March/April 2011 (vol. 31 no. 2) pp. 6-15 마이크로 프로세서

Summary

1. In single threading, sacrifice peak performance, throughput increase

2. In single threading, FPU is more important

3. ALU performance need in server

Bulldozer can deliver a significant performance improvement in the same power.

Bulldozer-based SoC

Page 13: By Michael Butler, Leslie Barnes, Debjit Das Sarma, Bob Gelinas This paper appears in: Micro, IEEE March/April 2011 (vol. 31 no. 2) pp. 6-15 마이크로 프로세서

The end