By Michael Butler, Leslie Barnes, Debjit Das Sarma, Bob Gelinas This paper appears in: Micro, IEEE March/April 2011 (vol. 31 no. 2) pp. 6-15 마이크로 프로세서

by

Michael Butler, Leslie Barnes,

Debjit Das Sarma, Bob Gelinas

This paper appears in: Micro, IEEE

March/April 2011 (vol. 31 no. 2)

pp. 6-15

Bulldozer:An Approach to multithreaded Compute Performance

마이크로 프로세서 구조 speaker: 박세준

1. Motivation

2. Introduction

3. Block diagram

4. Key features

5. Function block highlights

6. Bulldozer-based SoC

Contents

AMD has been focusing on the core count and highly parallel sever workloads

Two basic observations1. Future SoCs support multiple execution threads

• The smallest possible building module

2. Core would operate in constrained power environment.

• Power reduction techniques:

Filtering , speculation reduction, data movement minimization

Performance per watt!!

Motivation

Bulldozer is New direction in microar-chitecture

• Bulldozer is the first x86 design to share substantial hardware between multiple core

• Bulldozer is a hierarchical design with sharing at nearly every level

• Bulldozer is a high frequency opti-mized CPU

• Instead of peak performance, average performance increased.

Introduction

Major contribution

• Scaling the core structures

• Aggressive frequency goal• low gates per clock

Introduction

It combines two independent core as a module• implementation of a shared level 2 cache

• Improved area and power efficiency

Block diagram

The module can fetch and decode up to four x86 in-struction per clock.

Each core can services two loads per cycle.

Shared Frontend• Decoupled predict and

fetch pipelines

• ALU performance 33% decrease FPU performance 33% in-crease

• ALU performance 33% increase FPU performance 33% in-crease

Block diagram

1. Multithreading microarchitecture• Appropriate use of replication and shared hardware

• Main advantage to sharing instruction cache and branch

• Enforcing frontend (increasing ROB, BTB)

2. Decoupled branch-prediction from instruction fetch pipelines• Enablement of instruction prefetch using the prediction queue

• instruction control unit increased 128 (reorder buffer)

3. Register renaming and operand delivery• scheduler and operand-handling is the biggest power consumer in the integer execu-

tion unit

• PRF-based renaming microarchitecture for power efficiency

• Eliminates data replication

4. FMAC and media extension• FMAC(floating-point multiply-accumulate) deliver significant peak execution bandwidth

• It made one per each module like coprocessor

Key features

Branch prediction

multilevel BTB

Instruction cache

64 Kbyte, two-way set-associative,

cache shared between both threads

Function block highlights

Decode

branch fusion (intel: macro fusion ), four x86 instruction per cycle

Bulldozer execution pipeline


Integer scheduler and execution

renaming by PRF(Physical Register Files)

Floating point

FPU is a coprocessor between two integer core

L2 cache

the two cores share the unified L2 cache


Summary

1. In single threading, sacrifice peak performance, throughput increase

2. In single threading, FPU is more important

3. ALU performance need in server

Bulldozer can deliver a significant performance improvement in the same power.

Bulldozer-based SoC

The end

Documents

By Michael Butler, Leslie Barnes, Debjit Das Sarma, Bob Gelinas This paper appears in: Micro, IEEE March/April 2011 (vol. 31 no. 2) pp. 6-15 마이크로 프로세서