1
Ordered Mesh Network Interconnect (OMNI) Design & Implementation of In-Network Coherence Suvinay Subramanian, Advisor: Li-Shiuan Peh Collaborators: Chia-Hsin Owen Chen, Bhavya Daya, Tushar Krishna, Woo-Cheol Kwon, Sunghyun Park 286 386 486 Pentium P2 Celeron P3 P4 Itanium Core Core Nehalem Nehalem Westmere Sandybridge Larrabee TeraFlops SCC Opteron Opteron Barcelona Bulldozer Bulldozer PowerPC Power3 Power4 Power5 Power6 Power7 Cell Niagara TILE64 TILE-GX Octeon III 1 2 4 8 16 32 64 128 1980 1985 1990 1995 2000 2005 2010 2015 Number of Cores Year Uniprocessor Bus Ring Mesh Compute + Communicate • Moore’s Law “The number of transistors incorporated in a chip will approximately double every 24 months” Increased compute density Shift to multi-core processors better performance vis-à-vis power efficiency WE NEED A scalable coherence mechanism to maintain “consistent” view of “shared memory” WE NEED A scalable communication fabric Specification 36 Power-ISA cores 32 KB L1; 128 KB L2 6x6 Mesh network: OMNI Why Snoopy Coherence? Efficient cache-to-cache transfers No directory overhead No indirection latency Snoopy COherent Research Processor with Interconnect Ordering Evaluations Introduction Walkthrough Example Channel Width Number of Virtual Networks Number of Virtual Channels 16 bytes 3 4+2+2 = 8 L2 + Cache Controller 40% Network Interface + Router 10% Core 50% Area Breakdown Average Power Critical Path 1.2 ns The SCORPIO Processor OMNI Design Traditional interconnects (eg: buses) route all messages to central ordering point All nodes see all messages in the same order OMNI: Network provides mechanism for global ordering No central ordering point Idea: Nodes agree on global order at synchronized time intervals Router Microarchitecture Network Interface Block Diagram 0 0.2 0.4 0.6 0.8 1 Normalized Runtime Normalized Runtime of OMNI against directory protocol Technology Node 45 nm commercial Lookahead bypassing provides effective single cycle path per hop when possible Reserved VC (rVC) provides escape path preventing deadlock Filter redundant broadcasts to save bandwidth and power Routers maintain & propagate sharing information

Ordered Mesh Network Interconnect (OMNI) Design & …people.csail.mit.edu/suvinay/pubs/2013.omni.masterworks.pdf · 2017-05-19 · Ordered Mesh Network Interconnect (OMNI) Design

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Ordered Mesh Network Interconnect (OMNI) Design & …people.csail.mit.edu/suvinay/pubs/2013.omni.masterworks.pdf · 2017-05-19 · Ordered Mesh Network Interconnect (OMNI) Design

Ordered Mesh Network Interconnect (OMNI)

Design & Implementation of In-Network Coherence Suvinay Subramanian, Advisor: Li-Shiuan Peh

Collaborators: Chia-Hsin Owen Chen, Bhavya Daya, Tushar Krishna, Woo-Cheol Kwon, Sunghyun Park

286 386 486 Pentium P2 Celeron P3

P4

Itanium

Core

Core Nehalem

Nehalem

Westmere

Sandybridge

Larrabee

TeraFlops

SCC

Opteron

Opteron

Barcelona

Bulldozer

Bulldozer

PowerPC

Power3 Power4

Power5

Power6

Power7

Cell Niagara

TILE64

TILE-GX

Octeon III

1

2

4

8

16

32

64

128

1980 1985 1990 1995 2000 2005 2010 2015

Nu

mb

er o

f C

ore

s

Year

Uniprocessor

Bus

Ring

Mesh

Compute +

Communicate

• Moore’s Law “The number of transistors

incorporated in a chip will approximately

double every 24 months” Increased

compute density

• Shift to multi-core processors better

performance vis-à-vis power efficiency

• WE NEED A scalable coherence

mechanism to maintain “consistent” view of

“shared memory”

• WE NEED A scalable communication fabric

Specification

• 36 Power-ISA cores

• 32 KB L1; 128 KB L2

• 6x6 Mesh network: OMNI

Why Snoopy Coherence?

• Efficient cache-to-cache transfers

• No directory overhead

• No indirection latency

Snoopy COherent Research Processor with Interconnect Ordering

Evaluations

Introduction

Walkthrough Example

Channel Width Number of

Virtual

Networks

Number of

Virtual

Channels

16 bytes 3 4+2+2 = 8

L2 + Cache Controller

40%

Network Interface + Router

10%

Core 50%

Area Breakdown

Average Power

Critical Path 1.2 ns

The SCORPIO Processor OMNI Design

Traditional interconnects

(eg: buses) route all

messages to central

ordering point

All nodes see all messages

in the same order

OMNI: Network provides

mechanism for global

ordering

No central ordering point

Idea: Nodes agree on global

order at synchronized time

intervals

Router Microarchitecture Network Interface Block Diagram

0

0.2

0.4

0.6

0.8

1

Norm

alize

d R

untim

e

Normalized Runtime of OMNI against

directory protocol

Technology Node

45 nm commercial

Lookahead

bypassing

provides effective

single cycle path

per hop when

possible

Reserved VC

(rVC) provides

escape path

preventing

deadlock

Filter redundant broadcasts

to save bandwidth and

power

Routers maintain &

propagate sharing

information