36
Ramón Beivide Universidad de Cantabria Tendencias de Uso y Diseño de Redes de Interconexión en Computadores Paralelos 14 de Abril, 2016 Universidad Complutense de Madrid 2 Outline 1. Introduction 2. Network Basis 3. System networks 4. On-chip networks (NoCs) 5. Some current research

Tendencias de Uso y Diseño de Redes de Interconexión en Computadores Paralelos - Prof. Ramón Beivide

Embed Size (px)

Citation preview

Page 1: Tendencias de Uso y Diseño de Redes de Interconexión  en Computadores Paralelos - Prof. Ramón Beivide

Ramón Beivide

Universidad de Cantabria

Tendencias de Uso y Diseño de Redes de Interconexión en Computadores

Paralelos14 de Abril, 2016

Universidad Complutense de Madrid

2

Outline

1. Introduction

2. Network Basis

3. System networks

4. On-chip networks (NoCs)

5. Some current research

Page 2: Tendencias de Uso y Diseño de Redes de Interconexión  en Computadores Paralelos - Prof. Ramón Beivide

1. Intro: MareNostrum

3

1. Intro: MareNostrum BSC,Infiniband FDR10 non-blocking Folded Clos (up to 40 racks)

4

36-port FDR10 36-port FDR10 36-port FDR10 36-port FDR10 36-port FDR10 36-port FDR10 36-port FDR10 36-port FDR10

Mellanox 648-port

IBCore Switch

Mellanox 648-port

IBCore Switch

Mellanox 648-port

IBCore Switch

Mellanox 648-port

IBCore Switch

Infiniband 648-port

FDRCore

switch

Mellanox 648-port

IBCore Switch

Mellanox 648-port

IBCore Switch

36-port FDR10 36-port FDR10

560 560 560 560 560 560

Leaf switches

40 iDataPlex racks / 3360 dx360 M4 nodes

18 18 18 18 12

3 links to each core

3 links to each core

3 links to each core

3 links to each core

2 links to each core

FDR10 links

18 18 18 18 12

3 links to each core

3 links to each core

3 links to each core

3 links to each core

2 links to each core18 18 18 18 12

18 18 18 18 12

Latency: 0,7 μsBandwidth: 40Gb/s

Page 3: Tendencias de Uso y Diseño de Redes de Interconexión  en Computadores Paralelos - Prof. Ramón Beivide

5

1. Intro: Infiniband core switches

1. Intro: Cost dominated by (optical) wires

6

Page 4: Tendencias de Uso y Diseño de Redes de Interconexión  en Computadores Paralelos - Prof. Ramón Beivide

7

1. Intro: Blades

8

1. Intro: Blades

Page 5: Tendencias de Uso y Diseño de Redes de Interconexión  en Computadores Paralelos - Prof. Ramón Beivide

9

1. Intro: Multicore E5-2670 Xeon Processor

1. Intro: A row of servers in a Google DataCenter, 2012.

10

Page 6: Tendencias de Uso y Diseño de Redes de Interconexión  en Computadores Paralelos - Prof. Ramón Beivide

11

3. WSCs Array: Enrackable boards or blades + rack router

To other clusters

Figure 1.1: Sketch of the typical elements in warehouse-scale systems: 1U server (left), 7’ rack with Ethernet switch (middle), and diagram of a small cluster with a cluster-level Ethernet switch/router (right).

12

3. WSC Hierarchy

Page 7: Tendencias de Uso y Diseño de Redes de Interconexión  en Computadores Paralelos - Prof. Ramón Beivide

1. Intro: Cray Cascade (XC30, XC40)

13

14

1. Intro: Cray Cascade (XC30, XC40)

Page 8: Tendencias de Uso y Diseño de Redes de Interconexión  en Computadores Paralelos - Prof. Ramón Beivide

15

1. Intro: An Architectural Model

Interconnection Network

…M1 Mn

S/R

ATU

S/R

CPU1 …

L/SL/S

L/S L/SATU

…S/RS/R

Interconnection Network

Interconnection Network

…M1 Mn

S/R

ATU

S/R

CPU1 CPUn…

L/SL/S

L/S L/SATU

… … …

Interconnection Network

1. Intro: What we need for one ExaFlop/s

16

Networks are pervasive and critical components in Supercomputers, Datacenters, Servers and Mobile Computers.

Complexity is moving from system networks towards on-chip networks:less nodes but more complex

Page 9: Tendencias de Uso y Diseño de Redes de Interconexión  en Computadores Paralelos - Prof. Ramón Beivide

17

Outline

1. Introduction

2. Network Basis

Crossbars & Routers

Direct vs Indirect Networks

3. System networks

4. On-chip networks (NoCs)

5. Some current research

18

2. Network BasisAll networks based on Crossbar switches

• Switch complexity increases quadratically with the number of crossbar input/output ports, N, i.e., grows as O(N2)

• Has the property of being non-blocking (N! I/O permutations)• Bidirectional for exploiting communication locality • Minimize latency & maximize throughput

7

6

5

4

3

2

1

0

76543210

7

6

5

4

3

2

1

0

76543210

Page 10: Tendencias de Uso y Diseño de Redes de Interconexión  en Computadores Paralelos - Prof. Ramón Beivide

19

2. Blocking vs. Non-blocking

• Reduction cost comes at the price of performance– Some networks have the property of being blocking (Not N!)

– Contention is more likely to occur on network links› Paths from different sources to different destinations share one or

more links

blocking topology

X

non-blocking topology

7

6

5

4

3

2

1

0

76543210

7

6

5

4

3

2

1

0

7

6

5

4

3

2

1

0

20

2. Swith or Router Microarchitecture

Routing Control UnitHeader

Flit

ForwardingTable

Input buffers

DEMUX

Input buffers

DEMUX

Phys

ical

chan

nel

Link

Cont

rol

Link

Cont

rol

Phys

ical

chan

nel

MUX

Cros

sBar

DEMUXM

UX

DEMUX

Crossbar Control

Stage 1

Output buffers

MUX Lin

kCo

ntro

l

Output buffers

MUX Lin

kCo

ntro

l

Phys

ical

chan

nel

Phys

ical

chan

nel

Stage 2 Stage 3 Stage 4 Stage 5

ArbitrationUnit

Output Port #

IB (Input Buffering) RC (Route Computation) SA (Switch Arb) ST (Switch Traversal) OB (Output Buffering)

IB

IB

IB

RC

IB

SA

IB

IB

ST

ST

IB IB ST

IB IB ST

OB

OB

OB

OB

Packet header

Payload fragment

Payload fragment

Payload fragment

Pipelined Switch Microarchitecture

Matching the throughput of the internal switch datapath to the external link BW is the goal

Page 11: Tendencias de Uso y Diseño de Redes de Interconexión  en Computadores Paralelos - Prof. Ramón Beivide

21

2. Network Organization

Switches

End Nodes

Indirect (Centralized) and Direct (Distributed) Networks

2. Previous Myrinet core switches (Indirect, Centralized)

22

Page 12: Tendencias de Uso y Diseño de Redes de Interconexión  en Computadores Paralelos - Prof. Ramón Beivide

2. IBM BG/Q (Direct, Distributed)

23

24

2. Network Organization

64-node system with 8-port switches, c = 4 32-node system with 8-port switches

• As crossbars do not scale they need to be interconnected for servicing an increasing number of endpoints.

• Direct (Distributed) vs Indirect (Centralized) Networks• Concentration can be used to reduce network costs

– “c” end nodes connect to each switch

– Allows larger systems to be built from fewer switches and links

– Requires larger switch degree

Page 13: Tendencias de Uso y Diseño de Redes de Interconexión  en Computadores Paralelos - Prof. Ramón Beivide

25

Outline

1. Introduction

2. Network Basis

3. System networks

Folded Clos

Tori

Dragonflies

4. On-chip networks (NoCs)

5. Some current research

3. MareNostrum BSC,Infiniband FDR10 non-blocking Folded Clos (up to 40 racks)

26

36-port FDR10 36-port FDR10 36-port FDR10 36-port FDR10 36-port FDR10 36-port FDR10 36-port FDR10 36-port FDR10

Mellanox 648-port

IBCore Switch

Mellanox 648-port

IBCore Switch

Mellanox 648-port

IBCore Switch

Mellanox 648-port

IBCore Switch

Infiniband 648-port

FDRCore

switch

Mellanox 648-port

IBCore Switch

Mellanox 648-port

IBCore Switch

36-port FDR10 36-port FDR10

560 560 560 560 560 560

Leaf switches

40 iDataPlex racks / 3360 dx360 M4 nodes

18 18 18 18 12

3 links to each core

3 links to each core

3 links to each core

3 links to each core

2 links to each core

FDR10 links

18 18 18 18 12

3 links to each core

3 links to each core

3 links to each core

3 links to each core

2 links to each core18 18 18 18 12

18 18 18 18 12

Latency: 0,7 μsBandwidth: 40Gb/s

Page 14: Tendencias de Uso y Diseño de Redes de Interconexión  en Computadores Paralelos - Prof. Ramón Beivide

27

3. Network Topology

Centralized Switched (Indirect) Networks

16 port Crossbar network

76

54

32

10

1514

1312

1110

98

76

54

32

10

1514

1312

1110

98

28

3. Network Topology

Centralized Switched (Indirect) Networks

16 port, 3-stage Clos network

76

54

32

10

1514

1312

1110

98

76

54

32

10

1514

1312

1110

98

Page 15: Tendencias de Uso y Diseño de Redes de Interconexión  en Computadores Paralelos - Prof. Ramón Beivide

29

3. Network Topology

Centralized Switched (Indirect) Networks

16 port, 5-stage Clos network

76

54

32

10

1514

1312

1110

98

76

54

32

10

1514

1312

1110

98

30

3. Network Topology

Centralized Switched (Indirect) Networks

16 port, 7 stage Clos network = Benes topology

76

54

32

10

1514

1312

1110

98

76

54

32

10

1514

1312

1110

98

Page 16: Tendencias de Uso y Diseño de Redes de Interconexión  en Computadores Paralelos - Prof. Ramón Beivide

31

3. Network Topology

• Bidirectional MINs• Increase modularity• Reduce hop count, d• Folded Clos network

– Nodes at tree leaves

– Switches at tree vertices

– Total link bandwidth is constant across all tree levels, with full bisection bandwidth

Folded Clos = Folded Benes <> Fat tree network !!!

Centralized Switched (Indirect) Networks

76

54

32

10

1514

1312

1110

98

NetworkBisection

32

3. Other DIRECT System Network Topologies

Distributed Switched (Direct) Networks

2D to

rus

of 1

6 no

des

hype

rcub

eof

16

node

s(1

6 =

24 ,

so n

= 4

)

2D m

esh

or g

rid o

f 16

node

s

NetworkBisection

≤ full bisection bandwidth!

Page 17: Tendencias de Uso y Diseño de Redes de Interconexión  en Computadores Paralelos - Prof. Ramón Beivide

33

3. IBM BlueGene/L/P Network

Prismatic 32x32x64 Torus (mixed-radix networks)

BlueGene/P: 32x32x72 in maximum configuration

Mixed-radix prismatic Tori also used by Cray

3. IBM BG/Q

34

Page 18: Tendencias de Uso y Diseño de Redes de Interconexión  en Computadores Paralelos - Prof. Ramón Beivide

3. IBM BG/Q

35

36

3 .BG Network Routing

X Wires

Y Wires

Z Wires

Adaptive Bubble Routing

ATC-UC Research Group

Page 19: Tendencias de Uso y Diseño de Redes de Interconexión  en Computadores Paralelos - Prof. Ramón Beivide

3. Fujitsu Tofu Network

37

38

3. More Recent Network Topologies

Distributed Switched (Direct) Networks• Fully-connected network: all nodes are directly connected to

all other nodes using bidirectional dedicated links

6 2

45

70

1

3

Page 20: Tendencias de Uso y Diseño de Redes de Interconexión  en Computadores Paralelos - Prof. Ramón Beivide

3. IBM PERCS

39

3. IBM PERCS

40

Page 21: Tendencias de Uso y Diseño de Redes de Interconexión  en Computadores Paralelos - Prof. Ramón Beivide

3. IBM PERCS

41

Organized as groups of routers

Parameters:• a: Routers per group• p: Node per router• h: Global link per

router• Well-balanced

dragonfly [1]a = 2p =2h

3. Dragonfly Interconnection Network

Inter-group•Global links

•Complete graphIntra-group

•Local links•Complete graph

Page 22: Tendencias de Uso y Diseño de Redes de Interconexión  en Computadores Paralelos - Prof. Ramón Beivide

Minimal routing• Longest path 3 hops:

local-global-local• Good performance under

UN traffic

Adversarial traffic [1]• ADV+N: Nodes in group i

send traffic to group i+N• Saturation of the global

link

3. Dragonfly Interconnection Network

Source Node

DestinationNode

Destination Group i+N

Source Group i

SATURATION

[1] J. Kim, W. Dally, S. Scott, and D. Abts.“Technology-driven, highly-scalable dragonfly

topology.” ISCA ‘08.

Valiant Routing [2]• Randomly selects an

intermediate group to misroute packets

• Avoids saturated channel

• Longest path 5 hopslocal-global-local-

global-local

3. Dragonfly Interconnection Network

Source Node

DestinationNode

Intermediate Group

[2] L. Valiant, “A scheme for fast parallel communication," SIAM journal on com-

puting, vol. 11, p. 350, 1982.

Page 23: Tendencias de Uso y Diseño de Redes de Interconexión  en Computadores Paralelos - Prof. Ramón Beivide

45

3. Cray Cascade, electrical supernode

46

3. Cray Cascade, system and routing

Page 24: Tendencias de Uso y Diseño de Redes de Interconexión  en Computadores Paralelos - Prof. Ramón Beivide

47

Outline

1. Introduction

2. Network Basis

3. System networks

4. On-chip networks (NoCs)

Rings

Meshes

5. Some current research

SEM photo of local levels interconnect

4. On-Chip local interconnects

48

Page 25: Tendencias de Uso y Diseño de Redes de Interconexión  en Computadores Paralelos - Prof. Ramón Beivide

Global levels interconnect

4. On-Chip global interconnects

49

4. Metal Layers

50

Page 26: Tendencias de Uso y Diseño de Redes de Interconexión  en Computadores Paralelos - Prof. Ramón Beivide

4. Bumps & Balls

51

Multiple integration with 3D stacking…

4. 3D (& 2.5D) Stacking & Silicon Photonics

3M, IBM team to develop 3D IC adhesive, EETimes India                                                                            STMicroelectronics & CEA

52

Page 27: Tendencias de Uso y Diseño de Redes de Interconexión  en Computadores Paralelos - Prof. Ramón Beivide

4. Rings from ARM

53

4. Rings from Intel

54

Page 28: Tendencias de Uso y Diseño de Redes de Interconexión  en Computadores Paralelos - Prof. Ramón Beivide

55

Folded ring:Lower

maximum physical

link length

4. Rings (Direct or Indirect?)

• Bidirectional Ring networks (folded)– N switches (3 × 3) and N bidirectional network links

– Simultaneous packet transport over disjoint paths

– Packets must hop across intermediate nodes

– Shortest direction usually selected (N/4 hops, on average)

– Bisection Bandwidth???

56

4. Meshes and Tori

Distributed Switched (Direct) Networks

2D to

rus

of 1

6 no

des

2D m

esh

or g

rid o

f 16

node

s

NetworkBisection

Page 29: Tendencias de Uso y Diseño de Redes de Interconexión  en Computadores Paralelos - Prof. Ramón Beivide

4. Meshes from Tilera

4. Mesh from Pythium Mars Architecture

These images were taken form the slides presented at Hot Chips 2015

• L1:– Separated L1 Icache and L1 Dcache– 32 KB Icache– 32 KB Dcache

• 6 outstanding loads• 4 cycles latency from load to use

• L2:– 16 L2 banks of 4 MB– 32 MB of shared L2

• L3:– 8 L3 arrays of 16 MB– 128 MB of L3

• Memory Controllers:– 16 DDR3-1600 channels

• 2x16-lane PCIe-3.0• Directory based cache coherency

– 16 Directory Control Unit (DCU)• MOESI like cache coherence protocol

Page 30: Tendencias de Uso y Diseño de Redes de Interconexión  en Computadores Paralelos - Prof. Ramón Beivide

4. Pythium Mars NoC

This image was taken form the slides presented at Hot Chips 2015

• 6 bi-directional ports switches

• 4 physical channels for cache coherence

• 3 cycles for each hop

• 384 GB/s each cell

4. Meshes from Intel Knights Landing

60

Intel Knights Landing – 3 options

Page 31: Tendencias de Uso y Diseño de Redes de Interconexión  en Computadores Paralelos - Prof. Ramón Beivide

4. Intel Knights Landing

61

4. Intel Knights Landing

62

Page 32: Tendencias de Uso y Diseño de Redes de Interconexión  en Computadores Paralelos - Prof. Ramón Beivide

4. Intel Knights Landing

63

4. Intel Knights Landing

64

Page 33: Tendencias de Uso y Diseño de Redes de Interconexión  en Computadores Paralelos - Prof. Ramón Beivide

65

Outline

1. Introduction

2. Network Basis

3. System networks

4. On-chip networks (NoCs)

5. Some current research

5. Some research on NUCA-based CMP Models

66

Page 34: Tendencias de Uso y Diseño de Redes de Interconexión  en Computadores Paralelos - Prof. Ramón Beivide

5. Full-system simulation including concentration

67

GEM5 + BookSim full‐system simulation platform parameters

ISA X86

Number of Cores 64

CPU Model Out of Order

CPU Frequency 2 GHz

Cache Coherence Protocol MESI

L1 Instructions Size 32 KB

L1 Data Size 64 KB

Shared distributed L2 256 KB per Core

# Memory Controllers 4

Network Frequency 1 GHz

Router Pipeline Stages 4

Physical Networks 3

Buffer Size 10 flits

Link Width 64 bits

Topologies8x8 mesh, torus and FBFLY

4x4 FBFBLY with C=4

Applications used PARSEC benchmarks

5. Topology comparison

Three different topologies are considered:

68

Topology 2D Mesh 2D Torus 2D FBFLY

Degree (ports) ↓

Diameter (max. distance)↓ 2

Average distance ↓

Bisection Bandwidth (links)↑

AdvantagesLow degree

Shortest links

Low degree

Symmetry

Better properties

Symmetry

Best properties

Larger concentration

DisadvantagesLargest distances

Lowest BB

Folding

Deadlock

Highest costs

Non‐uniform link 

lengths

N=16

Page 35: Tendencias de Uso y Diseño de Redes de Interconexión  en Computadores Paralelos - Prof. Ramón Beivide

5. Full-system simulationNormalized execution time

and network latencies:• Average latency has impact in AMAT.• High latencies can degrade execution

times if the affected data are critical.

69

5. Router Power and AreaRouter leakage power and area evaluation:

• Buffers are the most consuming part of the router.• Crossbars and allocators grew quadratically with the number of

ports.• The load in these simulations is low. Hence, the leakage power is

the dominant one.

70

Page 36: Tendencias de Uso y Diseño de Redes de Interconexión  en Computadores Paralelos - Prof. Ramón Beivide

5. Router Power and AreaNetwork leakage power evaluation:

• FBFLY can manage higher concentrations because its higher BB.

71

5. OmpSs vs. pThreads

72