24
An STDM (Static Time Division Multiplexing) switch on a multi-FPGA system 2019.10.4 K.Azegami 1 , K.Musha 1 , K.Hironaka 1 , A.B.Ahmed 1 , M.Koibuchi 2 , Y.Hu 2 and H. Amano 1 1 Keio University, 2 National Institute of Informatics Special Thanks to Tomohiro Kudoh, Ryosei Takano, Kohei Itoh, Kensuke Iizuka and Yugo Yamauchi

An STDM (Static Time Division Multiplexing) switch on a multi … · 2019-10-04  · An STDM (Static Time Division Multiplexing) switch on a multi-FPGA system 2019.10.4 K.Azegami

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: An STDM (Static Time Division Multiplexing) switch on a multi … · 2019-10-04  · An STDM (Static Time Division Multiplexing) switch on a multi-FPGA system 2019.10.4 K.Azegami

An STDM (Static Time Division Multiplexing) switch on a multi-FPGA system

2019.10.4

K.Azegami1, K.Musha1, K.Hironaka1, A.B.Ahmed1, M.Koibuchi2, Y.Hu2 and

H. Amano1

1Keio University, 2National Institute of Informatics

Special Thanks to Tomohiro Kudoh, Ryosei Takano, Kohei Itoh, Kensuke Iizuka and Yugo Yamauchi

Page 2: An STDM (Static Time Division Multiplexing) switch on a multi … · 2019-10-04  · An STDM (Static Time Division Multiplexing) switch on a multi-FPGA system 2019.10.4 K.Azegami

MEC(Multi-access Edge Computing)

CentralizedCloud

⑤スマートコミュニティ

MEC data center

5G5G

5G

5G enables < 0.5msec latencyTiming Critical JobsLow Power but High Performance

FPGA Computing

Page 3: An STDM (Static Time Division Multiplexing) switch on a multi … · 2019-10-04  · An STDM (Static Time Division Multiplexing) switch on a multi-FPGA system 2019.10.4 K.Azegami

FPGA computing in MEC data center• Benefits:

• Suitable to timing critical applications.• Data are directly transferred from/to IoT devices with FPGA hardware.• The constant time execution is achieved with the hardware engine.

• Low energy computation.• Much energy efficient than GPUs

• Scalability for multiple access.• Easy to increase the number of FPGAs• An FPGA itself can be used as a switch.

• Programming environment has been improved:• Open-CL is widespread for computational usage.• Vivado-HLS is popularly used for general usage.

• High performance FPGAs are available.• Intel’s Stratix 10 with a lot of floating DSPs• Xilinx’s Ultrascale+ with UltraRAM

→However, cost is an important issue in the MEC data center.

Page 4: An STDM (Static Time Division Multiplexing) switch on a multi … · 2019-10-04  · An STDM (Static Time Division Multiplexing) switch on a multi-FPGA system 2019.10.4 K.Azegami

HostPC

FPGA

PCIe

FPGA

FPGA

FPGA

FPGA

FPGA

FPGA

Our Proposal : Virtual FPGA for MECA lot of cost-efficient middle-scale FPGAs are

tightly connected with their own serial links.They can be treated as if they were a single FPGAin HLS description level.

Higher performance per cost than conventionalFPGA in cloud.Practically infinite resource is used.Separated into a number of virtual FPGAs and shared by the multiple accesses.

Flow-in-Cloud (FiC) is the first prototype.

Direct datafrom IoT devices

Page 5: An STDM (Static Time Division Multiplexing) switch on a multi … · 2019-10-04  · An STDM (Static Time Division Multiplexing) switch on a multi-FPGA system 2019.10.4 K.Azegami

FPGAFPGAFPGA

FPGAFPGAFPGAFPGASTDM switch STDM switch STDM switch STDM switch

FPGA

STDM switch STDM switch STDM switch STDM switch

Circuit switching network

HLS modules

FiC-SW

Host CPUI/O boardKCU1500

High Speed Serial Links

Flow-in-Cloud (FiC) overview

Today, the STDM switch is focused

Page 6: An STDM (Static Time Division Multiplexing) switch on a multi … · 2019-10-04  · An STDM (Static Time Division Multiplexing) switch on a multi-FPGA system 2019.10.4 K.Azegami

Flow-in-Cloud (FiC) SW BoardFiC Network8x4 9.9Gbps

Ethernet

Control Network

Application Logic Area

SWControl board

Raspberry Pi 3 model B

FPGAXilinx Kintex

Ultrascale XCKU095

Rusberry Pi 3

FPGA KU085/095

STDMSwitch

HLS modules

DDR-4 SDRAM 16Gb

Here, we call eachlink “channel”,

and a bundle of4 channels “bundle”.

A board has 8 bundleseach of which has4 channels

DDR-4 SDRAM 16Gb

Page 7: An STDM (Static Time Division Multiplexing) switch on a multi … · 2019-10-04  · An STDM (Static Time Division Multiplexing) switch on a multi-FPGA system 2019.10.4 K.Azegami

XilinxAurora

XilinxAurora

STDMswitch

8.5Gbps x32 (4 chan. x 8 lane )

HLS module

PR domain

Static domain

8.5Gbps x32 (4 chan. x 8 lane )

Raspi3

Ethernet

DRAM

170bit 170bit

100MHz

100MHz

9.9Gbps

9.9Gbps

Block Diagram of FiC

STDMswitch

STDMswitch

STDMswitch

sw0sw1

sw2sw3

Page 8: An STDM (Static Time Division Multiplexing) switch on a multi … · 2019-10-04  · An STDM (Static Time Division Multiplexing) switch on a multi-FPGA system 2019.10.4 K.Azegami

STDM (Static Time Division Multiplexing)

8

Port1

Port2

Port3

Port 4

Port1

Port2

Port3

Port 4

S1

S2

S3

S4

S1

S2

S3

S4

S1

S2

S3

S4

S1

S2

S3

S4An input register is selected according to the pre-loaded table, and transferred to the outputregister.

Input data arrive at each port cyclically registered.

Output data are cyclicallysent to the output port

An example of4x4 with four slots

Page 9: An STDM (Static Time Division Multiplexing) switch on a multi … · 2019-10-04  · An STDM (Static Time Division Multiplexing) switch on a multi-FPGA system 2019.10.4 K.Azegami

STDM (Static Time Division Multiplexing)

9

Port1

Port2

Port3

Port 4

Port1

Port2

Port3

Port 4

S1

S2

S3

S4

S1

S2

S3

S4

S1

S2

S3

S4

S1

S2

S3

S4

P2S1 P2S2 P1S3 P3S4 P2S1 P2S2P1S3 P3S4…. ….port1

• A circuit is established betweensource and destination.

• Latency and bandwidth are kept.

• Latency = 45+2 x (# of slots)clock cycles

Page 10: An STDM (Static Time Division Multiplexing) switch on a multi … · 2019-10-04  · An STDM (Static Time Division Multiplexing) switch on a multi-FPGA system 2019.10.4 K.Azegami

Multicast using the STDM

Port1

Port2

Port3

Port 4

Port1

Port2

Port3

Port 4

S1

S2

S3

S4

S1

S2

S3

S4

S1

S2

S3

S4

S1

S2

S3

S4

For internal usage

Multicast is done efficiently.

Multiple outputs can receive the same data in a specific slot.

Page 11: An STDM (Static Time Division Multiplexing) switch on a multi … · 2019-10-04  · An STDM (Static Time Division Multiplexing) switch on a multi-FPGA system 2019.10.4 K.Azegami

The resource usage

GT: High speed link

Enough design is remained for HLS design.

4 switches are provided for each channel.

Page 12: An STDM (Static Time Division Multiplexing) switch on a multi … · 2019-10-04  · An STDM (Static Time Division Multiplexing) switch on a multi-FPGA system 2019.10.4 K.Azegami

fic00

fic01

fic02

fic03

1

2

1

2

fic04

fic05

fic06

fic07

1

2

1

2

fic08

fic09

fic10

fic11

1

2

1

2

m2fic00

m2fic01

m2fic02

m2fic03

1

2

1

2

m2fic04

m2fic05

m2fic06

m2fic07

1

2

1

2

m2fic08

m2fic09

m2fic10

m2fic11

1

2

1

2

3

3

3

3

3

3

3

3

3

3

3

3

4

4

4

4

4

4

4

4

4

4

4

4

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

The current configuration: 4x6 torus

Page 13: An STDM (Static Time Division Multiplexing) switch on a multi … · 2019-10-04  · An STDM (Static Time Division Multiplexing) switch on a multi-FPGA system 2019.10.4 K.Azegami

The current FiC with 24 boards

Page 14: An STDM (Static Time Division Multiplexing) switch on a multi … · 2019-10-04  · An STDM (Static Time Division Multiplexing) switch on a multi-FPGA system 2019.10.4 K.Azegami

fic00

fic01

fic02

fic03

1

2

1

2

fic04

fic05

fic06

fic07

1

2

1

2

fic08

fic09

fic10

fic11

1

2

1

2

m2fic00

m2fic01

m2fic02

m2fic03

1

2

1

2

m2fic04

m2fic05

m2fic06

m2fic07

1

2

1

2

m2fic08

m2fic09

m2fic10

m2fic11

1

2

1

2

3

3

3

3

3

3

3

3

3

3

3

3

4

4

4

4

4

4

4

4

4

4

4

4

Round Trip Time (24boards)

Page 15: An STDM (Static Time Division Multiplexing) switch on a multi … · 2019-10-04  · An STDM (Static Time Division Multiplexing) switch on a multi-FPGA system 2019.10.4 K.Azegami

Round Trip Time (24 nodes HLS-HLS)

Slot number

Late

ncy

(clo

ck c

ycle

s)

1.2GB HLS-HLS effective bandwidth4.6um max. latency

Transferred Data (x160b x4)

Page 16: An STDM (Static Time Division Multiplexing) switch on a multi … · 2019-10-04  · An STDM (Static Time Division Multiplexing) switch on a multi-FPGA system 2019.10.4 K.Azegami

0

0.2

0.4

0.6

0.8

1

1.2

1.4

0 2000 4000 6000 8000 10000 12000

Node-to-Node Bandwidth (HLS-HLS)B

an

dw

idth

(G

B /

sec)

Transferred Data (x160b x4)

2-slot

4-slot

8-slot

16-slot

32-slot

64-slot

Page 17: An STDM (Static Time Division Multiplexing) switch on a multi … · 2019-10-04  · An STDM (Static Time Division Multiplexing) switch on a multi-FPGA system 2019.10.4 K.Azegami

0

0.5

1

1.5

2

2.5

3

3.5

4

4 32 256 2048 16384 131072 1048576

バン

ド幅

(GB

/sec

)

データサイズ(Byte)

GPUtoGPU(PEACH3)

GPUtoGPU(PEACH2)

GPUtoGPU(MPI/IB)

Transferred Data (Byte)

Ban

dw

idth

(G

B /

sec)

FiC 2slot

Comparison between other FPGA Connected Multi-GPU system and MPI/Infiniband

Kaneda, et.al “Performance Evaluation of PEACH3: FPGA switch for Tightly Coupled Accelerators,”

HEART 2017

Page 18: An STDM (Static Time Division Multiplexing) switch on a multi … · 2019-10-04  · An STDM (Static Time Division Multiplexing) switch on a multi-FPGA system 2019.10.4 K.Azegami

0

200

400

600

800

1000

1200

0 10 20 30 40 50 60 70

The number of Slots

8 boards

4 boards

2 boards

Rou

nd

Tri

p T

ime (

clo

cks)

Round Trip Time vs. The number of Slots

Pass through time for a board is about 450nsec

with 2 slots.

1

320

Slot synchronousdelay increases the round trip time

Page 19: An STDM (Static Time Division Multiplexing) switch on a multi … · 2019-10-04  · An STDM (Static Time Division Multiplexing) switch on a multi-FPGA system 2019.10.4 K.Azegami

0

1

2

3

4

5

6

8 32 128 512

レイ

テン

シ(μsec)

データサイズ(Byte)

GPUtoGPU(PEACH3)

GPUtoGPU(PEACH2)

GPUtoGPU(MPI/IB)

Late

ncy

(μsec)

Transferred Data (Byte)

FiC 2slot (450ns)

K’s Tofu (100ns)

Comparison between other systems

Ajima, et.al “The Tofu Interconnect,” Hot Interconnect 2012

Page 20: An STDM (Static Time Division Multiplexing) switch on a multi … · 2019-10-04  · An STDM (Static Time Division Multiplexing) switch on a multi-FPGA system 2019.10.4 K.Azegami

fic00

fic01

fic02

fic03

1

2

1

2

fic04

fic05

fic06

fic07

1

2

1

2

fic08

fic09

fic10

fic11

1

2

1

2

m2fic00

m2fic01

m2fic02

m2fic03

1

2

1

2

m2fic04

m2fic05

m2fic06

m2fic07

1

2

1

2

m2fic08

m2fic09

m2fic10

m2fic11

1

2

1

2

3

3

3

3

3

3

3

3

3

3

3

3

4

4

4

4

4

4

4

4

4

4

4

4

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

Broadcast can be done with the maximum hop

④⑤

Page 21: An STDM (Static Time Division Multiplexing) switch on a multi … · 2019-10-04  · An STDM (Static Time Division Multiplexing) switch on a multi-FPGA system 2019.10.4 K.Azegami

0

100

200

300

400

500

0 5 10 15 20 25 30

1-to-all and All-to-all

Array Size

Late

ncy

(clo

cks)

2x2

2x4

4x6

4x4

All-to-all

1-to-all

All-to-all broadcast can be done efficiently

Page 22: An STDM (Static Time Division Multiplexing) switch on a multi … · 2019-10-04  · An STDM (Static Time Division Multiplexing) switch on a multi-FPGA system 2019.10.4 K.Azegami

fic00

fic01

fic02

fic03

1

2

1

2

fic04

fic05

fic06

fic07

1

2

1

2

fic08

fic09

fic10

fic11

1

2

1

2

m2fic00

m2fic01

m2fic02

m2fic03

1

2

1

2

m2fic04

m2fic05

m2fic06

m2fic07

1

2

1

2

m2fic08

m2fic09

m2fic10

m2fic11

1

2

1

2

3

3

3

3

3

3

3

3

3

3

3

3

4

4

4

4

4

4

4

4

4

4

4

4

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

Reduction Computation

① ① ① ① ① ①

Page 23: An STDM (Static Time Division Multiplexing) switch on a multi … · 2019-10-04  · An STDM (Static Time Division Multiplexing) switch on a multi-FPGA system 2019.10.4 K.Azegami

0

200

400

600

800

1000

1200

0 5 10 15 20

Reduction Calculation (170bit integer data)

The number of Slots

Com

pu

tati

on

Tim

e (

clo

cks)

Each nodefinishes

computation with1 clock cycle.

Page 24: An STDM (Static Time Division Multiplexing) switch on a multi … · 2019-10-04  · An STDM (Static Time Division Multiplexing) switch on a multi-FPGA system 2019.10.4 K.Azegami

Summary● 34GB max. aggregated throughput with 32links

○ vs. PEACH 3 : 31.4GB max.

○ vs. Catapult-2:10GB max. (40Gbps=5GB x 2 )● 1.2GB/sec 1-to-1 HLS throughput

○ Advantageous for small data size● 450nsec pass through time

○ Much smaller than PEACH3 and MPI /Infiniband● 4.6μsec maximum HLS-to-HLS latency

● Future work

○ Improving input/output delay of HLS part

○ Slot synchronization