19
Kyushu University KL, Malaysia Hardware and Software Requirements for Implementing a High-Performance Superconductivity Circuits-Based Accelerator Farhad Mehdipour, Hiroaki Honda, Hiroshi Kataoka, Koji Inoue, Kazuaki Murakami Kyushu University, Japan

Farhad Mehdipour , Hiroaki Honda, Hiroshi Kataoka , Koji Inoue, Kazuaki Murakami

  • Upload
    floria

  • View
    61

  • Download
    0

Embed Size (px)

DESCRIPTION

Hardware and Software Requirements for Implementing a High-Performance Superconductivity Circuits-Based Accelerator. Farhad Mehdipour , Hiroaki Honda, Hiroshi Kataoka , Koji Inoue, Kazuaki Murakami Kyushu University, Japan. - PowerPoint PPT Presentation

Citation preview

Page 1: Farhad Mehdipour ,  Hiroaki Honda, Hiroshi  Kataoka , Koji Inoue, Kazuaki Murakami

Kyushu University

KL, Malaysia

Hardware and Software Requirements for Implementing a High-Performance

Superconductivity Circuits-Based Accelerator

Farhad Mehdipour, Hiroaki Honda, Hiroshi Kataoka, Koji Inoue, Kazuaki Murakami

Kyushu University, Japan

Page 2: Farhad Mehdipour ,  Hiroaki Honda, Hiroshi  Kataoka , Koji Inoue, Kazuaki Murakami

Kyushu University

KL, Malaysia

CREST-JST (2006~): Low-power,high-performance, reconfigurable processor using

single-flux quantum (SFQ) circuits

SFQ-LSRDP

K. MurakamiK. InoueH. Honda

F. MehdipourH. Kataoka

Kyushu Univ.Architecture, Compiler

and Applications

S. Nagasawa et al.

Superconducting Research Lab. (SRL)

SFQ process

N. Yoshikawa et al.

Yokohama National Univ.SFQ-FPU chip, cell library

A. Fujimaki et al.

Nagoya Univ.SFQ-RDP chip, cell library,

and wiring

N. Takagi (Leader) et al.

Nagoya Univ.CAD for logic design and arithmetic circuits

Our mission: Architecture, compiler and application development 2

Page 3: Farhad Mehdipour ,  Hiroaki Honda, Hiroshi  Kataoka , Koji Inoue, Kazuaki Murakami

Kyushu University

KL, Malaysia

Outline of Large-Scale Reconfigurable Data-Path (LSRDP) Processor

ジョセフソン接合

超伝導ループ

磁束量子Single Flux QuantumSuperconductivityloop

Josephson junctionジョセフソン接合

超伝導ループ

磁束量子

ジョセフソン接合

超伝導ループ

磁束量子

ジョセフソン接合

超伝導ループ

磁束量子Single Flux QuantumSuperconductivityloop

Josephson junction

3

SFQ Features: High-speed switching and signal transmission Low power consumption Compact implementation (smaller area) Suitable for pipeline processing

Page 4: Farhad Mehdipour ,  Hiroaki Honda, Hiroshi  Kataoka , Koji Inoue, Kazuaki Murakami

Kyushu University

KL, Malaysia

Buffers

inst;inst;…conf_LSRDP ( ); Loop: rearrange_input_data ( ); set_IO_info ( ); run_LSRDP ( ); inst; … sync_lsrdp ( ); rearrange_output_data ( );End_Loopinst;…

instinstconf_LSRDP();

conf. bit-stream …

rearrange_input_data ()

GPP

Memory Controller

set_IO_info ( );

Memory Controller

run_LSRDP ( ); inst sync_lsrdp ( );

GPPGPP

Waiting for the LSRDP LSRDP terminating the

operation

rearrange_output_data ( )

GPP

How it works

4

Memory

Buffers

LSRDP

Page 5: Farhad Mehdipour ,  Hiroaki Honda, Hiroshi  Kataoka , Koji Inoue, Kazuaki Murakami

Kyushu University

KL, Malaysia

Architecture Exploration

Layout-I

ADD/SUBMUL

ADD/SUBMUL

ADD/SUBMUL

ADD/SUBMUL

ADD/SUBMUL

ADD/SUBMUL

ADD/SUBMUL

ADD/SUBMUL

ADD/SUBMUL

ADD/SUBMUL

...

...

...

ADD/SUBMUL

ADD/SUBMUL

ADD/SUBMUL

ADD/SUBMUL

...

.

.

.

.

.

.

.

.

.

ADD/SUBMUL

ADD/SUBMUL

ORN

ORN

ORN

.

.

.

Layout-II

ADD/SUB

MUL

ADD/SUB MUL ADD/

SUB MUL

ADD/SUB MUL ADD/

SUB MUL

...

...

...

ADD/SUB MUL ADD/

SUB MUL ...

.

.

.

.

.

.

.

.

.

MUL ADD/SUB

ORN

ORN

ORN

.

.

.

Layout-III

MUL MUL

ADD/SUB

ADD/SUB

ADD/SUB

ADD/SUB

MUL MUL MUL MUL

...

...

...

ADD/SUB

ADD/SUB

ADD/SUB

ADD/SUB

...

.

.

.

.

.

.

.

.

.

MUL MUL

ORN

ORN

ORN

.

.

.

MCL= 1

Num

ber o

f row

s = 1

.5×M

Number of columns = 4×MCL

Num

ber o

f row

s = 2

×M

Number of columns = 6×MCL+2MCL= 1

Num

ber o

f row

s = 1

.5×M

Number of columns = 4×MCL+1

MCL= 2

LSRDP Layouts

ORN structures

5

FU TUTU

PE arch. I

4-inps/3-outs

FU TU

PE arch. II

3-inps/3-outs

TU TU FU TU

Basic PE arch.

3-inps/2-outs

PE structures

Page 6: Farhad Mehdipour ,  Hiroaki Honda, Hiroshi  Kataoka , Koji Inoue, Kazuaki Murakami

Kyushu University

KL, Malaysia

LSRDP Tool Chain

ApplicationC code

1 Modified application code

2

Modifying application code

Inserting LSRDP instructions in the code

1

ISAcc or COINS compiler

2

DFG Extraction

1

binary code

2

Data flow graphsPlacing and Routing Tool

2

Configuration file +various text & schematic

reports

1

LSRDP library fileFunction definitions

& declarations1

LSRDP architecture description

2

1: flow of the assembly code generation for GPP

2: flow of configuration bit-stream generation for the LSRDP

SimulatorPerformance evaluation 6

Page 7: Farhad Mehdipour ,  Hiroaki Honda, Hiroshi  Kataoka , Koji Inoue, Kazuaki Murakami

Kyushu University

KL, Malaysia

Mapping DFGs onto LSRDP

7Longest connections

DFG

LSRDP Architecture Description

Placing Input Nodes

Placing Operational & Output Nodes

Routing Nets

Routing IO Nets

Final Map

Page 8: Farhad Mehdipour ,  Hiroaki Honda, Hiroshi  Kataoka , Koji Inoue, Kazuaki Murakami

Kyushu University

KL, Malaysia

Global routing algorithms

src

dest

src

dest

vacant

fully- occupied

exhaustive search-basedvery time consuming

branch and bound alg.Very fast

Routing DFG connections between source and destination PEs

8

Page 9: Farhad Mehdipour ,  Hiroaki Honda, Hiroshi  Kataoka , Koji Inoue, Kazuaki Murakami

Kyushu University

KL, Malaysia

Micro-Routing-Problem Definition

Inputs• LSRDP basic specifications

–Layout, Width (W), MCL, PE arch., and etc.–List of connections b/w consecutive rows

• ORN structure including–The number of CBs and T2s in each row–The number of CB rows–Topology of connections among CBs

Output• Detailed routes via cross-bar switches

–The list of CBs used for routing each connection–Configuration of CBs

FU T FU T FU T FU T…

FU T FU T FU T FU T…

ORN

i-th row

(i+1)-th row

A micro-routing algorithm has been implemented for the LSRDP with underlying layout II and PE arch. III

Page 10: Farhad Mehdipour ,  Hiroaki Honda, Hiroshi  Kataoka , Koji Inoue, Kazuaki Murakami

Kyushu University

KL, Malaysia

ORN Micro-routing

00 01 10 11

00 01 10 11

CB

½CB

(PE1 PE 5)

(PE2 PE5, PE6, PE7)

(PE3 PE6, PE8 )

(PE4 PE7, PE8)

1/2CB: 1-input/2-ouput

CB: 2-input/2-output

Micro-nets

Example

10

PE 1

PE 2

PE 3

PE 5

PE 6

PE 7

PE 4 PE 8

½CB

½CB

½CB

½CB

CB

CB

CB

(CB)

(CB)

CB

CB

CB

CB

3

2

4

2

2

3

4

1

1

2

2

2

4

3

3

4

3

4

3

2

2

4

1

-

Page 11: Farhad Mehdipour ,  Hiroaki Honda, Hiroshi  Kataoka , Koji Inoue, Kazuaki Murakami

Kyushu University

KL, Malaysia

1817

12

20

18

25

24

24

3231

PEs in 3rd Row PEs in 4th row

4

5

6

7

8

9

10

11

ORN Micro-Routing Example: Heat 8x2- ORN b/w 3rd and 4th Rows

9

10

11

12

13

14

16

18

8

17

6

15

7

9

10

11

12

13

14

16

18

8

17

6

15

7

9

10

11

12

13

14

16

18

8

17

6

15

7

9

10

11

12

13

14

16

18

8

17

6

15

7

9

10

11

12

13

14

16

18

8

17

6

15

7

9

10

11

12

13

14

16

18

8

17

6

15

712

17

24

20

25

18

31

32

18

24

12

18

20

24

18

17

32

25

24

31

12

18

2524

24

31

18

32

17

20

12

18

18

24

24

3132

25

17

20

9

10

11

12

13

14

16

18

8

17

6

15

7

12

18

20

24

24

31

32

17

18

25

12

1818

20

24

31

17

32

2425

12

18

24

25

32

9

10

11

12

13

14

16

18

8

17

6

15

7

17

20

31

12

18

20

24

3132

25

17

9

10

11

12

13

14

16

18

8

17

6

15

7

12

20

24

31

17

32

18

25

18

12

17

20

24

3132

25

9

10

11

12

13

14

16

18

8

17

6

15

7

6

4

5

6

7

8

9

10

11

Page 12: Farhad Mehdipour ,  Hiroaki Honda, Hiroshi  Kataoka , Koji Inoue, Kazuaki Murakami

Kyushu University

KL, Malaysia

Specifications of Attempted DFGs

total # of nodes # of Inputs # of outputs # of ops

Heat-8x1 34 6 4 16

Heat-8x2 60 8 4 32Heat-16x2 172 16 12 96

Poisson-3x3 62 18 1 33Vibration-4x2 48 8 4 24Vibration-8x2 136 16 12 72

Vibration-8x4 168 16 8 96

ERI-1 76 16 9 51ERI-2 67 19 1 47

12

Page 13: Farhad Mehdipour ,  Hiroaki Honda, Hiroshi  Kataoka , Koji Inoue, Kazuaki Murakami

Kyushu University

KL, Malaysia

Example of a DFG MappingVibration- 8x2

13

Page 14: Farhad Mehdipour ,  Hiroaki Honda, Hiroshi  Kataoka , Koji Inoue, Kazuaki Murakami

Kyushu University

KL, Malaysia

Results of routing nets using the proposed algorithms

DFG avg. hor. C.L. avg./max.ver. C.L. # of global/micro nets to route

Timeto map (sec)

Heat-8x1 0.35 0.75/3 36/64 0.015

Heat-8x2 0.44 1.32/5 68/114 1.75

Heat-16x2 0.47 1.64/7 204/343 1.05

Poisson-3x3 0.68 2.4/16 67/120 2074.5

Vibration-4x2 0.46 1.58/9 50/88 0.34

Vibration-8x2 0.42 2.15/10 154/332 2.20

Vibration-8x4 2.48 3.72/16 348/610 6721.3

ERI-1 0.75 2.21/9 111/374 53.61

ERI-2 0.78 2.99/9 95/332 0.327

14

Page 15: Farhad Mehdipour ,  Hiroaki Honda, Hiroshi  Kataoka , Koji Inoue, Kazuaki Murakami

Kyushu University

KL, Malaysia

Thank You for Your Attention!

Any Questions!

Page 16: Farhad Mehdipour ,  Hiroaki Honda, Hiroshi  Kataoka , Koji Inoue, Kazuaki Murakami

Kyushu University

KL, Malaysia

16

SMACSMAC

10TFLOPS SFQ-RDP computer

:...:::

SMAC

SB

ORN

...

ORN

...

: : : :

ORN

...

ORN

FPU SFQ RDP( 32 PE×32 chips )( 2.5 GFLOPS / PE)

4.2 K

Streaming memoryAccess controller

CMOSCPU

(One Chip)

Memory bandwidth per MCM : 256GB/ s(=16GB/s ×16 channels)

1024FPU@MCM(34 chips) ×4MCM

2TB memory module( FB-DIMM

[DDR3@1333MHz, 128GB]×16 modules )

SFQ 0.5μm process

PE PEPE

ORN

PE PE PEPE

PE PE PEPE

ORN

オペランドルーティングネットワーク(ORN)

ORN

PE ...

...

...

PE PEPE

ORN

PE PE PEPE

PE PE PEPE

ORN

ORN

...

...

...

PEPE

Operand Routing Network(ORN)

..

..

..

..

Page 17: Farhad Mehdipour ,  Hiroaki Honda, Hiroshi  Kataoka , Koji Inoue, Kazuaki Murakami

Kyushu University

KL, Malaysia  Chip Micro-architecture: Two types of PEs: F PA and FPM PE layout: Checkered pattern PE : Two Inputs ( A,B,C )→ Three Outputs ( A(*B),B,C )

Three scales of RDP (Small, Medium and Large-Scales)

17

FU TUTU TUFP TUTU TU

PE (i, j)

(i+2,j+1)

(i+L,j+1)

(i+1,j+1)

(i,j+1)

MCL = L

・・・

ORN

RDP parameters ( optimized by total number of JJs )

# Input # Output Width Height MCLTotal JJs(∝ RDP size )

RDP-S 19 12 22 14 4 19387KRDP-M 19 12 24 17 5 27027KRDP-L 38 24 41 34 6 96374K

Development of RDP Architecture

TU: Data Through

Page 18: Farhad Mehdipour ,  Hiroaki Honda, Hiroshi  Kataoka , Koji Inoue, Kazuaki Murakami

Kyushu University

KL, Malaysia

Development of RDP Complier

ApplicationC code

1 Modified code

2

Modifyingapplication code

Manual: Inserting LSRDP instructions in the code

1

ISAcc or COINScompiler

2

DFG ExtractionSemi-manual

1

.asm codefor MIPS-based GPP

2

Data flow graphsPlacement and Routing Tool

2

Configuration file +various text and schematic

reports

1

RDP library fileFunctions definition

& declaration

1RDP architecture description

2

1: flow of the assembly code generation for GPU

2: flow of configuration bit-stream generation for the RDP

SimulatorPerformance evaluation

Page 19: Farhad Mehdipour ,  Hiroaki Honda, Hiroshi  Kataoka , Koji Inoue, Kazuaki Murakami

Kyushu University

KL, Malaysia

19

Development of RDP Oriented Algorithms

One-dimensional heat and vibrational equations Two-dimensional heat and FDTD equations Two-Electron Repulsion Integral calculation in quantum chemistry Runge-Kutta calculation for ordinary differential equation

Performance Evaluation Two-dimensional heat equation   (1024x1024 mesh )

SFQ-RDP1): 50.6GFlop/s vs. GPU2): 63.0GFlop/s

1) Evaluation method:

RDP: - Execution time model,

- DFG has 21 inputs, 9 outputs, and 63 operations GPP:

- Cycle-accurate processor simulator- BW: 159.0GB/s

2) T. Aoki, and A. Nukada,“CUDA programming premier,“ Kougakusya, ISBN-10:4777514773, 2009 (in Japanese).