CG Yasui 201605 - 九州大学（KYUSHU UNIVERSITY）2016/06/11 · この研究の立ち位置 • Graph algorithm; BFS • Efficient NUMA-aware BFS algorithm – メモリアクセスの局所性の向上

計算機のメモリ階層構造を考慮した高速かつ並列化効率の高いグラフ探索

安井雄一郎九州大学共進化社会システム創成拠点, JST COI

[email protected]@gmail.com

物理現象の演出可能な離散モデルの構築2016年6月11日 13:40 − 14:10

計算機の特徴を考慮した高速計算• 計算機の特徴を考慮して高速に動作するソフトウェアを開発

– 優れた計算手法– 高速な実装方法

計算機の特徴を踏まえて同時に進める

• 計算機上で実際に性能を出すために重要なこと– 計算機が得意な処理を知る– アルゴリズムやデータ構造の性質を知る

• 理論的には高速だが実際には低速 … Fibonacci-heap など• 理論的には非効率であるが実際には高速 … Quick-sort など

• Graph500: 7th, 8th, 9th, 10th, 11th で最高性能 (1台)

• Green Graph500: 1st, 2nd, 3rd, 4th, 5th, 6th で最高の省電力性能– 1-8位まで我々の結果が独占

計算機の特徴を考慮した高速化を適用した実装

Ours: UV300 (32 sockets) … 219 GTEPSBG/Q (512 nodes) … 172 GTEPS

1/16 のソケット数で同等以上の性能

この研究の立ち位置• Graph algorithm; BFS

• Efficient NUMA-aware BFS algorithm– メモリアクセスの局所性の向上– 多ソケットマシン上での効率的な実装を目指す (SGI UV 2000, SGI UV 300)

- SCALE- edgefactor

- SCALE- edgefactor- BFS Time- Traversed edges- TEPS

Input parameters ResultsGraph generation Graph construction

TEPSratio

ValidationBFS

64 Iterations

- SCALE- edgefactor



TEPSratio

ValidationBFS

64 Iterations

CPU

RAM

CPU

RAM

CPU

RAM

…NUMA を考慮したアルゴリズム設計

Kronecker graph w/ SCALE 3417 billion nodes and 275 billion edges

SGI UV 30032-sockets 18-core Xeon and 16 TB RAM

巨大なグラフ

• NUMA / cc-NUMA architecture

Local access

Remote access

・・・

Many-socket system

Graph processing for Large scale networks• Large-scale graphs in various fields

– US Road network: 58 million edges– Twitter follow-ship: 1.47 billion edges– Neuronal network: 100 trillion edges

89 billion vertices & 100 trillion edges

Neuronal network @ Human Brain Project

Cyber-securityTwitter

US road network24 million vertices & 58 million edges 15 billion log entries / day

Social network

• Fast and scalable graph processing by using HPC

large

61.6 million vertices& 1.47 billion edges

20

25

30

35

40

45

15 20 25 30 35 40 45

log 2

(m)

log2(n)

USA-road-d.NY.gr

USA-road-d.LKS.gr

USA-road-d.USA.gr cit-Patents

soc-LiveJournal1

twitter-rv

Human Project

Graph500 (Toy)

Graph500 (Mini)

Graph500 (Small)

Graph500 (Medium)

Graph500 (Large)

Graph500 (Huge)

10億点 1兆点

10 億枝

1兆枝

Human Brain

全米道路

Twitter2009

Graph500 (SCALE29)・4-way Xeon (64cores)

点数 (対数)

Graph500 (SCALE40)・BlueGene/Q (65,536 nodes)・K computer (65,536 nodes)

対象のネットワークサイズ枝数

(対数

)

計算機１台(～512GB RAM)で扱えるサイズ

BFS on Twitter follow-ship network

• follow-ship network– #Users (#vertices) 41,652,230– Follow-ships (#edges) 2,405,026,092

Lv. #users ratio (%) percentile (%)0 1 0.00 0.00 1 7 0.00 0.00 2 6,188 0.01 0.01 3 510,515 1.23 1.24 4 29,526,508 70.89 72.13 5 11,314,238 27.16 99.29 6 282,456 0.68 99.97 7 11536 0.03 100.00 8 673 0.00 100.00 9 68 0.00 100.00 10 19 0.00 100.00 11 10 0.00 100.00 12 5 0.00 100.00 13 2 0.00 100.00 14 2 0.00 100.00 15 2 0.00 100.00

Total 41,652,230 100.00 -

BFS result from User 21,804,357

This network excludes unconnected usersThe six-degrees of

separation

我々の実装では BFS を 60 ms で計算可能

Twitter2009

Highway

Bridge

Betweenness centrality (BC)

CB(v) =!

s!v!t∈V

σst(v)σst

σst : number of shortest (s, t)-paths

σst(v) : number of shortest (s, t)-paths passing through vertex v

2 / 2

CB(v) =!

s!v!t∈V

σst(v)σst

σst : number of shortest (s, t)-paths

σst(v) : number of shortest (s, t)-paths passing through vertex v

2 / 2

: # of (s, t)-shortest paths: # of (s, t)-shortest pathspassing throw v

Osaka road network13,076 vertices and 40,528 edges

• BC requires #vertices-times BFS,because BFS obtains one-to-all shortest paths

• Computes an importance for each vertices and edges utilizing all-to-all shortest-paths (breadth-first search) w/o vertex coordinates

Importancelow high

Osaka station

Our software “NETAL” can solves BC for Osaka road network within one secondY. Yasui, K. Fujisawa, K. Goto, N. Kamiyama, and M. Takamatsu:NETAL: High-performance Implementation of Network Analysis Library Considering Computer Memory Hierarchy, JORSJ, Vol. 54-4, 2011.

Graph500 and Green Graph500• New benchmarks using graph processing (breadth-first search)• measures a performance and energy efficiency of irregular

memory access

TEPS score (# of Traversed edges per second) for Measuring a performance of irregular memory accesses

TEPS per Watt score for measuring power-efficient performace

Graph500 benchmark Green Graph500 benchmark

1. Generation

- SCALE- edgefactor



TEPSratio

ValidationBFS

64 Iterations

- SCALE- edgefactor



TEPSratio

ValidationBFS

64 Iterations

- SCALE- edgefactor



TEPSratio

ValidationBFS

64 Iterations

3. BFS x 642. Construction

x 64

Median TEPS

SCALE & edgefactor (=16)

Kronecker graph with 2SCALEvertices and 2SCALE×edgefactor edgesby using SCALE-times the RursiveKronecker products

G1 G2 G3 G4

- SCALE- edgefactor



TEPSratio

ValidationBFS

64 Iterations

3. BFS x 64

x 64

Median of 64 TEPSs

Powerconsumption

Power consumptionin watt

TEPS per Watt

Level-synchronized parallel BFS (Top-down)• Started from source vertex and

executes following two phases at each level

FrontierNeighbor

Level kLevel k+1QF

QN

Swap exchanges the frontier QF and the neighbors QN for next level

Traversal phase finds neighbors QN

from current frontier QF

visited

unvisited

QNLevel 1

SourceLevel 0QF

Level 2

Level 1

QN

QF

Level 3

Level 2

QN

QF

Level 0

Sync.

Sync.

Level 1

Level 2

QNQF

QNQF

one parent for each vertex v 2 V , it represents a tree withthe root vertex s 2 V .

Algorithm 1 is a fundamental parallel algorithm for aBFS. This requires the synchronization of each level thatis a certain number of hops away from the source. We callthis the level-synchronized parallel BFS [7]. Each traversalexplores all outgoing edges of the current frontier, which isthe set of vertices discovered at this level, and finds theirneighbors, which is the set of unvisited vertices at the nextlevel. We can describe this algorithm using a frontier queueQF and a neighbor queue QN , because unvisited verticesw are appended to the neighbor queue QN for each frontierqueue vertex v 2 QF in parallel with the exclusive controlat each level (Algorithm 1, lines 7–12), as follows:

QN �w 2 A(v) | w 62 visited, v 2 QF

. (1)

Algorithm 1: Level-synchronized Parallel BFS.Input : G = (V, A) : directed graph.

s : source vertex.Variables: QF : frontier queue.

QN : neighbor queue.visited : vertices already visited.

Output : ⇡(v) : predecessor map of BFS tree.1 ⇡(v) �1, 8v 2 V2 ⇡(s) s3 visited {s}4 QF {s}5 QN ;6 while QF 6= ; do7 for v 2 QF in parallel do8 for w 2 A(v) do9 if w 62 visited atomic then

10 ⇡(w) v11 visited visited [ {w}12 QN QN [ {w}

13 QF QN

14 QN ;

B. Hybrid BFS (Direction-optimized BFS)

The main runtime bottleneck of the level-synchronizedparallel BFS (Algorithm 1) is the exploration of all outgoingedges of the current frontier (lines 7–12). Beamer et al. [9],[10] proposed a hybrid BFS algorithm (Algorithm 2) thatreduced the number of edges explored. This algorithm com-bines two different traversal kernels: top-down (Algorithm 3)and bottom-up (Algorithm 4). Like the level-synchronizedparallel BFS, top-down kernels traverse neighbors of thefrontier. Conversely, bottom-up kernels find the frontier fromvertices in candidate neighbors. In other words, a top-downmethod finds the children from the parent, whereas a bottom-up method finds the parent from the children. For a large

frontier, bottom-up approaches reduce the number of edgesexplored, because this traversal kernel terminates once asingle parent is found (Algorithm 4, lines 3–8).

Algorithm 2: Hybrid BFS algorithm of Beamer et al.Input : G = (V, AF , AB) : directed graph.



Output : ⇡(v) : predecessor map of BFS tree.1 ⇡(v) �1, 8v 2 V2 visited {s}3 QF {s}4 QN ;5 while QF 6= ; do6 if is TopDown(QF , QN , visited) then7 QN Top-down(G, QF , visited, ⇡)8 else9 QN Bottom-up(G, QF , visited, ⇡)

10 QF QN

Algorithm 3: Top-down BFS.Input : G = (V, AF ) : directed graph.

QF : frontier queue.visited : vertices already visited.⇡ : predecessor map of BFS tree.

Output : QN : neighbor queue.1 QN ;2 for v 2 QF in parallel do3 for w 2 AF (v) do4 if w 62 visited atomic then5 ⇡(w) v6 visited visited [ {w}7 QN QN [ {w}

Algorithm 4: Bottom-up BFS.Input : G = (V, AB) : directed graph.


Output : QN : neighbor queue.1 QN ;2 for w 2 V \ visited in parallel do3 for v 2 AB(w) do4 if v 2 QF then5 ⇡(w) v6 visited visited [ {w}7 QN QN [ {w}8 break

one parent for each vertex v 2 V , it represents a tree withthe root vertex s 2 V .

Algorithm 1 is a fundamental parallel algorithm for aBFS. This requires the synchronization of each level thatis a certain number of hops away from the source. We callthis the level-synchronized parallel BFS [7]. Each traversalexplores all outgoing edges of the current frontier, which isthe set of vertices discovered at this level, and finds theirneighbors, which is the set of unvisited vertices at the nextlevel. We can describe this algorithm using a frontier queueQF and a neighbor queue QN , because unvisited verticesw are appended to the neighbor queue QN for each frontierqueue vertex v 2 QF in parallel with the exclusive controlat each level (Algorithm 1, lines 7–12), as follows:

QN �w 2 A(v) | w 62 visited, v 2 QF

. (1)

Algorithm 1: Level-synchronized Parallel BFS.Input : G = (V, A) : directed graph.



Output : ⇡(v) : predecessor map of BFS tree.1 ⇡(v) �1, 8v 2 V2 ⇡(s) s3 visited {s}4 QF {s}5 QN ;6 while QF 6= ; do7 for v 2 QF in parallel do8 for w 2 A(v) do9 if w 62 visited atomic then

10 ⇡(w) v11 visited visited [ {w}12 QN QN [ {w}

13 QF QN

14 QN ;

B. Hybrid BFS (Direction-optimized BFS)

The main runtime bottleneck of the level-synchronizedparallel BFS (Algorithm 1) is the exploration of all outgoingedges of the current frontier (lines 7–12). Beamer et al. [9],[10] proposed a hybrid BFS algorithm (Algorithm 2) thatreduced the number of edges explored. This algorithm com-bines two different traversal kernels: top-down (Algorithm 3)and bottom-up (Algorithm 4). Like the level-synchronizedparallel BFS, top-down kernels traverse neighbors of thefrontier. Conversely, bottom-up kernels find the frontier fromvertices in candidate neighbors. In other words, a top-downmethod finds the children from the parent, whereas a bottom-up method finds the parent from the children. For a large

frontier, bottom-up approaches reduce the number of edgesexplored, because this traversal kernel terminates once asingle parent is found (Algorithm 4, lines 3–8).

Algorithm 2: Hybrid BFS algorithm of Beamer et al.Input : G = (V, AF , AB) : directed graph.



Output : ⇡(v) : predecessor map of BFS tree.1 ⇡(v) �1, 8v 2 V2 visited {s}3 QF {s}4 QN ;5 while QF 6= ; do6 if is TopDown(QF , QN , visited) then7 QN Top-down(G, QF , visited, ⇡)8 else9 QN Bottom-up(G, QF , visited, ⇡)

10 QF QN

Algorithm 3: Top-down BFS.Input : G = (V, AF ) : directed graph.


Output : QN : neighbor queue.1 QN ;2 for v 2 QF in parallel do3 for w 2 AF (v) do4 if w 62 visited atomic then5 ⇡(w) v6 visited visited [ {w}7 QN QN [ {w}

Algorithm 4: Bottom-up BFS.Input : G = (V, AB) : directed graph.


Output : QN : neighbor queue.1 QN ;2 for w 2 V \ visited in parallel do3 for v 2 AB(w) do4 if v 2 QF then5 ⇡(w) v6 visited visited [ {w}7 QN QN [ {w}8 break

Frontier

Neighbors

Level k

Level k+1

FrontierLevel k

Level k+1neighbors

Top-down algorithm Bottom-up algorithm

switch

[Beamer,SC12] S. Beamer et al.: Direction-optimizing breadth-first search. SC12.

探索済み点集合の前線から、隣接した未探索点を探索

(隣接しているとは限らない)未探索点から探索済み点集合の前線を探索

１つでも見つければループを終了できる

排他制御が不要排他制御が必要

隣接点は全て確認する必要あり

Direction-optimizing BFS [Beamer, SC12]

FrontierLevel k

Level k+1NeighborsFrontier Neighbors

Level k

Level k+1

Candidates ofneighbors

Direction-optimizing BFS [Beamer, SC12]

• Top-down dir. using out-going edges • Bottom-up dir. using in-coming edges

Outgoingedges Incoming

edges

Chooses direction from Top-down or Bottom-up at each Level

幅優先探索に対する前方探索 (Top-down)と後方探索 (Bottom-up)

Level Top-down Bottom-up Hybrid0 2 2,103,840,895 21 66,206 1,766,587,029 66,2062 346,918,235 52,677,691 52,677,6913 1,727,195,615 12,820,854 12,820,8544 29,557,400 103,184 103,1845 82,357 21,467 21,4676 221 21,240 227

Total 2,103,820,036 3,936,072,360 65,689,631Ratio 100.00% 187.09% 3.12%

6 / 14

Distance from source

Large frontier

Top-down

Top-down

Bottom-up

Direction-opt. BFS

n=1,000 n=2,000

3重ループ 6.07 s 60.19 s

古典的なチューニング（ループブロッキング） 1.99 s 16.78 s

BLAS ライブラリ GotoBLAS2 (1スレッド) 0.03 s 1.38 s

BLAS ライブラリ GotoBLAS2 (8スレッド) 0.20 s

@ intel Xeon X5460 3.16 GHz / GCC 4.1.2

DGEMM（倍精度浮動小数点数型の行列積）演算量は同じでも実装により性能が異なる

N次正方行列の積：O(n3)

300倍ᘍЗᆢǳȸȉ̊ᲢᲽᚕᛖᲣ

z䝁䞊䝗

for (i=0; i<n; i++) for (j=0; j<n; j++)

for (k=0; k<n; k++) C[i][j] += A[i][k] *B[k][j];

6

C A Bi

j

i

k

k

j

䝇䝟䝁䞁䝥䝻䜾䝷䝭䞁䜾䠄䠍䠅䚸䠄Ϩ䠅

３重ループ

ブロッキング行列 A, B へのアクセス範囲が BLOCK_SIZE になるように調整

A BC

計算機のメモリ階層構造• アクセス速度と容量のトレードオフ関係• 上位の階層のメモリをうまく使うことが非常に重要• データアクセスがボトルネックとなるアルゴリズムは数百倍高速になる可能性

RAM

NV-RAM (SSD)

Storage

レジスタ

キャッシュメモリL1, L2, TLB, L3

メモリ階層構造

容量

速度

2.5-10

50-500

5-20

0.05-0.5

アクセス速度 (GB/s)

< 512GB

< 1KB

< 30MB

> 1TB

(< 1TB)

容量

J.L.Hennessy, D.A.Patterson: Computer Architecture, A Quantitative Approach.

Pem.4 世代の典型的な容量と速度

(6GB/s)

新しい階層

データアクセスとは• 全ての演算はレジスタ上で行われる

– メモリ階層で高い方から順に存在しているか確認する– キャッシュにヒットすれば非常に高速、しなければメインメモリへのアクセス

レジスタ

L1

L2

メモリ

レジスタ

L1

L2

メモリ

レジスタ

L1

L2

メモリ

数倍数百倍数倍

キャッシュヒットキャッシュミス

• 仮想アドレスを用いたメモリ管理– データアクセスには物理アドレスが必要（メインメモリへのアクセス）– TLB は物理アドレスと仮想アドレスの変換テーブル– 広域にデータアクセスしたり、アドレス計算を多用すると溢れる時間的・空間的にデータアクセスの局所性を高める工夫が非常に有効

NUMA system• NUMA = non-uniform memory access

RAM RAM

processor core & L2 cache

8-core Xeon E5 4640shared L3 cache

RAM RAM

Local memory access

Remote (non-local)Memory access

NUMA node

• 4-way Intel Xeon E5-4640 (Sandybridge-EP)– 4 (# of CPU sockets)– 8 (# of physical cores per socket)– 2 (# of threads per core)

4 x 8 x 2 = 64 threadsNUMA node

Max.

NUMAシステム上のメモリバンド幅

NUMA 0

NUMA 1 NUMA 2

NUMA 3

0

1

2

3

0 1 2 3

targ

et N

UM

A no

de

source NUMA node

24.2

3.4

3.0

3.4

3.3

23.9

3.5

3.0

3.0

3.4

24.3

3.4

3.5

3.0

3.4

24.2

Local access: 24 GB/s

Remote access: 3 GB/s

NUMA 0

NUMA 1 NUMA 2

NUMA 3

DataThreads

NUMA system (4 CPUs)

Datathreads

Fast local access Slow non-local access

Differentdistances

CPU

RAM

CPU

RAM

CPU

RAM

CPU

RAM

Thread placement

Mem

ory

plac

emen

t

(Example) 4-socket Xeon system• 4 (# of CPU sockets)• 8 (# of physical cores per socket)• 2 (# of threads per core)

我々の Graph500 実装• 効率的なグラフデータ構造

4

3

20

1

Input graph

1

3

04

2

Vertex sorting (HPCS15) Adjacency list sorting (ISC14)

A0

A1

A2

A3

NUMA-aware graph (BD13)

• 効率的な BFS 実装 (Beamerらのアルゴリズムの改善)

CQ

Socket−queue

Remote

Remote

Local

VS NQ

Agarwal’s Top-down (SC10) Pruning edges (HPGP16)

Top-down direction Bottom-up direction

NUMA-aware Bottom-up (BD13)

A0

A1

A2

A3

Input:CQ

Data:VSk

Output:NQk

Local

Sorting by outdegree

CSR graph

Reduction ofremote edges

Binding on NUMA node

0

10

20

30

40

50

2011 SC10 SC12 BigData13ISC14

G500,ISC14

GT

EPS

Ref

eren

ce

NU

MA-a

war

e

Dir.

Opt

.

NU

MA-O

pt.

NU

MA-O

pt.

+D

eg.a

war

e

NU

MA-O

pt.

+D

eg.a

war

e+

Vtx

.Sor

t87M 800M

5G

11G

29G

42G

⇥1 ⇥9

⇥58

⇥125

⇥334

⇥489Le

vel-s

ync.

Top-down Top-down

Bottom-up

Top-down

CPU

RAM

NUMA-aware Bottom-up

Top-down

CPU

RAM

NUMA-aware

Ours: NUMA-optimized + Dir. Opt. BFS

Our latest ver.Is 489X faster thanReference code

Our implementations

• Obtained 4 Xeon proc.

NUMA 0

NUMA 1 NUMA 2

NUMA 3

Ours: NUMA-aware 1-D part. graph [BD13]• Divides sub graphs and assigns on each NUMA node

A0

A1

A2

A3

Adjacency matrix 1-D part. Graph

CPU

RAMassigndivide

CPU

RAM

CPU

RAM

CPU

RAM

NUMA node

A0

A1

A2

A3

Input:Frontier

CQ

Data:visited VSk

Output:neighbors

NQ

Local RAM

Bottom-up direction

• At bottom-up direction (Bottleneck component), each NUMA node computes partial NQ using local copied CQ and local assigned VS.

Each sub graph represents by CSR graph

Top-down direction uses inverse of G.(G is undirected)

A0

A1

A2

A3

Input:Frontier

CQk

Data:visited VSk

Output:neighbors

NQ

LocalLocal

Remote Remote

Modified version of Agarwal’s NUMA-aware BFS

Ours: Adjacency list sorting [ISC14]

• Reduces unnecessary edge traversals at Bottom-up dir.Loop count τ

A(va)A(vb)

finds frontier vertex and breaks this loop

……

Bottom-up

Skipped adjacency verticesTraversed adjacency vertices

• Sorting adjacency lists by the corresponding outdegreeVertex vi Vertex vi+1

Index

Value

High Low

Adjacency vertices of vi

Sorting by outdegree

Ours: Vertex sorting [HPCS15]

Degree distribution

Access freq. w/ vertex sorting

• # of vertex traversals equals the outdegree of the corresponding vertex• Our vertex sorting reorders vertex indices by the outdegrees

Access freq. and OutDegree are correlated4

3

20

1

1

3

04

2

Originalindices

Sorted indices by outdegree

Highestoutdegree

Many accessesfor small-index vertex

NUMA-aware Top-down BFS• Original version was proposed by Agarwal [Agarwal-SC10]• Reducing random remote accesses using socket-queue

CQ

Local + Remote

NUMA 0

NUMA 1

NUMA 2

NUMA 3

「Local : Remote = 1 : ℓ」on ℓ-sockets

e.g.) focused on NUMA 2

synchronize

Socket−queue

Local

VS NQ

synchronize

Swap CQ and NQ

Append unvisited vertices into NQ

Local

Phase1: CQ ⇒ NQ or Socket-queue Phase2: Socket-queue ⇒NQ Next level

CQ

Socket−queue

Remote

Remote

Local

VS NQ

Append unvisited vertices into NQ

NUMA-aware Top-down w/ Pruning remote edges• pruning remote edges to reduce remote accesses

NUMA 0

NUMA 1

NUMA 2

NUMA 3

e.g.) focused remote edge traversal on NUMA 2

This paperproposed by Agarwal’s SC10 paper

with Pruningw/o Pruning (original)

Each NUMA node appends remote edges (v,w) into the corresponding socket-queue, if the Fdoesn't contain w. (And then, F appends w)

Each NUMA node appends all remote edges (v,w) into the corresponding socket-queue

F (reuse CQ bitmap for Bottom-up)

CQ(vector queue)

Socket−queue

Remote

Local

Local

Remote

The F is not initialized, while there is no change of search direction.

CQ(vector queue)

Socket−queue

Remote

Remote

Each vertex is searched once only.

SGI UV 2000• UV 2000

– Single OS: SUSE Linux 11 (x86_64)– hypercube interconnection– Up to 2,560 cores and 64 TB RAM– (= 128 UV 2000 shassis x 2 sockets x 10 cores)

ISM has two full-spec. UV 2000• Hierarchical network topologies

– Sockets, Chassis, Cubes, Inner-racks, and Outer-racksUV2000 Chassis = 2 sockets Cube = 8 Chassis Rack = 32 nodes

CPU

RAM

CPU

RAM

× 4 = NUMAlink6

6.7GB/s

⇒ Cannot detect

NUMAnode

ISM Kyushu U.

0

50

100

150

200

26(1)

27(2)

28(4)

29(8)

30(16)

31(32)

32(64)

33(128)

GTE

PS

SCALE (NUMA nodes)

Weak scaling with SCALE 26 per NUMA node• Fast, scalable, and enegry-efficient Fastest of signle-node

9th & 10th Graph500

SCALE 338.6 Billion vertices137.4 Billion edges

174 GTEPS

133 GTEPS

Most energy efficient of commercial supercomputers

3rd & 4rd & 5th Green Graph500 list

128 CPUs (1280 threads)

ScalableUp to 1280 threads

109 TEPS

RAM CPU

to other chassis

to other chassis

SGI UV 300• UV 300

– Single OS: SUSE Linux 11 (x86_64)– All-to-all interconnection– Up to 1,152 cores and 16 TB RAM– (= 8 UV 300 shassis x 4 sockets x 18 cores x 2 SMT)

• UV 300 chassis– 4-socket 18-core Intel Xeon E7-8867 (Haswell)– 2TB RAM (512 GB per NUMA node)

UV300 chassis

UV300 Rack

All-to-All • 18-core Xeon E7-8867• HT enabled (2 SMT)• 512GB RAM

NUMA node

UV 300 chassis

Kyushu U.

8 chassis

0

50

100

150

200

250

1 2 4 8 16 32 64

GTE

PS

Number of NUMA nodes (CPU sockets)

UV300 (HT, THP, Local): SCALE 29 per socketUV300 (HT, THP, Remote): SCALE 29 per socketUV300 (HT, Remote): SCALE 29 per socketUV300 (HT, Remote): SCALE 27 per socketUV2000: SCALE 27 per socket

18.732.5

64.7

100.3

161.5

219.4

8.3 14.225.1

38.6

61.5

91.8

152.2

Weak scaling performance• UV300 と UV2000 の比較

UV2000

UV300

Single-node での最高性能の更新

New result and Nov. 2015 list• Update fastest single-node

Oursfastest ofsingle-node

Ours

SCALE34219 GTEPS

SGI UV300(1 node / 576 cores)− HT enabled− THP enabled− local-ref. mode

SGI UV2000(1280 cores)SCALE 33174.7 GTEPS

SGI UV2000(1280 cores)SCALE 33149.8 GTEPS

Graph500 performance (in TEPS)• Weak scaling performance

1

4

16

64

256

1024

4096

16384

1 4 16 64 256 1024 4096 16384 65536

GTE

PS (l

ogsc

ale)

Number of sockets (logscale)

SGI UV 2000 (CPU socket per SCLAE 26)IBM BG/Q (CPU socket per SCALE 24)HP SuperdomeX (480 threads)SGI UV 300 (CPU socket per SCALE 29, Remote)SGI UV 300 (CPU socket per SCALE 29, Local)

IBM BG/Q

SGI UV2000

SGI UV300

HP SuperdomeX

System #CPUs (#threads) SCALE HT THP GTEPS

SGI UV300 16 (576) 33 ✔ ✔ 162

HP SuperdomeX 16 (480) 33 ✔ ✔ 128

System #CPUs SCALE GTEPS

SGI UV300 32 33 219

SGI UV2000 128 33 173

IBM BG/Q 512 33 172

MemoryUsage: 4TB

UV300 と UV2000• STREAM TRIAD で測定したバンド幅 (GB/s) を描画• ローカルアクセスがリモートアクセスに対して明らかに高速

各シャーシ (4sockets) はAll-to-all で接続

各シャーシ (2 sockets) はHyper-cube で接続

0

16

32

48

0 16 32 48

Mem

ory

Plac

emen

t

Thread Placement

0

5

10

15

20

25

30

35UV 2000 (64 sockets)

3-7 GB/s

3-7 GB/s

Local33 GB/s

UV 2000chassis

0

4

8

12

16

20

24

28

0 4 8 12 16 20 24 28

Mem

ory

Plac

emen

t

Thread Placement

0

10

20

30

40

50

60UV 300 (32 sockets)

6 GB/s

6 GB/s

Local56 GB/s

12-14 GB/s

UV 300chassis

STREAM benchmark with NUMACTL• STREAM benchmark is a popular benchmark for

measuring memory bandwidth using vector arithmetics• NUMACTL is a Linux command for CPU and memory

binding

$ wget https://www.cs.virginia.edu/stream/FTP/Code/stream.c$ icc -O2 -fopenmp -DSTREAM_ARRAY_SIZE=100000000 -o stream stream.c

Download and compile source code

SOCKETS=`seq 0 31`THREADS=36for i in $SOCKETS; do

for j in $SOCKETS; doOMP_NUM_THREADS=$THREADS ¥

numactl --cpunodebind=$i --membind=$j ./streamdone

done

Execute STREAM benchmark with NUMACTL

⇒ 36 threads on socket $i compute the vector data on socket $j

1

4

16

64

256

1024

4096

16384

65536

1stNov. 2010

2ndJune 2011

3rdNov. 2011

4thJune 2012

5thNov. 2012

6thJune 2013

7thNov. 2013

8thJune 2014

9thNov. 2014

10thJuly 2015

11thNov. 2015

GTE

PS (i

n lo

gsca

le)

Top 1K-computerTSUBAME 2.5 (2.0)FX-10SGI UV2000TSUBAME-KFC4-way Xeon server

7

18

253

3541

15363 15363 15363 1797723751

38621 38621

100

317462 462 462

1280 1345 1345 1345

358609

993 1003

1003 1003 1003 1003

5524 5524 5524

17977 19585

38621 38621

8.210.5 11.1

31.645.7 55.7 57.6 59.9

131.4174.7 174.7 174.7

44.0 104.3 104.3 104.3 104.3

• K computer won the 8th, 9th, 10th, 11th

(current list) Graph500 benchmark

Distributed Massive Parallel Supercomputer

Our achievements in Graph500

• UV2000 ranked the fastest of single node entry in 7th, 8th, 9th, 10th, and 11th Graph500 list

4-way Xeon server

K computer

TSUBAME 2.5

TSUBAME-KFC

SGI UV2000

K computer is #1

CPU ver.

GPU ver.

#3 #4

#3 FX10

Ours

Current list

Distributed Shared Memory Supercomputer

#1

#1 #1#2

ULIBCの開発方針• ハードウェア情報の取得と制御のための APIs を提供

– OpenMPや Pthreadsなどで利用可能– スレッドの固定だけならば ULIBC をリンクして ULIBC_Init(); の追加

• 汎用性を重視して既存の枠組みと融合を可能に– 対応コンパイラ ... GCC, Intel Compiler, SunCC, XLC– 対応ツール… numactl, Intel thread affinity interface, PBS

RAM RAM

processor core & L2 cache

8-core Xeon E5 4640shared L3 cache

RAM RAM

T

NUMA-aware D

TD

T

D

T

D

D

T

Accessing local memoryPinning threads and memory

NUMA (Non-uniform memory access)

CPU

RAM

CPU

RAM

CPU

RAM

CPU

RAM

NUMA node

Local memory

Remote memory

2. Detects online topology

ULIBC Affinity

RAM

1. Detects entire topology

RAM

RAM RAM

Socket 0

Socket 2

Socket 1

Socket 3

RAM 0 RAM 1

RAM 2 RAM 3

RAM

RAM

Socket 0

Socket 2

Socket 1

Socket 3

RAM 0 RAM 1

RAM 2 RAM 3

numactl --cpunodebind=1,2 ¥--membind=1,2

e.g.)

3. Constructs two-type affinities

NUMA node 0

NUMA node 1

thread 0thread 1

thread 2thread 3

RAM

RAM

Local RAM

assigns threads in a position close to each other.Compact-type affinity

export ULIBC_AFFINITY=compact:fineexport OMP_NUM_THREADS=7e.g.)

export ULIBC_AFFINITY=scatter:fineexport OMP_NUM_THREADS=7e.g.)

NUMA node 0

NUMA node 1

thread 0 thread 2

thread 1 thread 3

RAM

RAM

Local RAM

distributes the threads as evenly as possible across online processors.

Scatter-type affinityRAM

RAM

ULIBC のビルドとインストール

• Linux 上での ULIBC のビルドとインストールmake && make install

− コンパイル時のオプションmake CC=icc # Intel compiler

− インストール時のオプションmake PREFIX=${HOME}/local install # ディレクトリ先

$ ls ~/local/*/home/yasui/local/include:omp_helpers.h ulibc.h

/home/yasui/local/lib:libulibc.a libulibc.so*

• インストール後の確認

ヘッダーファイル ulibc.h

静的ライブラリ .a と動的ライブラリ .so

• Linux 上での ULIBC のビルドとインストールgit clone https://bitbucket.org/yuichiro_yasui/ulibc.git

例1: スレッドの位置#include <stdio.h>#include <omp.h>#include <ulibc.h>

int main(void) {/* initialize ULIBC variables */ULIBC_init();

/* OpenMP region */_Pragma("omp parallel") {

const int tid = ULIBC_get_thread_num();const struct numainfo_t loc = ULIBC_get_numainfo( tid );

printf("Thread: %2d, NUMA: node %d, core %d¥n",loc.id, loc.node, loc.core);

}return 0;

}

スレッドIDの取得 NUMA位置の取得

$ ULIBC_AFFINITY=scatter:fine ¥OMP_NUM_THREADS=4 ./minimal1

Thread: 3, NUMA: node 3, core 0Thread: 1, NUMA: node 1, core 0Thread: 0, NUMA: node 0, core 0Thread: 2, NUMA: node 2, core 0

struct numainfo_t {int id; /* Thread ID */int proc; /* Processor ID */int node; /* NUMA node ID */int core; /* NUMA core ID */int lnp; /* 同一 node 内のコア数 */

};

$ ULIBC_AFFINITY=compact:fine ¥OMP_NUM_THREADS=4 ./minimal1

Thread: 2, NUMA: node 0, core 2Thread: 1, NUMA: node 0, core 1Thread: 0, NUMA: node 0, core 0Thread: 3, NUMA: node 0, core 3

例2: バインドされるプロセッサの位置#include <stdio.h>#include <omp.h>#include <ulibc.h>

int main(void) {/* initialize ULIBC variables */ULIBC_init();

/* OpenMP region */_Pragma("omp parallel") {

const int tid = ULIBC_get_thread_num();const struct numainfo_t loc = ULIBC_get_numainfo( tid );const struct cpuinfo_t proc = ULIBC_get_cpuinfo( loc.proc );printf("Thread: %2d, NUMA: node %d, core %d "

"(Proc: %2d, Socket: %2d, Core: %2d, SMT: %2d)¥n",loc.id, loc.node, loc.core,proc.id, proc.node,proc.core, proc.smt);

}return 0;

}

$ ULIBC_AFFINITY=scatter:fine OMP_NUM_THREADS=4 ./minimal2Thread: 3, NUMA: node 0, core 3 (Proc: 24, Socket: 3, Core: 3, SMT: 0)Thread: 1, NUMA: node 0, core 1 (Proc: 8, Socket: 1, Core: 1, SMT: 0)Thread: 0, NUMA: node 0, core 0 (Proc: 0, Socket: 0, Core: 0, SMT: 0)Thread: 2, NUMA: node 0, core 2 (Proc: 16, Socket: 2, Core: 2, SMT: 0)

スレッドIDの取得 NUMA位置の取得

struct numainfo_t {int id; /* Thread ID */int proc; /* Processor ID */int node; /* NUMA node ID */int core; /* NUMA core ID */int lnp; /* 同一 node 内のコア数 */

};

struct cpuinfo_t {int id; /* Processor ID */int node; /* Package ID */int core; /* Core ID */int smt; /* SMT ID */

};

プロセッサの位置の取得

お題: daxpyi演算• ベクトル x[]の alpha倍を y[]に足し込むというベクトル演算• ただし y[]にアクセスするのは indx[]に格納された引数のみ

void daxpyi_naive(size_t n, double alpha,const double* x, const int* indx, double* y) {

size_t k;for (k = 0; k < n; ++k) {

y[ indx[k] ] += alpha * x[k];}

}

• 計算機実験環境– 4-socket 8-core Xeon server (Hyper Threading enabled)– n=10000, m=1000000, x[n]={0}, y[m]={0}– indx[n] = {0, 100, 200, 300, … }

• 性能向上しやすいコードか？– 各ループは独立しているため、容易に並列可能 ⇒ 並列可能– データ移動量が大, データの再利用が不可能 ⇒ 並列計算による性能向上

0

0.5

1

1.5

2

2.5

3

3.5

4

1 2 4 8 16 32 64

Spee

dup

Number of threads

mklseqomp

Daxpyiの OpenMP並列• スレッド並列による高速化

mkl, naïve

omp

OpenMP並列版は並列数の増加で性能悪化

MKL と naïve はほぼ同じ性能MKLを基準とした性能向上比率

OpenMP並列による高速化の効果

void daxpyi_omp(size_t n, double alpha, const double* x, const int* indx, double* y) {

_Pragma("omp parallel for") {for (size_t k = 0; k < n; ++k) {

y[ indx[k] ] += alpha * x[k];}

}}

0

1

2

3

4

5

6

7

8

1 2 4 8 16 32 64

Spee

dup

Number of threads

compact, mklcompact, seqcompact, omp

scatter, mklscatter, seqscatter, omp

daxpyi w/ ULIBC on 4-socket Xeon• 疎ベクトル演算 daxpyi … y[ indx[k] ] += alpha * x[k]

– x[10000], y[1000000], indx[10000] (indx[]に y[] の要素番号を格納)

OpenMP並列版は並列数の増加に伴う性能低下の軽減

スレッドを近くに配置分散配置x6.47

Compact型のaffinityが高い性能を示す

Scatter型の affinityはデフォルトと近い性能

0

1

2

3

4

5

6

7

8

1 2 4 8 16 32 64

Spee

dup

Number of threads

compact, mklcompact, seqcompact, omp

scatter, mklscatter, seqscatter, omp

どのように性能を考察するか?• ソケットをまたぐと性能低下⇒スレッドとデータの位置が問題

– 「ソケット数 x スレッド数」での比較

intra-socket

inter-socket1x8

1x4

1x2

1x16 (HT)

2x16(HT) 4x16(HT)

2x1

4x1 4x2 4x44x8

4x16(HT)

スレッドを近くに配置分散配置

Socket は横断させないHTの効果はあまりなし

x6.47

計算機の特徴を考慮した高速計算• 計算機の特徴を考慮して高速に動作するソフトウェアを開発

– 優れた計算手法– 高速な実装方法

計算機の特徴を踏まえて同時に進める

• 計算機上で実際に性能を出すために重要なこと– 計算機が得意な処理を知る– アルゴリズムやデータ構造の性質を知る

• 理論的には高速だが実際には低速 … Fibonacci-heap など• 理論的には非効率であるが実際には高速 … Quick-sort など

• Graph500: 7th, 8th, 9th, 10th, 11th で最高性能 (1台)

• Green Graph500: 1st, 2nd, 3rd, 4th, 5th, 6th で最高の省電力性能– 1-8位まで我々の結果が独占

計算機の特徴を考慮した高速化を適用した実装

Ours: UV300 (32 sockets) … 219 GTEPSBG/Q (512 nodes) … 172 GTEPS

1/16 のソケット数で同等以上の性能

参考文献

• [BD13] Y. Yasui, K. Fujisawa, and K. Goto: NUMA-optimized Parallel Breadth-first Search on Multicore Single-node System, IEEE BigData 2013

• [ISC14] Y. Yasui, K. Fujisawa, and Y. Sato: Fast and Energy-efficient Breadth-first Search on a single NUMA system, IEEE ISC'14, 2014

• [HPCS15] Y. Yasui and K. Fujisawa: Fast and scalable NUMA-based thread parallel breadth-first search, HPCS 2015, ACM, IEEE, IFIP, 2015.

• [HPGP16] Y. Yasui, K. Fujisawa, Eng Lim Goh, J. Baron, A. Sugiura, T. Uchiyama: NUMA-aware Scalable Graph Traversal on SGI UV Systems, HPGP’16, ACM, 2016

• [GraphCREST2015] K. Fujisawa, T. Suzumura, H. Sato, K. Ueno, Y. Yasui, K. Iwabuchi, and T. Endo: Advanced Computing & Optimization Infrastructure for Extremely Large-Scale Graphs on Post Peta-Scale upercomputers, Proceedings of the Optimization in the Real World --Toward Solving Real-World Optimization Problems --, Springer, 2015.

参考文献 (共有メモリマシン)

参考文献 (Graph500プロジェクト)

いつメモリは確保されるか• Linux ではページ単位 (4KB) で管理されている

– 初回参照時 (first-touch) にメモリが割り当てられるlong n = 1000000000;double *A = NULL;

A = malloc(sizeof(long) * n);

for (long i = 0; i < n ++i) {A[i] = 0.0;

}

const long stride = getpagesize() / sizeof(double);for (long i = 0; i < n i += stride) {

A[i] = 0.0;}

109 x 8 bytes (7.45 GB) のメモリ確保

各要素を 0.0 で初期化 ⇒ メモリ割当て

ページの先頭の要素のみを初期化

• いくつかの例外– サイズ小：予め確保したメモリを割り当てるだけ (glibc)– サイズ大：大きなページ (Transparent hugepage; 2MB) を使用

posix_memalignや mmapでも基本的には同じ

※ THPによりユーザが意識せずに性能向上を行える

仮想記憶と Huge page (Large page)• データアクセスには物理アドレスが必要

• 短時間に広範囲にメモリアクセスすると TLBミスが発生• ページサイズと使い勝手のトレードオフ

– 大きくすると TLBミスを軽減できる– 大きくすると無駄な領域が増える (2.00001 MB の確保⇒ 4MB)

A[i] = 0.0; (1) &A[i]は仮想アドレス, 物理アドレスが必要

(2) 該当ページが TLB に登録されているか確認

(3) 物理アドレスを取得するために RAM へ確認

• ページサイズを自動的に変更 (4KB ⇒ 2MB)– 近年の Linux では transparent huge page が default で enable に

TLBミス

Documents

CG Yasui 201605 - 九州大学（KYUSHU UNIVERSITY）2016/06/11 · この研究の立ち位置 • Graph algorithm; BFS • Efficient NUMA-aware BFS algorithm – メモリアクセスの局所性の向上