Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
計算機のメモリ階層構造を考慮した高速かつ並列化効率の高いグラフ探索
安井 雄一郎九州大学共進化社会システム創成拠点, JST COI
[email protected]@gmail.com
物理現象の演出可能な離散モデルの構築2016年6月11日 13:40 − 14:10
計算機の特徴を考慮した高速計算• 計算機の特徴を考慮して高速に動作するソフトウェアを開発
– 優れた計算手法– 高速な実装方法
計算機の特徴を踏まえて同時に進める
• 計算機上で実際に性能を出すために重要なこと– 計算機が得意な処理を知る– アルゴリズムやデータ構造の性質を知る
• 理論的には高速だが実際には低速 … Fibonacci-heap など• 理論的には非効率であるが実際には高速 … Quick-sort など
• Graph500: 7th, 8th, 9th, 10th, 11th で最高性能 (1台)
• Green Graph500: 1st, 2nd, 3rd, 4th, 5th, 6th で最高の省電力性能– 1-8位まで我々の結果が独占
計算機の特徴を考慮した高速化を適用した実装
Ours: UV300 (32 sockets) … 219 GTEPSBG/Q (512 nodes) … 172 GTEPS
1/16 のソケット数で同等以上の性能
この研究の立ち位置• Graph algorithm; BFS
• Efficient NUMA-aware BFS algorithm– メモリアクセスの局所性の向上– 多ソケットマシン上での効率的な実装を目指す (SGI UV 2000, SGI UV 300)
- SCALE- edgefactor
- SCALE- edgefactor- BFS Time- Traversed edges- TEPS
Input parameters ResultsGraph generation Graph construction
TEPSratio
ValidationBFS
64 Iterations
- SCALE- edgefactor
- SCALE- edgefactor- BFS Time- Traversed edges- TEPS
Input parameters ResultsGraph generation Graph construction
TEPSratio
ValidationBFS
64 Iterations
CPU
RAM
CPU
RAM
CPU
RAM
…NUMA を考慮したアルゴリズム設計
Kronecker graph w/ SCALE 3417 billion nodes and 275 billion edges
SGI UV 30032-sockets 18-core Xeon and 16 TB RAM
巨大なグラフ
• NUMA / cc-NUMA architecture
Local access
Remote access
・・・
Many-socket system
Graph processing for Large scale networks• Large-scale graphs in various fields
– US Road network: 58 million edges– Twitter follow-ship: 1.47 billion edges– Neuronal network: 100 trillion edges
89 billion vertices & 100 trillion edges
Neuronal network @ Human Brain Project
Cyber-securityTwitter
US road network24 million vertices & 58 million edges 15 billion log entries / day
Social network
• Fast and scalable graph processing by using HPC
large
61.6 million vertices& 1.47 billion edges
20
25
30
35
40
45
15 20 25 30 35 40 45
log 2
(m)
log2(n)
USA-road-d.NY.gr
USA-road-d.LKS.gr
USA-road-d.USA.gr cit-Patents
soc-LiveJournal1
twitter-rv
Human Project
Graph500 (Toy)
Graph500 (Mini)
Graph500 (Small)
Graph500 (Medium)
Graph500 (Large)
Graph500 (Huge)
10億点 1兆点
10 億枝
1兆枝
Human Brain
全米道路
Twitter2009
Graph500 (SCALE29)・4-way Xeon (64cores)
点数 (対数)
Graph500 (SCALE40)・BlueGene/Q (65,536 nodes)・K computer (65,536 nodes)
対象のネットワークサイズ枝数
(対数
)
計算機1台(~512GB RAM)で扱えるサイズ
BFS on Twitter follow-ship network
• follow-ship network– #Users (#vertices) 41,652,230– Follow-ships (#edges) 2,405,026,092
Lv. #users ratio (%) percentile (%)0 1 0.00 0.00 1 7 0.00 0.00 2 6,188 0.01 0.01 3 510,515 1.23 1.24 4 29,526,508 70.89 72.13 5 11,314,238 27.16 99.29 6 282,456 0.68 99.97 7 11536 0.03 100.00 8 673 0.00 100.00 9 68 0.00 100.00 10 19 0.00 100.00 11 10 0.00 100.00 12 5 0.00 100.00 13 2 0.00 100.00 14 2 0.00 100.00 15 2 0.00 100.00
Total 41,652,230 100.00 -
BFS result from User 21,804,357
This network excludes unconnected usersThe six-degrees of
separation
我々の実装では BFS を 60 ms で計算可能
Twitter2009
Highway
Bridge
Betweenness centrality (BC)
CB(v) =!
s!v!t∈V
σst(v)σst
σst : number of shortest (s, t)-paths
σst(v) : number of shortest (s, t)-paths passing through vertex v
2 / 2
CB(v) =!
s!v!t∈V
σst(v)σst
σst : number of shortest (s, t)-paths
σst(v) : number of shortest (s, t)-paths passing through vertex v
2 / 2
: # of (s, t)-shortest paths: # of (s, t)-shortest pathspassing throw v
Osaka road network13,076 vertices and 40,528 edges
• BC requires #vertices-times BFS,because BFS obtains one-to-all shortest paths
• Computes an importance for each vertices and edges utilizing all-to-all shortest-paths (breadth-first search) w/o vertex coordinates
Importancelow high
Osaka station
Our software “NETAL” can solves BC for Osaka road network within one secondY. Yasui, K. Fujisawa, K. Goto, N. Kamiyama, and M. Takamatsu:NETAL: High-performance Implementation of Network Analysis Library Considering Computer Memory Hierarchy, JORSJ, Vol. 54-4, 2011.
Graph500 and Green Graph500• New benchmarks using graph processing (breadth-first search)• measures a performance and energy efficiency of irregular
memory access
TEPS score (# of Traversed edges per second) for Measuring a performance of irregular memory accesses
TEPS per Watt score for measuring power-efficient performace
Graph500 benchmark Green Graph500 benchmark
1. Generation
- SCALE- edgefactor
- SCALE- edgefactor- BFS Time- Traversed edges- TEPS
Input parameters ResultsGraph generation Graph construction
TEPSratio
ValidationBFS
64 Iterations
- SCALE- edgefactor
- SCALE- edgefactor- BFS Time- Traversed edges- TEPS
Input parameters ResultsGraph generation Graph construction
TEPSratio
ValidationBFS
64 Iterations
- SCALE- edgefactor
- SCALE- edgefactor- BFS Time- Traversed edges- TEPS
Input parameters ResultsGraph generation Graph construction
TEPSratio
ValidationBFS
64 Iterations
3. BFS x 642. Construction
x 64
Median TEPS
SCALE & edgefactor (=16)
Kronecker graph with 2SCALEvertices and 2SCALE×edgefactor edgesby using SCALE-times the RursiveKronecker products
G1 G2 G3 G4
- SCALE- edgefactor
- SCALE- edgefactor- BFS Time- Traversed edges- TEPS
Input parameters ResultsGraph generation Graph construction
TEPSratio
ValidationBFS
64 Iterations
3. BFS x 64
x 64
Median of 64 TEPSs
Powerconsumption
Power consumptionin watt
TEPS per Watt
Level-synchronized parallel BFS (Top-down)• Started from source vertex and
executes following two phases at each level
FrontierNeighbor
Level kLevel k+1QF
QN
Swap exchanges the frontier QF and the neighbors QN for next level
Traversal phase finds neighbors QN
from current frontier QF
visited
unvisited
QNLevel 1
SourceLevel 0QF
Level 2
Level 1
QN
QF
Level 3
Level 2
QN
QF
Level 0
Sync.
Sync.
Level 1
Level 2
QNQF
QNQF
one parent for each vertex v 2 V , it represents a tree withthe root vertex s 2 V .
Algorithm 1 is a fundamental parallel algorithm for aBFS. This requires the synchronization of each level thatis a certain number of hops away from the source. We callthis the level-synchronized parallel BFS [7]. Each traversalexplores all outgoing edges of the current frontier, which isthe set of vertices discovered at this level, and finds theirneighbors, which is the set of unvisited vertices at the nextlevel. We can describe this algorithm using a frontier queueQF and a neighbor queue QN , because unvisited verticesw are appended to the neighbor queue QN for each frontierqueue vertex v 2 QF in parallel with the exclusive controlat each level (Algorithm 1, lines 7–12), as follows:
QN �w 2 A(v) | w 62 visited, v 2 QF
. (1)
Algorithm 1: Level-synchronized Parallel BFS.Input : G = (V, A) : directed graph.
s : source vertex.Variables: QF : frontier queue.
QN : neighbor queue.visited : vertices already visited.
Output : ⇡(v) : predecessor map of BFS tree.1 ⇡(v) �1, 8v 2 V2 ⇡(s) s3 visited {s}4 QF {s}5 QN ;6 while QF 6= ; do7 for v 2 QF in parallel do8 for w 2 A(v) do9 if w 62 visited atomic then
10 ⇡(w) v11 visited visited [ {w}12 QN QN [ {w}
13 QF QN
14 QN ;
B. Hybrid BFS (Direction-optimized BFS)
The main runtime bottleneck of the level-synchronizedparallel BFS (Algorithm 1) is the exploration of all outgoingedges of the current frontier (lines 7–12). Beamer et al. [9],[10] proposed a hybrid BFS algorithm (Algorithm 2) thatreduced the number of edges explored. This algorithm com-bines two different traversal kernels: top-down (Algorithm 3)and bottom-up (Algorithm 4). Like the level-synchronizedparallel BFS, top-down kernels traverse neighbors of thefrontier. Conversely, bottom-up kernels find the frontier fromvertices in candidate neighbors. In other words, a top-downmethod finds the children from the parent, whereas a bottom-up method finds the parent from the children. For a large
frontier, bottom-up approaches reduce the number of edgesexplored, because this traversal kernel terminates once asingle parent is found (Algorithm 4, lines 3–8).
Algorithm 2: Hybrid BFS algorithm of Beamer et al.Input : G = (V, AF , AB) : directed graph.
s : source vertex.Variables: QF : frontier queue.
QN : neighbor queue.visited : vertices already visited.
Output : ⇡(v) : predecessor map of BFS tree.1 ⇡(v) �1, 8v 2 V2 visited {s}3 QF {s}4 QN ;5 while QF 6= ; do6 if is TopDown(QF , QN , visited) then7 QN Top-down(G, QF , visited, ⇡)8 else9 QN Bottom-up(G, QF , visited, ⇡)
10 QF QN
Algorithm 3: Top-down BFS.Input : G = (V, AF ) : directed graph.
QF : frontier queue.visited : vertices already visited.⇡ : predecessor map of BFS tree.
Output : QN : neighbor queue.1 QN ;2 for v 2 QF in parallel do3 for w 2 AF (v) do4 if w 62 visited atomic then5 ⇡(w) v6 visited visited [ {w}7 QN QN [ {w}
Algorithm 4: Bottom-up BFS.Input : G = (V, AB) : directed graph.
QF : frontier queue.visited : vertices already visited.⇡ : predecessor map of BFS tree.
Output : QN : neighbor queue.1 QN ;2 for w 2 V \ visited in parallel do3 for v 2 AB(w) do4 if v 2 QF then5 ⇡(w) v6 visited visited [ {w}7 QN QN [ {w}8 break
one parent for each vertex v 2 V , it represents a tree withthe root vertex s 2 V .
Algorithm 1 is a fundamental parallel algorithm for aBFS. This requires the synchronization of each level thatis a certain number of hops away from the source. We callthis the level-synchronized parallel BFS [7]. Each traversalexplores all outgoing edges of the current frontier, which isthe set of vertices discovered at this level, and finds theirneighbors, which is the set of unvisited vertices at the nextlevel. We can describe this algorithm using a frontier queueQF and a neighbor queue QN , because unvisited verticesw are appended to the neighbor queue QN for each frontierqueue vertex v 2 QF in parallel with the exclusive controlat each level (Algorithm 1, lines 7–12), as follows:
QN �w 2 A(v) | w 62 visited, v 2 QF
. (1)
Algorithm 1: Level-synchronized Parallel BFS.Input : G = (V, A) : directed graph.
s : source vertex.Variables: QF : frontier queue.
QN : neighbor queue.visited : vertices already visited.
Output : ⇡(v) : predecessor map of BFS tree.1 ⇡(v) �1, 8v 2 V2 ⇡(s) s3 visited {s}4 QF {s}5 QN ;6 while QF 6= ; do7 for v 2 QF in parallel do8 for w 2 A(v) do9 if w 62 visited atomic then
10 ⇡(w) v11 visited visited [ {w}12 QN QN [ {w}
13 QF QN
14 QN ;
B. Hybrid BFS (Direction-optimized BFS)
The main runtime bottleneck of the level-synchronizedparallel BFS (Algorithm 1) is the exploration of all outgoingedges of the current frontier (lines 7–12). Beamer et al. [9],[10] proposed a hybrid BFS algorithm (Algorithm 2) thatreduced the number of edges explored. This algorithm com-bines two different traversal kernels: top-down (Algorithm 3)and bottom-up (Algorithm 4). Like the level-synchronizedparallel BFS, top-down kernels traverse neighbors of thefrontier. Conversely, bottom-up kernels find the frontier fromvertices in candidate neighbors. In other words, a top-downmethod finds the children from the parent, whereas a bottom-up method finds the parent from the children. For a large
frontier, bottom-up approaches reduce the number of edgesexplored, because this traversal kernel terminates once asingle parent is found (Algorithm 4, lines 3–8).
Algorithm 2: Hybrid BFS algorithm of Beamer et al.Input : G = (V, AF , AB) : directed graph.
s : source vertex.Variables: QF : frontier queue.
QN : neighbor queue.visited : vertices already visited.
Output : ⇡(v) : predecessor map of BFS tree.1 ⇡(v) �1, 8v 2 V2 visited {s}3 QF {s}4 QN ;5 while QF 6= ; do6 if is TopDown(QF , QN , visited) then7 QN Top-down(G, QF , visited, ⇡)8 else9 QN Bottom-up(G, QF , visited, ⇡)
10 QF QN
Algorithm 3: Top-down BFS.Input : G = (V, AF ) : directed graph.
QF : frontier queue.visited : vertices already visited.⇡ : predecessor map of BFS tree.
Output : QN : neighbor queue.1 QN ;2 for v 2 QF in parallel do3 for w 2 AF (v) do4 if w 62 visited atomic then5 ⇡(w) v6 visited visited [ {w}7 QN QN [ {w}
Algorithm 4: Bottom-up BFS.Input : G = (V, AB) : directed graph.
QF : frontier queue.visited : vertices already visited.⇡ : predecessor map of BFS tree.
Output : QN : neighbor queue.1 QN ;2 for w 2 V \ visited in parallel do3 for v 2 AB(w) do4 if v 2 QF then5 ⇡(w) v6 visited visited [ {w}7 QN QN [ {w}8 break
Frontier
Neighbors
Level k
Level k+1
FrontierLevel k
Level k+1neighbors
Top-down algorithm Bottom-up algorithm
switch
[Beamer,SC12] S. Beamer et al.: Direction-optimizing breadth-first search. SC12.
探索済み点集合の前線から、隣接した未探索点を探索
(隣接しているとは限らない)未探索点から探索済み点集合の前線を探索
1つでも見つければループを終了できる
排他制御が不要排他制御が必要
隣接点は全て確認する必要あり
Direction-optimizing BFS [Beamer, SC12]
FrontierLevel k
Level k+1NeighborsFrontier Neighbors
Level k
Level k+1
Candidates ofneighbors
Direction-optimizing BFS [Beamer, SC12]
• Top-down dir. using out-going edges • Bottom-up dir. using in-coming edges
Outgoingedges Incoming
edges
Chooses direction from Top-down or Bottom-up at each Level
幅優先探索に対する前方探索 (Top-down)と後方探索 (Bottom-up)
Level Top-down Bottom-up Hybrid0 2 2,103,840,895 21 66,206 1,766,587,029 66,2062 346,918,235 52,677,691 52,677,6913 1,727,195,615 12,820,854 12,820,8544 29,557,400 103,184 103,1845 82,357 21,467 21,4676 221 21,240 227
Total 2,103,820,036 3,936,072,360 65,689,631Ratio 100.00% 187.09% 3.12%
6 / 14
Distance from source
Large frontier
Top-down
Top-down
Bottom-up
Direction-opt. BFS
n=1,000 n=2,000
3重ループ 6.07 s 60.19 s
古典的なチューニング(ループブロッキング) 1.99 s 16.78 s
BLAS ライブラリ GotoBLAS2 (1スレッド) 0.03 s 1.38 s
BLAS ライブラリ GotoBLAS2 (8スレッド) 0.20 s
@ intel Xeon X5460 3.16 GHz / GCC 4.1.2
DGEMM(倍精度浮動小数点数型の行列積)演算量は同じでも実装により性能が異なる
N次正方行列の積:O(n3)
300倍ᘍЗᆢdzȸȉ̊ᲢᲽᚕᛖᲣ
z䝁䞊䝗
for (i=0; i<n; i++) for (j=0; j<n; j++)
for (k=0; k<n; k++) C[i][j] += A[i][k] *B[k][j];
6
C A Bi
j
i
k
k
j
䝇䝟䝁䞁䝥䝻䜾䝷䝭䞁䜾䠄䠍䠅䚸䠄Ϩ䠅
3重ループ
ブロッキング 行列 A, B へのアクセス範囲が BLOCK_SIZE になるように調整
A BC
計算機のメモリ階層構造• アクセス速度と容量のトレードオフ関係• 上位の階層のメモリをうまく使うことが非常に重要• データアクセスがボトルネックとなるアルゴリズムは数百倍高速になる可能性
RAM
NV-RAM (SSD)
Storage
レジスタ
キャッシュメモリL1, L2, TLB, L3
メモリ階層構造
容量
速度
2.5-10
50-500
5-20
0.05-0.5
アクセス速度 (GB/s)
< 512GB
< 1KB
< 30MB
> 1TB
(< 1TB)
容量
J.L.Hennessy, D.A.Patterson: Computer Architecture, A Quantitative Approach.
Pem.4 世代の典型的な容量と速度
(6GB/s)
新しい階層
データアクセスとは• 全ての演算はレジスタ上で行われる
– メモリ階層で高い方から順に存在しているか確認する– キャッシュにヒットすれば非常に高速、しなければメインメモリへのアクセス
レジスタ
L1
L2
メモリ
レジスタ
L1
L2
メモリ
レジスタ
L1
L2
メモリ
数倍 数百倍数倍
キャッシュヒット キャッシュミス
• 仮想アドレスを用いたメモリ管理– データアクセスには物理アドレスが必要(メインメモリへのアクセス)– TLB は物理アドレスと仮想アドレスの変換テーブル– 広域にデータアクセスしたり、アドレス計算を多用すると溢れる時間的・空間的にデータアクセスの局所性を高める工夫が非常に有効
NUMA system• NUMA = non-uniform memory access
RAM RAM
processor core & L2 cache
8-core Xeon E5 4640shared L3 cache
RAM RAM
Local memory access
Remote (non-local)Memory access
NUMA node
• 4-way Intel Xeon E5-4640 (Sandybridge-EP)– 4 (# of CPU sockets)– 8 (# of physical cores per socket)– 2 (# of threads per core)
4 x 8 x 2 = 64 threadsNUMA node
Max.
NUMAシステム上のメモリバンド幅
NUMA 0
NUMA 1 NUMA 2
NUMA 3
0
1
2
3
0 1 2 3
targ
et N
UM
A no
de
source NUMA node
24.2
3.4
3.0
3.4
3.3
23.9
3.5
3.0
3.0
3.4
24.3
3.4
3.5
3.0
3.4
24.2
Local access: 24 GB/s
Remote access: 3 GB/s
NUMA 0
NUMA 1 NUMA 2
NUMA 3
DataThreads
NUMA system (4 CPUs)
Datathreads
Fast local access Slow non-local access
Differentdistances
CPU
RAM
CPU
RAM
CPU
RAM
CPU
RAM
Thread placement
Mem
ory
plac
emen
t
(Example) 4-socket Xeon system• 4 (# of CPU sockets)• 8 (# of physical cores per socket)• 2 (# of threads per core)
我々の Graph500 実装• 効率的なグラフデータ構造
4
3
20
1
Input graph
1
3
04
2
Vertex sorting (HPCS15) Adjacency list sorting (ISC14)
A0
A1
A2
A3
NUMA-aware graph (BD13)
• 効率的な BFS 実装 (Beamerらのアルゴリズムの改善)
CQ
Socket−queue
Remote
Remote
Local
VS NQ
Agarwal’s Top-down (SC10) Pruning edges (HPGP16)
Top-down direction Bottom-up direction
NUMA-aware Bottom-up (BD13)
A0
A1
A2
A3
Input:CQ
Data:VSk
Output:NQk
Local
Sorting by outdegree
CSR graph
Reduction ofremote edges
Binding on NUMA node
0
10
20
30
40
50
2011 SC10 SC12 BigData13ISC14
G500,ISC14
GT
EPS
Ref
eren
ce
NU
MA-a
war
e
Dir.
Opt
.
NU
MA-O
pt.
NU
MA-O
pt.
+D
eg.a
war
e
NU
MA-O
pt.
+D
eg.a
war
e+
Vtx
.Sor
t87M 800M
5G
11G
29G
42G
⇥1 ⇥9
⇥58
⇥125
⇥334
⇥489Le
vel-s
ync.
Top-down Top-down
Bottom-up
Top-down
CPU
RAM
NUMA-aware Bottom-up
Top-down
CPU
RAM
NUMA-aware
Ours: NUMA-optimized + Dir. Opt. BFS
Our latest ver.Is 489X faster thanReference code
Our implementations
• Obtained 4 Xeon proc.
NUMA 0
NUMA 1 NUMA 2
NUMA 3
Ours: NUMA-aware 1-D part. graph [BD13]• Divides sub graphs and assigns on each NUMA node
A0
A1
A2
A3
Adjacency matrix 1-D part. Graph
CPU
RAMassigndivide
CPU
RAM
CPU
RAM
CPU
RAM
NUMA node
A0
A1
A2
A3
Input:Frontier
CQ
Data:visited VSk
Output:neighbors
NQ
Local RAM
Bottom-up direction
• At bottom-up direction (Bottleneck component), each NUMA node computes partial NQ using local copied CQ and local assigned VS.
Each sub graph represents by CSR graph
Top-down direction uses inverse of G.(G is undirected)
A0
A1
A2
A3
Input:Frontier
CQk
Data:visited VSk
Output:neighbors
NQ
LocalLocal
Remote Remote
Modified version of Agarwal’s NUMA-aware BFS
Ours: Adjacency list sorting [ISC14]
• Reduces unnecessary edge traversals at Bottom-up dir.Loop count τ
A(va)A(vb)
finds frontier vertex and breaks this loop
……
Bottom-up
Skipped adjacency verticesTraversed adjacency vertices
• Sorting adjacency lists by the corresponding outdegreeVertex vi Vertex vi+1
Index
Value
High Low
Adjacency vertices of vi
Sorting by outdegree
Ours: Vertex sorting [HPCS15]
Degree distribution
Access freq. w/ vertex sorting
• # of vertex traversals equals the outdegree of the corresponding vertex• Our vertex sorting reorders vertex indices by the outdegrees
Access freq. and OutDegree are correlated4
3
20
1
1
3
04
2
Originalindices
Sorted indices by outdegree
Highestoutdegree
Many accessesfor small-index vertex
NUMA-aware Top-down BFS• Original version was proposed by Agarwal [Agarwal-SC10]• Reducing random remote accesses using socket-queue
CQ
Local + Remote
NUMA 0
NUMA 1
NUMA 2
NUMA 3
「Local : Remote = 1 : ℓ」on ℓ-sockets
e.g.) focused on NUMA 2
synchronize
Socket−queue
Local
VS NQ
synchronize
Swap CQ and NQ
Append unvisited vertices into NQ
Local
Phase1: CQ ⇒ NQ or Socket-queue Phase2: Socket-queue ⇒NQ Next level
CQ
Socket−queue
Remote
Remote
Local
VS NQ
Append unvisited vertices into NQ
NUMA-aware Top-down w/ Pruning remote edges• pruning remote edges to reduce remote accesses
NUMA 0
NUMA 1
NUMA 2
NUMA 3
e.g.) focused remote edge traversal on NUMA 2
This paperproposed by Agarwal’s SC10 paper
with Pruningw/o Pruning (original)
Each NUMA node appends remote edges (v,w) into the corresponding socket-queue, if the Fdoesn't contain w. (And then, F appends w)
Each NUMA node appends all remote edges (v,w) into the corresponding socket-queue
F (reuse CQ bitmap for Bottom-up)
CQ(vector queue)
Socket−queue
Remote
Local
Local
Remote
The F is not initialized, while there is no change of search direction.
CQ(vector queue)
Socket−queue
Remote
Remote
Each vertex is searched once only.
SGI UV 2000• UV 2000
– Single OS: SUSE Linux 11 (x86_64)– hypercube interconnection– Up to 2,560 cores and 64 TB RAM– (= 128 UV 2000 shassis x 2 sockets x 10 cores)
ISM has two full-spec. UV 2000• Hierarchical network topologies
– Sockets, Chassis, Cubes, Inner-racks, and Outer-racksUV2000 Chassis = 2 sockets Cube = 8 Chassis Rack = 32 nodes
CPU
RAM
CPU
RAM
× 4 = NUMAlink6
6.7GB/s
⇒ Cannot detect
NUMAnode
ISM Kyushu U.
0
50
100
150
200
26(1)
27(2)
28(4)
29(8)
30(16)
31(32)
32(64)
33(128)
GTE
PS
SCALE (NUMA nodes)
Weak scaling with SCALE 26 per NUMA node• Fast, scalable, and enegry-efficient Fastest of signle-node
9th & 10th Graph500
SCALE 338.6 Billion vertices137.4 Billion edges
174 GTEPS
133 GTEPS
Most energy efficient of commercial supercomputers
3rd & 4rd & 5th Green Graph500 list
128 CPUs (1280 threads)
ScalableUp to 1280 threads
109 TEPS
RAM CPU
to other chassis
to other chassis
SGI UV 300• UV 300
– Single OS: SUSE Linux 11 (x86_64)– All-to-all interconnection– Up to 1,152 cores and 16 TB RAM– (= 8 UV 300 shassis x 4 sockets x 18 cores x 2 SMT)
• UV 300 chassis– 4-socket 18-core Intel Xeon E7-8867 (Haswell)– 2TB RAM (512 GB per NUMA node)
UV300 chassis
UV300 Rack
All-to-All • 18-core Xeon E7-8867• HT enabled (2 SMT)• 512GB RAM
NUMA node
UV 300 chassis
Kyushu U.
8 chassis
0
50
100
150
200
250
1 2 4 8 16 32 64
GTE
PS
Number of NUMA nodes (CPU sockets)
UV300 (HT, THP, Local): SCALE 29 per socketUV300 (HT, THP, Remote): SCALE 29 per socketUV300 (HT, Remote): SCALE 29 per socketUV300 (HT, Remote): SCALE 27 per socketUV2000: SCALE 27 per socket
18.732.5
64.7
100.3
161.5
219.4
8.3 14.225.1
38.6
61.5
91.8
152.2
Weak scaling performance• UV300 と UV2000 の比較
UV2000
UV300
Single-node での最高性能の更新
New result and Nov. 2015 list• Update fastest single-node
Oursfastest ofsingle-node
Ours
SCALE34219 GTEPS
SGI UV300(1 node / 576 cores)− HT enabled− THP enabled− local-ref. mode
SGI UV2000(1280 cores)SCALE 33174.7 GTEPS
SGI UV2000(1280 cores)SCALE 33149.8 GTEPS
Graph500 performance (in TEPS)• Weak scaling performance
1
4
16
64
256
1024
4096
16384
1 4 16 64 256 1024 4096 16384 65536
GTE
PS (l
ogsc
ale)
Number of sockets (logscale)
SGI UV 2000 (CPU socket per SCLAE 26)IBM BG/Q (CPU socket per SCALE 24)HP SuperdomeX (480 threads)SGI UV 300 (CPU socket per SCALE 29, Remote)SGI UV 300 (CPU socket per SCALE 29, Local)
IBM BG/Q
SGI UV2000
SGI UV300
HP SuperdomeX
System #CPUs (#threads) SCALE HT THP GTEPS
SGI UV300 16 (576) 33 ✔ ✔ 162
HP SuperdomeX 16 (480) 33 ✔ ✔ 128
System #CPUs SCALE GTEPS
SGI UV300 32 33 219
SGI UV2000 128 33 173
IBM BG/Q 512 33 172
MemoryUsage: 4TB
UV300 と UV2000• STREAM TRIAD で測定したバンド幅 (GB/s) を描画• ローカルアクセスがリモートアクセスに対して明らかに高速
各シャーシ (4sockets) はAll-to-all で接続
各シャーシ (2 sockets) はHyper-cube で接続
0
16
32
48
0 16 32 48
Mem
ory
Plac
emen
t
Thread Placement
0
5
10
15
20
25
30
35UV 2000 (64 sockets)
3-7 GB/s
3-7 GB/s
Local33 GB/s
UV 2000chassis
0
4
8
12
16
20
24
28
0 4 8 12 16 20 24 28
Mem
ory
Plac
emen
t
Thread Placement
0
10
20
30
40
50
60UV 300 (32 sockets)
6 GB/s
6 GB/s
Local56 GB/s
12-14 GB/s
UV 300chassis
STREAM benchmark with NUMACTL• STREAM benchmark is a popular benchmark for
measuring memory bandwidth using vector arithmetics• NUMACTL is a Linux command for CPU and memory
binding
$ wget https://www.cs.virginia.edu/stream/FTP/Code/stream.c$ icc -O2 -fopenmp -DSTREAM_ARRAY_SIZE=100000000 -o stream stream.c
Download and compile source code
SOCKETS=`seq 0 31`THREADS=36for i in $SOCKETS; do
for j in $SOCKETS; doOMP_NUM_THREADS=$THREADS ¥
numactl --cpunodebind=$i --membind=$j ./streamdone
done
Execute STREAM benchmark with NUMACTL
⇒ 36 threads on socket $i compute the vector data on socket $j
1
4
16
64
256
1024
4096
16384
65536
1stNov. 2010
2ndJune 2011
3rdNov. 2011
4thJune 2012
5thNov. 2012
6thJune 2013
7thNov. 2013
8thJune 2014
9thNov. 2014
10thJuly 2015
11thNov. 2015
GTE
PS (i
n lo
gsca
le)
Top 1K-computerTSUBAME 2.5 (2.0)FX-10SGI UV2000TSUBAME-KFC4-way Xeon server
7
18
253
3541
15363 15363 15363 1797723751
38621 38621
100
317462 462 462
1280 1345 1345 1345
358609
993 1003
1003 1003 1003 1003
5524 5524 5524
17977 19585
38621 38621
8.210.5 11.1
31.645.7 55.7 57.6 59.9
131.4174.7 174.7 174.7
44.0 104.3 104.3 104.3 104.3
• K computer won the 8th, 9th, 10th, 11th
(current list) Graph500 benchmark
Distributed Massive Parallel Supercomputer
Our achievements in Graph500
• UV2000 ranked the fastest of single node entry in 7th, 8th, 9th, 10th, and 11th Graph500 list
4-way Xeon server
K computer
TSUBAME 2.5
TSUBAME-KFC
SGI UV2000
K computer is #1
CPU ver.
GPU ver.
#3 #4
#3 FX10
Ours
Current list
Distributed Shared Memory Supercomputer
#1
#1 #1#2
ULIBCの開発方針• ハードウェア情報の取得と制御のための APIs を提供
– OpenMPや Pthreadsなどで利用可能– スレッドの固定だけならば ULIBC をリンクして ULIBC_Init(); の追加
• 汎用性を重視して既存の枠組みと融合を可能に– 対応コンパイラ ... GCC, Intel Compiler, SunCC, XLC– 対応ツール… numactl, Intel thread affinity interface, PBS
RAM RAM
processor core & L2 cache
8-core Xeon E5 4640shared L3 cache
RAM RAM
T
NUMA-aware D
TD
T
D
T
D
D
T
Accessing local memoryPinning threads and memory
NUMA (Non-uniform memory access)
CPU
RAM
CPU
RAM
CPU
RAM
CPU
RAM
NUMA node
Local memory
Remote memory
2. Detects online topology
ULIBC Affinity
RAM
1. Detects entire topology
RAM
RAM RAM
Socket 0
Socket 2
Socket 1
Socket 3
RAM 0 RAM 1
RAM 2 RAM 3
RAM
RAM
Socket 0
Socket 2
Socket 1
Socket 3
RAM 0 RAM 1
RAM 2 RAM 3
numactl --cpunodebind=1,2 ¥--membind=1,2
e.g.)
3. Constructs two-type affinities
NUMA node 0
NUMA node 1
thread 0thread 1
thread 2thread 3
RAM
RAM
Local RAM
assigns threads in a position close to each other.Compact-type affinity
export ULIBC_AFFINITY=compact:fineexport OMP_NUM_THREADS=7e.g.)
export ULIBC_AFFINITY=scatter:fineexport OMP_NUM_THREADS=7e.g.)
NUMA node 0
NUMA node 1
thread 0 thread 2
thread 1 thread 3
RAM
RAM
Local RAM
distributes the threads as evenly as possible across online processors.
Scatter-type affinityRAM
RAM
ULIBC のビルドとインストール
• Linux 上での ULIBC のビルドとインストールmake && make install
− コンパイル時のオプションmake CC=icc # Intel compiler
− インストール時のオプションmake PREFIX=${HOME}/local install # ディレクトリ先
$ ls ~/local/*/home/yasui/local/include:omp_helpers.h ulibc.h
/home/yasui/local/lib:libulibc.a libulibc.so*
• インストール後の確認
ヘッダーファイル ulibc.h
静的ライブラリ .a と動的ライブラリ .so
• Linux 上での ULIBC のビルドとインストールgit clone https://bitbucket.org/yuichiro_yasui/ulibc.git
例1: スレッドの位置#include <stdio.h>#include <omp.h>#include <ulibc.h>
int main(void) {/* initialize ULIBC variables */ULIBC_init();
/* OpenMP region */_Pragma("omp parallel") {
const int tid = ULIBC_get_thread_num();const struct numainfo_t loc = ULIBC_get_numainfo( tid );
printf("Thread: %2d, NUMA: node %d, core %d¥n",loc.id, loc.node, loc.core);
}return 0;
}
スレッドIDの取得 NUMA位置の取得
$ ULIBC_AFFINITY=scatter:fine ¥OMP_NUM_THREADS=4 ./minimal1
Thread: 3, NUMA: node 3, core 0Thread: 1, NUMA: node 1, core 0Thread: 0, NUMA: node 0, core 0Thread: 2, NUMA: node 2, core 0
struct numainfo_t {int id; /* Thread ID */int proc; /* Processor ID */int node; /* NUMA node ID */int core; /* NUMA core ID */int lnp; /* 同一 node 内のコア数 */
};
$ ULIBC_AFFINITY=compact:fine ¥OMP_NUM_THREADS=4 ./minimal1
Thread: 2, NUMA: node 0, core 2Thread: 1, NUMA: node 0, core 1Thread: 0, NUMA: node 0, core 0Thread: 3, NUMA: node 0, core 3
例2: バインドされるプロセッサの位置#include <stdio.h>#include <omp.h>#include <ulibc.h>
int main(void) {/* initialize ULIBC variables */ULIBC_init();
/* OpenMP region */_Pragma("omp parallel") {
const int tid = ULIBC_get_thread_num();const struct numainfo_t loc = ULIBC_get_numainfo( tid );const struct cpuinfo_t proc = ULIBC_get_cpuinfo( loc.proc );printf("Thread: %2d, NUMA: node %d, core %d "
"(Proc: %2d, Socket: %2d, Core: %2d, SMT: %2d)¥n",loc.id, loc.node, loc.core,proc.id, proc.node,proc.core, proc.smt);
}return 0;
}
$ ULIBC_AFFINITY=scatter:fine OMP_NUM_THREADS=4 ./minimal2Thread: 3, NUMA: node 0, core 3 (Proc: 24, Socket: 3, Core: 3, SMT: 0)Thread: 1, NUMA: node 0, core 1 (Proc: 8, Socket: 1, Core: 1, SMT: 0)Thread: 0, NUMA: node 0, core 0 (Proc: 0, Socket: 0, Core: 0, SMT: 0)Thread: 2, NUMA: node 0, core 2 (Proc: 16, Socket: 2, Core: 2, SMT: 0)
スレッドIDの取得 NUMA位置の取得
struct numainfo_t {int id; /* Thread ID */int proc; /* Processor ID */int node; /* NUMA node ID */int core; /* NUMA core ID */int lnp; /* 同一 node 内のコア数 */
};
struct cpuinfo_t {int id; /* Processor ID */int node; /* Package ID */int core; /* Core ID */int smt; /* SMT ID */
};
プロセッサの位置の取得
お題: daxpyi演算• ベクトル x[]の alpha倍を y[]に足し込むというベクトル演算• ただし y[]にアクセスするのは indx[]に格納された引数のみ
void daxpyi_naive(size_t n, double alpha,const double* x, const int* indx, double* y) {
size_t k;for (k = 0; k < n; ++k) {
y[ indx[k] ] += alpha * x[k];}
}
• 計算機実験環境– 4-socket 8-core Xeon server (Hyper Threading enabled)– n=10000, m=1000000, x[n]={0}, y[m]={0}– indx[n] = {0, 100, 200, 300, … }
• 性能向上しやすいコードか?– 各ループは独立しているため、容易に並列可能 ⇒ 並列可能– データ移動量が大, データの再利用が不可能 ⇒ 並列計算による性能向上
0
0.5
1
1.5
2
2.5
3
3.5
4
1 2 4 8 16 32 64
Spee
dup
Number of threads
mklseqomp
Daxpyiの OpenMP並列• スレッド並列による高速化
mkl, naïve
omp
OpenMP並列版は並列数の増加で性能悪化
MKL と naïve はほぼ同じ性能MKLを基準とした性能向上比率
OpenMP並列による高速化の効果
void daxpyi_omp(size_t n, double alpha, const double* x, const int* indx, double* y) {
_Pragma("omp parallel for") {for (size_t k = 0; k < n; ++k) {
y[ indx[k] ] += alpha * x[k];}
}}
0
1
2
3
4
5
6
7
8
1 2 4 8 16 32 64
Spee
dup
Number of threads
compact, mklcompact, seqcompact, omp
scatter, mklscatter, seqscatter, omp
daxpyi w/ ULIBC on 4-socket Xeon• 疎ベクトル演算 daxpyi … y[ indx[k] ] += alpha * x[k]
– x[10000], y[1000000], indx[10000] (indx[]に y[] の要素番号を格納)
OpenMP並列版は並列数の増加に伴う性能低下の軽減
スレッドを近くに配置 分散配置x6.47
Compact型のaffinityが高い性能を示す
Scatter型の affinityはデフォルトと近い性能
0
1
2
3
4
5
6
7
8
1 2 4 8 16 32 64
Spee
dup
Number of threads
compact, mklcompact, seqcompact, omp
scatter, mklscatter, seqscatter, omp
どのように性能を考察するか?• ソケットをまたぐと性能低下⇒スレッドとデータの位置が問題
– 「ソケット数 x スレッド数」での比較
intra-socket
inter-socket1x8
1x4
1x2
1x16 (HT)
2x16(HT) 4x16(HT)
2x1
4x1 4x2 4x44x8
4x16(HT)
スレッドを近くに配置 分散配置
Socket は横断させないHTの効果はあまりなし
x6.47
計算機の特徴を考慮した高速計算• 計算機の特徴を考慮して高速に動作するソフトウェアを開発
– 優れた計算手法– 高速な実装方法
計算機の特徴を踏まえて同時に進める
• 計算機上で実際に性能を出すために重要なこと– 計算機が得意な処理を知る– アルゴリズムやデータ構造の性質を知る
• 理論的には高速だが実際には低速 … Fibonacci-heap など• 理論的には非効率であるが実際には高速 … Quick-sort など
• Graph500: 7th, 8th, 9th, 10th, 11th で最高性能 (1台)
• Green Graph500: 1st, 2nd, 3rd, 4th, 5th, 6th で最高の省電力性能– 1-8位まで我々の結果が独占
計算機の特徴を考慮した高速化を適用した実装
Ours: UV300 (32 sockets) … 219 GTEPSBG/Q (512 nodes) … 172 GTEPS
1/16 のソケット数で同等以上の性能
参考文献
• [BD13] Y. Yasui, K. Fujisawa, and K. Goto: NUMA-optimized Parallel Breadth-first Search on Multicore Single-node System, IEEE BigData 2013
• [ISC14] Y. Yasui, K. Fujisawa, and Y. Sato: Fast and Energy-efficient Breadth-first Search on a single NUMA system, IEEE ISC'14, 2014
• [HPCS15] Y. Yasui and K. Fujisawa: Fast and scalable NUMA-based thread parallel breadth-first search, HPCS 2015, ACM, IEEE, IFIP, 2015.
• [HPGP16] Y. Yasui, K. Fujisawa, Eng Lim Goh, J. Baron, A. Sugiura, T. Uchiyama: NUMA-aware Scalable Graph Traversal on SGI UV Systems, HPGP’16, ACM, 2016
• [GraphCREST2015] K. Fujisawa, T. Suzumura, H. Sato, K. Ueno, Y. Yasui, K. Iwabuchi, and T. Endo: Advanced Computing & Optimization Infrastructure for Extremely Large-Scale Graphs on Post Peta-Scale upercomputers, Proceedings of the Optimization in the Real World --Toward Solving Real-World Optimization Problems --, Springer, 2015.
参考文献 (共有メモリマシン)
参考文献 (Graph500プロジェクト)
いつメモリは確保されるか• Linux ではページ単位 (4KB) で管理されている
– 初回参照時 (first-touch) にメモリが割り当てられるlong n = 1000000000;double *A = NULL;
A = malloc(sizeof(long) * n);
for (long i = 0; i < n ++i) {A[i] = 0.0;
}
const long stride = getpagesize() / sizeof(double);for (long i = 0; i < n i += stride) {
A[i] = 0.0;}
109 x 8 bytes (7.45 GB) のメモリ確保
各要素を 0.0 で初期化 ⇒ メモリ割当て
ページの先頭の要素のみを初期化
• いくつかの例外– サイズ小:予め確保したメモリを割り当てるだけ (glibc)– サイズ大:大きなページ (Transparent hugepage; 2MB) を使用
posix_memalignや mmapでも基本的には同じ
※ THPによりユーザが意識せずに性能向上を行える
仮想記憶と Huge page (Large page)• データアクセスには物理アドレスが必要
• 短時間に広範囲にメモリアクセスすると TLBミスが発生• ページサイズと使い勝手のトレードオフ
– 大きくすると TLBミスを軽減できる– 大きくすると無駄な領域が増える (2.00001 MB の確保⇒ 4MB)
A[i] = 0.0; (1) &A[i]は仮想アドレス, 物理アドレスが必要
(2) 該当ページが TLB に登録されているか確認
(3) 物理アドレスを取得するために RAM へ確認
• ページサイズを自動的に変更 (4KB ⇒ 2MB)– 近年の Linux では transparent huge page が default で enable に
TLBミス