Upload
hashim
View
43
Download
0
Embed Size (px)
DESCRIPTION
Automatic OpenCL Work-Group Size Selection for Multicore CPUs. Sangmin Seo 1 Jun Lee 2 Gangwon Jo 2 Jaejin Lee 2 1 ManyCoreSoft Co., Ltd. 2 Seoul National University http:// aces.snu.ac.kr September 11, 2013. Goal: Finding a Good Work-group Size. OpenCL kernel. Work-group size. - PowerPoint PPT Presentation
Citation preview
PACT 2013
PACT 2013
Automatic OpenCL Work-Group Size Selection for Multicore CPUs
Sangmin Seo1 Jun Lee2 Gangwon Jo2 Jaejin Lee2
1ManyCoreSoft Co., Ltd. 2Seoul National University
http://aces.snu.ac.kr
September 11, 2013
PACT 2013 2
Goal: Finding a Good Work-group Size
Work-group size
Manual selection Automatic selectionBy programmers
Time-consuming
Not portable
By tools
Fast
Portable
index space
work-group
OpenCL kernel
multicore CPUs
PACT 2013 3
Why OpenCL on CPUs?
• CPU is the basic processing unit in modern computing systems
• CPU is increasing the number of its cores– It can accelerate computations by parallelizing them
• Without hardware accelerators– CPU should execute OpenCL code for portability
• CPUs also have diverse architectures– Performance portability across CPUs is important
PACT 2013 4
Outline
• Motivation• OpenCL execution model • Effect of the work-group size• Work-group size selection• Selection framework• Evaluation• Related work• Conclusions and future work
PACT 2013
OpenCL Execution Model
• OpenCL program = a host program + kernels• Executes a kernel at each point in the N-dimensional index space
5
PE 1
Compute Unit 1
PE M
...
Compute Device
CU
CU
...
...
work-item work-group
2-dimensional index space
PACT 2013
Work-group Size
• A partition of the index space• An important factor for the performance of OpenCL applications• Influences utilization of the compute device and load balance
6
2-D Index Space
work-item work-group
Sy
SxGx
Gy
How to determine the work-group size (Sx, Sy)?
PACT 2013 7
Effect of the Work-group Size• With AMD OpenCL
– Kernel y_solve of SP in the OpenCL NAS Parallel Benchmarks
(1,1)
(1,2)
(2,50
)(4,
1)(4,
25)(5,
20)
(10,10
)(20
,5)
(20,50
)(25
,4)(50
,1)
(100,1
)0.7
0.8
0.9
1.0
1.1
1.2
Work-Group Size
Nor
mal
ized
Exe
cutio
n Ti
me
(1,1)
(1,2)
(2,50
)(4,
1)(4,
25)(5,
20)
(10,10
)(20
,5)
(20,50
)(25
,4)(50
,1)
(100,1
)0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6
Work-Group Size
Nor
mal
ized
Exe
cutio
n Ti
me
Intel Xeon X5680 AMD Opteron 6172
The best work-group sizes are different
PACT 2013 8
Effect of the Work-group Size• With SnuCL (open-source OpenCL framework)
– Kernel y_solve of SP in the OpenCL NAS Parallel Benchmarks
(1,1)
(1,2)
(2,50
)(4,
1)(4,
25)(5,
20)
(10,10
)(20
,5)
(20,50
)(25
,4)(50
,1)
(100,1
)0.0
0.5
1.0
1.5
2.0
2.5
3.0
Work-Group Size
Nor
mal
ized
Exe
cutio
n Ti
me
(1,1)
(1,2)
(2,50
)(4,
1)(4,
25)(5,
20)
(10,10
)(20
,5)
(20,50
)(25
,4)(50
,1)
(100,1
)0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
Work-Group Size
Nor
mal
ized
Exe
cutio
n Ti
me
Intel Xeon X5680 AMD Opteron 6172
The best work-group sizes are different according to devices and OpenCL frameworks
PACT 2013
Work-group Size Selection
• Determines a work-group size– Shows the best performance among all possible work-group sizes
• Given the index space and the target architecture
– An auto-tuning technique• To find the best parameter, i.e., work-group size
• Considers cache utilization and load balance between cores– Using polyhedron models
• Profile-based approach– Finds the best work-group size before the kernel execution– Exploits runtime information
9
PACT 2013
Work-group Size Selection (contd.)
10
valid valid valid invalid
index space work-item work-group
PACT 2013 11
Work-group Size Selection (contd.)
index space: (11, 11)
Valid work-group sizes?
(1, 1), (1, 11)
(11, 1), (11, 11)
11 is a prime number
PACT 2013
Virtually-extended Index Space (VIS)
12
Gy
virtual work-item
Ey
Vy
Gx Ex
Vx
work-item
PACT 2013 13
Virtually-extended Index Space (VIS)2
2
Work-group size selection with VIS
VIS enables us to select an arbitrary work-group size!
Invalid work-group size
index space: (11, 11) VIS: (12, 12)
PACT 2013 14
How to Determine a Work-group Size?
• Target OpenCL framework: SnuCL– An open-source software– Can understand its mechanism
• What factor is important?– Cache misses of a work-group
• A work-group is a scheduling unit in SnuCL• CPUs are our target architecture
• Finds the largest work-group size– Minimizes cache misses
PACT 2013
Code Generation in SnuCL
15
OpenCL kernel Compiler-generated C code__kernel void mat_mul_2d(__global float *A, __global float *B, __global float *C, int WX, int WY){ int i = get_global_id(1); int j = get_global_id(0); C[i*WX+j] = A[i*WX+j] + B[i*WX+j];}
void mat_mul_2d(__global float *A, __global float *B, __global float *C, int WX, int WY){ for(__j=0; __j<__local_size[1]; __j++) { for(__i=0; __i<__local_size[0]; __i++) { int i = get_global_id(1); int j = get_global_id(0); C[i*WX+j] = A[i*WX+j] + B[i*WX+j]; } }}
Executes all work-items in a work-group as a doubly-nested loop
PACT 2013
Working-set Estimation
16
Compiler-generated C codevoid mat_mul_2d(__global float *A, __global float *B, __global float *C, int WX, int WY){ for(__j=0; __j<__local_size[1]; __j++) { for(__i=0; __i<__local_size[0]; __i++) { int i = get_global_id(1); int j = get_global_id(0); C[i*WX+j] = A[i*WX+j] + B[i*WX+j]; } }}
Working-set of a work-group = a set of distinct cache lines accessed by a work-group during its execution
Use the polyhedral model
Cache lines
Working-set modeling
PACT 2013 17
Locality Enhancement
• Minimizes cache misses– For the L1 private data cache
• Capacity misses– Working-set of a work-group ≤ cache size
• Conflict misses– # of different tags of cache lines for every set ≤ associativity
PACT 2013
Work-group Size Selection Algorithm
18
for each wgs in all work-group sizes do find WGScp that does not incur capacity missesend for
for wgs from WGScp downto (1,1,1) do find WGScf that does not incur conflict missesend for
find WGSopt that make all CPU cores have almost the same number of work-groups
cache utilization
loadbalancing
PACT 2013
Selection Framework
19
kernel code Code generator
Search library
Selection framework
C++ compiler
Work-group size finder
Selection Framework
Work-group size finder
run-timeparameters work-group size
Selection Process
PACT 2013
Evaluation - Target Machines
20
Machine M1 M2 M3 M4
CPU XeonX5680
XeonE5310
Opteron6172
Opteron4184
Vendor Intel AMD
Clock freq. 3.3 GHz 1.6 GHz 2.1 GHz 2.8 GHz
# of CPUs 2 2 2 2
# of cores 12 8 24 12
# of threads 24 8 24 12
L1I cache 12 x 32KB 8 x 32KB 24 x 64KB 12 x 64KB
L1D cache 12 x 32KB 8 x 32KB 24 x 64KB 12 x 64KB
L2 cache 12 x 256KB 2 x 4MB 24 x 512KB 12 x 512KB
L3 cache 2 x 12MB 4 x 6MB 2 x 6MB
Memory 72GB 12GB 128GB 64GB
OS CentOS 6.3
PACT 2013 21
Evaluation Methodology
• Selection framework– Code generator
• Using a C front-end clang of LLVM
– Search library• Using a polyhedron library, called barvinok
• OpenCL framework– SnuCL
• Modifying to support the virtually-extended index space (VIS)
• Benchmark applications– OpenCL version of NAS Parallel Benchmarks– 31 kernels from BT, CG, EP, and SP
• Counterpart– Exhaustive search
• Looking for the best work-group size among all possible work-group sizes by executing one by one and comparing their kernel execution time
PACT 2013
Selection Accuracy
22
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 G0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8 ES-M AS AS-VIS
Nor
mal
ized
Exe
cutio
n Ti
me
Results on M1 (Intel Xeon X5680)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 G0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0
ES-M AS AS-VIS
Nor
mal
ized
Exe
cutio
n Ti
me
2.47
Results on M3 (AMD Opteron 6172)
PACT 2013 23
Selection Accuracy (contd.)
• Our approaches are quite effective and promising• The VIS enhances the effectiveness
– By increasing the number of possible work-group sizes
M1 M2 M3 M4
ES-M +18% +6% +30% +21%
AS +8% +3% +13% +6%
AS-VIS -2% -7% +5% -3%
Average performance (slowdown in the execution time)
PACT 2013
Cache, TLB Misses vs. Exec. Time
24
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Exec. TimeL1D MissL2D MissDTLB Miss
Work-Group Size (sorted by Exec. Time)
SP.compute_rhs2.B (19) on M3(high correlation between L1D misses and the kernel execution time)
PACT 2013
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 G0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
2.0 ES-M AS AS-VIS
Norm
aliz
ed E
xecu
tion
Tim
eSP.compute_rhs2.B (19) on M3
25
Results on M3 (AMD Opteron 6172)
PACT 2013
Cache, TLB Misses vs. Exec. Time
26
CG.main_3.C (8) on M3(low correlation between L1D misses and the kernel execution time)
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Exec. TimeL1D MissL2D MissDTLB Miss
Work-Group Size (sorted by Exec. Time)
PACT 2013
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 G0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
2.0 ES-M AS AS-VIS
Norm
aliz
ed E
xecu
tion
Tim
eCG.main_3.C (8) on M3
27
Results on M3 (AMD Opteron 6172)
PACT 2013 28
Selection Time
ES(average)
AS(average)
Speedup(geo. mean)
M1 5688.2 sec. 0.503 sec. 3720M2 23002.3 sec. 5.214 sec. 4289M3 6249.7 sec. 6.291 sec. 466M4 8001.3 sec. 4.982 sec. 809
Avg. 1566x faster than the exhaustive search on 4 machines
PACT 2013
Related Work• Execution of fine-grained SPMD-threaded code for CPUs
– [CGO’10, PACT’10, PACT’11]– e.g., OpenCL or CUDA– Mainly focus on how to correctly translate the code into CPU code– Do not provide work-group size selection methods
• Auto-tuning– [Software Automatic Tuning’10]– Finds a thread block size for CUDA kernels– Uses profiling but executes all possible block sizes
• Tile size selection– [PLDI’95, ICS’99, JS’04, CGO’10, CC’11]– Target is different– Complementary to the work-group size selection
• Working-set estimation or memory footprint analysis– [SIGMETRICS’05, PPoPP’11, PACT’11]– Analyze real address traces
29
PACT 2013
Conclusions and Future Work
• Proposes an automatic work-group size selection technique– For OpenCL kernels on multicore CPUs– Selection algorithm
• Integrates working-set estimation and cache misses analysis techniques• Implemented as a selection framework, a profiling-based tool
– Virtually-extended index space• Enhances the accuracy of our selection algorithm
– Evaluation results• Practical and promising• Applicable to a wide range of multicore CPU architectures
• Future work– To develop static techniques
• Find a work-group size without profiling
– To extend our approach to other compute devices• E.g., GPUs and Intel Xeon Phi coprocessors
30
PACT 2013 31
Thank you
• SnuCL is an open-source OpenCL framework
• If you are interested in SnuCL, please visit
http://snucl.snu.ac.kr