Parametric FlowsAutomated Behavior Equivalencing
for Symbolic Analysis of Races in CUDA Programs
Peng Li, Guodong Li, and Ganesh Gopalakrishnan{peterlee, ligd, ganesh}@cs.utah.edu
School of Computing, University of Utah,
Salt Lake City, UT 84112, USA
GPU-based Computing• Titan [AMD+NVidia Kepler] is ranked 1st in the latest top 500!
• Various of GPU Programming models exist
2
(courtesy of NVidia) (courtesy of AMD) (courtesy of Intel) (courtesy of Microsoft)
CUDA OpenCL C/C++ C++ AMP
CUDA programs harbor insidious bugs!
• Data Races– Caused by unsynchronized accesses
tid = 1 tid = 2 … = a[tid] a[tid-1] = …
– Can produce unpredictable results– Compilers can misbehave if given code with races
• Deadlocks and other problems
3
CUDA Thread + Memory Organization
4Thread Warp Block Grid
5
Illustration of Race
tid 0 1 63
A
t0:write A[0]
...
__global__ void inc_gpu(int*A, int b, int N) { unsigned tid = threadIdx.x; A[tid] = A[(tid+1) % 64] + b;}
RACE!
t63: read A[0]
t0 t63
Illustration of Deadlock
6
tid %2 == 0
__syncthreads()
t0t1 t2 t3
t0 t2 t1 t3
true false
Debugging CUDA Programs is hard!
7
Why Hard?
8
E1
E2
En
…
…
…
…t0 t1 t2 t3 t4
Read(Addr=10)
Write(Addr=10)
Why Hard?• Traditional Methods– bugs only w.r.t. current platforms + inputs +
schedules
• Formal Methods– bugs analyzed w.r.t. future / different platforms
(PORTING ISSUE!)– all relevant inputs – all relevant schedules
9
Solution to relevant inputs: symbolic execution
X < 3
X
X < 10
X =
x<3 x>=3
x>=3& x<10
x>=3& x>=10
Path 1 : x < 3Path 2 : 3 <= x < 10Path 3 : x >= 10
Example Test Case 1 : x = 2Example Test Case 2 : x = 3Example Test Case 3 : x = 11
Constraint Solver
10
Solution to relevant schedules: representative interleaving
11
Solution to relevant schedules: representative interleaving
__device__ int d[64];__global void foo(int *d) {1: __shared__ int a[64];2: int tid = threadIdx.x;3: a[tid] = d[tid];4: __syncthreads();5: a[tid]++;6: if (tid % 2 == 0) {7: a[tid] = a[tid]+2;8: } else {9: a[tid] = a[tid%32];10:}11:__syncthreads();}
12
Solution to relevant schedules: representative interleaving
__device__ int d[64];__global void foo(int *d) {1: __shared__ int a[64];2: int tid = threadIdx.x;3: a[tid] = d[tid];4: __syncthreads();5: a[tid]++;6: if (tid % 2 == 0) {7: a[tid] = a[tid]+2;8: } else {9: a[tid] = a[tid%32];10:}11:__syncthreads();}
13
Solution to relevant schedules: representative interleaving
__device__ int d[64];__global void foo(int *d) {1: __shared__ int a[64];2: int tid = threadIdx.x;3: a[tid] = d[tid];4: __syncthreads();5: a[tid]++;6: if (tid % 2 == 0) {7: a[tid] = a[tid]+2;8: } else {9: a[tid] = a[tid%32];10:}11:__syncthreads();}
Barrier
Barrier
14
BarrierInterval
Solution to relevant schedules: representative interleaving
__device__ int d[64];__global void foo(int *d) {1: __shared__ int a[64];2: int tid = threadIdx.x;3: a[tid] = d[tid];4: __syncthreads();5: a[tid]++;6: if (tid % 2 == 0) {7: a[tid] = a[tid]+2;8: } else {9: a[tid] = a[tid%32];10:}11:__syncthreads();}
Barrier
Barrier
t0 t1 t2 t29t30t31
15
BarrierInterval
Solution to relevant schedules: representative interleaving
__device__ int d[64];__global void foo(int *d) {1: __shared__ int a[64];2: int tid = threadIdx.x;3: a[tid] = d[tid];4: __syncthreads();5: a[tid]++;6: if (tid % 2 == 0) {7: a[tid] = a[tid]+2;8: } else {9: a[tid] = a[tid] + 1;10:}11:__syncthreads();}
Barrier
Barrier
t0 t1 t2 t29t30t31
t0 t2 … t30
16
BarrierInterval
Solution to relevant schedules: representative interleaving
__device__ int d[64];__global void foo(int *d) {1: __shared__ int a[64];2: int tid = threadIdx.x;3: a[tid] = d[tid];4: __syncthreads();5: a[tid]++;6: if (tid % 2 == 0) {7: a[tid] = a[tid]+2;8: } else {9: a[tid] = a[tid]+1;10:}11:__syncthreads();}
Barrier
Barrier
t0 t1 t2 t29t30t31
t1 t3 … t31
t0 t2 … t30
17
BarrierInterval
Solution to relevant schedules: representative interleaving
__device__ int d[64];__global void foo(int *d) {1: __shared__ int a[64];2: int tid = threadIdx.x;3: a[tid] = d[tid];4: __syncthreads();5: a[tid]++;6: if (tid % 2 == 0) {7: a[tid] = a[tid]+2;8: } else {9: a[tid] = a[tid%32];10:}11:__syncthreads();}
Barrier
Barrier
t0 t1 t2 t29t30t31 t32
t33
t34 t61t62t63
t0 t2 … t30
t1 t3 … t31
t32 t34 … t62
t33 t35 … t63
SIMD-Aware Canonical Schedule
18
Solution to relevant schedules: representative interleaving
__device__ int d[64];__global void foo(int *d) {1: __shared__ int a[64];2: int tid = threadIdx.x;3: a[tid] = d[tid];4: __syncthreads();5: a[tid]++;6: if (tid % 2 == 0) {7: a[tid] = a[tid]+2;8: } else {9: a[tid] = a[tid%32];10:}11:__syncthreads();}
Barrier
Barrier
t0 t1 t2 t29t30t31 t32
t33
t34 t61t62t63
t0 t2 … t30
t1 t3 … t31
t32 t34 … t62
t33 t35 … t63
Around 16K pairs
19
SIMD-Aware Canonical Schedule
Result in PPoPP’ 12:Guarantee to find races !!
Evolution of Formal Analysis Tools for CUDA in our group
• Previous tool : GKLEE [PPoPP’12]– complete – does not scale, because every thread (e.g. 20K or more)
explicitly modeled
• This paper [SC’12] : GKLEEp – complete (in practice) – scales to 20k threads or more..
20
GKLEEp’s Flow
LLVM byte-code
instructions
SymbolicAnalyzer and
Scheduler
Error Monitors
C++ CUDA Programs with Symbolic Variable
Declarations
LLVM-GCC
• Data races• Deadlocks•Concrete test inputs• Bank conflicts• Warp divergences• Non-coalesced • Test Cases
• Provide high coverage• Can be run on HW
21
Key Contributions• Parametric flows are the control-flow equivalence classes of
threads that diverge in the same manner
• GKLEEp found bugs missed by GKLEE (GKLEEp scales!)– GKLEE: upto 2K threads– GKLEEp: well beyond 20K threads– GKLEEp finds all races (except in contrived programs) 22
Key Idea: Branching on TDC (Thread-ID Dependent Conditional)
__global void foo(int *d) {1: __shared__ int a[64];2: int tid = threadIdx.x;3: a[tid] = d[tid];4: __syncthreads();
5: a[tid]++;6: if (tid % 2 == 0) {7: a[tid] = a[tid]+2;8: } else {9: a[tid] = a[tid%32];10:}11: __syncthreads();}
Barrier
Barrier
23
A Motivating Example• __shared__ unsigned b[2048];• __global__ void test(unsigned * a) {• 1: unsigned tid = threadIdx.x;• 2: int x, y;• 3: if (tid < 1024) {• 4: b[tid] = a[tid] + 1; • 5: if (tid % 2 != 0) {• 6: b[tid] = 2; • 7: } else {• 8: if (tid > 0)• 9: b[tid] = b[tid-1]+1; • 10: if (x < y) … • 11: }• 12: } • 13: } else {• 14: b[tid] = b[tid-1]; • 15: }• } 24
A Motivating Example• __shared__ unsigned b[2048];• __global__ void test(unsigned * a) {• 1: unsigned tid = threadIdx.x;• 2: int x, y;
• 3: if (tid < 1024) { <<== TDC• 4: b[tid] = a[tid] + 1;
• 5: if (tid % 2 != 0) { <<== TDC• 6: b[tid] = 2; • 7: } else {
• 8: if (tid > 0) { <<== TDC• 9: b[tid] = b[tid-1]+1; • 10: if (x < y) … << == Not TDC• 11: }• 12: } • 13: } else {• 14: b[tid] = b[tid-1]; • 15: }• }
25
A Motivating Example
tid == 0
tid %2 != 0
b[tid] = 2;
tid >= 1024
b[tid] = b[tid-1];
tid < 1024
tid < 1024
tid > 0
b[tid] = b[tid-1]+1
tid %2 != 0
b[tid] = a[tid] + 1;
tid % 2 == 0
tid > 0
26
Parametric Flow Tree
4 Parametric Flows
Correctness of GKLEEp• No False Alarms– guaranteed - because of exact symbolic constraint
solving!!
• No Omissions– "no omissions" true in practice
Details in paper!!
27
SDK Kernel Example: Symbolic race checking
__global__ void histogram64Kernel(unsigned *d_Result, unsigned *d_Data, int dataN) { const int threadPos = ((threadIdx.x & (~63)) >> 0) | ((threadIdx.x & 15) << 2) | ((threadIdx.x & 48) >> 4); ... __syncthreads(); for (int pos = IMUL(blockIdx.x, blockDim.x) + threadIdx.x; pos < dataN; pos += IMUL(blockDim.x, gridDim.x)) { unsigned data4 = d_Data[pos]; addData64(s_Hist, threadPos, (data4 >> 2) & 0x3FU); ... } __syncthreads(); ...}
__device__ void addData64(unsigned char *s_Hist, int threadPos, unsigned int data){ s_Hist[threadPos + IMUL(data, THREAD_N)]++; }
threadPos = … threadPos = …
data = (data4>>2) & 0x3FU
data = (data4>>2) & 0x3FU
s_Hist[threadPos + data*THREAD_N]++;
s_Hist[threadPos + data*THREAD_N]++;
t1 t2
28
SDK Kernel Example: Symbolic race checking
threadPos = … threadPos = …
data = (data4>>2) & 0x3FU
data = (data4>>2) & 0x3FU
s_Hist[threadPos + data*THREAD_N]++;
s_Hist[threadPos + data*THREAD_N]++;
RW set:t1: writes s_Hist((((t1 & (~63)) >> 0) | ((t1 & 15) << 2) | ((t1 & 48) >> 4)) + ((d_Data[t1] >> 26) & 0x3FU) * 32), …
t2: writes s_Hist((((t2 & (~63)) >> 0) | ((t2 & 15) << 2) | ((t2 & 48) >> 4)) + ((d_Data[t2] >> 26) & 0x3FU) * 32), …
t1 t2
t1,t2,d_Data: (t1 t2) (((t1 & (~63)) >> 0) | ((t1 & 15) << 2) | ((t1 & 48) >> 4)) + ((d_Data[t1] >> 2) & 0x3FU) * 32) == ((((t2 & (~63)) >> 0) | ((t2 & 15) << 2) | ((t2 & 48) >> 4)) + ((d_Data[t2] >> 2) & 0x3FU) * 32)
?
29Satisfiable! There is a race!!
SDK Kernel Example: race checking
threadPos = … threadPos = …
data = (data4>>2) & 0x3FU
data = (data4>>2) & 0x3FU
s_Hist[threadPos + data*THREAD_N]++;
s_Hist[threadPos + data*THREAD_N]++;
t1 t2
GKLEEp indicates that these two
addresses are equal when
t1 = 23, t2 = 31, d_data[23]= 0xfcfcfcfc,
and d_data[31] = 0xf4f4f4f4
indicating a Write-Write race
RW set:t1: writes s_Hist((((t1 & (~63)) >> 0) | ((t1 & 15) << 2) | ((t1 & 48) >> 4)) + ((d_Data[t1] >> 26) & 0x3FU) * 64), …
t2: writes s_Hist((((t2 & (~63)) >> 0) | ((t2 & 15) << 2) | ((t2 & 48) >> 4)) + ((d_Data[t2] >> 26) & 0x3FU) * 64), …
30
Evaluation
31
Timed Out!
Evaluation
3232
GKLEEp in practice
• Accepts host program with many kernel calls
• Each kernel can be ~1K LOC, e.g., eigenvalues
• Finds races as well as inputs causing them
EvaluationKernels Race #T = 32 #T = 64 #T = 256 #T = 1,024 #T = 2,048
GKLEE GKLEEp GKLEE GKLEEp GKLEE GKLEEp GKLEE GKLEEp GKLEE GKLEEp
Bitonic Sort T.O. 7.7/20 T.O. 29.3/27 T.O. 177.2/44 T.O. 198/65 T.O T.O
Histogram64 WW 67.1/6 14.7/5 T.O. 21.8/6 T.O. 350.3/7 T.O. 292.2/2 T.O. 367.2/2
Scalar Product 0.7/1 16.1/1 0.6/1 4.3/1 0.8/1 0.8/1 1.3/1 0.9/1 2.6/1 1.2/1
Matrix Mult 0.2/1 4.5/1 0.4/1 4.0/1 2.0/1 3.2/1 19/1 2.8/1 362.1/1 3.4/1
Reduction0 0.02/1 0.07/1 0.1/1 0.03/1 0.3/1 0.2/1 2.9/1 0.3/1 10.5/1 0.4/1
Reduction1 0.01/1 0.1/1 0.1/1 0.1/1 0.8/1 0.2/1 8.1/1 0.3/1 24.0/1 0.5/1
Reduction2 0.02/1 0.1/1 0.03/1 0.1/1 0.2/1 0.1/1 2.9/1 0.3/1 10.2/1 0.4/1
Reduction3 0.01/1 0.1/1 0.03/1 0.1/1 0.3/1 0.1/1 2.7/1 0.3/1 10.0/1 0.4/1
Reduction4 RW 0.1/1 0.04/1 0.3/1 0.03/1 2.8/1 0.2/1 17.3/1 0.4/1 42.4/1 0.6/1
Reduction5 RW 0.1/1 0.04/1 0.3/1 0.03/1 2.8/1 0.2/1 11.4/1 0.4/1 21.3/1 0.5/1
Reduction6 RW 0.1/1 0.05/1 0.3/1 0.04/1 2.8/1 0.2/1 11.5/1 0.4/1 22.6/1 0.6/1
Scan Best 0.3/1 3.6/1 2.1/1 5.1/1 48.8/1 8.1/1 923.3/1 12.5/1 T.O. 26.6/1
Scan Naive 0.04/1 0.2/1 0.2/1 0.4/1 3.4/1 0.5/1 66.0/1 0.9/1 291.8/1 15.2/1
Scan WorkEfficient 0.1/1 0.6/1 0.4/1 0.8/1 12.1/1 1.2/1 250.8/1 2.1/1 T.O. 3.1/1
Scan Large 0.3/1 2.1/1 0.8/1 2.7/1 6.3/1 2.7/1 67.7.1/1 8.1/1 230.3/1 21.3/1
TABLE I SDK 2.0 KERNEL RESULTS. WE SET 7200 SECONDS AS THE THRESHOLD FOR TIME OUT (ABBREVIATED AS T.O.). A/B , A is the tool runtime (in seconds) and B is the
number of control flow paths34
Related formal methods based work: compare with other formal tools
• [M.Zheng et al, PPoPP’11]: – Combination of static analysis and dynamic analysis
• [A. Leung et al, PLDI’12]: – A single dynamic run can be used to learn much more information about
a CUDA program’s behavior
• [A. Betts et al, SPLASH’12]:– Two threads abstraction – Found errors in real SDK kernels
GKLEEp scales more and finds races in real kernels!35
Conclusion
• New formal approach for analyzing CUDA kernels• Employs a “parametric” reasoning style which
capitalizes on thread symmetry• Scales to over 10^5 threads on realistic CUDA
programs• Finds races missed by– Traditional testing– Previous formal approaches
• Tool will be released soon – check websitehttp://www.cs.utah.edu/fv/GKLEE
36
Thanks!
Questions?
37
Extra Slides
• How to pick symbolic inputs?– taint analyzer being developed– help pick inputs that matter and make symbolic
• Loops invariant– Static analysis to avoid loop unrolling
A Motivating Example• __global__ void test(unsigned * a) {• 1: unsigned bid = blockIdx.x;• 2: unsigned tid = threadIdx.x;• 3:• 4: if (bid % 2 != 0) {• 5: if (tid < 1024) {• 6: unsigned idx = bid * blockDim.x + tid;
• 7: b[tid] = a[idx] + 1; • 8: if (tid % 2 != 0) {• 9: b[tid] = 2; • 10: } else {• 11: if (tid > 0)
• 12: b[tid] = b[tid-1]+1; • 13: }• 14: } else {• 15: b[tid] = b[tid-1]; • 16: }• 17: } else {• 18: unsigned idx = bid * blockDim.x + tid;• 19: b[tid] = a[idx] + 1;• 20: }• }
GKLEEp: T1: <1,0,0><511,0,0> and T2: <1,0,0><512,0,0> incur the write-read race, needs
1.9s s
GKLEE: T1: <1,0,0><31,0,0> and T2: <1,0,0><32,0,0> incur the write-read race, needs
50.5s s
39
A Motivating Example• 7: b[tid] = a[idx] + 1; • 8: if (tid % 2 != 0) {• 9: b[tid] = 2; • 10: } else {• 11: if (tid > 0)
• 12: b[tid] = b[tid-1]+1; • 13: }• 14: }
• Constraint for race checking: – Configuration Constraint:
• •
•
– TDC Constraint from Parametric Flow Tree: • •
– Thread Relation Constraint:•
Precondition
40
A Motivating Example• 7: b[tid] = a[idx] + 1; • 8: if (tid % 2 != 0) {• 9: b[tid] = 2; • 10: } else {• 11: if (tid > 0)
• 12: b[tid] = b[tid-1]+1; • 13: }• 14: }
• Constraint for race checking: – Configuration Constraint:
• •
•
– TDC Constraint from Parametric Flow Tree: • •
– Thread Relation Constraint:•
– Race Constraint:
GKLEEp:
T1: <1,0,0><511,0,0> and T2: <1,0,0><512,0,0> incur
the inter-warp write-read races
Precondition
41
Recommended