Pawan 09 Graph Algorithms

7/28/2019 Pawan 09 Graph Algorithms

1/26

Large Graph Algorithms for Massively Multithreaded Architectures

Pawan Harish, Vibhav Vineet and P. J. NarayananCenter for Visual Information TechnologyInternational Institute of Information Technology Hyderabad, [email protected], [email protected], [email protected]

Technical Report Number IIIT/TR/2009/74

Modern Graphics Processing Units (GPUs) provide high computation power at low costs and have been de-scribed as desktop supercomputers. The GPUs expose a general, data-parallel programming model today in theform of CUDA and CAL. The GPU is presented as a massively multithreaded architecture by them. Severalhigh-performance, general data processing algorithms such as sorting, matrix multiplication, etc., have been de-

veloped for the GPUs. In this paper, we present a set of general graph algorithms on the GPU using the CUDAprogramming model. We present implementations of breadth-first search, st-connectivity, single-source shortestpath, all-pairs shortest path, minimum spanning tree, and maximum flow algorithms on commodity GPUs. Ourimplementations exhibit high performance, especially on large graphs. We experiment on random, scale-free, andreal-life graphs of up to millions of vertices. Parallel algorithms for such problems have been reported in theliterature b efore, especially on supercomputers. The approach has been that of divide-and-conquer, where indi-vidual processing nodes solve smaller sub-problems followed by a combining step. The massively multithreadedmodel of the GPU makes it possible to adopt the data-parallel approach even to irregular algorithms like graphalgorithms, using O(V) or O(E) simultaneous threads. The algorithms and the underlying techniques presentedin this paper are likely to be applicable to many irregular algorithms on them.

1. Introduction

Graphs are popular data representations inmany computing, engineering, and scientific ar-eas. Fundamental graph operations such asbreadth first search, st-connectivity, shortestpaths, etc., are building blocks to many appli-cations. Implementations of serial fundamentalgraph algorithms exist [36,14] with computingtime of the order of vertices and edges. Such im-plementations become impractical on very largegraphs involving millions of vertices and edges,common in many domains like VLSI layout, phy-logeny reconstruction, network analysis, etc. Par-

allel processing is essential to apply graph algo-rithms on large datasets. Parallel implementa-tions of some graph algorithms on supercomput-ers are reported, but are accessible only to a fewowing to the high hardware costs [6,8,52]. CPUclusters have been used for distributed implemen-tations. Synchronization however becomes a bot-tleneck for them. All graph algorithms cannotscale to parallel hardware models. For example,

there does not exist an efficient PRAM solution

to the DFS problem. A suitable mix of paralleland serial hardware is required for efficient imple-mentation in such cases.

Modern Graphics Processing Units (GPUs)provide high computation power at low costs andhave been described as desktop supercomputers.The GPUs have been used for many general pur-pose computations due to their low cost, highcomputing power, and high availability. The lat-est GPUs, for instance, can deliver close to 1TFLOPs of compute power at a cost of around$

400. The stages of the graphics pipeline wereexploited for parallelism with the flow of execu-tion handled serially using the pipeline in the ear-lier, GPGPU model. The GPUs expose a general,data-parallel programming model today in theform of CUDA and CAL. The recently adoptedOpenCL standard [38] will provide a commoncomputing model to not only all GPUs, but alsoto other platforms like multicore, manycore, andCell/B.E. The Compute Unified Device Archi-tecture (CUDA) from Nvidia presents a hetero-

1


2/26

2 IIIT/TR/2009/74

geneous programming model where the parallel

hardware can be used in conjunction with theCPU. This provides good control over sequentialflow of execution which was absent from the ear-lier GPGPU. CUDA can be used to imitate a par-allel random access machine (PRAM) if globalmemory alone is used. In conjunction with aCPU, it can be used as a bulk synchronous par-allel (BSP) hardware with the CPU deciding thebarrier for synchronization.

CUDA presents the GPU as a massivelythreaded parallel architecture, allowing upto mil-lions of threads to run in parallel over its pro-

cessors, with each having access to a commonglobal memory. Such a tight architecture is a de-parture from the supercomputers, which typicallyhave a small number of powerful cores. The par-allelizing approach on them was that of divide-and-conquer, where individual processing nodessolve smaller sub-problems followed by a com-bining step. The massively multithreaded modelpresented by the GPU makes it possible to adoptthe data-parallel approach even to irregular algo-rithms like graph algorithms, using O(V) or O(E)simultaneous threads. The multicore and many-core processors of the future are likely to support

a massively multithreaded model. Each core willbe simple with support for multiple threads inflight, as the number of cores exceeds a few dozensinto the hundreds as has been proposed.

Several high-performance, general data pro-cessing algorithms such as sorting, matrix mul-tiplication, etc., have been developed for theGPUs. In this paper, we present a set of generalgraph algorithms on the GPU, using the CUDAprogramming model. Specifically, we presentimplementations of breadth first search (BFS),st-connectivity (STCON), single source shortest

path (SSSP), all pairs shortest path (APSP), min-imum spanning trees (MST), and maximum flow(MF). Our implementations exhibit high perfor-mance, especially on large graphs. We experi-ment on random, scale-free, and real-life graphsof up to millions of vertices. Our algorithms dontjust present high performance. They also pro-vide strategies for exploiting the massively mul-tithreaded architectures on future manycore ar-chitectures for graph theory as well as for other

problems.

Using a single graphics card, we perform BFSin about half a second on a 10M vertex graphwith 120M edges, and SSSP on it in 1.5 seconds.On the DIMACS full-USA graph of 24M ver-tices and 58M edges it takes less than 9 secondsfor our implementation to compute the minimumspanning tree. We study different approaches toAPSP and show a speed up by a factor of 2 4times over Katz and Kider [33]. A speed up ofnearly 10 15 times over CPU Boost graph li-brary is achieved for all algorithms are reportedin this paper for general large graphs.

2. Compute Unified Device Architecture

Programmability was introduced to the GPUwith shader model 1.0 and since enhanced upto the current shader model 4.0 standard. AGPGPU solution poses an algorithm as a series ofrendering passes following the graphics pipeline.Programmable shaders are used to interpret datain each pass and write the output to the framebuffer.

Amount of memory available on the GPUand its representation are restricting factors forGPGPU algorithms; a GPU cannot handle datagreater than the largest texture supported by it.The GPU memory layout is also optimized forgraphics rendering which restricts the GPGPUsolution. Limited access to memory and proces-sor anatomy makes it tricky to port general algo-rithms to this framework [40].

CUDA enhances the GPGPU programmingmodel by not treating the GPU as a graphicspipeline but as a multicore co-processor. Fur-ther it improves the GPGPU model by remov-ing memory restrictions for each processor. Data

representation is also improved by providing pro-grammer friendly data structures. All memoryavailable on the CUDA device can be accessedby all processors with no restriction on its repre-sentation, though the access times may vary fordifferent types of memory.

At the hardware level, CUDA is a collectionof multiprocessors consisting of a series proces-sors. Each multiprocessor contains a small sharedmemory, a set of 32-bit registers, texture, and


3/26

Large Graph Algorithms for Massively Multithreaded Architectures 3

SP SP SP SP

SP SP SP SP

Shared Mem

Shared Mem

The device global memory

Grid with multiple blocks resulting from a Kernel call

The CUDA Device, with a number of Multiprocessors

CUDA Block

Multiprocessor

Runs On

Threads

The CUDA Hardware Model

1

n

The CUDA Programming Model

Variables

Figure 1. The CUDA hardware model (top) and programming model (bottom), showing the block tomultiprocessor mapping.

constant memory caches common to all proces-sors inside it. Each processor in the multipro-cessor executes the same instruction on differentdata, which makes it a SIMD model. Commu-nication between multiprocessors is through thedevice global memory, which is accessible to allprocessors within a multiprocessor.

As a software interface, CUDA API is a set oflibrary functions which can be coded as an ex-

tension of the C language. A compiler generatesexecutable code for the CUDA device. For theprogrammer CUDA is a collection of threadsrun-ning in parallel. Each thread can use a numberof private registers for its computation. A collec-tion of threads (called a block) runs on a multi-processor at a given time. The threads of eachblock have access to a small amount of commonshared memory. Synchronization barriers are alsoavailable for all threads of a block. A group ofblocks can be assigned to a single multiproces-sor but their execution is time-shared. The avail-able shared memory and registers are split equally

amongst all blocks that timeshare a multiproces-sor. A single execution on a device generates anumber of blocks. Multiple groups of blocks arealso time shared on the multiprocessor for execu-tion. A collection of all blocks in a single execu-tion is called a grid (Figure 1).

Each thread executes a single instruction setcalled the kernel. Each thread and block is given aunique ID that can be accessed within the thread

during its execution. These can be used by eachthread to perform the kernel task on its part ofthe data, an SIMD execution. An algorithm mayuse multiple kernels, which share data throughthe global memory and synchronize their execu-tion at the end of each kernel. Threads from mul-tiple blocks can only synchronize at the end of thekernel execution by all threads.

3. Representation and Algorithm OutlineEfficient data structures for graph representa-

tion have been studied in depth. Complex datastructures like hash tables [32] have been used forefficiency on the CPU. The GPU memory layoutis optimized for graphics rendering and cannotsupport user defined data structures efficiently.Creating an efficient data structure under theGPGPU memory model is a challenging prob-lem [35,28]. However, the CUDA model treatsmemory as general arrays and can support moreefficient data structures.

The adjacency matrix is a good choice for rep-resenting graphs on the GPU. However, it is notsuitable for large graphs because of its O(V2)space requirements. This restricts the size ofgraphs that can be handled by the GPU. Adja-cency list is a more practical representation forlarge graphs requiring O(V + E) space. We rep-resent graphs using a compact adjacency list rep-resentation with each vertex pointing to its start-ing edge list in a packed adjacency list of edges


4/26

4 IIIT/TR/2009/74

Size O(V)

Va0 3 5 7 9 V-1 V

Ea2 5 20 13 15 3 6 18 11 7 3 2 0 7

Size O(E)

Starting Edge

pointers

Figure 2. Graph representation is in terms of avertex list that points to a packed edge list.

(Figure 2). Uneven array sizes can be supported

on CUDA.The vertex list Va points to its starting index

in the edge list Ea. Each entry in the edge list Eapoints to a vertex in vertex list Va. Since we dealwith undirected graphs, each edge results in oneentry for each of its end vertices. Cache efficiencyis hard to achieve using this representation as theedge list can point to any vertex in Va and cancause random jumps in memory. The problemof laying out data in memory for efficient cacheusage is a variant of the BFS problem itself.

We use this representation for all algorithms re-

ported in this paper except one all pairs shortestpaths method. A block-divided adjacency matrixrepresentation is used to exploit better cache ef-ficiency there (explained in section 7.2.1). Theoutput for APSP requires O(V2) space and thusadjacency matrix is a more suitable representa-tion.

3.1. Algorithm Outline on CUDAThe CUDA hardware can be seen as a mul-

ticore/manycore co-processor in a bulk syn-chronous parallel mode when used in conjunc-tion with the CPU. Synchronization of CUDA

threads can be achieved with the CPU decidingthe barrier for synchronization. Broadly a bulksynchronous parallel machine follows three steps:(a)Concurrent computation: Asynchronous com-putation takes place on each processing element(PE). (b)Communication: PEs exchange databetween each other. (c)Barrier Synchronization:Each process waits for other processes to fin-ish. Concurrent computation takes place at theCUDA device in the form of program kernels with

communication through the global memory. Syn-

chronization is achieved only at the end of eachkernel. Algorithm 1 outlines the CPU code inthis scenario. The skeleton code runs on the CPUwhile the kernels run on a CUDA device.

Algorithm 1 CPU SKELETON

1: Create and initialize working arrays onCUDA device.

2: while NOT Terminate do3: Terminate true4: For each vertex/edge/color in parallel:

5: Invoke Kernel16: Synchronize7: For each vertex/edge/color in parallel:8: Invoke Kernel29: Synchronize

10: etc...11: For each vertex/edge/color in parallel:12: Invoke Kerneln and modify Terminate13: Synchronize14: Copy Terminate from GPU to CPU15: end while

The termination of an operation depends on aconsensus between threads. A logical OR opera-tion needs to be performed over all active threadsfor termination. We use a single boolean variable(initially set to true) that is written over by allthreads independently, typically by the last kernelduring execution. Each non-terminating threadwrites a false to this location in global memory.If no thread modifies this value, the loop termi-nates. The variable needs to be copied from GPUto CPU after each iteration to check for termina-tion (Algorithm 1 line 2).

Algorithms presented differ from each other in

the kernel code and the data structure require-ments but the CPU skeleton pseudo-code givenabove applies to all algorithms reported in thispaper.

3.2. Vertex List CompactionWe assign threads to an attribute of the graph

(vertex, color etc.) in most implementations.This leads to |V| threads executing in parallel.The number of active vertices, however, varies


5/26


in each iteration of execution. Active vertices

are typically indicated in an activity mask, whichholds a 1 for each active vertex. During execu-tion, each vertex thread confirms its status fromthe activity mask and continues execution if ac-tive. This can lead to poor load balancing on theGPU, as CUDA blocks have to be scheduled evenwhen all vertices of the block are inactive, leadingto an unbalanced SIMD execution. Performancewill improve if we deploy only as many threads asthe active vertices, reducing the number of blocksand thus time sharing on the CUDA device [46].

A scan operation on the activity mask can de-

termine the number of active vertices as well asgive each an ordinal number. This establishes amapping between the original vertex index and itsnew index amongst the currently active vertices.Using the scan output, we compact all entries inthe activity mask to a new active mask (Figure 3)creating the mapping of new thread IDs to oldvertex IDs. While using active mask, each threadfinds its vertex by looking at its active mask andthereafter executes normally.

1

Activity mask (Active Threads)

Active Mask

Scan output

0 1 42 3 5 6 7 8 9 10 11 12 13 14 15 16 Vertex/

Thread

IDs

New Thread

IDs

Old Active

Vertex IDs

Active

Vertices

Parallel Scan / Prefix Sum

0 1 0 1 0 0 1 1 0 1 1 1 101 0

1 2 2 3 3 4 4 4 5 6 6 7 8 980 1 10

0 2 4 6 9 10 12 13 15 16

0 1 42 3 5 6 7 8 9

Deploy these many threads only

Write active Vertex IDs to these locations

Figure 3. Vertex compaction is used to reduce thenumber of threads needed when not all verticesare active.

There is a trade-off between time taken by par-allel thread execution and time taken for scanand compacting. For graphs where parallelismexpands slowly, compaction makes most sense, asmany threads will be inactive in a single grid ex-ecution. For faster expanding graphs, compact-ing can become an overhead. We report experi-

ments where vertex compaction gives better per-

formance than the non compacted version.

4. Breadth First Search (BFS)

The BFS problem is to find the minimum num-ber of edges needed to reach every vertex in graphG from a source vertex s. BFS is well studied inserial setting with best time complexity reportedas O(V + E). Parallel versions of BFS algorithmalso exist. A study of the BFS algorithm onCell/B.E. processor using the bulk synchronousparallel model appeared in [45]. Zhang et al. [53]gave a heuristic search for BFS using level syn-chronization. Bader et al.[6] implement BFS forthe CRAY MTA2 supercomputer and Yoo et al.[52] on the BlueGene/L.

We treat the GPU as a bulk synchronous de-vice and use level synchronization to implementBFS. BFS traverses the graph in levels, once alevel is visited it is not visited again during exe-cution. We use this as our barrier and synchronizethreads at each level. A BFS frontier correspondsto all vertices at the current level, see Figure 4.Concurrent computation takes place at the BFSfrontier where each vertex updates the cost of its

neighboring vertices by assigning cost values totheir respective indices in the global memory.

We assign one thread to every vertex, elimi-nating the need for queues in our implementa-tion. This decision further eliminates the need tochange grid configuration and reassigning indicesin the global memory with every kernel execution,which incurs additional overheads and slows downthe execution.

GPU ImplementationWe keep two boolean arrays Fa and Xa of size

|V|

for the frontier and visited vertices respec-tively. Initially, Xa is set to false and Fa con-tains the source vertex. In the first kernel (Al-gorithm 2), each thread looks at its entry in thefrontier array Fa (Figure 4). If present, it updatesthe cost of its unvisited neighbors by writing itsown cost plus one to its neighbors index in theglobal cost array Ca.

It is possible for many vertices to write a value at onelocation concurrently while executing this step, leading to


6/26

6 IIIT/TR/2009/74

BFS FrontierIteration 2

s

1 1 1 1 1 1

Iteration 1

Frontier

1 1 11 1 1 1 1 1 1 1 1 Visited

Figure 4. Parallel BFS: Vertices in the frontierlist execute in parallel in each iteration. Execu-tion stops when the frontier is empty.

Algorithm 2 KERNEL1 BFS

1: tid getThreadID2: if Fa[tid] then3: Fa[tid] false4: for all neighbors nid of tid do5: if NOT Xa[nid] then6: Ca[nid] Ca[tid]+17: Fua[nid] true8: end if9: end for

10: end if

Each thread removes its vertex from the fron-tier array Fa and adds its neighbors to an alter-nate updating frontier array Fua. This is neededas there is no synchronization possible betweenall CUDA threads. Modifying the frontier at thetime of updation may result in read after writeinconsistencies. A second kernel (Algorithm 3)copies the updated frontier Fua to the actual fron-tier Fa. It adds the vertex in Fua to the visitedvertex array Xa and sets the termination flag as

false.The process is repeated until the frontier ar-

ray is empty and the while loop in Algorithm 1line 2 terminates. In the worst case, the algo-rithm needs the order of the diameter of the graph

clashes in the global memory. We do not lock memory forconcurrent write operations b ecause all frontier verticeswrite the same value at their neighbors index location inCa. CUDA guarantees at least one of them will succeedwhich is sufficient for our BFS cost propagation.

Algorithm 3 KERNEL2 BFS

1: tid getThreadID2: if Fua[tid] then3: Fa[tid] true4: Xa[tid] true5: Fua[tid] false6: Terminate false7: end if

G(V, E) iterations. Results for this implementa-tion are summarized in Figure 11.

5. ST-Connectivity (STCON)

The st-Connectivity problem resembles theBFS problem closely. Given an unweighted di-rected graph G(V, E) and two vertices, s andt, find a path if one exists from s to t. Theproblem falls under N L-complete category with anon-deterministic time complexity requiring log-space. The undirected version of this algorithmfalls in SL category with same space complexity.Trifonov [48] gives a theoretical deterministic al-gorithm requiring O(log(n)loglog(n)) space forthe undirected case. Reingold [44] provides theoptimal deterministic O(log(n)) space algorithmfor the undirected case. These results howeverare not implemented in practice. Bader et al. [6]implement STCON by extending their BFS im-plementation; they find the smallest distance be-tween s and t by keeping track of all expandedfrontier vertices. We also modify BFS to find thesmallest number of edges needed to reach t froms for the undirected case.

Our approach starts BFS concurrently from s

and t with initial colors assigned to them. Ineach iteration, colors are propagated to neigh-bors along with the BFS cost. Termination iswhen both colors meet. Evidently, both frontiershold the smallest distance to current vertices fromtheir respective source vertices, the smallest pathbetween s and t is reached when frontiers comein contact with each other. Figure 5 depicts twotermination condition due to merging of frontiers,either at a vertex or an edge.


7/26


Terminate

Ra

Ga

Termination

Conditions

st

Figure 5. Parallel st-connectivity with colors ex-panding from s and t vertices.

GPU ImplementationAlong with Va, Ea, Fa, and Ca we keep two

boolean arrays Ra and Ga, one for red color andthe other for green, of size |V| as the vertices vis-ited by s and t frontiers respectively. Initially Raand Ga are set to false and Fa contains the sourceand target vertices. To keep the state of variablesintact, alternate updating arrays Rua, Gua andFua of size |V| are used in each iteration.

Each vertex, if present in Fa, reads its color inboth Ra and Ga and sets its own color to one ofthe two. This is exclusive as a vertex can onlyexist in one of the two arrays as an overlap isa termination condition for the algorithm. Eachvertex updates the cost of its unvisited neighborsby adding 1 to its own cost and writing it to theneighbors index in Ca. Based on its color, thevertex also adds its neighbors to its own colorsvisited vertices by adding them to either Rua orGua. The algorithm terminates if any unvisitedneighbor of the vertex is of the opposite color.The vertex removes itself from the frontier array

Fa and adds its neighbors to the updating frontierarray Fua. Kernel1 (Algorithm 4) depicts thesesteps.

The second Kernel (Algorithm 5) copies the up-dating arrays Fua, Rua, Gua to actual arrays Fa,Ra and Ga for all newly visited vertices. It alsochecks the termination condition due to mergingof frontiers and terminates the algorithm if fron-tiers meet at any vertex. Figure 12 summarizesresults for this implementation.

Algorithm 4 KERNEL1 STCON

1: tid getThreadID2: if Fa[tid] then3: Fa[tid] false4: for all neighbors nid of tid do5: if (Ra[tid]&Ga[nid]) | (Ga[tid]&Ra[nid])

then Terminate6: if NOT (Ga[nid] | Ra[nid]) then7: if Ga[tid] then Gua[nid] true8: if Ra[tid] then Rua[nid] true9: Fua[nid] true

10: Ca[nid] Ca[tid]+111: end if

12: end for13: end if

Algorithm 5 KERNEL2 STCON

1: tid getThreadID2: if Fua[tid] then3: if Gua[tid] &Rua[tid] then Terminate4: Fa[tid] true5: if Rua[tid] then Ra[tid] true6: if Gua[tid] then Ga[tid] true7: Fua[tid]

false

8: Rua[tid] false9: Gua[tid] false

10: end if

6. Single Source Shortest Path (SSSP)

The sequential solution to single source short-est path problem comes from Dijkstra [22]. Orig-inally the algorithm required O(V2) time butwas later improved using Fibonacci heap toO(V log V + E). A parallel version of Dijkstras

algorithm on a PRAM given in [18] introducesa O(V1/3 log V) algorithm requiring O(V log V)work. Nepomniaschaya et al. [41] parallelized Di-jkstras algorithm for associative parallel proces-sors. Narayanan [39] solves the SSSP problemfor processor arrays. Although parallel imple-mentations of the Dijkstras SSSP algorithm arereported [17], an efficient PRAM algorithm doesnot exist.

Single source shortest path does not traverse a


8/26

8 IIIT/TR/2009/74

graph in levels, as cost of a visited vertex may

change due to a low cost path being discoveredlater in the execution. Simultaneous updatesare triggered by vertices undergoing a change incost values. These vertices constitute an execu-tion mask. Termination condition is reached withequilibrium when there is no change in cost forany vertex.

We assign one thread to every vertex. Threadsin the execution mask execute in parallel. Eachvertex updates the cost of its neighbors and re-moves itself from the execution mask. Any vertexwhose cost is updated is put into the execution

mask for next iteration of execution. This pro-cess is repeated until there is no change in costfor any vertex. Figure 6 shows the execution mask(shown as colors) and cost states for a simple case,costs are updated in each iteration, with verticesundergoing re-execution if their cost changes.

11

1

2

8

17

6

2 4

5

2

1

9

3

S

1

2

17

4

8

18

9

10

12 1 2 0 17

1 4 2 8 0 17 18 9

1 4 2 8 0 10 12 9

1 4 2 8 0 10 11 9

1 4 2 8 0 10 11 9

Mask and Cost StatesIteration #

1

2

3

4

5

Change in cost with each iteration Execution terminates when mask is empty

Figure 6. SSSP execution: In each iteration, ver-tices in the mask update costs of their neighbors.A vertex whose cost changes is put in the maskfor execution in the next iteration.

GPU ImplementationFor our implementation (Algorithm 6 and Al-

gorithm 7) we keep a boolean mask Ma and costarray Ca of size |V|. Wa holds the weights ofedges and an updating cost array Cua is usedfor intermediate cost values. Initially the maskMa contains the source vertex. Each vertex looksat its entry in the mask Ma. If true, it fetchesits own cost from Ca and updates the cost ofits neighbors if greater than Ca(u)+Wa(u, v) inthe updating cost array Cua (where u is the cur-rent vertex index and v is the neighboring ver-

tex index). The alternate cost array Cua is used

to resolve read after write inconsistencies in theglobal memory. Update in Cua(v) needs to lockits memory location for modifying the cost asmany threads may write different values at thesame location concurrently. We use the atom-icMinfunction supported on CUDA 1.1 hardware(lines 5 9, Algorithm 6) to resolve this.

Algorithm 6 KERNEL1 SSSP

1: tid getThreadID2: if Ma[tid] then

3: Ma [tid] false4: for all neighbors nid of tid do5: Begin Atomic6: if Cua[nid] > Ca[tid] +Wa[nid] then7: Cua[nid] Ca[tid]+Wa[nid]8: end if9: End Atomic


Algorithm 7 KERNEL2 SSSP

1: tid getThreadID2: if Ca[tid] > Cua[tid] then3: Ca[tid] Cua[tid]4: Ma[tid] true5: Terminate false6: end if7: Cua[tid] Ca[tid]

Atomic functions resolve concurrent writes byassigning exclusive rights to one thread at a time.The clashes are thus serialized in an unspecified

order. The function compares the existing Cua(v)cost with Ca(u)+Wa(u, v) and updates the valueif necessary. A second kernel (Algorithm 7) isused to reflect updating cost Cua to the cost arrayCa. If Ca is greater than Cua for any vertex,it is set for execution in the mask Ma and thetermination flag is toggled to continue execution.This process is repeated until the mask is empty.Experimental results for this implementation arereported in figure 13.


9/26


7. All Pairs Shortest Paths (APSP)

Warshall defined boolean transitive closure formatrices that was later used to develop theFloyd Warshall algorithm for the APSP problem.The algorithm had O(V2) space complexity andO(V3) time complexity. Numerous parallel ver-sions for the APSP problem have been developedto date [47,43,29]. Micikevicius [37] reported aGPGPU implementation for the same, but dueto O(V2) space requirements he reported resultson small graphs.

The Floyd Warshall parallel CREW PRAM al-gorithm (Algorithm 8) can be easily extended toCUDA if the graph is represented as an adjacencymatrix. The kernel implements line 4 of Algo-rithm 8 while the rest of the code runs on theCPU. This approach however requires entire ma-trix to be present on the CUDA device. In prac-tice this approach performs slower as comparedto approaches outlined below. Please see [30] fora comparative study.

Algorithm 8 Parallel-Floyd-Warshall

1: Create adjacency Matrix A from G(V , E , W )2: for k from 1 to V do3: for all Elements of A, in parallel do4: A[i, j] min(A[i, j], A[i, k]+A[k, j])5: end for6: end for

7.1. APSP using SSSPReducing space requirements on the CUDA de-

vice directly helps in handling larger graphs. A

simple space conserving solution to the APSPproblem is to run SSSP from each vertex iter-atively using the graph representation given inFigure 2. This implementation requires O(V +E)space on the GPU with a vector of O(V) copiedback to the CPU memory in each iteration. How-ever for dense graphs this approach proves ineffi-cient. We implemented this approach for generalgraphs and found it to be a scalable solution forlow degree graphs. See the results in Figure 14.

7.2. APSP as Matrix Multiplication

Katz and Kider [33] formulate a CUDA im-plementation for APSP on large graphs usinga matrix block approach. They implement theFloyd Warshall algorithm based on transitive clo-sure with a cache efficient blocking technique(extension of method proposed by Venkatara-man [49]), in which the adjacency matrix (bro-ken into blocks) present in the global memoryis brought into the multiprocessor shared mem-ory intelligently. They handle larger graphs us-ing multiple CUDA devices by partitioning theproblem across the number of devices. We take a

different approach and use streaming of data fromthe CPU to GPU memory for handling larger ma-trices. Our implementation uses a modified paral-lel matrix multiplication with blocking approach.Our times are slightly slower as compared to Katzand Kider for fully connected small graphs. Forgeneral large graphs however we gain 2 4 timesspeed over the method proposed by Katz andKider.

A simple modification to the matrix multipli-cation algorithm yields an APSP solution (Al-gorithm 9). Lines 4 11 is the general matrixmultiplication algorithm with the multiplication

and addition operations replaced by addition andminimum operations respectively, line 7. Theouter loop (line 3) utilizes the transitive propertyof matrix multiplication and runs log V times.

Algorithm 9 MATRIX APSP

1: D1 A2: for m log V do3: for i 1 to V do4: for j 1 to V do5: Dmi,j 6: for k

1 to V do

7: Dmi,j min(Dmi,j , D(m1)i,j + Ak,j )8: end for9: end for

10: end for11: end for

We modify the parallel version of matrix mul-tiplication proposed by Volkov and Demmel [51]for our APSP solution.


10/26

10 IIIT/TR/2009/74

7.2.1. Cache Efficient Graph Representa-

tionFor matrix multiplication based APSP, we use

an adjacency matrix to represent graph. Figure 7depicts a cache efficient conflict free blockingscheme used for matrix multiplication by Volkovand Demmel. We present two new ideas overthe basic matrix multiplication scheme. The firstis the modification to handle graphs larger thanthe device memory by streaming data as requiredfrom the CPU. The second is the lazy evaluationof the minimum finding which results in a hugeboost in performance.

CPU

R1

C1 C2 C3

R2

R3

Global memory

C

R

Shared MemoryGlobal MemoryCPU Memory

B

16x16

C

64 x 16

R

B

64x4

Figure 7. Blocks for matrix multiplication byVolkov and Demmel [51] modified to stream fromCPU to GPU.

Streaming Blocks: To handle large graphs,the adjacency matrix present in the CPU mem-ory is divided into rectangular row and column

submatrices. These are streamed into the deviceglobal memory and a matrix-block Dm based ontheir values is computed. Let R be the row andC the column submatrices of the original matrixpresent in the device memory. For every row sub-matrix R we iterate through all column subma-trices C of the original matrix. We assume CPUmemory is large enough to hold the adjacency ma-trix, though our method can be easily extendedto secondary storage with slight modification.

Let the size of available device memory be

GP Umem. We divide the adjacency matrix intorows R and column C submatrices of size (BV)and (V B) respectively such thatsize

RBV + CVB + D

mBB

GPUmem,where B is the block size. A total of

log V

V3

B+ V2

O

log V

V3

B

elements are transferred between CPU and GPUfor a V V adjacency matrix for our APSPcomputation, with V3 log V /B reads and V2 log V

writes. Time taken for this data transfer is neg-ligible compared to the computation time, andcan be easily hidden using asynchronous read andwrite operations supported on current generationCUDA hardware.

For example, for a 18K 18K matrix withinteger entries and 1GB device memory, a blocksize B 6K can be used. At a PCI-e16 practi-cal transfer rate of 3.3 GB/s, data transfer takesnearly 16 seconds. This time is negligible ascompared to 800 seconds of computation timetaken on Tesla for a 18K 18K matrix withoutstreaming (result taken from Table 4).

Lazy Minimum Evaluation: The basic stepof Floyds algorithm is similar to matrix multipli-cation with multiplication replaced by additionand addition by minimum finding. However, forsparse-degree graphs, the connections are few andthe other entries of the adjacency matrix are in-finity. When an entry is infinity, additions in-volving it and subsequent minimum finding canbe skipped altogether without affecting correct-ness. We, therefore, evaluate the minimum ina lazy manner, skipping all paths involving a

non-existent edge. This results in a speedup of2 to 3 times over complete evaluation on mostgraphs. This also makes the running time degree-dependent.

GPU ImplementationLet R be the row and C be the column sub-

matrices of the adjacency matrix. Let Di denotea temporary matrix variable of size B B usedto hold intermediate values. In each iteration of


11/26


outer loop (Algorithm 9, line 2) Di is modified

using C and R. Lines 3 10 of Algorithm 9 areexecuted on the CUDA device while the rest ofthe code executes on the CPU.

Algorithm 10 KERNEL APSP

1: d[1 : 16] 16 values of 64 16 block of Di2: for loop over (dimension of C)/16 times do3: c[1 : 16][1 : 16] next 16 16 block of C4: p 05: for 1 to 4 do6: r[1 : 4] 4 values of next 64 16 block

of R

7: for j 1 to 4 do8: for k 1 to 16 do9: if d[k] > r[j]+c[p][k] then

10: d[k] r[j]+c[p][k]11: end if12: end for13: p p + 114: end for15: end for16: end for17: Merge d[1 : 16] with 64 16 block of Di

Our kernel modifies the matrix multiplicationroutine given by Volkov and Demmel [51] byreplacing the multiplication and addition oper-ations with addition and minimum operations.Shared memory is used as a user managed cacheto improve performance. Volkov and Demmelbring sections of matrices R, C and Di intoshared memory in blocks: R is brought in 64 4sized blocks, C in 16 16 sized blocks and Di in64 16 sized blocks. These values are selected tomaximize throughput of the CUDA device. Dur-ing execution, each thread computes 64 16 val-ues of Di. Algorithm 10 describes the modifiedmatrix multiplication kernel. Please see [51] forfull details on the matrix multiplication kernel.

Here d, r and c are shared memory variables.The above kernel produces a speed up of up to 2times over the earlier parallel matrix multiplica-tion kernel [42]. Lazy minimum evaluation goesfurther; if either entry ofR or C is infinity due toa non-existent edge, we do not update the outputmatrix. We achieve much better performance on

general graphs using lazy min evaluation. How-

ever, computation time is no more independentof the degree of the graph.

7.3. Gaussian Elimination Based APSPIn a parallel work that came to light very re-

cently, Buluc et al. [11] formulate a fast recursiveAPSP algorithm based on Gaussian elimination.They cleverly extend the R-Kleene [20] algorithmfor in place APSP computation on global mem-ory. They split each APSP step recursively into 2APSPs involving graphs of half the size, 6 matrixmultiplications and 2 matrix additions. The base-

case is when there are 16 or fewer vertices; Floydsalgorithm is applied in that case by modifying theCUDA matrix multiplication kernel proposed byVolkov and Demmel [51]. They also use the fastmatrix multiplication for other steps. Their im-plementation is degree independent and fast; theyachieve a speed up of 510 times over the APSPimplementation presented above.

While the approach of Buluc et al. is the fastestAPSP implementation on the GPU so far, ourkey ideas can extend it further. We incorporatedthe lazy minimum evaluation into their code andobtained a speed up of 2 over their native ap-proach. Their approach is memory heavy andis best suited when the adjacency matrix can fitcompletely in the GPU device memory. Theirapproach involves several matrix multiplicationsand additions. Extending this to stream the datafrom CPU to the GPU for matrix operations interms of blocks that fit in the device memory willinvolve many more communications between theCPU and the GPU and many more computations.The CPU to GPU communication bandwidth hasnot at all kept pace with the increase in the num-ber of cores or computation power of the GPU.

Thus, our non-matrix approach is likely to scalebetter to arbitrarily high graphs than the Gaus-sian Elimination based approach by Buluc et al.

Comparison of the matrix multiplication ap-proach with APSP using SSSP and Gaussianelimination approach is summarized in Figure 14.Comparison of matrix multiplication approachwith Katz and Kider is given in Figure 15. Behav-ior of the matrix approach with varying degree isreported in Table 1.


12/26

12 IIIT/TR/2009/74

8. Minimum Spanning Tree (MST)

Best time complexity for a serial solutionto the MST problem, proposed by BernardChazelle [13], is O(E(E, V)), where isthe functional inverse of Ackermanns function.Boruvkas algorithm [10] is a popular solutionto the MST problem. In a serial setting ittakes O(Elog V) time. However, numerous par-allel variations of this algorithm also exist [34].Chong et al. [15] report a EREW PRAM algo-rithm requiring O(log V) time and O(V log V)work. Bader et al. [4] design a fast algorithmfor symmetric multiprocessors with O((V +E)/p)lookups and local operations for a p processormachine. Chung et al. [16] efficiently imple-ment Boruvkas algorithm on a asynchronous dis-tributed memory machine by reducing communi-cation costs. Dehne and Gotz implement threevariations of Boruvkas algorithm using the BSPmodel [21].

We implement a modified parallel Boruvka al-gorithm on CUDA. We create colored partialspanning trees from all vertices, grow individualtrees, and merge colors when trees come in con-tact. Cycles are removed explicitly in each itera-

tion. Connected components are found via colorpropagation, an approach similar to our SSSP im-plementation (section 6).

We represent each sup ervertex in Boruvkas al-gorithm as a color. Each supervertex finds theminimum weighted edge to another supervertexand adds it to the output MST array. Each newlyadded edge in the MST edge list updates thecolors of both its supervertices until there is nochange in color values for all supervertices. Cyclesare removed from the newly created graph andeach vertex in a supervertex updates its color tothe new color of the supervertex. This processesis repeated and the number of supervertices keepon decreasing. The algorithm terminates whenexactly one supervertex remains.

GPU ImplementationWe use colors array Ca, color index array Cia

(per vertex color index to which the vertex be-longs to), active colors array Aca and newlyadded MST edges NMsta of size |V|. Output is a

Algorithm 11 Minimum Spanning Tree

1: Create Va, Ea, Wa from G(V , E , W )2: Initialize Ca and Cia to vertex id.3: Initialize Msta to false4: while More than 1 supervertex remains do5: Clear NMsta, Aca, Dega and Cya6: Kernel1 for each vertex: Finds the mini-

mum weighted outgoing edge from each su-pervertex to the lowest outgoing color byworking at each vertex of the supervertex,

sets the edge in NMsta.7: Kernel2 for each supervertex: Each super-vertex sets its added edge in NMsta as partof output MST, Msta.

8: Kernel3 for each supervertex: Each addededge, in NMsta, increments the degrees ofboth its supervertices in Dega using coloras index. Old colors are saved in PrevCa.

9: while no change in color values Ca do10: Kernel4 for each supervertex: Each

edge in NMsta updates colors of super-vertices by propagating the lower colorto the higher.

11: end while12: while 1 degree supervertex remains do13: Kernel5 for each supervertex: All 1 de-

gree supervertices nullify their edge inNMsta, and decrement their own degreeand the degree of its outgoing superver-tex using old colors from PrevCa.

14: end while15: Kernel6 for each supervertex: Each re-

maining edge in NMsta adds itself to Cyausing new colors from Ca.

16: Kernel7 for each supervertex: Each entry

in Cya is removed from the output MST,Msta, resulting in cycle free MST.17: Kernel8 for each vertex: Each vertex up-

dates its own colorindex to the new colorof its new supervertex.

18: end while19: Copy Msta to CPU memory as output.


13/26


Increasing order of colors

< < Wa[NMsta[col]] then12: NMsta[col] Ieid13: end if14: End Atomic15: Aca[col] true

Algorithm 13 KERNEL2 MST

1: col

getThreadID

2: if Aca[col] then3: Msta[NMsta[col]] true4: end if

8.2. Finding and Removing CyclesAs C edges are added for C colors, atleast one

cycle is expected to be formed in the new graph ofsupervertices. Multiple cycles can also form fordisjoint components of supervertices. Figure 9shows such a case. It is easy to see that eachsuch component can have atmost one cycle con-sisting of exactly 2 supervertices with both edges

in the cycle having equal weights. Identifyingthese edges and removing one edge per cycle iscrucial for correct output.

In order to find these edges, we assign degreesto supervertices using newly added edges NMsta.We then remove all 1-degree supervertices itera-tively until there is no 1-degree supervertex left,resulting in supervertices that are part of cycles.

Each added edge increments the degree of bothits supervertices using color of the supervertex as


14/26

14 IIIT/TR/2009/74

its index in Dega (Algorithm 14). After color

propagation, i.e., merger of supervertices (Sec-tion 8.3), all 1-degree supervertices nullify theiradded edge in NMsta. They also decrement theirown degree and the degree of thier added edgesoutgoing supervertex in Dega (Algorithm 16).This process is repeated until there is no 1-degreesupervertex left, resulting in supervertices whoseedges form a cycle.

Incrementing the degree array needs to be donebefore propagating colors, as the old color is usedas index in Dega for each supervertex. Old colorsare also needed after color propagation to identify

supervertices while decrementing the degrees. Tothis end, we preserve old colors before propagat-ing new colors in an alternate color array PrevCa(Algorithm 14).

After removing 1-degree supervertices, result-ing supervertices write their edge from NMsta totheir new color location in Cya (Algorithm 17),after new colors have been assigned to super-vertices of each disjoint component using Algo-rithm 15. One edge of the two, per disjoint com-ponent, survives this step. Since both edges haveequal weights, there is no preference for any edge.Edges in Cya are then removed from the output

MST array Msta (Algorithm 18) resulting in cyclefree partial MST.


1: col getThreadID2: if Aca[col] then3: col2 Ca[Cia[Ea[NMsta[col]]]]4: Begin Atomic5: Dega[col] Dega[col]+16: Dega[col2] Dega[col2]+17: End Atomic8: end if9: PrevCa[col] Ca[col]

8.3. Merging SuperverticesEach added edge merges two supervertices.

Lesser color of the two is propagated by assigningit to the higher colored supervertex. This process

Increasing order of colors

< Ca[cid2] then7: Ca[cid] Ca[cid2]8: end if9: if Ca[cid2] > Ca[cid] then

10: Ca[cid2] Ca[cid]11: end if12: End Atomic13: end if

8.4. Assigning Colors to VerticesEach vertex in a supervertex must know its

color; merging of colors in the previous step doesnot necessarily end with all vertices in a compo-nent being assigned the minimum color of thatcomponent. Rather, a link in color values is


15/26



1: cid getThreadID2: col PrevCa[cid]3: if Aca[col] & Dega[col] = 1 then4: col2 PrevCa[Cia[Ea[NMsta[cid]]]]5: Begin Atomic6: Dega[col] Dega[col]17: Dega[col2] Dega[col2]18: End Atomic9: NMsta[col]

10: end if


1: cid getThreadID2: col PrevCa[cid]3: if Aca[col] & NMsta[col] = then4: newcol Ca[Cia[Ea[NMsta[col]]]]5: Cya[newcol] NMsta[col]6: end if


1: col getThreadID2: if Cya[col] = then3: Msta[Cya[col]] false4: end if


1: tid getThreadID2: cid Cia[tid]3: col Ca[cid]4: while col = cid do5: col Ca[cid]6: cid

Ca[col]

7: end while8: Cia[tid] cid9: if col = 0 then

10: Terminate false11: end if

established during the previous step. This linkmust be traversed by each vertex to find the low-

est color it should point to. Since the color is

same as the index initially, the color and the in-dex must be same for all active colors. This prop-erty is used while updating colors for each vertex.Each vertex in Kernel8 (Algorithm 19) finds itscolorindex cid and traverses the colors array Cauntil coloridnex is not equal to color.

Timings for MST implementation on varioustypes of graphs are reported in Figure 16.

9. Maximum Flow (MF)/Min Cut

Maxflow tries to find the minimum weighedcut that separates a graph into two disjointsets of vertices, one containing the source sand the other target t. The fastest se-rial solution due to Goldberg and Rao takesO(Emin(V2/3,

E)log(V2/E)log(U)) time [26],

where U is the maximum capacity of the graph.Popular serial solutions to the max flow prob-

lem include Ford-Fulkersons algorithm [25], laterimproved by Edmond and Karp [23], and thePush-Relabel algorithm [27] by Goldberg andTarjan. Edmond-Karps algorithm repeatedlycomputes augmenting paths from s to t usingBFS, through which flows are pushed, until noaugmented paths exist. The Push-Relabel algo-rithm works by pushing flow from s to t by in-creasing heights of nodes farther away from t.Rather than examining the entire residual net-work to find an augmenting path, it works locally,looking at each vertexs neighbors in the residualgraph.

Anderson and Setubal [3] first gave a paral-lel version of the Push-Relabel algorithm. Bader

and Sachdeva implemented parallel cache efficientvariation of the push-relabel algorithm using anSMP [9]. Alizadeh and Goldberg [2] implementedthe same on a massively parallel Connection Ma-chine CM2. GPU implementations of the push-relabel algorithm are also reported [31]. A CUDAimplementation for grid graphs specific to visionapplications is reported in [50]. We implementthe parallel push-relabel algorithm using CUDAfor general graphs.


16/26

16 IIIT/TR/2009/74

The Push-Relabel Algorithm

The push-relabel algorithm constructs andmaintains a residual graph at all times. Theresidual graph Gf of the graph G has the sametopology, but consists of the edges which can ad-mit more flow. Each edge has a current capacityin Gf, called its residual capacity which is theamount of flow that it can admit currently. Eachvertex in the graph maintains a reservoir of flow(excess flow) and a height. Based on its heightand excess flow either push or relabel operationsare undertaken at each vertex. Initially height ofs is set to |V| and height of t to 0. Height atall times is a conservative estimate of the vertexsdistance from the source.

Push: The push operation is applied at avertex if its height is one more than anyof its neighbor and it has excess flow in itsreservoir. The result of push is either sat-uration of an edge in Gf or saturation ofvertex, i.e., empty reservoir.

Relabel: Relabel operation is applied tochange the heights. If any vertex has excessflow but there is a height mismatch and it

cannot flow, the relabel operation makes itsheight one more than the minimum heightof its neighbor.

Better estimates of height values can be ob-tained using global or gap relabeling [9]. Globalrelabeling uses BFS to correctly assign distancesfrom the target whereas gap relabeling finds gapsusing height mismatches in the entire graph.However, both are expensive operations, espe-cially when executed on parallel hardware. Thealgorithm terminates when neither push nor re-labeling can be applied. The excess flows in the

nodes are then pushed back to the source and thesaturated nodes of the final residual graph givesthe maximum flow/minimal cut.

GPU ImplementationWe keep ea and ha arrays representing excess

flow and height per vertex. An activity maskMa holds three distinct sates per vertex, 0 corre-sponding to the relabeling state (ea(u)> 0, ha(v) ha(u) neighbors v Gf), 1 for the push

ha

Height

Nodes present in Gf

Nodes not present in Gf

Sink

SourceRelabelling Nodes

Pushing Nodes

Residual Graph Gf at intermediate stage Push and Relabel operations inGf

ha(s) = No of nodes

ha(t) = 0

Figure 10. Parallel maxflow, showing push andrelabel operations based on ea and ha

state (ea(u) > 0 and ha(u) = ha(v)+1 for anyneighbor v Gf) and 2 for saturation. Basedon these values the push and relabel operationsare undertaken. Initially activity mask is set to0 for all vertices. We use both local relabelingand global relabeling. We apply multiple pushesbefore applying the relabel operation. Multiplelocal relabels are also applied before applying asingle global relabel step. Algorithm 20 describesthis scenario. Backward BFS from the sink isused for global relabeling.

Algorithm 20 Max Flow Step

1: for 1 to k times do2: Apply m pushes3: Apply Local Relabel4: end for5: Apply Global Relabel

Relabel: Relabels are applied as given in Al-gorithm 20. Local relabel operation is appliedat a vertex if it has positive excess flow but nopush is possible to any neighbor due to heightmismatch. The height of vertex is increased bysetting it to one more than the minimum heightof its neighboring nodes.

For local relabeling, each vertex updates itsheight to one more than the minimum height of


17/26


its neighbors in the residual graph. Kernel1 (Al-

gorithm 21) explains this operation. For globalrelabeling we use backward BFS from sink, whichpropagates the height values to each vertex in theresidual graph based on its actual distance fromsink.

Algorithm 21 KERNEL1 MAXFLOW

1: tid getThreadID2: if Ma[tid] = 0 then3: for all neighbors nid of tid do4: if nid

Gf&minh > ha[nid] then

5: minh ha[nid]6: end if7: end for8: ha[tid] minh + 19: Ma[tid] 1

10: end if

Push: The push operation can be applied ata vertex if it has excess flow and its height is equalto 1 more than one of its neighbors. After thepush, either vertex is saturated (i.e., excess flowis zero) or the edge is saturated (i.e., capacity ofedge is zero).

Each vertex looks at its activity mask Ma, if 1it pushes the excess flow along the edges presentin residual graph. It atomically subtracts the flowfrom its own reservoir and adds it to the neigh-bors reservoir. For every edge (u, v) of u in resid-ual graph it atomically subtracts the flow fromthe residual capacity of (u, v) and adds (atomi-cally) it to the residual capacity of (v, u). Kernel2(Algorithm 22) performs the push operation.

Algorithm 23 changes the state of each ver-tex. The activity mask is set to either 0, 1 or

2 states reflecting relabel, push and saturationstates based on the excess flow, residual edge ca-pacities and height mismatches at each vertex.Each vertex sets the termination flag to false ifits state undergoes a change.

The operations terminate when there is nochange in the activity mask. This does not neces-sarily occur when all nodes have saturated. Dueto saturation of edges, some unsaturated nodesmay get cutoff from sink, these nodes do not con-


1: tid getThreadID2: if Ma[tid] = 1 then3: for all neighbors nid of tid do4: if nid Gf&ha[tid]= ha[nid]+1 then5: minflow min(ea[tid], Wa[nid])6: Begin Atomic7: ea[tid] ea[tid] minflow8: ea[nid] ea[nid] +minflow9: Wa(tid,nid)Wa(tid,nid)minflow

10: Wa(nid,tid)Wa(nid,tid)+minflow11: End Atomic12: end if



1: tid getThreadID2: for all neighbors nid of tid do3: if ea[tid] 0 OR Wa(tid,nid) 0 then4: state 25: else6: if ea[tid]> 0 then7: if ha[tid] = ha[nid] + 1 then8: state 19: else

10: state 011: end if12: end if13: end if14: end for15: if Ma[tid]= state then16: Terminate false17: Ma[tid] state18: end if

tribute any further and thus are not actively tak-ing part in the process, consequently their statedoes not change which can lead to an infinite loopif termination is based on saturation of nodes.Results of this implementation are given in Fig-ure 17. Figure 18 shows the behavior of our im-plementation with varying m and k in accordanceto Algorithm 20.


18/26

18 IIIT/TR/2009/74

10. Performance Analysis

We choose graphs representatives of real worldproblems. Our graph sizes vary from 1M to 10Mvertices for all algorithms except APSP. The fo-cus of our experiments is to show fast process-ing of general graphs. Scarpazza et al. [45] fo-cus on improving the throughput of the Cell/B.E.for BFS. Bader and Madduri [6,8,4] use CRAYMTA2 for BFS, STCON and MST implemen-tations. Dehne and Gotz [21] use CC48 to per-form MST. Edmonds et al. [24] use Parallel Boostgraph library and Crobak et al. [19] use CRAYMTA

2 for their SSSP implementations. Yoo et

al. [52] use the BlueGene/L for a BFS implemen-tation. Though our input sizes are not compara-ble with the ones used in these implementations,of orders of billions of vertices and edges, we showimplementations on a hardware several orders lessexpensive. Because of the large difference in inputsizes, we do not compare our results with theseimplementations directly. We show a comparisonof our APSP approach with Katz and Kider [33]and Buluc et al. [11] on similar graph sizes as theimplementations are directly comparable.

10.1. Types of GraphsWe tested our algorithms on various types ofsynthetic and real world large graphs includinggraphs from the ninth DIMACS challenge [1].Primarily, three generative models were used forperformance analysis, using the Georgia Tech.graph generators [7]. These models approximatereal world datasets and are good representativesfor graphs commonly used in real world domains.We assume all graphs to be connected with posi-tive weights. Graphs are also assumed to be undi-rected with a complementary edge present for ev-ery edge in the graph. Models used for graph

generation are:

Random Graphs: Random graphs have ashort band of degree where all vertices lie,with a large number of vertices having sim-ilar degrees. A slight variation from the av-erage degree results in a drastic decrease innumber of such vertices in the graph.

R-MAT [12]/Scale Free/Power law: A

large number of vertices have small degree

with a few vertices having large degree.This model best approximates large graphsfound in real world. Practical large graphsmodels including, Erdos-Renyi, power-lawand its derivations follow this generativemodel. Due to its small degree distribu-tion over most vertices and uneven degreedistribution these graphs expand slowly ineach iteration and exhibit uneven load bal-ancing. These graphs therefore are a worstcase scenario for our algorithms as verifiedempirically.

SSCA#2 [5]: These graphs are made upof random sized cliques of vertices with ahierarchical distribution of edges betweencliques based on a distance metric.

10.2. Experimental SetupOur testbed consisted of a single Nvidia GTX

280 graphics adapter with 1024MB memory onboard (30 multiprocessors, 240 stream proces-sors) controlled by a Quad Core Intel processor(Q6600 @ 2.4GHz) with 4GB RAM running Fe-dora Core 9. For CPU comparison we use the

Boost C++ graph library (with the exception ofBFS) compiled using gcc at optimization settingO4. We use our own BFS implementation onCPU as it proved faster than Boost. BFS wasimplemented using STL and C++, compiled withgcc using O4 optimization. A quarter TeslaS1070 1U was used for graphs larger than 6Min most cases, it has the same GPU as GTX 280with 4096MB of memory.

10.3. Summary of ResultsIn experiments, we observed lower performance

on low degree/linear graphs. This behavior is not

surprising as our implementations cannot exploitparallelism on such graphs. Only two vertices areprocessed in parallel in each iteration for a lin-ear graph. We also observed that vertex list com-paction helps reduce running time by factor of 1.5in such cases. Another hurdle for efficient parallelexecution is large variation in degree, which slowsdown the execution on an SIMD machine owing touneven load per thread. This behavior is seen inall algorithms on R-MAT graphs. Larger degree


19/26


graphs benefit more using our implementations as

the expansion per iteration is more, resulting inbetter expansion of data and thus better perfor-mance. We show relative behavior of each algo-rithm in the following sections, detailed timingsof plots given in each section are listed in Table 3and Table 4 in the Appendix.

10.4. Breadth First Search (BFS)

1 2 3 4 5 6 7 8 9 1010

1

102

103

104

Number of Vertices in millions, average degree 12, weights in range 1100

LogplotofTimeinMilliseconds

Random GPU

RMAT GPU

SSCA2 GPU

Random CPU

RMAT CPU

SSCA2 CPU

Figure 11. Breadth first search on varying num-ber of vertices for synthetic graphs.

Figure 11 summarizes the results for our BFSimplementation using the compaction process. R-MAT graphs perform poorly on the GPU as theexpansion of frontier at every level is low owingto their linear nature. This prevents optimal uti-lization of the available computing resources asdiscussed in Section 10.3. Further, because of fewhigh degree vertices, the loop in Algorithm 2 line4 results in varying loads on different threads.Loops of non-uniform length are inefficient on

SIMD architecture. Our implementation gainsa speed up of about nearly 5 times for the R-MAT graphs and nearly 15 times for Random andSSCA#2 graphs over the CPU implementation.

Throughput, millions of edges processed persecond (ME/S) is used as a BFS performancemeasure in [45]. It is, however, not a good mea-sure for our BFS implementation as we exploitparallelism over vertices. However, for a 1M nodegraph with 12M edges we obtain a throughput of

315 ME/S as oppose to 101 ME/S obtained on

the Cell/B.E.

10.5. ST-Connectivity (STCON)

1 2 3 4 5 6 7 8 910

0

101

102

103

104


Logp

lotofTimeinMilliseconds

Random GPU

RMAT GPU

SSCA2 GPU

Random CPU

RMAT CPU

SSCA2 CPU

Figure 12. st-Connectivity for varying number ofvertices for synthetic graph models.

Figure 12 shows results of our STCON imple-mentation. In the worst case, our STCON imple-mentation takes half the time of our BFS imple-

mentation, since the maximum distance betweens and t can be the diameter of the graph and westart BFS from s and t concurrently for STCON.For fair comparison we averaged times of our re-sults over 100 iterations of randomly selected sand t vertices. Because of the linear nature of R-MAT graphs we see a slower expansion of frontiervertices with uneven loops leading to load imbal-ance and thus poor timings.

10.6. Single Source Shortest Path (SSSP)Single source shortest path results are summa-

rized in Figure 13. Vertex list compaction was

used to reduce number of threads executed ineach kernel call for R-MAT graphs. A 40% reduc-tion in times was observed for R-MAT graphs us-ing compaction over the non-compacted version.R-MAT graphs however, even after compaction,perform badly on the GPU as compared to othertypes of graphs. On the CPU however they per-form better compared to other graph models. Wegain 1520 times speed up against boost for ourSSSP implementation.


20/26

20 IIIT/TR/2009/74

1 2 3 4 5 6 7 8 9 1010

2

103

104

105



Random GPU

RMAT GPU

SSCA2 GPU

Random CPU

RMAT CPU

SSCA2 CPU

Figure 13. Single source shortest path on varyingnumber of vertices for synthetic graph models.

10.7. All Pairs Shortest Paths (APSP)

256 512 1024 2048 4096 9216 10240 11264 18K 25K 30K10

0

101

102

103

104

105

106

107

Number of Vertices, average degree 12, weights in range 1100

LogplotofTime

inMilliseconds

APSP using SSSP Random

APSP using SSSP RMAT

APSP using SSSP SSCA2

Matrix Random

Matrix RMAT

Matrix SSCA2

Matrix Fully Conn

GE Lazy Min

GE Without Lazy Min

Figure 14. Comparing APSP using SSSP,APSP using matrix multiplications and APSPusing Gaussian elimination [11] approaches on aGTX280 and Tesla

The SSSP-based, matrix multiplication-basedand Gaussian elimination-based APSP imple-mentations are compared in Figure 14 on GTX280 and Tesla. Matrix multiplication APSP usesgraph representation outlined in Section 7.2.1.We stream data from CPU to GPU for graphslarger than 18K for this approach. As seenfrom the experiments, APSP using SSSP per-forms badly on all types of graph, but is a scalablesolution for large, low-degree graphs. For smaller

256 512 1024 2048 4096 9216 10240 11264 18K10

0

101

102

103

104

105

106

107

Number of Vertices, average degree 12, weights in range 1100


Matrix Random

Matrix RMAT

Matrix SSCA2

Matrix Fully Connected

APSP Katz and Kider

Figure 15. Comparing our matrix-based APSPapproach with Katz and Kider [33] on a QuadroFX5600

graphs, matrix approach proves much faster. Wedo not use lazy min for fully connected graphsas it becomes an overhead for them. We areable to process a fully connected 25K graph usingstreaming of matrix blocks in nearly 75 minuteson a single unit of Tesla S1070, which is equiva-lent in computing power to the GTX280, but has4 times the memory.

For direct comparison with Katz andKider [33], we also show results on Quadro FX5600. Figure 15 summarizes the results of theseexperiments. In case of fully connected graphswe are 1.5 times slower than Katz and Kiderup to the 10K graph. We achieve a speed upof 2 4 times over Katz and Kider for largergeneral graphs. The Gaussian elimination basedAPSP by Buluc et al. [11] is the fastest amongthe approaches. However, introducing the lazyminimum evaluation to their approach providesa further speed up of 2 3 as can be seen fromFigure 14 and Table 4.

10.8. Minimum Spanning Tree (MST)Timings for minimum spanning tree implemen-

tation are summarized in Figure 16 for syntheticgraphs. Our MST algorithm is not affected bythe linearity of the graph, as each supervertexis processed in parallel independent to other su-pervertices and there is no expansion of frontier.However for R-MAT graphs we see a slowdowndue to uneven loops over vertices with high de-


21/26


1 2 3 4 5 6 7 810

2

103

104

105



Random GPU

RMAT GPU

SSCA2 GPU

Random CPU

RMAT CPU

SSCA2 CPU

Figure 16. Minimum spanning tree results forvarying number of vertices for synthetic graphmodels

gree, which proves inefficient on a SIMD model.A speed up of 15 times is achieved for random

and SSCA#2 graphs, a speed up of nearly 5 timesis achieved over the Boost C++ implementationfor R-MAT.

10.9. Max Flow (MF)

1 2 3 4 5 6 7 810

2

103

104

105

106



Random GPU

RMAT GPU

SSCA2 GPU

Random CPU

RMAT CPU

SSCA2 CPU

Figure 17. Maxflow results for varying number ofvertices for synthetic graph models

Maxflow timings for various synthetic graphsare shown in Figure 17. We average timings formax flow over 5 iterations for randomly selectedsource s and sink t vertices. R-MAT timings onthe GPU out shoots the CPU times because oflinear nature of these graphs.

We also study the behavior of our max flow im-

plementation for varying m and k, to control theperiodicity of the local and global relabeling stepsas given in Algorithm 20. Figure 18 depicts thisbehavior for all three generative models on a 1Mvertex graph. Random and SSCA#2 graphs showa similar behavior with time increasing with num-ber of pushes for low or no local relabels. Timedecreases as we apply more local relabels. Wefound for m = 3 and k = 7 the timing were op-timal for random and SSCA#2 graphs. R-MATgraphs however exhibit very different behavior forvarying m and k. For low local relabels the time

increases with increasing number of pushes simi-lar to random and SSCA#2. However as local re-labels are increased we see an increase in timings.This can be attributed to the fact that linearityposes slow convergence of local relabels in case ofR-MAT graphs.

10.10. ScalabilityScalability of our implementations over vary-

ing degree are summarized in Table 1. We showresults for a 100K vertex graph with varying de-gree. For APSP matrix based approach results fora 4K graph are shown. As expected, the runningtime increase with increasing degree in all cases.However the scaling factor for GPU is much bet-ter than CPU in all implementations. This is be-cause scans of edges increase with increasing de-gree, which is distributed over threads running inparallel, resulting in less increase in time for theGPU as compared to the CPU and thus betterscalability.

Results on the ninth DIMACS challenge [1]dataset are summarized in Table 2. GPU per-forms worse than CPU in most implementationsfor these inputs. The behavior can be explainedbased on the linearity of these graphs. Parallelexpansion is minimal for these graphs as theiraverage degree is 2 3. Minimum spanning treehowever performs much faster than its CPU coun-terpart on these graphs owing to its inertness tolinearity.


22/26

22 IIIT/TR/2009/74

Figure 18. Maxflow behavior on a 1M vertex, 12M edge graph with varying m and k, Algorithm 20.

11. Conclusions

In this paper, we presented massively multi-threaded algorithms on large graphs for the GPUusing the CUDA model. Each operation is typ-ically broken down using a BSP-like model intoseveral kernels executing on the GPU that aresynchronized by the CPU. Using vertex list com-paction we reduce the number of threads to be ex-

ecuted on the device and hence reducing contextswitching of multiple blocks for iterative frontierbased algorithms. The divide and conquer ap-proach scales well to massively multithreaded ar-chitectures for non-frontier based algorithms likeMST. Where the problem can be divided to itssimplest form at the lowest (vertex) level and con-quered recursively further up the hierarchy. Wepresent results on medium and large graphs aswell as graphs that are random, scale-free, andinspired by real-life examples. Most of our imple-mentations can process graphs with millions of

vertices in 1 2 seconds on a commodity GPU.This makes the GPUs attractive co-processorsto the CPU for several scientific and engineeringtasks that are modeled as graph algorithms. Inaddition to the performance, we believe the mas-sively multithreaded approach we present will beapplicable to the multicore and manycore archi-tectures that are in the pipeline from differentmanufacturers.

Acknowledgments:We would like to thank the Nvidia Corporation

for their generous support, especially for provid-ing hardware used in this work. We would alsolike to acknowledge Georgia Tech Institute fortheir graph generating software. We would liketo thank Yokesh Kumar for his help and discus-sions for various algorithms given as part of thiswork.

REFERENCES

1. The Ninth DIMACS implementation chal-lange on shortest pathshttp://www.dis.uniroma1.it/ challenge9/.

2. F. Alizadeh and A. V. Goldberg. Experi-ments with the push-relabel method for themaximum flow problem on a connection ma-chine. In DIMACS Implementation Chal-lenge Workshop: Network Flows and Match-ing, Tech. Rep., volume 92, pages 5671,

1991.3. R. J. Anderson and J. C. Setubal. On the par-

allel implementation of Goldbergs maximumflow algorithm. In SPAA 92: Proceedings ofthe fourth annual ACM symposium on Par-allel algorithms and architectures, pages 168177, 1992.

4. D. A. Bader and G. Cong. A fast, parallelspanning tree algorithm for symmetric multi-processors (SMPs). J. Parallel Distrib. Com-


23/26


Table 1

Scalability with varying degree on a 100K vertex graph, 4K for APSP, weights in range 1 100. Timesin milliseconds.Degree BFS GPU/CPU STCON GPU/CPU SSSP GPU/CPU

Random R-MAT SSCA#2 Random R-MAT SSCA#2 Random R-MAT SSCA#2100 15/420 91/280 7/160 0.8/1.1 1.47/14.3 1.04/5.8 169/260 305/190 120/170200 48/800 122/460 13/290 1.5/1.1 2.36/18.9 1.11/8.7 375/380 400/250 237/220400 125/1520 163/770 24/510 2.8/1.2 3.93/27.9 1.53/8.9 898/710 504/360 474/320600 177/2300 182/1050 38/730 4.1/1.3 5.26/41.9 2.81/9.1 1449/- 587/430 683/410800 253/3060 210/1280 67/980 5.5/1.5 6.85/55.4 2.589/9.8 1909/- 691/540 1042/520

1000 364/- -/- -/- 6.8/- -/- -/- 2157/- -/- -/-

Degree MST GPU/CPU Max Flow GPU/CPU APSP Matrix GPU/CPURandom R-MAT SSCA#2 Random R-MAT SSCA#2 Random R-MAT SSCA#2

100 302/12150 461/10290 122/7470 808/6751 3637/4950 345/3750 6111/30880 5450/19470 3400/16390200 369/25960 638/22180 218/16700 2976/13430 6308/9230 615/7670 8100/53480 5875/27370 5253/25860400 1149/- 849/- 347/- 10842/32900 8502/16570 1267/17360 9034/92700 6202/38070 7078/41580

600 1908/- 1103/- 499/- 14722/- 11238/- 6018/- 9123/102383 6317/41273 7483/48729800 2484/- 1178/- 883/- 22489/- 14598/- 8033/- 9231/126830 6391/45950 7888/57460

1000 3338/- -/- -/- 32748/- -/- -/- 9613/167630 6608/54540 8309/68580

Table 2Results on the ninth DIMACS challenge [1] graphs, weights in range 1 300K. Times in milliseconds.

Graphs with Vertices Edges Time GPU/CPUdistances as weights BFS STCON SSSP MST Max Flow

New York 264346 733846 147/20 1.25/8.8 448/190 76/780 657/420San Fransisco Bay 321270 800172 199/20 2.2/11.3 623/230 85/870 1941/740

Colorado 435666 1057066 414/30 2.36/15.9 1738/340 116/1280 3021/2770Florida 1070376 2712798 1241/80 5.02/37.7 4805/810 261/3840 6415/2810

Northwest USA 1207945 2840208 1588/100 7.8/48.3 8071/1030 299/4290 11018/3720Northeast USA 1524453 3897636 2077/140 8.8/66.5 8563/1560 383/6050 18722/4100

California and Nevada 1890815 4657742 2762/180 9.4/100 11664/1770 435/7750 19327/4270

Great Lakes 2758119 6885658 5704/240 19.8/114.7 32905/2730 671/12300 21915/6360Eastern USA 3598623 8778114 7666/400 24.4/183.8 41315/4140 1222/16280 70147/16920Western USA 6262104 15248146 14065/800 58/379.8 82247/8500 1178/32050 184477/25360Central USA 14081816 34292496 37936/3580 200/1691 215087/34560 3768/- 238151/-

Full USA 23947347 58333344 102302/- 860/- 672542/- 8348/- -/-

Results taken on TeslaMax Flow results at m = 3 and k = 7

put., 65(9):9941006, 2005.5. D. A. Bader and K. Madduri. Design and

Implementation of the HPCS Graph Anal-ysis Benchmark on Symmetric Multiproces-sors. In HiPC, volume 3769 of Lecture Notesin Computer Science, pages 465476, 2005.

6. D. A. Bader and K. Madduri. DesigningMultithreaded Algorithms for Breadth-FirstSearch and st-connectivity on the Cray MTA-2. In ICPP, pages 523530, 2006.

7. D. A. Bader and K. Madduri. GTgraph: ASynthetic Graph Generator Suite. Technicalreport, 2006.

8. D. A. Bader and K. Madduri. Parallel Al-

gorithms for Evaluating Centrality Indices inReal-world Networks. In ICPP 06: Proceed-ings of the 2006 International Conference onParallel Processing, pages 539550, 2006.

9. D. A. Bader and V. Sachdeva. A Cache-Aware Parallel Implementation of the Push-Relabel Network Flow Algorithm and Ex-perimental Evaluation of the Gap RelabelingHeuristic. In ISCA PDCS, pages 4148, 2005.

10. O. Boruvka. O Jistem Problemu Minimalnm(About a Certain Minimal Problem) (inCzech, German summary). Prace Mor.Prrodoved. Spol. v Brne III, 3, 1926.

11. A. Buluc, J. R. Gilbert, and C. Budak. Gaus-


24/26

24 IIIT/TR/2009/74

sian Elimination Based Algorithms on the

GPU. Technical report, November 2008.12. D. Chakrabarti, Y. Zhan, and C. Faloutsos.

R-MAT: A recursive model for graph mining.In In SIAM International Conference on DataMining, 2004.

13. B. Chazelle. A minimum spanning tree algo-rithm with inverse-Ackermann type complex-ity. J. ACM, 47(6):10281047, 2000.

14. J. D. Cho, S. Raje, and M. Sarrafzadeh.Fast Approximation Algorithms on Maxcut,k-Coloring, and k-Color Ordering for VLSIApplications. IEEE Transactions on Com-

puters, 47(11):12531266, 1998.15. K. W. Chong, Y. Han, and T. W. Lam. Con-current threads and optimal parallel mini-mum spanning trees algorithm. J. ACM,48(2):297323, 2001.

16. S. Chung and A. Condon. Parallel Implemen-tation of Borvkas Minimum Spanning TreeAlgorithm. In IPPS 96: Proceedings of the10th International Parallel Processing Sym-posium, pages 302308, 1996.

17. A. Crauser, K. Mehlhorn, and U. Meyer. Aparallelization of Dijkstras shortest path al-gorithm. In In Proc. 23rd MFCS98, Lecture

Notes in Computer Science, pages 722731.Springer, 1998.

18. A. Crauser, K. Mehlhorn, and U. Meyer. Aparallelization of dijkstras shortest path al-gorithm. In In Proc. 23rd MFCS98, LectureNotes in Computer Science, pages 722731,1998.

19. J. R. Crobak, J. W. Berry, K. Madduri, andD. A. Bader. Advanced shortest paths al-gorithms on a massively-multithreaded archi-tecture. Parallel and Distributed ProcessingSymposium, International, 0:497, 2007.

20. P. DAlberto and A. Nicolau. R-kleene:A high-performance divide-and-conquer al-gorithm for the all-pair shortest path fordensely connected networks. Algorithmica,47(2):203213, 2007.

21. F. Dehne and S. Gotz. Practical Parallel Al-gorithms for Minimum Spanning Trees. InSRDS 98: Proceedings of the The 17th IEEESymposium on Reliable Distributed Systems,page 366, 1998.

22. E. W. Dijkstra. A note on two problems in

connexion with graphs. Numerische Mathe-matik, 1:269271, 1959.

23. J. Edmonds and R. M. Karp. TheoreticalImprovements in Algorithmic Efficiency forNetwork Flow Problems. J. ACM, 19(2):248264, 1972.

24. N. Edmonds, A. Breuer, D. Gregor, andA. Lumsdaine. Single-Source Shortest Pathswith the Parallel Boost Graph Library. InThe Ninth DIMACS Implementation Chal-lenge: The Shortest Path Problem, November2006.

25. L. R. Ford and D. R. Fulkerson. MaximalFlow through a Network. Canadian Journalof Mathematics, 8(3):399404, 1956.

26. A. V. Goldberg and S. Rao. Beyond the flowdecomposition barrier. J. ACM, 45(5):783797, 1998.

27. A. V. Goldberg and R. E. Tarjan. A newapproach to the maximum flow problem. InSTOC 86: Proceedings of the eighteenth an-nual ACM symposium on Theory of comput-ing, pages 136146, 1986.

28. N. K. Govindaraju, S. Larsen, J. Gray, andD. Manocha. MemoryA memory model

for scientific algorithms on graphics proces-sors. In SC 06: Proceedings of the 2006ACM/IEEE conference on Supercomputing,page 89, 2006.

29. Y. Han, V. Y. Pan, and J. H. Reif. Effi-cient Parallel Algorithms for Computing allPair Shortest Paths in Directed Graphs. InACM Symposium on Parallel Algorithms andArchitectures, pages 353362, 1992.

30. P. Harish and P. J. Narayanan. Accelerat-ing Large Graph Algorithms on the GPU Us-ing CUDA. In HiPC, volume 4873 of Lecture

Notes in Computer Science, pages 197208,2007.

31. M. Hussein, A. Varshney, and L. Davis. OnImplementing Graph Cuts on CUDA. InFirst Workshop on General Purpose Process-ing on Graphics Processing Units, 2007.

32. J. Hyvonen, J. Saramaki, and K. Kaski.Efficient data structures for sparse networkrepresentation. Int. J. Comput. Math.,85(8):12191233, 2008.


25/26


33. G. J. Katz and J. T. Kider, Jr. All-

pairs shortest-paths for large graphs on theGPU. In GH 08: Proceedings of the 23rdACM SIGGRAPH/EUROGRAPHICS sym-posium on Graphics hardware, pages 4755,2008.

34. V. King, C. K. Poon, V. Ramachandran, andS. Sinha. An optimal EREW PRAM algo-rithm for minimum spanning tree verification.Inf. Process. Lett., 62(3):153159, 1997.

35. A. Lefohn, J. M. Kniss, R. Strzodka, S. Sen-gupta, and O. J. D. Glift: Generic, efficient,random-access GPU data structures. ACM

Transactions on Graphics, 25(1):6099, Jan.2006.36. T. Lengauer and R. E. Tarjan. A fast algo-

rithm for finding dominators in a flowgraph.ACM Trans. Program. Lang. Syst., 1(1):121141, 1979.

37. P. Micikevicius. General Parallel Compu-tation on Commodity Graphics Hardware:Case Study with the All-Pairs Shortest PathsProblem. In International Conference onParallel and Distributed Processing Tech-niques and Applications, pages 13591365,2004.

38. A. Munshi. OpenCL: Parallel computing othe GPU and CPU. In SIGGRAPH, Tutorial,2008.

39. P. J. Narayanan. Single Source Shortest PathProblem on Processor Arrays. In Proceedingsof the Fourth IEEE Symposium on the Fron-tiers of Massively Parallel Computing, pages553556, 1992.

40. P. J. Narayanan. Processor Autonomy onSIMD Architectures. In International Con-ference on Supercomputing, pages 127136,1993.

41. A. S. Nepomniaschaya and M. A. Dvoskina.A simple implementation of Dijkstras short-est path algorithm on associative parallel pro-cessors. Fundam. Inf., 43(1-4):227243, 2000.

42. NVIDIA. NVIDIA CUDA ProgrammingGuide 2.0http://www.nvidia.com/object/cuda develop.html/.

43. M. Pal and G. P. Bhattacharjee. An optimalparallel algorithm for all-pairs shortest pathson unweighted interval graphs. Nordic J. of

Computing, 4(4):342356, 1997.

44. O. Reingold. Undirected ST-connectivity inlog-space. In STOC 05: Proceedings ofthe thirty-seventh annual ACM symposium onTheory of computing, pages 376385, 2005.

45. D. P. Scarpazza, O. Villa, and F. Petrini. Ef-ficient Breadth-First Search on the Cell/BEProcessor. IEEE Trans. Parallel Distrib.Syst., 19(10):13811395, 2008.

46. S. Sengupta, M. Harris, Y. Zhang, and O. J.D. Scan primitives for GPU computing. InGH 07: Proceedings of the 22nd ACM SIG-GRAPH/EUROGRAPHICS symposium on

Graphics hardware, pages 97106, 2007.47. T. Takaoka. An Efficient Parallel Algorithmfor the All Pairs Shortest Path Problem. InWG 88: Proceedings of the 14th Interna-tional Workshop on Graph-Theoretic Con-cepts in Computer Science, pages 276287,1989.

48. V. Trifonov. An O(log n log log n) spacealgorithm for undirected st-connectivity. InSTOC 05: Proceedings of the thirty-seventhannual ACM symposium on Theory of com-puting, pages 626633, 2005.

49. G. Venkataraman, S. Sahni, and

S. Mukhopadhyaya. A Blocked All-PairsShortest-Paths Algorithm. Journal ofExperimental Algorithmics, 8:2003, 2003.

50. V. Vineet and P. Narayanan. CUDA cuts:Fast graph cuts on the GPU. In CVGPU08,pages 18, 2008.

51. V. Volkov and J. Demmel. LU, QR andCholesky Factorizations using Vector Capa-bilities of GPUs. Technical report, EECS De-partment, University of California, Berkeley,May 2008.

52. A. Yoo, E. Chow, K. Henderson, W. McLen-

don, B. Hendrickson, and U. Catalyurek. AScalable Distributed Parallel Breadth-FirstSearch Algorithm on BlueGene/L. In SC 05:Proceedings of the 2005 ACM/IEEE confer-ence on Supercomputing, page 25, 2005.

53. Y. Zhang and E. Hansen. Parallel Breadth-First Heuristic Search on a Shared-MemoryArchitecture. In AAAI-06 Workshop onHeuristic Search, Memory-Based Heuristicsand Their Applications, 2006.


26/26

26 IIIT/TR/2009/74

APPENDIX

Table 3Summary of results for synthetic graphs. Times in milliseconds

Algo Graph Number of Vertices, average degree 12, weights varying from 1 to 100Type 1M 2M 3M 4M 5M 6M 7M 8M 9M 10M

BFS

Random GPU 38 82 132 184 251 338 416 541 635 678

Random CPU 530 1230 2000 2710 3480 4290 5040 5800 - -R-MAT GPU 244 433 778 944 1429 1526 1969 2194 2339 3349

R-MAT CPU 340 760 1230 1680 2270 2760 3220 3620 - -SSCA#2 GPU 30 62 95 142 178 233 294 360 433 564

SSCA#2 CPU 420 930 1460 2010 2550 3150 3710 4310 - -

STCON

Random GPU 1.42 3.06 4.28 5.34 6.62 7.37 9.96 10.8 11.15 -Random CPU 68 164 286 310 416 536 692 - - -R-MAT GPU 19.2 32.37 172.1 347.4 408.3 579.1 626 1029 - -R-MAT CPU 160 358 501 638 926 1055 1288 - - -

SSCA#2 GPU 1.96 3.76 5.33 5.44 7.23 8.08 9.1 12.33 - -SSCA#2 CPU 78 176 286 422 552 595 665 - - -

SSSP

Random GPU 116 247 393 547 698 920 947 1140 1247 1535Random CPU 2330 5430 10420 18130 - - - - - -R-MAT GPU 576 1025 1584 1842 2561 3575 11334 - - -R-MAT CPU 1950 4200 6700 11680 - - - - - -

SSCA#2 GPU 145 295 488 632 701 980 1187 1282 1583 2198

SSCA#2 CPU 2110 4490 6970 9550 - - - - - -

MST

Random GPU 770 1526 2452 3498 4654 6424 8670 11125 - -Random CPU 12160 26040 - - - - - - - -R-MAT GPU 2076 4391 5995 9102 10875 12852 15619 21278 - -R-MAT CPU 10230 22340 - - - - - - - -

SSCA#2 GPU 551 1174 1772 2970 4173 4879 7806 9993 - -SSCA#2 CPU 7540 15980 25230 - - - - - - -

MF

Random GPU 598 3013 5083 7179 7323 16871 30201 34253 - -

Random CPU 15390 33290 - - - - - - - -R-MAT GPU 30743 55514 74767 148627 232789 311267 - - - -R-MAT CPU 8560 18770 - - - - - - - -

SSCA#2 GPU 459 2548 2943 7388 8606 12742 - - - -SSCA#2 CPU 9760 20960 - - - - - - - -

Table 4Summary of results for synthetic graphs APSP approaches. Times in milliseconds

APSP Graph Number of Vertices, average degree 12, weights in range 1 100Type 256 512 1024 2048 4096 9216 10240 11264 18K 25K 30K

Using Random 499 1277 3239 7851 18420 56713 65375 77265 160608 316078 556313SSSP R-MAT 489 1531 4145 12442 38812 143991 170121 211277 465037 1028275 1362119

GTX 280 SSCA#2 469 1300 3893 7677 17450 5 0498 58980 67794 163081 353166 461901

Random 2.77 11.3 55.7 330.4 2240.8 41150 58889 72881 244264 1724970 3072443

Matrix R-MAT 2.54 10.9 66.2 478 3756 32263 56906 71035

339188

1152989

4032675

GTX 280 SSCA#2 2.55 8.6 42.9 263.6 2 063.7 4 3045 62517 76399 220868 1360469 1872394

Fully Conn. 2.9 15.2 112 959 8363 110658 151820 202118 799035 4467455 -

Random 3.25 21.3 136.3 827.9 5548 65552 87043 113993 1048598 - -Matrix R-MAT 2.99 22 174.8 1307.6 10534 94373 115294 137854 1487025 - -

FX 5600 SSCA#2 2.91 15.78 103.2 635.3 4751 62555 82708 109744 1001212 - -Fully Conn. 2.9 25.1 221.5 1941 16904 268757 368397 490157 4300447 - -

GE Based Random 1.8 4.4 12.1 44.9 230 1505 - - - - -Lazy Min R-MAT 1.8 4.5 12.1 45 230 1505 - - - - -GTX 280 SSCA#2 1.8 4.5 12.1 44 230 1497 - - - - -GE Based Random 2.18 6 19.64 96.5 639 3959 - - - - -Buluc [11] R-MAT 1.86 5.9 19.1 96 638 3959 - - - - -GTX 280 SSCA#2 2.18 5.9 19.67 96 638 3965 - - - - -Katz [33] - 7.7 34.9 230.1 1735.6 13720 158690 216400 1015700 - - -

CPU implementation is oursUsing

Documents

Pawan 09 Graph Algorithms