Study of Bandwidth Partitioning for Co-executing GPU Kernelsuu.diva-portal.org/smash/get/diva2:1154188/FULLTEXT01.pdf · Graphics Processing Units (GPU) provide substantial raw computing

IT 17 032

Examensarbete 15 hpJuli 2017

Study of Bandwidth Partitioning for Co-executing GPU Kernels

Erik Melander

Institutionen för informationsteknologiDepartment of Information Technology

Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress: Box 536 751 21 Uppsala Telefon: 018 – 471 30 03 Telefax: 018 – 471 30 00 Hemsida: http://www.teknat.uu.se/student

Abstract

Study of Bandwidth Partitioning for Co-executingGPU Kernels

Erik Melander

Co-executing GPU kernels on a partitioned GPU has been shown to improveutilization efficiency of poorly scaling tasks. While kernels can be executed in parallel,data transfers to the GPU are serial which can negatively impact parallelism andpredictability of the kernels.In this work we implement a fairness-based approach to memory transfers bychunking data sets and transferring them interleaved and evaluate the overhead of thisapproach. Then we develop a model to predict when kernels will start using thisimplementation. We found that chunked transfers in a single CUDA stream have onlya small overhead compared to serial transfers, while event synchronized transfers inseveral streams have larger overhead particularly for chunk sizes less than 500 KB.The prediction models accurately estimate kernel starting times and return transfertimes with less than 2.7% relative error.

Tryckt av: Reprocentralen ITCIT 17 032Examinator: Olle GällmoÄmnesgranskare: Andra-Ecaterina HugoHandledare: Johan Janzén

Acknowledgements

I would like to thank my supervisor Johan Janzén and reviewer Andra-Ecaterina Hugo for thepatient help and guidance they have provided on every step of this thesis project.

Contents

1 Introduction 1

2 Background 12.1 High Performance Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2 Co-executing Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.3 Problem description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.4 Memory transfers with CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

3 Implementation 33.1 MAGMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33.2 Fairness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3.2.1 Sequential memory transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . 43.2.2 Synchronization of memory transfers in several streams . . . . . . . . . . . 43.2.3 Synchronization of memory transfers in one stream . . . . . . . . . . . . . . 7

3.3 Predictability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

4 Evaluation 84.1 Experimental platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84.2 Performance impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84.3 Predictability evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

5 Results 95.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105.2 Predictability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

6 Discussion 31

7 Related works 31

8 Conclusions 32

9 Future work 32

1 Introduction

Graphics Processing Units (GPU) provide substantial raw computing power making them of in-terest for High Performance Computing (HPC) applications. In order to take full advantage ofGPUs, applications must scale well with the large number of threads of the GPU. However, manycomputational kernels in HPC applications lack this degree of scalability, resulting in lesser per-formance than hoped for. Previous work has shown a significant improvement by co-schedulingpoorly scaling tasks on partitions of the GPU to increase the kernel efficiency [1]. However, beforea kernel can start, the dataset on which it should perform computations must be transferred tothe local memory of the device. Datasets for HPC applications are regularly huge in size andavailable bandwidth for transferring data from the CPU to the GPU and back is limited. Differentstrategies in partitioning the bandwidth have different trade offs. A strategy may improve thestarting execution of kernels but also increase the overall time a computation of many tasks takeby blocking critical tasks access to bandwidth. We implement a strategy that provides fair accessto the bandwidth for each task and predictable transfer times.

2 Background

There is currently a common trend to use GPUs in High Performance Computing, and there isa lot of effort made to make optimizations for the applications running on these systems. Thissection presents a technique that uses co-executing kernels to improve efficiency and shows howmemory transfers to the GPU are a bottleneck for this technique.

2.1 High Performance ComputingGPUs have gone from primarily being used in graphics processing to be able to function as moregeneral computational engines. Several versions exist whose only purpose is to function as GeneralPurpose GPUs, and several of the world’s fastest super computers have GPUs as components. InNovember 2016, the Swiss National Supercomputing Center’s Piaz Daint supercomputer was ableto retain its number 8 place on the list of the worlds most powerful computers by upgrading itscomputational power with 3.5 petaflops by installing NVIDIA P100 Tesla GPUs. NVIDIA’s ownDGX SATURNV system, also using P100 Teslas, took the number one spot for most power efficientsupercomputer with 9.46 gigaflops/watt.[2] Supercomputing is not the only use for GPUs, latencycritical real time systems, such as autonomous driving, are also examples of the kind of data-parallel and compute-intensive applications that can benefit from GPUs.[3] GPU programmingmodels, such as CUDA, short for Compute Unified Device Architecture, used by NVIDIA and theOpen Computing Language (OpenCL) provide frameworks for writing programs that execute onGPUs in the case of CUDA or heterogeneous platforms consisting of various devices in the case ofOpenCL.

MAGMA is a collection of dense linear algebra libraries for heterogeneous architectures used inHPC applications as BLAS (Basic Linear Algebra Subroutines). These applications are generallydescribed as a graph of multiple tasks that are scheduled on multiple CPUs and GPUs for loadbalancing and performance improvement.[4]

2.2 Co-executing KernelsPrograms written in CUDA require parallelism to be explicitly programmed and perform theresource allocation at runtime, with the aim of providing the optimum performance that thehardware allows. In reality this is not always the case. Pai et al.[5] found that CUDA programsperforming programs from the Parboil2 benchmark suite utilized only between 20-70% of resourceson average. Janzén et al.[1] found that for certain benchmarks in the MAGMA suite, kernelefficiency was only 58% when using the the full GPU, but higher when using less of the GPUs.Janzén et al.[1] listed two cases where providing the GPU runtime with as large tasks as possibledid not lead to better performance:

• Limited GPU scalability: Tasks that did not scale well across the whole GPU, such as theMAGMA DTRSM (solving triangular matrix with multiple right hand sides) benchmarkwhich had a 58% kernel efficiency on 13 SMs, but 87% when running on only 5 SMs.

1

• Limited task-based parallelism: When an application is made up of a set of tasks and ex-ecuting only one task at a time can block the critical path through the applications’ taskgraph.

A way to address both of these cases is by making computations on the GPU execute concurrentlyand to be able to schedule the execution of tasks in an optimal order. However, simply executingconcurrently often does not provide the benefits expected compared to serial execution due to alack of control over the resources allocated to each kernel.[5]

CUDA Streams allow CUDA operations to execute asynchronously both by not blocking theHost and by executing operations in different streams concurrently on the Device. The operationscan include data transfers, kernel launches etc. To perform co-executing kernels, several streamsmust be used.[6] NVIDIA does not publish the policies used by the runtime to schedule operations,but Pai et al.[5] identified that CUDA uses a co-scheduling policy called "left-over" scheduling.This policy will only co-execute kernels if they do not fill the GPU. In practice this means thatkernels will often not co-execute as they are usually launched with more blocks than the hardwarecan handle at once, except at the end of a kernel when the remaining blocks may not fill the GPU.[5]Several methods to overcome this limitation by partitioning the GPU has been developed[5][7][8],but the one we will continue to build on is the method used by Janzén et al.[1] in which theydevised a software partitioning of GPUs. GPU architecture is built around the concept of StreamingMultiprocessors (SM). Each SM in a GPU supports the concurrent execution of a large numberof threads and a GPU usually consists of several SMs. Blocks are are distributed among SMs forexecution and once a block is scheduled to an SM, it will execute only on that SM.[6] Janzén etal.[1] devised a technique to remap blocks to SMs which isolate kernels to particular sets of SMs,thereby using the "left-over" policy to effectively partitioning the GPU.

2.3 Problem descriptionWhether an application is used by supercomputers or by real time systems, co-executing kernelsare only a piece of the puzzle when determining the optimal scheduling order of tasks. Before anapplication can launch a kernel, the data on which the kernel will operate on must be transferedto the GPU (Device) memory from the CPU (Host) memory. When executing the DGEMMbenchmark1 for 4096 by 4096 matrices running on a NVIDIA K20 GPU, data transfers to andfrom the GPU take 26% of the total time it takes to compute and return the results. To accuratelyschedule multiple tasks on a partitioned GPU a scheduler need to be able to predict the time ahost-to-device transfer takes and when the kernel will start to execute. It also need to predict thetime the device-to-host transfer takes to find the total time the entire operation will take. If datatransfers are not explicitly managed the CUDA runtime will schedule transfers on its own leadingto hard to predict results. And while the GPU can co-execute kernels, the same can not be said fordata transfers. A transfer will use the entire bandwidth in a direction, this means that sequentialdata transfers will negatively impact the parallelism of the kernels by reducing their overlap.

We provide a solution to both of the issue of predictability and of fairly partitioning the availablebandwidth by implementing and analyzing a software transfer method.

We make the following contributions:

• A fairness based data transfer implementation that ensures equal access to the serial bus andstarvation-free execution of the kernels.

• An analysis of the implementation’s performance compared to the performance of unparti-tioned kernels and to co-executing kernels using sequential data transfers.

• A model for predicting how long transfers to and from the GPU will take with the imple-mentation.

• Analysis of how well the model predicts a number of realistic scenarios.

2.4 Memory transfers with CUDAMemory management in CUDA is similar to C programming in that the programmer must explicitlyallocate memory on both the host (CPU) and the Device (GPU). In addition, the programmer

1MAGMA library implementation of Matrix multiplication

2

must explicitly handle the transfer of data to and from the device. The CPU and GPU memoryis connected via a bus, a PCI-e 3.0 bus allows 16 Gb/s serial data transfer with full duplex. Thiscan be compared to the theoretical peak bandwidth between the GPU and its on onboard GDDR5memory which is 144 GB/sec for a Fermi C2050 GPU.[6]. While the PCI-e architecture only allowsa single memory transfer to be ongoing in one direction at a time, the C2050 architecture has twocopy engines which can utilize the full duplex of PCI-e to transfer memory to and from the GPUin parallel[9].

Figure 1: C2050 dual copy engines. Adapted from Nvidia website[9].

This means that access to the bus in one direction is a resource that only can be utilized byone process at a time.

Memory transfers launched synchronous with cudaMemcpy is placed in the default stream andblock the CPU while each transfer is performed. As a result the transfers will happen in the orderthey are issued. CUDA support asynchronous transfers using the cudaMemcpyAsync function callwhich will not block the CPU. Calls with cudaMemcpyAsync will be placed in a stream specifiedas a parameter to the function. To be able to use asynchronous transfers the memory allocatedmust be pinned to prevent it from being moved by the host virtual memory to different physicallocations.[6]

3 Implementation

This section describes the MAGMA kernels and the definition and implementation of a fairness-based approach to memory transfers to co-executing MAGMA DGEMM kernels.

3.1 MAGMAWe use the Open Source MAGMA (Matrix Algebra on GPU and Multicore Architectures) libraryfor benchmarking and testing our implementation. Janzén et al.[1] adapted four tasks from theMAGMA collection to run on partitioned GPUs, DSYRK, DTRSM, DGEMM and DPOTRF.Together they make up the tasks necessary to perform Cholesky factorization. MAGMA improveperformance by scheduling different task on different processing units of the heterogeneous system.Tasks and their algorithms can be represented as Directed Acyclic Graphs. Tasks in the graph areeither executed on the GPU or the CPU in a multicore and GPU architecture.[4] We only use theDGEMM kernel for our evaluation as it has a straight forward implementation and well knownbehavior. DGEMM means Double precision General Matrix Multiplication and is part of the BasicLinear Algebra Subroutines (BLAS). It does not use any heterogeneous execution, running only onthe GPU.[10] It is a kernel that does not suffer badly from scalability issues (Janzén et al. foundit to have a kernel efficiency of 92% on 13 SMs), meaning it does not gain a lot by running on apartitioned GPU. In our case this is more of a benefit than a downside as we are not interested inthe kernels performance, but only in the performance of our data transfer method.

3

3.2 FairnessThe objective of this project is to implement a strategy that ensures an efficient use of the PCI-busbandwidth. A way of achieving this is by implementing a method that ensures that tasks haveequal access to the PCI-bus bandwidth, regardless of the size of the task and its memory. We referto this approach as "fairness-based" memory transfers.

In CPU scheduling, round-robin is one way of achieving equal access to a resource. Round-robinis implemented by defining a small unit of time, called a time quantum. Tasks are placed in aqueue and the CPU scheduler goes through each task in the queue and lets it execute for one timequantum. Once the end of the queue is reached, the scheduler begins at the beginning of the queueagain.[11] Some aspects of CPU scheduling are not applicable to our situation. A direct memoryaccess (DMA) transfer like the ones we are doing over the PCI bus is non-preemptive. Once atransfer starts, it cannot be interrupted. Since bandwidth is the only resource under considerationwe can get around this problem by equating the time quantum concept of CPU scheduling withchunks of the data we want to transfer. Each set of data is chunked into pieces that will takesome time quanta to transfer. We then interleave transfers of chunks from different tasks in around-robin fashion.

3.2.1 Sequential memory transfer

The simplest way of transferring memory to a GPU is to use sequential transfer. If we do that forco-executing kernels we get the results in figure 2.

Figure 2: Sequential data transfers to co-executing kernels.

The host-to-device transfer blocks the CPU from starting the kernels resulting in both kernelsstarting their execution at the same time. The first kernel could have started half way throughthe memory transfer as that is when all of its data was available on the GPU. The second kernelon the other hand has to wait until the first kernels device-to-host transfer has completed beforeit can make its transfer back to the host.

Sequential memory transfer has often been used in previous work to eliminate memory transferas a parameter when evaluating kernels.[1][12] Asynchronous memory transfers negatively impactthe parallelism of the co-executing kernels as there will be less overlap in the execution. Sequentialtransfer also make kernel launches unpredictable unless explicitly synchronized which is a significantissue for a scheduler. In our work the sequential transfer is used to represent the worst case scenariofor each co-executing kernel as unless streams are synchronized it is not guaranteed which kernelwill execute last.

3.2.2 Synchronization of memory transfers in several streams

Our fairness-based implementation is built on the idea that the bus is equally utilized by all activetasks for any sufficiently large time period. Sufficiently large depending on the time quanta andnumber of tasks. An example of this principle can be seen in figure 3.

Figure 3: Fair chunked memory transfer.

Memory transfers placed in a stream will preserve the order they were issued when executingon the copy engine. However, the CUDA runtime will not preserve the order the CPU issued thecalls between different streams. The result is that the order of a cudaMemcpyAsync call placed instream 1 before acudaMemcpyAsync call placed in stream 2 is not guaranteed to be in the orderthe calls were issued by the CPU. This leads to unpredictable results when chunking the matriceswe are transferring. As seen in Figure 4, the cuda runtime sometimes transfers two chunks in the

4

same stream directly after each other and sometimes not resulting in unpredictable timing whenthe kernels launch.

Figure 4: Memory transfers in four streams without synchronization.

Cuda have different methods to allow synchronization between streams and between CPU andGPU. Events can be recorded using cudaEventRecord at points in a stream. The execution indifferent streams can be synchronized by waiting until the relevant event has been recorded withcudaStreamWaitEvent.

event stream eventsfor s in stream:

if s = first streamcudaStreamWaitEvent(s, event[last stream])

elsecudaStreamWaitEvent(s,event[s�1])

cudaMemcpyAsync()cudaEventRecord(event[s],s)

end

This allows fine grained control over the order that the chunks are transferred for all streams,see Figure 5

Figure 5: Memory transfers in four streams with synchronization.

Unfortunately, this reveals a significant problem, namely that the kernels still do not executein the order expected with the kernel in stream 16 first. The reason is in the design of the functionthat launches the kernel. To be able to partition the GPU, the function that will launch the kernelhas to send a set of parameters to the GPU before it launches. This 60 bytes transfer is alsotransferred by a call to cudaMemcpyAsync and need to use the PCI-e bus. The CUDA runtime willschedule it as it sees fit, and in Figure 5 it has decided that the transfer for the kernel in stream16 should be the last transfer.

Figure 6: Tiny 60 bytes parameter transfers.

Even though the CPU is not used for computations with the co-executing version of DGEMM,our implementation of memory transfers and kernel launches is generalized to be able to accountfor kernels that distribute their tasks between both the Host and Device. To do this, each kernelis launched from its own CPU thread as done in previous work by Janzén et al. (Figure 7).

5

Figure 7: Memory transfers operations are called from one CPU thread, kernels launched fromtheir own.

There is no way using only CUDA to synchronize the launch of the kernels if they are to belaunched in separate CPU threads. It is instead necessary to manage the memory transfers in aCPU thread that is running concurrently with the threads that launch the kernels and synchronizethe launches using pthread functions. For this purpose, a thread safe FIFO queue data structureis needed. Every kernel will have an associated queue where its’ chunks are placed. A managingfunction running in a separate thread will take chunks from the queues and transfer them. Oncea queue is empty, the function wakes the kernel thread and goes to sleep. Once the kernel threadhas issued the commands to transfer the parameters and to launch the kernel, it wakes the queuemanagement function to allow it to continue scheduling transfers.

6

Figure 8: Solution using pthreads to synchronize kernel launches.

3.2.3 Synchronization of memory transfers in one stream

Instead of using the same streams that the kernels execute in, all chunks can be scheduled using asingle stream. Since every transfer in a stream will execute in the order it was issued, this eliminatesthe need to use events to enforce the order that the chunks are transferred. There is still need forsynchronization though, since the kernels will execute in different streams than the one that is usedto transfer the chunks it must be delayed until the transfer of all chunks belonging to that kernelhave completed. This can be achieved by recording an event after a kernel’s last chunk’s order hasbeen issued. The thread launching the kernel must include a cudaStreamWaitEvent call beforethe kernel launch. The same must be done for Device to Host transfers. An event recorded afterthe kernel launch command, and a cudaStreamWaitEvent or cudaEventQuery placed before thetransfers. The issue with the parameter transfers and kernel launches are solved using the samemethod as for several streams, thread safe queues.

Figure 9: Chunked transfers in a single stream.

3.3 PredictabilityTo be able to schedule tasks on different partitions of the GPU correctly, the scheduler needs to beable to predict how long the memory transfer methods will take to transfer the data to each kerneland when the kernel starts under different circumstances. Kernel start time depend on severalfactors, the total size of the memory that is transferred, the choice of chunk size, and if a kernelhas full or share access to the bus for parts or the entire transfer. The model must be able topredict transfer times for several scenarios and variations of parameters.

Fujii et al.[3] found that host-to-device and device-to-host had very different performance char-acteristics for Direct Memory Access (DMA) transfers. Device-to-host transfers were faster which

7

they attributed to hardware capabilities not documented publicly. This requires us to considerhost-to-device and device-to-host transfers as two separate cases for our models.

In order to predict the start and end of a chunked transfer we use linear regression with respectto size of the transfered chunks and the time it takes to transfer to build the predictive models.Only a single linear regression is needed to model chunked transfer in single stream as it doesnot do any synchronization between chunks regardless of the number of partitions. We need tobuild a spline-based model for chunked event synchronized transfers.[6] There is a difference insynchronization overhead if a single partition is transferring memory, where no synchronizationwill take place, or if there are several partitions in which case chunks are synchronized betweenthe different streams.

4 Evaluation

We evaluate the implementations by partitioning the GPU into two equal parts, each consistingof 6 SMs. This results in the 13th SM of the GPU being left without any work to do. All kernelsrun the MAGMA DGEMM benchmark. For each kernel we transfer three matrices (A, B and C)to the device, A and B are multiplied and the result placed in C. C is returned to the host. Thetransfers use our fairness-based implementations:

• Chunked event synchronized transfer

• Chunked transfer in single stream

4.1 Experimental platformMeasurements were done on a 2-socket 8-core Inter(R) Xeon(R) E5-2680 CPU with a base frequencyof 2.7 GHz and 64 GB of RAM, with an NVIDIA Tesla K20 graphics card. The K20 has 13 SMs,each with 192 cores and 5 GB of GDDR5 memory. We use Intel MKL 11.3.0, MAGMA 1.7.0 andCUDA 7.5 on Linux 2.6. When collecting baselines, ANTT and Total time measurements, eachconfiguration we ran each configuration 5 times. When collecting scenario measurements, we raneach configuration 100 times.

4.2 Performance impactMatrices were chunk in 7 different sizes, ranging from 128 by 128 elements (1̃31 Kb) to 1024 by1024 elements (8̃Mb), and the size of the matrices ranged from 128 by 128 elements to 4700 by 4700elements. In addition to the two fairness-based methods, we also perform the transfers chunked intwo streams but without any enforced synchronization (Chunked unsynchronized transfer). Thisgives us a comparison to how the CUDA runtime performs. We use three metrics to compareperformance to sequential transfers.

Overhead

Overhead is measured by measuring the fairness-based methods and normalizing the results withregular CUDA memcpy of the same size.

Kernel starting time

Starting time measures the time from when the first transfer start to when each kernel starts.This was normalized with the same measurement run on a GPU with the same size and numberof partitions, but which used sequential data transfers.

Total time

The total time measures the time i ms from when the first Host to Device transfer starts to whenthe last Device to Host transfer finishes. Its purpose is to measure the total time for all kernelsto perform their calculations and return the results. Total time is a lower-is-better metric. It isnormalized in the same way as the Starting time metric.

8

Average Normalized Turnaround Time

Average Normalized Turnaround time (ANTT) quantifies the slowdown during multi-task task ex-ecution relative to single-task execution. The normalized turnaround time (NTT ) is defined as:

NTTi =T

MPi

T

SPi

(1)

Where T

MPi is the execution time when running multiple tasks, and T

SPi is the execution time

when running a single task. In our case, TMPi is the time from starting to transmit data Host to

Device until the results are returned for a co-executing kernel. T

SPi is the same when the GPU is

a single partition. We then calculate the average:

ANTT =1

n

nX

i=1

T

MPi

T

SPi

(2)

ANTT is a lower-is-better metric.[13]

4.3 Predictability evaluationWe use the data collected for the performance impact evaluation (4.2) to build the models forpredicting the data transfers, both host-to-device transfers and device-to-host. The models wereevaluated by comparing the calculated results to scenarios run 100 times each and calculating thepercentage difference between the result of the model and the arithmetic mean of the tests. P

sizei

is the total size of the memory transfer used by partition i, ti is the time when memory for thetransfer belonging to partition i can begin transmission.

We chose three scenarios that usually happen in a task based application:

• P

size1 = P

size2 , t1 < t2

• P

size1 < P

size2 , t1 < t2

• P

size1 > P

size2 , t1 < t2

Host-to-device measurements are done the same way for all three scenarios. We measure fromtime when the first chunk starts to transfer to when each kernel begin its computation. Device-to-host measurements are made slightly differently depending on the scenario. For the first scenariotwo measurements are made, the first from when kernel 1 finishes to when its data set has beentransferred. The second from when kernel 1 finishes to when the second kernels data set has beentransferred. The reason for the design of the second measurement is that due to the differencea chunk of equal size takes to transfer host-to-device versus device-to-host. Since the latter isfaster, Kernel 2 often finishes as a transfer is already ongoing from the result of the first dataset.This makes accurately measuring and modeling the transfer time of kernel 2’s data set difficultas there will be a gap from when the kernel finishes to when the first chunk can start transfer.Instead the second measurement is to measure and model the total time it takes both data setsto be transferred. For scenario two and three this is not an issue as in both cases, device-to-hosttransfers will begin immediately after the kernel finishes. This is due to the difference in the timewhen the kernels finish is so large that the smaller device-to-host transfer will finish before thelarger begins.

We also compare the time of the host-to-device transfers of the configurations to the timesequential transfers take for the same configuration to determine the difference in starting time ofthe kernels.

5 Results

The results of the evaluation of the fairness-based methods captures the overhead of the fairness-based methods and the precision of the prediction models. The results show that the chunkedtransfer in a single stream method has little to no overhead over letting the CUDA runtimetransfer the chunks unsynchronized while the Chunked event synchronized transfer method hasmore overhead, particularly for small chunks size and matrices. Both methods can be predictedby the models with less than 2.7% relative error.

9

5.1 PerformanceOverhead

Figure 10 show the overhead for host-to-device transfers normalized using standard CUDA memcpy.Both methods show significant overhead for smaller chunk sizes. The chunked event synchronizedmethod show overhead for larger sizes as well, whereas the overhead is very small for the chunkedtransfer in single stream method.

Figure 10: Overhead with regards to chunk size.

Kernel start time

The default case is when the total size of the memory transfer for the partitions and the time thetransmission can begin are both equal: P

size1 = P

size2 , t1 = t2.

Figure 11 show the result when host-to-device transfer is split into 24 chunks.

10

Figure 11: Kernel starting time when host-to-device transfer is split into 24 chunks, default case.

Kernel starting time for chunked event synchronized transfers show the significant overhead ofthe method for small matrices in particular, but both kernels start executing close in time. Chunkedtransfer without synchronization show the CUDA runtime essentially sequentially transferring thesmaller matrices, but varying which it transfers first. For larger matrices, CUDA interleaves thechunks but still varies which kernel start to execute first. Chunked transfer in single stream hasless overhead than the event synchronized mehtod and kernel start time is close in all cases.

Total time

The tests are run using two partitions, each assigned 6 SMs and the DGEMM kernel. Matrix sidesize, the measurement used on the x-axis, refers to the size of the matrices being multiplied. Thechunks are labels with the size they would have if they were matrices, i.e. a chunk labeled 128 hasthe same size as a 128 by 128 matrix.

11

Figure 12: Comparison of interleaved transfer methods.

The results show the high cost of synchronizing several streams using events when using smallchunk sizes. The first graph has considerable overhead over a sequential memcpy when chunkingin blocks of 128 ⇥ 128 element sized chunks. As the size of the matrices being multiplied increases,the overhead decreases as a result of the time the kernel computations takes increases faster thanthe increase in memory transfer time.

The second graph shows a significant improvement in performance when allowing the runtimeto decide the order over the synchronized version. The CUDA runtime seems to favor orderingsmall chunks so that they create larger aggregated transfers in each stream.

Figure 13: Unsynchronized transfers.

As the chunks become larger relative to the total size of the transfer, the runtime will interleave

12

transfers more evenly between the two streams.The third graph is quite similar to the second, but with less variation as can be seen in the

smoother curve as well as smaller error bars than both previous versions. The similar performancecan be attributed to not needing event synchronization to interleave chunks to the two kernels.And the smoother curve can be attributed to the more predictable time before the kernels willlaunch as a result of the enforced interleaving.

Average Normalized Turnaround Time

The ANTT metric was normalized using sequential memory transfers and a single 12 SM partition.

Figure 14: Comparison of interleaved transfer methods using ANTT metric.

5.2 PredictabilityWe use the performance data collected to calculate an average transfer time for different chunksizes. The host-to-device time was measured from when the transfer of the first chunk begins towhen the final chunk has been transferred and divided by the number of chunks. The data wasplotted based on chunk size. In addition to the three previous transfer methods described, thechunked event synchronized method was used for transfers to a single partition (single partitionsynced stream). This is needed to model the behaviors of our chunked event synchronized methodwhen one partition gets full access to the bus. Figure 15 show the linear regression model forhost-to-device transfers and figure 16 the model for device-to-host transfers. We then show threescenarios for tasks executing on a GPU with two partitions and how the model predicts them.

13

Method function Model R-squaredSynchronized streams fsync 1.673e�7 · x+ 1.520e�2 0.999Single stream fstream 1.616e�7 · x+ 5.268e�3 0.999Single partition synchronized fpart1 1.608e�7 · x+ 4.548e�3 0.999

Table 1: 1st order linear regression, device-to-host.

Figure 15: Average transfer time related to chunk size, host-to-device.

The result show a fairly direct relationship between chunk size and transfer time. First orderlinear regression of the models (excluding non synchronized transfer) shows a good fit.

The same was performed for device-to-host transfers. These turn out to run faster than host-to-device transfers, and there is little difference between the method with event synchronizedstreams and methods only using a single stream. Unsynchronized transfers are still faster, butshow significant variability.

14

Method function Model R-squaredSynchronized streams fsync 1.501e�7 · x+ 0.014e�2 0.993Single stream fstream 1.51e�7 · x+ 0.005e�3 0.997Single partition synchronized fpart1 1.497e�7 · x+ 0.004e�3 0.998

Table 2: 1st order linear regression, Device to Host

Figure 16: Average transfer time related to chunksize Device to Host.

Scenario 1

Two pair of equally sized matrices are to be multiplied using two equally sized partitions. Chunksbelonging to the second pair will not start transmitting until 5 chunks of the first pair has beentransmitted. The model was run 100 times. This simulates the behavior when one task is runningand a second task is added a short while later.

Figure 17: Host to Device transfer first scenario.

Two formulas are used:kernel_1_time = m ⇤ fpart1(c) + n ⇤ fsync(c)kernel_2_time = m ⇤ fpart1(c) + n ⇤ fsync(c) + k ⇤ fpart1(c)

where m is the number of chunks transmitted in sequence in in stream 1, n is the number ofinterleaved chunks, and k is the number of chunks transmitted in sequence in stream 2. The model

15

attempts to calculate the time from the point when transfers to the first kernel begin to when eachkernel starts.

The blue dot is the models prediction. The box plot are the measured values, the whiskersrepresenting 1.5⇥interquartile range and the small dots are outliers.

Figure 18: Scenario 1: Host to device transfers in chunked event synchronized streams vs model(blue dot).

Partition Average(ms) Calculated(ms) Diff(ms) Diff(%)Kernel 1 size 819200 7.083 7.236 0.153 2.11Kernel 2 size 819200 8.143 8.069 -0.073 -0.90Kernel 1 size 2097152 17.320 17.451 0.130 0.74Kernel 2 size 2097152 19.684 19.526 -0.159 -0.81Kernel 1 size 3075200 25.025 25.269 0.243 0.96Kernel 2 size 3075200 28.309 28.293 -0.015 -0.05Kernel 1 size 7372800 59.222 59.622 0.400 0.67Kernel 2 size 7372800 66.673 66.820 0.146 0.21

Table 3: Scenario 1: Host to Device in two event synchronized streams

16

The model’s precision increases with larger chunks. The model overestimates kernel 1 consis-tently, while no bias in either direction can be found on kernel 2. Relative predictive error is < 1%with the exception of Kernel 1 when using the smallest chunk size.

The same scenario was also evaluated for when all chunked transfers are done in a single stream.The model is then:

kernel_1_time = m ⇤ fstream(c) + n ⇤ fstream(c)kernel_2_time = m ⇤ fstream(c) + n ⇤ fstream(c) + k ⇤ fstream(c)

Figure 19: Scenario 1: Host to Device transfers in chunked single stream vs model (blue dot).

17

Partition Average(ms) Calculated(ms) Diff(ms) Diff(%)Kernel 1 size 819200 6.878 6.744 -0.133 -1.98Kernel 2 size 819200 7.423 7.433 0.009 0.13Kernel 1 size 2097152 16.645 16.863 0.217 1.29Kernel 2 size 2097152 18.414 18.584 0.169 0.91Kernel 1 size 3075200 24.456 24.607 0.150 0.61Kernel 2 size 3075200 27.155 27.118 -0.037 -0.13Kernel 1 size 7372800 57.878 58.634 0.755 1.28Kernel 2 size 7372800 64.746 64.617 -0.129 -0.19

Table 4: Scenario 1: Host to Device in one stream

Single stream transfer shows no tendency for the model consistently over or under estimating thetransfer times. The relative predictive error is more equal regardless of chunk size, but somewhatbetter for larger sizes and consistenly <2%. Transfer times are also 1-2 seconds faster than withthe chunked event synchronized method.

Figure 20: Scenario 1: Device to Host transfers in chunked event synchronized streams vs model(blue dot)

18

Partition Average(ms) Calculated(ms) Diff(ms) Diff(%)Kernel 1 size 819200 1.416 1.446 0.029 2.05Kernel 2 size 819200 2.296 2.345 0.049 2.09Kernel 1 size 2097152 3.519 3.553 0.033 0.94Kernel 2 size 2097152 5.727 5.791 0.064 1.10Kernel 1 size 3075200 5.137 5.165 0.028 0.55Kernel 2 size 3075200 8.385 8.429 0.044 0.52Kernel 1 size 7372800 12.198 12.251 0.052 0.42Kernel 2 size 7372800 19.919 20.020 0.101 0.50

Table 5: Scenario 1: Device to Host in two streams

The device-to-host transfer show similar characteristics as the host-to-device with relative pre-dictive error decreasing with chunk size. A small tendency for the model to overestimate the timeis present in all cases.

Figure 21: Scenario 1: Device to Host transfers in chunked single stream vs model (blue dot).

19


Table 6: Scenario 1: Host to Device in one stream

The results are again similar to the host-to-device results for the same model with some im-provement when chunk size is increasing but less than for the event synchronized model. Again atendency for the model to slightly overestimate the transfer time.

Figure 22: Time to start each kernel in scenario 1.

Chunked event synchronized transfers do worse in performance than chunked transfers in asingle stream which is clearly visible when looking at the estimated starting time in figure 22.The result show that predictability is fairly good regardless of transfer method with less than 2%relative predictive error for everything but the smallest chunk size tested.

20

Scenario 2

Scenario 2 consist of a five chunk delay on the second stream, and that the data set in the firststream is smaller than the set in the second. This simulates a large second task being added ashort while after the first task.

Figure 23: Host to Device transfer for the second scenario.

Figure 24: Scenario 2: Host to Device transfers in in chunked event synchronized streams vs model(blue dot)

21

Partition Average(ms) Calculated(ms) Diff(ms) Diff(%)Kernel 1 size 819200 7.080 7.236 0.156 2.15Kernel 2 size 819200 10.996 10.931 -0.066 -0.60Kernel 1 size 2097152 17.138 17.451 0.313 1.79Kernel 2 size 2097152 26.657 26.702 0.045 0.17Kernel 1 size 2097152 25.044 25.269 0.225 0.89Kernel 2 size 2097152 38.860 38.772 -0.088 -0.23Kernel 1 size 2097152 59.256 59.622 0.367 0.61Kernel 2 size 2097152 91.862 91.809 -0.052 -0.06

Table 7: Scenario 2: Host to Device transfers in in chunked event synchronized streams vs model

The results are quite similar to scenario 1 with less relative error for larger chunks. Kernel 1 isconsistently overestimated by the model.


22

Partition Average(ms) Calculated(ms) Diff(ms) Diff(%)Kernel 1 size 819200 6.875 6.745 -0.130 -1.93Kernel 2 size 819200 10.235 10.324 0.088 0.86Kernel 1 size 2097152 16.704 16.863 0.159 0.95Kernel 2 size 2097152 25.645 25.811 0.166 0.64Kernel 1 size 3075200 24.378 24.607 0.229 0.93Kernel 2 size 3075200 37.525 37.664 0.139 0.37Kernel 1 size 7372800 58.130 58.635 0.505 0.86Kernel 2 size 7372800 90.129 89.747 -0.382 -0.43

Table 8: Scenario 2: Host to Device transfers in chunked single stream vs model

The model has less than 1% relative error for every measurement other than kernel 1 the lowestchunk size. Most but not all calculated values slightly overestimates the time.


23

Partition Average(ms) Calculated(ms) Diff(ms) Diff(%)Kernel 1 size 819200 1.156 1.143 -0.012 -1.08Kernel 2 size 819200 2.019 2.032 0.014 0.67Kernel 1 size 2097152 2.876 2.865 -0.011 -0.38Kernel 2 size 2097152 5.081 5.094 0.012 0.24Kernel 1 size 2097152 4.193 4.183 -0.010 -0.24Kernel 2 size 2097152 7.444 7.436 -0.008 -0.11Kernel 1 size 2097152 9.972 9.973 0.002 0.02Kernel 2 size 2097152 17.684 17.731 0.046 0.26

Table 9: Scenario 2: Device to Host transfers in chunked event synchronized streams vs model

The model show no clear trend to either over- or underestimate the measured values. Therelative error is -1.08 % for kernel 1 with lowest chunksize and significantly lower for the rest.


24


Table 10: Scenario 2: Device to Host transfers in chunked single stream vs model

The model consistently overestimated the time. Most results have a relative predictive errorbetween 1-2% which is higher than previously seen.


The result show that predictability has <2.15% predictive error regardless of transfer method.Chunked event synchronized transfers do worse in performance than Chunked transfers in a singlestream as seen in figure 28. The first, smaller, kernel has a significantly improved kernel startingtime over sequential transfers where, in a worst case situation, it is could end up executing last.

25

Scenario 3

Scenario 3 consist of a five chunk delay on the second stream, and that the data set in the firststream is larger than the set in the second. This simulates a second small task being added aftera large task has begun.

Figure 29: Host to Device transfer for the third scenario.

Figure 30: Scenario 3: Host to Device transfers in chunked event synchronized streams vs model(blue dot)

26

Partition Average(ms) Calculated(ms) Diff(ms) Diff(%)Kernel 1 size 819200 11.114 11.092 -0.021 -0.19Kernel 2 size 819200 8.692 8.912 0.220 2.47Kernel 1 size 2097152 26.839 26.946 0.107 0.40Kernel 2 size 2097152 21.073 21.478 0.405 1.89Kernel 1 size 3075200 39.060 39.079 0.019 0.05Kernel 2 size 3075200 30.771 31.095 0.324 1.04Kernel 1 size 7372800 92.308 92.392 0.084 0.09Kernel 2 size 7372800 72.849 73.352 0.504 0.69

Table 11: Scenario 3: Host to Device transfers in chunked event synchronized streams vs model

The larger task has a significantly smaller predictive error even though its model is morecomplicated than the smaller task. Size of the error decreases with chunk size for the smaller taskbut no clear trend for the larger. Model overestimate all except for one value, but frequently byvery little.


27

Partition Average(ms) Calculated(ms) Diff(ms) Diff(%)Kernel 1 size 819200 10.213 10.324 0.111 1.07Kernel 2 size 819200 8.337 8.121 -0.216 -2.66Kernel 1 size 2097152 25.848 25.811 -0.036 -0.14Kernel 2 size 2097152 20.633 20.305 -0.328 -1.62Kernel 1 size 3075200 37.548 37.664 0.116 0.31Kernel 2 size 3075200 29.899 29.629 -0.270 -0.91Kernel 1 size 7372800 90.821 89.747 -1.074 -1.20Kernel 2 size 7372800 71.887 70.601 -1.287 -1.82

Table 12: Scenario 3: Host to Device transfers in chunked single stream vs model

The model has no clear trend with regards to over- or underestimation. Predictive error issimilar to chunked event syncrhonized transfer for the smaller task, while worse for the larger task.


28

Partition Average(ms) Calculated(ms) Diff(ms) Diff(%)Kernel 1 size 819200 2.022 2.032 0.010 0.51Kernel 2 size 819200 1.154 1.143 -0.011 -0.98Kernel 1 size 2097152 5.073 5.094 0.020 0.40Kernel 2 size 2097152 2.877 2.865 -0.012 -0.42Kernel 1 size 3075200 7.466 7.436 -0.030 -0.40Kernel 2 size 3075200 4.202 4.183 -0.019 -0.45Kernel 1 size 7372800 17.662 17.731 0.069 0.39Kernel 2 size 7372800 9.978 9.973 -0.004 -0.04

Table 13: Scenario 3: Device to Host transfers in chunked event synchronized streams vs model

Most but not all calculated values underestimate the actual values. predictive error is less than1% for all cases.


29


Table 14: Scenario 3: Device to Host transfers in chunked single stream vs model

The model overestimates the time for all cases. The predictive error is worse for the largerkernel which is opposite to the results for the chunked event synchronized method.


The result show that predictability is similar regardless of transfer method for the third scenarioas well. Scenario 3, where a small task is added after a large, show the worst relative predictiveerror of all tests for the smallest chunksize with 2.66% relative error (figure 31). Chunked eventsynchronized transfers do worse in performance than Chunked transfers in a single stream withregards to starting time (figure 34. The smaller kernel once again has a significantly improvedkernel starting time over sequential transfers where, in a worst case situation, it could end up

30

executing last.

6 Discussion

The fairness-based memory transfer implementation aims to provide tasks executing on differentpartitions of the GPU with equal access to the PCI-bus and to do this with predictable results.Chunked event synchronized transfers have a significant overhead compared to chunked transferin a single stream. This does not come as a surprise as the impact of event synchronization wasexpected. The effect is largest for small chunk sizes which can be attributed to the overhead beinglarge relative to the overall time the transfer takes (see Section 4.2). As total data sizes becomebigger, kernel execution time increases faster than memory transfer times resulting in the overheadhaving less impact as well. Chunk sizes larger than 500 kbytes show a significant performanceincrease when they are also relatively large compared to the total size of data being transferred.This is due to the first kernel beginning its execution in parallel with the last chunk to the secondbeing transferred. As total size increases, the total overhead will increase because more chunksare transferred and the relative performance increase becomes smaller since the chunk transfertime becomes smaller compared to total time. This leads to performance evening out. Overall theoverhead evaluation suggests that chunks should not be set too small and small data sets shouldeven be aggregated to achieve this.

The models accurately predicts transfers times in both directions with less than 2.7% relativepredictive error. Larger chunk sizes lead to relative prediction being better. There is no obviouspattern to whether the predictions overestimate or underestimate the times. The previous per-formance evaluation suggest that smaller chunk sizes should be avoided to avoid negative impacton performance, or even to improve it. Since our model is based on data for chunks as small as131 Kb it is reasonable to assume that if a lower floor for chunk size was set, the model couldbe improved using a range with that as a lower limit. The difference in transfer time betweenhost-to-device and device-to-host transfers noted by Fuji et al.[3] is found in our results as well,which must be considered when scheduling tasks. If the time delta between time two kernels finishis not evenly divisible with the time a chunk takes to transfer, the second kernel will get a negativeperformance impact. To avoid this, device-to-host transfers can be set to a larger size, a size whosetime to transfer evenly divide the kernel delta. If the GPU is split into only two parts such as inour scenarios, it is likely to be even more efficient to aggregate enough chunks into a single chunkwhose time to transfer is equal to the kernel delta. This is equally true for scenario 2 and 3 whereno device-to-host transfers are interleaved. These findings indicate that a chunking strategy shoulddynamically select chunk size to optimize host-to-device and device-to-host transfers.

7 Related works

Guevara et al.[14] tried to solve the issue with small kernels that does not occupy the full GPUby merging small kernels. They then use a scheduler to decide whether to run the original kernelsin sequence or the merged kernel based on the number of blocks needed and the sum of memoryrequests compared to available memory. They consider the impact on application latency notingthat since the GPU does not return results until the entire kernel has completed, latency will benegatively impacted for a low latency kernel that is merged with a high latency kernel as the resultsof the previous will be part of the merged kernels transfer from device to host. The initial transferfrom host to device is not brought up for consideration as they primarily focus on kernel executiontime.

Pai et al.[5] implemented elastic kernels that could run using physical grid and thread blockdimensions that were different from the ones defined by the programmer. This was done in order tocontrol the resource usage of concurrently executing kernels with respect to the resource constraintsof the system they ran on. Several different policies were tested. They included changing the kernelresource use to median and maximum estimates based on profiling, giving each kernel an equalshare of available resources and a queue based approach where each waiting kernel is examinedand modified before being launched. In addition to this they also implemented time-slicing of thekernels, allowing them to restrict the kernel’s execution to only a certain range of the originalgrid. After the range completed, the kernel is relaunched with the next range. In addition totime-slicing the kernels, memory transfers were also sliced with a 4MB chunk size. This did not

31

have any significant effect on performance, which was attributed to over 97% of dynamic transfersbeing less than 1MB in size for their benchmarks.

Chen et al.[12] proposed a task-based dynamic load-balancing solution for single- and multi-GPU systems using queues and implemented it in CUDA. A task was chosen to be of a size thatit should execute on a single tread block and all thread blocks and kernels are persistent untilthere are no more tasks to complete. The host enqueue tasks onto one or more queues and thekernels dequeue and compute the tasks. Index variables for the queues need to be duplicated onboth host and device, while the queues themselves reside on the device. CUDA events were usedto ensure correctness of the enqueue and dequeue actions by alerting the host to dequeue eventsby the devices. Atomic functions on the GPU were used to update index variables residing onthe GPU without using locks. In order to avoid data races, the host could on enqueue tasks untila queue is empty. This could handles by using several queues, even if there was a single device,to allow overlapping enqueue and dequeue operations. In the multi-GPU implementation, a hostthread was spawned to control each GPU. Each GPU had two queues that could hold 20 taskseach. When a queue became empty, the host thread would try to fetch as much as 20 tasks fromthe task pool at a time to enqueue in a single operation. Unfortunately, for our sake, as Chenet al. was primarily concerned with load balancing, the input data was transferred to the GPUbeforehand to ensure that all performance differences were due to load balancing.

Röderinger et al.[15] developed a distributed query engine based on Remote direct memory ac-cess (RDMA) over Infiniband. Infiniband uses credit-based link-level flow control to avoid blockingof ethernet switches due to full receive buffers. This does not prevent switch contention, whereseveral of the switchs’ input ports transmit data to the same output port. They addressed this byimplementing a round-robing style network scheduling. A server sends eight 512 KB messages toa single target, after which all servers synchronize before sending to the next target.

8 Conclusions

Co-executing GPU kernels on a partitioned GPU has been shown to improve utilization efficiencyof poorly scaling tasks in High Performance Computing applications. While kernels can co-executeon a partitioned GPU, memory transfers on the PCI-bus cannot be done concurrently resultingin poor parallelism and predictability if transfers are done sequentially. We implemented fairness-based memory transfer methods for DMA transfers in a multithreaded environment to and from apartitioned GPU and models to predict the starting time of kernels.

The implementations use chunking to provide each task with fair access to the PCI-bus band-width. Two methods were compared, transferring chunks in a single stream and transferring chunksin several streams using event synchronization to ensure fair bandwidth use. The results show thatchunked transfer in a single stream has less overhead over that chunked event synchronized transfersand the overhead has the most performance impact for smaller transfers.

We predict the transfer time for the implementations using linear regression on the size of thechunks being transferred. The models accurately predict transfer times for several scenarios to lessthan 2.7% relative predictive error for smaller chunk sizes and decreasing for larger sizes. Usingchunk size to predict transfer times is a viable way forward for future work on task scheduling. Weverified previous observations that device-to-host transfers are faster than host-to-device, with theimplications that this needs to be taken into consideration in future work on scheduling.

9 Future work

The conclusions regarding chunk size indicate that tasks with small amount of data would benefitfrom being aggregated into larger chunks to avoid the cost of overhead and improve predictability.Combining chunks into bigger chunks would be achieved by copying the datasets into a coherentblock of memory on the host and then transferring the block as a single chunk to the correspondingmemory block on the device. The kernel can then be launched with a pointer to the kernel’s datasetin the device memory block as a parameter as usual. More fine grained testing should be done todecide for which data sizes this would be beneficial and to adjust the prediction models to takeadvantage of the lower limit to chunk size.

The next step is to implement the fairness-based transfer method into a scheduler and evaluateits efficiency in a more general setting. The ability to dynamically adjust chunk size would be useful

32

both depending on the size of the task’s data set and if the scheduler can dynamically change thepartitions of the GPU. A method that can aggregate small transfers would also need have specificscheduling strategies to determine when to aggregate or when not to.

References

[1] J. Janzén, D. Black-Schaffer, and A. Hugo, “Partitioning gpus for improved scalability.” IEEE,2016, pp. 42–49.

[2] Top500.org. Retrieved 2017-05-31. [Online]. Available: https://www.top500.org/lists/2016/11/

[3] Y. Fujii, T. Azumi, N. Nishio, S. Kato, and M. Edahiro, “Data transfer matters for gpucomputing,” in Parallel and Distributed Systems (ICPADS), 2013 International Conferenceon. IEEE, 2013, pp. 275–282.

[4] S. Tomov, J. Dongarra, and M. Baboulin, “Towards dense linear algebra for hybrid gpu accel-erated manycore systems,” Parallel Computing, vol. 36, no. 5, pp. 232–240, 2010.

[5] S. Pai, M. J. Thazhuthaveetil, and R. Govindarajan, “Improving gpgpu concurrency withelastic kernels,” SIGPLAN Not., vol. 48, no. 4, pp. 407–418, Mar. 2013. [Online]. Available:http://doi.acm.org/10.1145/2499368.2451160

[6] J. Cheng, M. Grossman, and T. McKercher, Professional CUDA C Programming, ser. EBL-Schweitzer. Wiley, 2014.

[7] B. Wu, G. Chen, D. Li, X. Shen, and J. Vetter, “Enabling and exploiting flexible task assign-ment on gpu through sm-centric program transformations,” in Proceedings of the 29th ACMon International Conference on Supercomputing. ACM, 2015, pp. 119–130.

[8] Y. Ukidave, C. Kalra, D. Kaeli, P. Mistry, and D. Schaa, “Runtime support for adaptivespatial partitioning and inter-kernel communication on gpus,” in Computer Architecture andHigh Performance Computing (SBAC-PAD), 2014 IEEE 26th International Symposium on.IEEE, 2014, pp. 168–175.

[9] M. Harris. How to overlap data transfers in cuda c/c++. Retrieved 2017-05-07. [Online].Available: https://devblogs.nvidia.com/parallelforall/how-overlap-data-transfers-cuda-cc/

[10] J. Dongarra, M. Gates, A. Haidar, J. Kurzak, P. Luszczek, S. Tomov, and I. Yamazaki, “Ac-celerating numerical dense linear algebra calculations with gpus,” in Numerical Computationswith GPUs. Springer, 2014, pp. 3–28.

[11] A. Silberschatz, P. B. Galvin, and G. Gagne, Operating system concepts, 9th ed. Hoboken,NJ: Wiley, 2014.

[12] L. Chen, O. Villa, S. Krishnamoorthy, and G. R. Gao, “Dynamic load balancing on single-and multi-gpu systems,” 2010, pp. 1–12.

[13] L. Eeckhout, “Computer architecture performance evaluation methods,” Synthesis Lectures onComputer Architecture, vol. 5, no. 1, pp. 1–145, 2010.

[14] M. Guevara, C. Gregg, K. Hazelwood, and K. Skadron, “Enabling task parallelism in the cudascheduler,” in Workshop on Programming Models for Emerging Architectures, vol. 9, 2009.

[15] W. Rödiger, T. Mühlbauer, A. Kemper, and T. Neumann, “High-speed query processing overhigh-speed networks,” Proceedings of the VLDB Endowment, vol. 9, no. 4, pp. 228–239, 2015.

33

https://www.top500.org/lists/2016/11/

https://www.top500.org/lists/2016/11/

http://doi.acm.org/10.1145/2499368.2451160

https://devblogs.nvidia.com/parallelforall/how-overlap-data-transfers-cuda-cc/

Documents

Study of Bandwidth Partitioning for Co-executing GPU Kernelsuu.diva-portal.org/smash/get/diva2:1154188/FULLTEXT01.pdf · Graphics Processing Units (GPU) provide substantial raw computing