Image Convolution Processing: a GPU versus FPGA Comparison
Lucas M. Russo, Emerson C. Pedrino, Edilson Kato Federal University of Sao Carlos - DC
Rodovia Washington Lus, km 235 - SP-310 13565-905 So Carlos - So Paulo - Brazil
firstname.lastname@example.org; emerson, email@example.com
Valentin Obac Roda Federal University of Rio Grande do Norte - DEE
Campus Universitrio Lagoa Nova 59072-970 Natal Rio Grande do Norte Brazil
AbstractConvolution is one of the most important operators used in image processing. With the constant need to increase the performance in high-end applications and the rise and popularity of parallel architectures, such as GPUs and the ones implemented in FPGAs, comes the necessity to compare these architectures in order to determine which of them performs better and in what scenario. In this article, convolution was implemented in each of the aforementioned architectures with the following languages: CUDA for GPUs and Verilog for FPGAs. In addition, the same algorithms were also implemented in MATLAB, using predefined operations and in C using a regular x86 quad-core processor. Comparative performance measures, considering the execution time and the clock ratio, were taken and commented in the paper. Overall, it was possible to achieve a CUDA speedup of roughly 200x in comparison to C, 70x in comparison to Matlab and 20x in comparison to FPGA.
Keywords- Image processing; Convolution; GPU; CUDA; FPGA
I. INTRODUCTION In 2006 Nvidia Corporation announced a new general
purpose parallel computing architecture based on the GPGPU paradigm (General-Purpose Computing on Graphics Processing Units): CUDA (Compute Unified Device Architecture) . CUDA is an architecture classified as GPGPU, and it is a category of the SPMD (single process, multiple data; or single program, multiple data) parallel programming, the model is based on the execution of the same program by different processors, supplied with different input data, without the strict coordination requirement among them that the SIMD (single instruction, multiple data) model imposes. As a central point to the model are the so called kernels: C-style functions that are parallel executed through multiple threads and, when called from the application, dynamically allocate a hierarchy processing structure specified by the user. Interchangeably with the execution of the kernels, portions of sequential code are usually inserted in a CUDA program flow. For this reason, it constitutes a heterogeneous programming model.
The CUDA model was conceived to implement the so called transparent scalability effectively, i.e., the ability of the programming model to adapt itself in the available hardware in such a way that more processors can be scalable without altering the algorithm and, at the same time, reduce the development time of parallel or heterogeneous solutions. All aforementioned model abstractions are particularly suitable and
easily adapted to the field of digital image processing, given that many applications in this area operate in independent pixel by pixel or pixel window approach.
Many years before the advent of the CUDA architecture, Xilinx in 1985 made available to the market the first FPGA chip . The FPGA is basically, a highly customizable integrated chip that has been used in a variety of science fields, such as: digital signal processing, voice recognition, bioinformatics, computer vision, digital image processing and other applications that require high performance: real time systems and high performance computing.
The comparison between CUDA and FPGA has been documented in various works in different applications domains. Asano et al  compared the use of CUDA and FPGAs in image processing applications, namely two-dimensional filters, stereo vision and k-means clustering; Che et al  compared their use in three applications algorithms: Gaussian Elimination, Data Encryption Standard (DES), and Needleman-Wunsch; Kestur et al  developed a comparison for BLAS (Basic Linear Algebra Subroutines); Park et al  analyzed the performance of integer and floating-point algorithms and Weber et al  compared the architectures using a Quantum Monte Carlo Application.
In this work, CUDA and a FPGA dedicated architecture will be used and compared on the implementation of the convolution, an operation often used for image processing.
II. METHODOLOGY All CPU (i.e., Matlab and C) and GPU (i.e., CUDA)
execution times were obtained from the following configuration:
Processor: Intel Core i5 750 (8MB cache L2), Motherboard: ASUS P7P55DE-PRO; RAM Memory: 2 x 2 GB Corsair (DDR2-800); Graphics Board: XFX Nvidia GTX 295, 896MB
Software Windows 7 Professional 64-bit; Visual Studio 2008 SP1
Drivers Nvidia driver video version: 190.38; Nvidia CUDA toolkit version: 2.3
Cyclone II EP2C35F672 on Terasic DE2 board; Quartus II 10.1 Software with SOPC Builder, NIOS II EDS 10.1 and ModelSim 6.6d Simulation Tool, for the implementation of the algorithms.
Sponsors: FAPESP grants number 2010/04675-4 and 2009/17736-4; DC/UFSCAR; DEE UFRN
978-1-4673-0186-2/12/$31.00 2012 IEEE
The main comparison parameters presented in this article are the execution time and the number of clock cycles of the implemented algorithms. In order to obtain that, different approaches were used according to the architecture profiled.
On C, the Performance Counters were used through the functions: QueryPerformanceCounter() and QueryPerformance Frequency(). The former is used to extract the value of the counter until the function call.
On CUDA, the Event Management provides functionality to create, destroy and record an event. Hence, it is possible to measure the amount of time it took to execute a specific part of code, such as a kernel call, in the manner described in . Concerning the clock cycles, the clock() function was used within the kernel to obtain the measure.
On Matlab, a simple approach is provided through the usage of a built-in stopwatch. It is possible to control it with the tic and toc syntax. The first starts the timer and the second stops it, displaying the time, in seconds, to execute the statements between tic and toc. The Matlab number of clock cycles was not measured since it was not found a simple way to do it.
At last, on the FPGA, it is possible to infer the execution time directly from the architecture implemented on it. With the knowledge of the clock rate, explicitly defined by the designer, and the number of clock cycles taken to process the input data, extracted from the waveforms or from the architecture itself, the following expression can be used: execution time = number of clock cycles/clock frequency
III. CONVOLUTION Mathematically, convolution can be expressed as a linear
combination or sum of products of the mask coefficients with the input function.
Where f denotes the input function and w the mask. It is implicit that equation (2) is applied for every point in the input function.
It is possible to extend the convolution operation to a 2-D dimension as follows:
There is, in convolution, a limitation in what refers to the boundaries of an input image, since the mask is positioned in such way that there are mask values which do not overlap with the input image. Thus, two approaches are commonly used in the context of image processing: padding the edges of the input image with zeros or clamping the edges of the input
image with the closest border pixel. In this work the first choice is used as in GONZALES .
Considering an image of size of MxN pixels, a mask of size SxT the multiplication is the more costly operation. Hence, (MN)(ST) operations are performed and, consequently, the algorithm belongs to O(MNST).
A mask w(x,y) can be decomposed in w1(x) and w2(y) in such a way that w(x,y) = w1(x) w2(y), where w1(x) is a vector of size (Sx1) and w2(y) is a vector of size (1xT), the 2D convolution can be performed as two 1D convolutions. In this way, it is said that the convolution is separable and the algorithmic complexity decays allowing for a more flexible implementation. Hence, the separable convolution formula can be expressed as in equation 4.
IV. IMPLEMENTATION The separable convolution was implemented in C, CUDA
and Matlab (built-in function) and the regular convolution was implemented in FPGA [Eq. 3]. The reason to implement the regular convolution in FPGA was due to performance limitations. The separable algorithm, although reducing the total number of operations performed [Eq. 4], requires the image data stream to be processed twice, one for lines and one for columns. Consequently only the column filter itself would take as much time as the regular convolution to process the entire image. The reason for that is due the time required to fill the shift register and the streaming interface, which can transmit only one pixel at clock cycle.
A. C Implementation The C implementation of convolution was based in [Eq. 4]
and it is fairly straightforward. Follows the sequential separable algorithm implemented.
The image was first loaded to memory with the OpenCV C
library. Later, for each input pixel, the column convolution (with mask w2 and size equal to 2*b+1) was applied to it.
B. Matlab Implementation For Matlab, the