6
Image Convolution Processing: a GPU versus FPGA Comparison Lucas M. Russo, Emerson C. Pedrino, Edilson Kato Federal University of Sao Carlos - DC Rodovia Washington Luís, km 235 - SP-310 13565-905 São Carlos - São Paulo - Brazil [email protected]; emerson, [email protected] Valentin Obac Roda Federal University of Rio Grande do Norte - DEE Campus Universitário Lagoa Nova 59072-970 Natal – Rio Grande do Norte – Brazil [email protected] Abstract—Convolution is one of the most important operators used in image processing. With the constant need to increase the performance in high-end applications and the rise and popularity of parallel architectures, such as GPUs and the ones implemented in FPGAs, comes the necessity to compare these architectures in order to determine which of them performs better and in what scenario. In this article, convolution was implemented in each of the aforementioned architectures with the following languages: CUDA for GPUs and Verilog for FPGAs. In addition, the same algorithms were also implemented in MATLAB, using predefined operations and in C using a regular x86 quad-core processor. Comparative performance measures, considering the execution time and the clock ratio, were taken and commented in the paper. Overall, it was possible to achieve a CUDA speedup of roughly 200x in comparison to C, 70x in comparison to Matlab and 20x in comparison to FPGA. Keywords- Image processing; Convolution; GPU; CUDA; FPGA I. INTRODUCTION In 2006 Nvidia Corporation announced a new general purpose parallel computing architecture based on the GPGPU paradigm (General-Purpose Computing on Graphics Processing Units): CUDA (Compute Unified Device Architecture) [1]. CUDA is an architecture classified as GPGPU, and it is a category of the SPMD (single process, multiple data; or single program, multiple data) parallel programming, the model is based on the execution of the same program by different processors, supplied with different input data, without the strict coordination requirement among them that the SIMD (single instruction, multiple data) model imposes. As a central point to the model are the so called kernels: C-style functions that are parallel executed through multiple threads and, when called from the application, dynamically allocate a hierarchy processing structure specified by the user. Interchangeably with the execution of the kernels, portions of sequential code are usually inserted in a CUDA program flow. For this reason, it constitutes a heterogeneous programming model. The CUDA model was conceived to implement the so called transparent scalability effectively, i.e., the ability of the programming model to adapt itself in the available hardware in such a way that more processors can be scalable without altering the algorithm and, at the same time, reduce the development time of parallel or heterogeneous solutions. All aforementioned model abstractions are particularly suitable and easily adapted to the field of digital image processing, given that many applications in this area operate in independent pixel by pixel or pixel window approach. Many years before the advent of the CUDA architecture, Xilinx in 1985 made available to the market the first FPGA chip [2]. The FPGA is basically, a highly customizable integrated chip that has been used in a variety of science fields, such as: digital signal processing, voice recognition, bioinformatics, computer vision, digital image processing and other applications that require high performance: real time systems and high performance computing. The comparison between CUDA and FPGA has been documented in various works in different applications domains. Asano et al [3] compared the use of CUDA and FPGAs in image processing applications, namely two-dimensional filters, stereo vision and k-means clustering; Che et al [4] compared their use in three applications algorithms: Gaussian Elimination, Data Encryption Standard (DES), and Needleman- Wunsch; Kestur et al [5] developed a comparison for BLAS (Basic Linear Algebra Subroutines); Park et al [6] analyzed the performance of integer and floating-point algorithms and Weber et al [7] compared the architectures using a Quantum Monte Carlo Application. In this work, CUDA and a FPGA dedicated architecture will be used and compared on the implementation of the convolution, an operation often used for image processing. II. METHODOLOGY All CPU (i.e., Matlab and C) and GPU (i.e., CUDA) execution times were obtained from the following configuration: Component Description Hardware Processor: Intel Core i5 750 (8MB cache L2), Motherboard: ASUS P7P55DE-PRO; RAM Memory: 2 x 2 GB Corsair (DDR2-800); Graphics Board: XFX Nvidia GTX 295, 896MB Software Windows 7 Professional 64-bit; Visual Studio 2008 SP1 Drivers Nvidia driver video version: 190.38; Nvidia CUDA toolkit version: 2.3 FPGA Cyclone II EP2C35F672 on Terasic DE2 board; Quartus II 10.1 Software with SOPC Builder, NIOS II EDS 10.1 and ModelSim 6.6d Simulation Tool, for the implementation of the algorithms. Sponsors: FAPESP grants number 2010/04675-4 and 2009/17736-4; DC/UFSCAR; DEE UFRN 978-1-4673-0186-2/12/$31.00 ©2012 IEEE

Artigo Fpga vs Cuda

Embed Size (px)

DESCRIPTION

FPGA

Citation preview

  • Image Convolution Processing: a GPU versus FPGA Comparison

    Lucas M. Russo, Emerson C. Pedrino, Edilson Kato Federal University of Sao Carlos - DC

    Rodovia Washington Lus, km 235 - SP-310 13565-905 So Carlos - So Paulo - Brazil

    [email protected]; emerson, [email protected]

    Valentin Obac Roda Federal University of Rio Grande do Norte - DEE

    Campus Universitrio Lagoa Nova 59072-970 Natal Rio Grande do Norte Brazil

    [email protected]

    AbstractConvolution is one of the most important operators used in image processing. With the constant need to increase the performance in high-end applications and the rise and popularity of parallel architectures, such as GPUs and the ones implemented in FPGAs, comes the necessity to compare these architectures in order to determine which of them performs better and in what scenario. In this article, convolution was implemented in each of the aforementioned architectures with the following languages: CUDA for GPUs and Verilog for FPGAs. In addition, the same algorithms were also implemented in MATLAB, using predefined operations and in C using a regular x86 quad-core processor. Comparative performance measures, considering the execution time and the clock ratio, were taken and commented in the paper. Overall, it was possible to achieve a CUDA speedup of roughly 200x in comparison to C, 70x in comparison to Matlab and 20x in comparison to FPGA.

    Keywords- Image processing; Convolution; GPU; CUDA; FPGA

    I. INTRODUCTION In 2006 Nvidia Corporation announced a new general

    purpose parallel computing architecture based on the GPGPU paradigm (General-Purpose Computing on Graphics Processing Units): CUDA (Compute Unified Device Architecture) [1]. CUDA is an architecture classified as GPGPU, and it is a category of the SPMD (single process, multiple data; or single program, multiple data) parallel programming, the model is based on the execution of the same program by different processors, supplied with different input data, without the strict coordination requirement among them that the SIMD (single instruction, multiple data) model imposes. As a central point to the model are the so called kernels: C-style functions that are parallel executed through multiple threads and, when called from the application, dynamically allocate a hierarchy processing structure specified by the user. Interchangeably with the execution of the kernels, portions of sequential code are usually inserted in a CUDA program flow. For this reason, it constitutes a heterogeneous programming model.

    The CUDA model was conceived to implement the so called transparent scalability effectively, i.e., the ability of the programming model to adapt itself in the available hardware in such a way that more processors can be scalable without altering the algorithm and, at the same time, reduce the development time of parallel or heterogeneous solutions. All aforementioned model abstractions are particularly suitable and

    easily adapted to the field of digital image processing, given that many applications in this area operate in independent pixel by pixel or pixel window approach.

    Many years before the advent of the CUDA architecture, Xilinx in 1985 made available to the market the first FPGA chip [2]. The FPGA is basically, a highly customizable integrated chip that has been used in a variety of science fields, such as: digital signal processing, voice recognition, bioinformatics, computer vision, digital image processing and other applications that require high performance: real time systems and high performance computing.

    The comparison between CUDA and FPGA has been documented in various works in different applications domains. Asano et al [3] compared the use of CUDA and FPGAs in image processing applications, namely two-dimensional filters, stereo vision and k-means clustering; Che et al [4] compared their use in three applications algorithms: Gaussian Elimination, Data Encryption Standard (DES), and Needleman-Wunsch; Kestur et al [5] developed a comparison for BLAS (Basic Linear Algebra Subroutines); Park et al [6] analyzed the performance of integer and floating-point algorithms and Weber et al [7] compared the architectures using a Quantum Monte Carlo Application.

    In this work, CUDA and a FPGA dedicated architecture will be used and compared on the implementation of the convolution, an operation often used for image processing.

    II. METHODOLOGY All CPU (i.e., Matlab and C) and GPU (i.e., CUDA)

    execution times were obtained from the following configuration:

    Component Description

    Hardware

    Processor: Intel Core i5 750 (8MB cache L2), Motherboard: ASUS P7P55DE-PRO; RAM Memory: 2 x 2 GB Corsair (DDR2-800); Graphics Board: XFX Nvidia GTX 295, 896MB

    Software Windows 7 Professional 64-bit; Visual Studio 2008 SP1

    Drivers Nvidia driver video version: 190.38; Nvidia CUDA toolkit version: 2.3

    FPGA

    Cyclone II EP2C35F672 on Terasic DE2 board; Quartus II 10.1 Software with SOPC Builder, NIOS II EDS 10.1 and ModelSim 6.6d Simulation Tool, for the implementation of the algorithms.

    Sponsors: FAPESP grants number 2010/04675-4 and 2009/17736-4; DC/UFSCAR; DEE UFRN

    978-1-4673-0186-2/12/$31.00 2012 IEEE

  • The main comparison parameters presented in this article are the execution time and the number of clock cycles of the implemented algorithms. In order to obtain that, different approaches were used according to the architecture profiled.

    On C, the Performance Counters were used through the functions: QueryPerformanceCounter() and QueryPerformance Frequency(). The former is used to extract the value of the counter until the function call.

    On CUDA, the Event Management provides functionality to create, destroy and record an event. Hence, it is possible to measure the amount of time it took to execute a specific part of code, such as a kernel call, in the manner described in [1]. Concerning the clock cycles, the clock() function was used within the kernel to obtain the measure.

    On Matlab, a simple approach is provided through the usage of a built-in stopwatch. It is possible to control it with the tic and toc syntax. The first starts the timer and the second stops it, displaying the time, in seconds, to execute the statements between tic and toc. The Matlab number of clock cycles was not measured since it was not found a simple way to do it.

    At last, on the FPGA, it is possible to infer the execution time directly from the architecture implemented on it. With the knowledge of the clock rate, explicitly defined by the designer, and the number of clock cycles taken to process the input data, extracted from the waveforms or from the architecture itself, the following expression can be used: execution time = number of clock cycles/clock frequency

    (1)

    III. CONVOLUTION Mathematically, convolution can be expressed as a linear

    combination or sum of products of the mask coefficients with the input function.

    (2)

    Where f denotes the input function and w the mask. It is implicit that equation (2) is applied for every point in the input function.

    It is possible to extend the convolution operation to a 2-D dimension as follows:

    (3)

    There is, in convolution, a limitation in what refers to the boundaries of an input image, since the mask is positioned in such way that there are mask values which do not overlap with the input image. Thus, two approaches are commonly used in the context of image processing: padding the edges of the input image with zeros or clamping the edges of the input

    image with the closest border pixel. In this work the first choice is used as in GONZALES [8].

    Considering an image of size of MxN pixels, a mask of size SxT the multiplication is the more costly operation. Hence, (MN)(ST) operations are performed and, consequently, the algorithm belongs to O(MNST).

    A mask w(x,y) can be decomposed in w1(x) and w2(y) in such a way that w(x,y) = w1(x) w2(y), where w1(x) is a vector of size (Sx1) and w2(y) is a vector of size (1xT), the 2D convolution can be performed as two 1D convolutions. In this way, it is said that the convolution is separable and the algorithmic complexity decays allowing for a more flexible implementation. Hence, the separable convolution formula can be expressed as in equation 4.

    (4)

    IV. IMPLEMENTATION The separable convolution was implemented in C, CUDA

    and Matlab (built-in function) and the regular convolution was implemented in FPGA [Eq. 3]. The reason to implement the regular convolution in FPGA was due to performance limitations. The separable algorithm, although reducing the total number of operations performed [Eq. 4], requires the image data stream to be processed twice, one for lines and one for columns. Consequently only the column filter itself would take as much time as the regular convolution to process the entire image. The reason for that is due the time required to fill the shift register and the streaming interface, which can transmit only one pixel at clock cycle.

    A. C Implementation The C implementation of convolution was based in [Eq. 4]

    and it is fairly straightforward. Follows the sequential separable algorithm implemented.

    The image was first loaded to memory with the OpenCV C

    library. Later, for each input pixel, the column convolution (with mask w2 and size equal to 2*b+1) was applied to it.

    B. Matlab Implementation For Matlab, the conv2() built-in function was used to

    perform the convolution.

    /* Line Convolution*/ for i 0 to number of lines-1 for j 0 to number of columns -1 g(i, j) 0 for l b to -b if j-l >= 0 and j-l < number of columns g(i, j) g(i, j) + f(i, j - l)*w2(b+l) end-if end-for end-for end-for

    /* Column Convolution*/ for i 0 to number of lines-1 for j 0 to number of columns-1 o(i, j) 0 for k a to a if i-k >= 0 and i-k < number of lines o(i, j) o(i, j) + g(i-k, j)*w1(a+k) end-if end-for end-for end-for

  • C. CO

    diffekernkernalgoreal

    Line

    Tor 4 2-D imagpixedoinreduguar

    F(4x4dispmapwayregioS0,0+concloadload

    Fig

    Asyncare gIn o__sycorre

    Icalcuthe o(Fig

    CimagNUMROWNUMfetch

    CUDA ImplemOn CUDA, therent kernels: nel and the secnel. The develrithm of Podimages as inp

    e Convolution

    The threads wlines by 16 cgrid with siz

    ge. Each threals from the in

    ng this, the uced, as long ranteed.

    Figs. 1 and 2 4) instead of laying purposped to the fir, the thread0on), S0,0+block_s

    +5*block_size (rigcerning the acded for each lded to shared m

    gure 1. Example o

    After the loadchronize their going to acceorder to do yncthreads() ectly.

    In the final sulate four outpones that the . 2).

    Concerning thges are notM_LOADS_PW_X is theM_LOADS_Phed from the

    mentation he algorithm the first part i

    cond one is imlopment of thedlozhnyuk [9]puts.

    n Kernel

    were grouped columns and, ze depending ad in the blocnput image toaccess to thas the mem

    illustrate the f its real dimses. The first lrst line of th,0 is mapped

    size, S0,0+2*block_ght apron rectual block sizine) x 4 (num

    memory.

    of a 2-D block of column

    ding stage, alexecution, si

    ess elements tthat, a call is issued an

    stage, each tput pixels, whthread were

    he flexibility t multiple

    PER_THREADe number oPER_THREAD

    main region.

    is implemenis implemente

    mplemented the convolution, extended to

    in 2-D blocksin turn, they won the dimen

    ck is responsibo per-block shhe Global De

    mory coalescin

    general idea mensions (4 line of the 2-De input imageto the pixel

    _size, S0,0+3*blockgion) and s

    ze, 384, 16x6 mber of block

    size 16 (i.e., 16 tkernels.

    ll threads witince, in the nhat were loadto the CUD

    nd the progr

    thread is assihich are in themapped to in

    of the convolof (BLOCKD), in whichof columns D is the nIn order to s

    nted through ed through thehrough the coln was based ono support any

    s with size (4were grouped

    nsions of the ible for fetchinhared memoryevice Memorng restrictions

    with block sizx 16) for im

    D block (Fig. e (Fig. 2). Inls S0,0 (left ak_size, S0,0+4*bloco on. Moreo(number of p

    k lines), pixel

    threads) for line a

    thin a blocknext stage, thrded by other oDA API funram can pro

    igned the tase same positionn the main re

    lution filter, sK_SIZE_ROWh BLOCK_SI

    per block number of psolve this, the

    two e line lumn n the

    y size

    4x16) d in a input

    ng six y. By ry is s are

    ze of mage 1) is

    n this apron ck_size, over,

    pixels ls are

    and

    must reads ones.

    nction oceed

    sk to ns as egion

    some W_X* IZE_

    and pixels e line

    kernbegi

    Blast prevexacconsAddautoequa Colu

    Twiththe respshar

    conswarp8 linpixeaproreus

    Inlaun(BLAD)of linumfollo

    D. FF

    on Fand

    TA gMeminteg

    SNIOwrapin [laye

    nel is launcheinning of the i

    rowOffset = w

    By doing thiscolumn will

    viously calculctly which wstruct the rowditionally, thomatically saal four and BL

    umn ConvoluThe threads, ih size 8x16 orgrid size dep

    ponsible for fred memory.

    The decision sidering the

    rp access to cones in the bloels loaded to son pixels/numse, since they w

    n the same wnched againLOCK_SIZE_), in which Bines per block

    mber of fetcheowing offset w

    columnOffse height BL NUM_L

    FPGA ImplemFor the FPGAFig. 3 was devVerilog HDL

    The architecturayscale or bimory, is convger values bet

    Such conversiOS II Fast Copper function[12]. Thereforer and is

    ed again withimage:

    width BLOC NUM_LOA

    , every columbe calculated

    ated. Hence, were the last wOffset, increahe memory tisfied for N

    LOCK_SIZE_

    ution Kernel in the column r 8 lines by 16pends on the ifetching six p

    to 16 columnmemory coalontiguous memock size tend shared memor

    mber of outpuwill be loaded

    way as the linn for im

    _COLUMN_YBLOCK_SIZEk and NUM_Ls per thread f

    was used:

    t=LOCK_SIZE_LOADS_PER_

    mentation A implementaveloped with

    L coding.

    ure is responsiinary image inerted to RAWtween 0 (i.e., b

    ion was perfoore [10], the n for decomprre, this const

    controlled

    h the followin

    CK_SIZE_ROADS_PER_TH

    mn between rd, even if soit is not nececalculated coasing the vericoalescing

    NUM_LOADS_ROW_X equ

    n filter, are div6 columns. Asinput image apixels from th

    ns in this blolescing requirmory positionto reduce thery, reduce the

    ut pixels) andd to others sha

    ne kernel, themages notY*NUM_LOAE_COLUMN_LOADS_PERfrom the imag

    _COLUMN_Y_THREAD

    ation, the arcthe assistance

    ible for the fon JPEG forma

    W format, thatblack) and 25

    ormed by the C library libj

    ressing JPEGtitutes the ap

    by the N

    ng offset from

    OW_X *HREAD

    (5

    rowOffset andome of them ssary to deterolumn in ordification overhrequeriments S_PER_THR

    ual 16.

    vided in 2-D bs in the line keand each threhe input imag

    ock size was mrements (i.e.,s). Conversely

    e number of ae ratio (numb

    d increase meared memories

    e column kernt multiple ADS_PER_TH

    N_Y is the nuR_THREAD ige main. Thus

    Y *

    chitecture depe of SOPC Bu

    ollowing functat, stored on Ft is, to a matr5 (i.e., white)

    softcore procbjpeg [11] and

    images, avaipplication softNIOS II

    m the

    5)

    d the were

    rmine der to head.

    are READ

    blocks ernel, ead is ge to

    made half y, the apron ber of emory s.

    nel is of

    HREumber is the s, the

    (6)

    picted uilder

    tions. Flash rix of .

    cessor d the ilable

    ftware EDS.

  • Figu

    AbuffDMAwithremathe iwithconsFirstComconvdualbetwclockcont

    Trest Inter

    TstateINGEND

    Fregis(DAclockshiftfloor

    ure 2. Example of

    Figure 3.

    After that, thefer (SDRAM A (Pixel Buff

    hout interruptinaining of the image is proceh its interfacestituting the o tly, the imag

    mponent. Folverted to 30-b-clock queue

    ween two clok, and the 25troller is used

    The implemenof the arch

    rface.

    This module ies: DATA_FI

    G_STATE_1, D_PROCESS

    Firstly, upon ster was initTA_FILL_BUk rise, the inpt register (Fir (KS/2)*(1

    f an image region

    Pipeline Architec

    e decompresseChip lower affer DMA Cong the procespipeline (Ima

    essed through e called Avapipeline (Ima

    ge is procesllowing, eachbit RGB (RGB

    (Clock Crossock domains 5MHz, the VGto display the

    nted convoluthitecture by

    is based on aILL_BUFFERDATA_PRO

    SING_STATE

    the reset signtialized with UFFER_STAput interface ig. 4). This stIW), or in ot

    n with 96 pixels m

    cture for Image P

    ed image is waddresses) andontroller) is sor and transmage Processinvarious stream

    alon Streaminage Processinssed by the h pixel (8-bB Resampler)sing Bridge) (100MHz, th

    GA clock). Ane processed im

    tion module ismeans of A

    a finite state mER_STATE, DOCESSING_SE.

    nal, every pothe value 0

    ATE) consistsand store thetate lasts untilther words, un

    mapped to 16 thre

    Processing.

    written to the pd, in this wayable to acce

    mit the pixel tg Pipeline). Tming compon

    ng Interfaceg Pipeline Fig

    User Streambit grayscale). Then, thereacting as a brhe general dend, lastly, a V

    mage.

    s interfaced toAvalon Stream

    machine with DATA_PROC

    STATE_2 DA

    sition of the 0. The first on reading, value read in

    l it has been ntil the first v

    eads of line kerne

    pixel y, the ess it to the Then, nents, [13], g. 3). ming e) is e is a ridge esign VGA

    o the ming

    four CESS ATA_

    shift state each

    n the read valid

    pixeof

    Figude

    kernth

    NDATstate

    Fi(Lefmom

    FDAT(Eq.cycl(i.e.mencomFig.DAT

    el. Pixels with the

    el is positionFig. 4.

    ure 4. Layout of tenotes the size, innel 3x3, KS = 3; Iat means, for a 64

    pixe

    Next, the TA_PROCESe is depicted in

    igure 5. Example ft) and the values ment of DATA_PR

    From this statTA_END_PR. 3) will be ale, an output p, Avalon Stre

    ntion that thimbinational ci

    6). The TA_PROCES

    e same color are m

    ned in the c

    he convolution mn one dimension, W (Image Width40x480 image, IWels used to the con

    present SSING_STATEn Fig. 5.

    of an image withassociated with tROCESSING_ST

    value not c

    e until the endROCESSING_applied to thepixel will be aeaming Sourcis calculationircuit sub mo

    present SSING_STATE

    mapped to the sam

    center gray r

    module shift regisof the used kerne

    h) denotes de widW = 640. The grenvolution calcula

    state is TE_1. The fir

    h pixels values rathe shift register wTATE_1 state. Thconsidered.

    d of the state m_STATE), thee gray area anavailable at th

    ce Interface). n is performodule (Convostate will

    TE_2 when al

    me thread (see Fig

    region coordi

    ter. KS (Kernel Sel, that means, forth of the input im

    ey area indicates tation.

    modified st moment of

    anging from 0 to 2with KS = 3 in thhe x symbol indic

    machine (i.e.,e convolutionnd, at every che output inteIt is importa

    med by a paolution Operal change ll input pixel

    g. 1).

    inate,

    Size) r a

    mage, the

    to f this

    255 e first cates a

    state n sum clock

    erface ant to arallel ation,

    to ls are

  • readequa

    Twithvaluand DATperfoneigpixethe oconsbotto

    LstatestateregiscaseDAT

    Iimagmodelemkernarch

    TloadAfteclockinter

    TexecmaskCUDpres

    d, or more spals the number

    This state is sih the exceptioue zero will be

    similarly TA_PROCESSformed, since hborhood) wls or to valuesone depicted sidered are theom-right side)

    Lastly, after te is modifiede. This last sster and the co). Next, the

    TA_FILL_BU

    It is importantges up to 640xdule. This is ments limitatinel and image hitecture itself.

    Figure

    The number oding period (ier this state, uk cycle, one orface.

    V.The graphs focution time fok size of 15, DA speedup fented in figur

    pecifically, whr of pixels of t

    imilar to DATn that, as thee forced into t

    to the fiSSING_ STAT

    the border pwill have its v

    s 0. The last min Fig. 5, wite ones located).

    the calculationd to DATA_state is respoounters to its

    state machiUFFER_STA

    t to highlight x480 are supp

    due the DEion. Thereforsizes were es.

    e 6. Convolution

    f clock cycles.e., state DAuntil the end output pixel is

    . RESULTS Aor the convol

    or various grayas well as th

    for Matlab, C es 7, 8 and 9.

    hen the numbthe input imag

    TA_PROCESre are no pixethe shift regis

    first input TE_1 state, a bpixels (i.e., wvalues convo

    moment of thisth the exceptid near the end

    n of the last _END_PROConsible for rdefault value

    ine returns toATE.

    that only kernported in the FE2 board mre, results instimated consi

    module block dia

    s were obtaineTA_FILL_BUof the state m

    s available at

    AND COMPARI

    lution operatiy scale imagehe number ofand the FPG

    ber of pixels ge.

    SSING_ STATels to be readter. By doing pixels of

    border treatmewithout a com

    luted to neigs state is similon that the vaof the image

    pixel, the preCESSING_STAresetting the s (i.e., zero ino its initial

    nels up to 5x5FPGA convolu

    memory and lnvolving diffidering the mo

    agram

    ed disregardinUFFER_STAmachine, at ethe output mo

    ISON on comparings resolutions

    f clock cyclesGA architectur

    read

    TE_1 d, the g this,

    the ent is

    mplete ghbor lar to alues (i.e.,

    esent TATE

    shift n this state

    5 and ution logic

    ferent odule

    ng the ATE). every odule

    g the for a

    s and re are

    Fthat behatendincrfor relat

    Fig

    Tspeeimpexceimagdue and Mat

    Tservotheand imp

    Twithparaboarparathaninter(Fig512xrelatCUD

    Amulperflargincrthe C

    For the convothe execution

    aves as a expod to maintain teases. The eximage resoluted to cache p

    gure 7. Average e

    The speedup gedup considelementations eption to this ge samples. Tthe large amohigh resource

    tlab and FPGAThe C imple

    ved as a coners parallel alg

    number of cllementations. The FPGA imh the parallelallelism apprord (DE2) rallelism calcun a true parresting to not

    g. 13) was smx512. Besidetivity low (i.DA (i.e., 1242Another limittiplier block

    formance for er FPGA witease the perfoCUDA implem

    olution applicn time graph (Fonential (note the same grow

    xception to thiutions 3300x2erformance.

    xecution times ofand various im

    graph (Fig. 9) ering CUDAand various

    is the Matlab The reason foount of arithme utilization oA. ementation ditrol implemegorithms. Beclock cycles pe

    mplementationconvolution

    oach because resource limilation the exerallel implemtice that the F

    mall, even lesses, the clock .e., 100MHz 2 MHz for the ation of the u

    ks that woulthe convolut

    th more resouormance of thementation.

    cation, it is pFig. 7) for all the log scale

    wth rate, as this is the Matl2400 and 409

    f the convolutionmage resolutions.

    ) shows an appA in regard

    image resolimplementati

    or this good Cmetic operationof the GPU in

    id not exploientation for ccause of that, erformed wor

    n explored socalculation,

    of the FPGAmitations. The

    ecution timesmentation (i.eFPGA numbe

    s than CUDArate used inin this desi

    e processor cloused FPGA w

    uld improve tion operationurces and a fahe FPGA and p

    ossible to obthree architecon the y-axis

    he image resolab implement96x4096, pos

    n with mask size o

    proximately std to the lutions. Againion for the twoCUDA speedns, high granun comparison

    it parallelismcomparison toits execution

    rse than the o

    me parallelismbut lacked a

    A and developerefore, due s were only we., CUDA). er of clock c

    A for image sin the FPGA gn) comparin

    ock). was the absen

    significantlyn. Hence, usifaster clock shpossibly overc

    serve ctures s) and lution tation ssibly

    of 15

    teady other n, an o last

    dup is ularity

    to C,

    m and o the

    n time others

    m, as a true pment

    the worse It is

    cycles ize of

    was ng to

    nce of y the ing a hould come

  • Figu

    Figu

    Ain awhilcerta

    IC, MBasethe cyclimplin imexplresomultband

    re 8. Number of c

    ure 9. Speedup of

    A positive poia small boardle the GPU bainly much lar

    In this paper, Matlab and FPed on results pbest performes and speedlemented FPGmage resolutioore better mlution imagetiple pipelinesdwidth [14] .

    clock cycles of thvarious imag

    f the Convolution resol

    int in favor of d, with the pboard needs arger and more

    VI. CONwe presented

    PGA for the copresented, it i

    mance in execdup in comp

    GA architecturon. That is duemassive amos, based on s, high theoret

    he convolution wige resolutions.

    with mask size olutions.

    f the FPGA is eripheral inte

    a PC to be coe power consu

    NCLUSIONS a comparison

    onvolution of s inferable thacution time, parison to C,re and increasee to the fact thounts of data

    its inherent tical peak of G

    ith mask size of 1

    of 15 and various

    that in can operfaces integronnected whicuming.

    n between CUf grayscale imaat CUDA prenumber of c

    , Matlab andes with the gr

    hat CUDA tena, such as features suc

    GFLOPS and

    15 and

    image

    perate rated, ch is

    UDA, ages. sents clock d the rowth nds to

    high ch as

    high

    Rgrapkeptcyclavaicerta

    FFPGin trandatanummorimp

    T2010ComDepRio

    [1]

    [2]

    [3]

    [4]

    [5]

    [6]

    [7]

    [8]

    [9]

    [10]

    [11][12]

    [13]

    [14]

    Regarding thephs that it perft a steady groles. It must bilable that caainly increase

    Finally, it is GA algorithmsvarious squasmitted to pa

    a streams, canmber of convre FPGA die aractical.

    The authors 0/04675-4 an

    mputer Sciencepartment of E

    Grande do No

    Nvidia Corpora[Online]. Avadownloads.htmlXilinx, C. www.xilinx.comS. Asano, T. Mof FPGA, GPUConference on 2009, Prague, 20S. Che, J. Li, Compute-Intenson Application 101-107. S. Kestur, J.D. CPU and GPU,(ISVLSI), LixourS. J. Park, D.RFPGA and GPUDOD HPCMP UR. Weber, A.Comparing HaStudy, in IEEE2011, Vol. 22, nR. C. Gonzales Domain, in DigV. PodlozhnyukAvailable: http://x86_64_websiteble.pdf Altera CO. NIcom/devices/proIndependent JPEAltera CO. Niowww.altera.comAltera CO. (20Available: wwwD. B. Kirk and WParallel Process2010.

    e FPGA archiformed well, aowth in execube noticed thaan operate ine or even surpa

    possible to ims even more. Dares and prorallel convolun improve tholution moduarea is consum

    ACKNOWLare grateful

    nd 2009/177e, Federal Uni

    Electrical Engiorte for the su

    REFERation. (2009). Nailable: http://

    (2010) Our m/company/histor

    Maruyama and Y. U and CPU in Field Programm

    009, pp. 126-131.J.W. Sheaffer, Kive Applications Specific Process

    Davis and O. W in IEEE Compuri and Kefalonia, . Shires and B.JU, in DoD HPCUGC, Seattle, 200

    Gothandaramanrdware AcceleratE Transactions o. 1, pp. 58-68. and R. E. Wood

    gital Image Procek. (2007, jun.). Im//developer.downe/projects/convol

    IOS II Processoocessor/nios2/ni2-EG Group. libjpegos II System A

    m/support/example011). Avalon St

    w.altera.com/literaW. W. Hwu, Intsors: A Hands-on

    itecture, it canalthough worsution time andat there are mn higher clocass the GPU p

    mprove the pDividing the ioviding that ution moduleshe performancules. Howevemed and coul

    LEDGMENTS l to FAPESP736-4, to theiversity of Sao

    gineering, Fedupport through

    RENCES Nvidia CUDA /developer.nvidia

    History. [ry Yamaguchi, PeImage Process

    mable Logic and.

    K. Skadron and Jwith GPUs and F

    sors - SASP 200

    Williams, BLAS uter Society Annu 2010, pp. 288-29

    J. Henz, CoprocPCMP Users Gro08, pp. 366-370. an, R.J. Hinde ators in Scientific

    on Parallel and

    ds, Image Enhaessing, 3rd ed. Prmage Convolutionnload.nvidia.com/lutionSeparable/d

    sor. [Online]. A-index.html g. [Online]. Avai

    Architect Designes/nios2/exm-systreaming Interfaature/manual/mnl

    ntroduction, in Pn Approach, 1st e

    n be seen fromse than CUDAd number of

    more dense FPck rates that performance.

    performance oinput image re

    each regions through diffce roughly byer, by doing d possibly ma

    P, grants nue Departmeno Carlos and t

    deral Universihout this work

    Programming a.com/object/cuda

    [Online]. Ava

    erformance Compsing, in Internad Applications -

    J. Lach, AcceleFPGAs, in Symp8, Anaheim, 200

    Comparison on Fual Symposium on93. cessor Computingoup Conference,

    and G.D. Petc Applications: Ad Distributed Sy

    ancement in the Srentice Hall, 2008n with CUDA [O/compute/cuda/1.doc/convolutionS

    Available: www.

    lable: www.ijg.or. [Online]. Ava

    stem-architect.htmace, Cap. 5. [Ol_avalon_spec.pd

    Programming Mased. Morgan Kauf

    m the A and clock PGAs

    will

    of the egion n be ferent y the

    this, ake it

    umber nt of to the ity of k.

    Guide. a_2_3_

    ailable:

    parison ational

    FPL

    erating posium 08, pp.

    FPGA, n VLSI

    g with 2008.

    terson, A Case ystems,

    Spatial 8.

    Online]. 1-Beta epara

    .altera.

    rg ailable: ml

    Online]. df ssively fmann,

    /ColorImageDict > /JPEG2000ColorACSImageDict > /JPEG2000ColorImageDict > /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 150 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages true /GrayImageDownsampleType /Bicubic /GrayImageResolution 300 /GrayImageDepth -1 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /DCTEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict > /GrayImageDict > /JPEG2000GrayACSImageDict > /JPEG2000GrayImageDict > /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages true /MonoImageDownsampleType /Bicubic /MonoImageResolution 600 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict > /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False

    /CreateJDFFile false /Description >>> setdistillerparams> setpagedevice