Artigo Fpga vs Cuda

Image Convolution Processing: a GPU versus FPGA Comparison

Lucas M. Russo, Emerson C. Pedrino, Edilson Kato Federal University of Sao Carlos - DC

Rodovia Washington Lus, km 235 - SP-310 13565-905 So Carlos - So Paulo - Brazil

[email protected]; emerson, [email protected]

Valentin Obac Roda Federal University of Rio Grande do Norte - DEE

Campus Universitrio Lagoa Nova 59072-970 Natal Rio Grande do Norte Brazil

[email protected]

AbstractConvolution is one of the most important operators used in image processing. With the constant need to increase the performance in high-end applications and the rise and popularity of parallel architectures, such as GPUs and the ones implemented in FPGAs, comes the necessity to compare these architectures in order to determine which of them performs better and in what scenario. In this article, convolution was implemented in each of the aforementioned architectures with the following languages: CUDA for GPUs and Verilog for FPGAs. In addition, the same algorithms were also implemented in MATLAB, using predefined operations and in C using a regular x86 quad-core processor. Comparative performance measures, considering the execution time and the clock ratio, were taken and commented in the paper. Overall, it was possible to achieve a CUDA speedup of roughly 200x in comparison to C, 70x in comparison to Matlab and 20x in comparison to FPGA.

Keywords- Image processing; Convolution; GPU; CUDA; FPGA

I. INTRODUCTION In 2006 Nvidia Corporation announced a new general

purpose parallel computing architecture based on the GPGPU paradigm (General-Purpose Computing on Graphics Processing Units): CUDA (Compute Unified Device Architecture) [1]. CUDA is an architecture classified as GPGPU, and it is a category of the SPMD (single process, multiple data; or single program, multiple data) parallel programming, the model is based on the execution of the same program by different processors, supplied with different input data, without the strict coordination requirement among them that the SIMD (single instruction, multiple data) model imposes. As a central point to the model are the so called kernels: C-style functions that are parallel executed through multiple threads and, when called from the application, dynamically allocate a hierarchy processing structure specified by the user. Interchangeably with the execution of the kernels, portions of sequential code are usually inserted in a CUDA program flow. For this reason, it constitutes a heterogeneous programming model.

The CUDA model was conceived to implement the so called transparent scalability effectively, i.e., the ability of the programming model to adapt itself in the available hardware in such a way that more processors can be scalable without altering the algorithm and, at the same time, reduce the development time of parallel or heterogeneous solutions. All aforementioned model abstractions are particularly suitable and

easily adapted to the field of digital image processing, given that many applications in this area operate in independent pixel by pixel or pixel window approach.

Many years before the advent of the CUDA architecture, Xilinx in 1985 made available to the market the first FPGA chip [2]. The FPGA is basically, a highly customizable integrated chip that has been used in a variety of science fields, such as: digital signal processing, voice recognition, bioinformatics, computer vision, digital image processing and other applications that require high performance: real time systems and high performance computing.

The comparison between CUDA and FPGA has been documented in various works in different applications domains. Asano et al [3] compared the use of CUDA and FPGAs in image processing applications, namely two-dimensional filters, stereo vision and k-means clustering; Che et al [4] compared their use in three applications algorithms: Gaussian Elimination, Data Encryption Standard (DES), and Needleman-Wunsch; Kestur et al [5] developed a comparison for BLAS (Basic Linear Algebra Subroutines); Park et al [6] analyzed the performance of integer and floating-point algorithms and Weber et al [7] compared the architectures using a Quantum Monte Carlo Application.

In this work, CUDA and a FPGA dedicated architecture will be used and compared on the implementation of the convolution, an operation often used for image processing.

II. METHODOLOGY All CPU (i.e., Matlab and C) and GPU (i.e., CUDA)

execution times were obtained from the following configuration:

Component Description

Hardware

Processor: Intel Core i5 750 (8MB cache L2), Motherboard: ASUS P7P55DE-PRO; RAM Memory: 2 x 2 GB Corsair (DDR2-800); Graphics Board: XFX Nvidia GTX 295, 896MB

Software Windows 7 Professional 64-bit; Visual Studio 2008 SP1

Drivers Nvidia driver video version: 190.38; Nvidia CUDA toolkit version: 2.3

FPGA

Cyclone II EP2C35F672 on Terasic DE2 board; Quartus II 10.1 Software with SOPC Builder, NIOS II EDS 10.1 and ModelSim 6.6d Simulation Tool, for the implementation of the algorithms.

Sponsors: FAPESP grants number 2010/04675-4 and 2009/17736-4; DC/UFSCAR; DEE UFRN

978-1-4673-0186-2/12/$31.00 2012 IEEE

The main comparison parameters presented in this article are the execution time and the number of clock cycles of the implemented algorithms. In order to obtain that, different approaches were used according to the architecture profiled.

On C, the Performance Counters were used through the functions: QueryPerformanceCounter() and QueryPerformance Frequency(). The former is used to extract the value of the counter until the function call.

On CUDA, the Event Management provides functionality to create, destroy and record an event. Hence, it is possible to measure the amount of time it took to execute a specific part of code, such as a kernel call, in the manner described in [1]. Concerning the clock cycles, the clock() function was used within the kernel to obtain the measure.

On Matlab, a simple approach is provided through the usage of a built-in stopwatch. It is possible to control it with the tic and toc syntax. The first starts the timer and the second stops it, displaying the time, in seconds, to execute the statements between tic and toc. The Matlab number of clock cycles was not measured since it was not found a simple way to do it.

At last, on the FPGA, it is possible to infer the execution time directly from the architecture implemented on it. With the knowledge of the clock rate, explicitly defined by the designer, and the number of clock cycles taken to process the input data, extracted from the waveforms or from the architecture itself, the following expression can be used: execution time = number of clock cycles/clock frequency

(1)

III. CONVOLUTION Mathematically, convolution can be expressed as a linear

combination or sum of products of the mask coefficients with the input function.

(2)

Where f denotes the input function and w the mask. It is implicit that equation (2) is applied for every point in the input function.

It is possible to extend the convolution operation to a 2-D dimension as follows:

(3)

There is, in convolution, a limitation in what refers to the boundaries of an input image, since the mask is positioned in such way that there are mask values which do not overlap with the input image. Thus, two approaches are commonly used in the context of image processing: padding the edges of the input image with zeros or clamping the edges of the input

image with the closest border pixel. In this work the first choice is used as in GONZALES [8].

Considering an image of size of MxN pixels, a mask of size SxT the multiplication is the more costly operation. Hence, (MN)(ST) operations are performed and, consequently, the algorithm belongs to O(MNST).

A mask w(x,y) can be decomposed in w1(x) and w2(y) in such a way that w(x,y) = w1(x) w2(y), where w1(x) is a vector of size (Sx1) and w2(y) is a vector of size (1xT), the 2D convolution can be performed as two 1D convolutions. In this way, it is said that the convolution is separable and the algorithmic complexity decays allowing for a more flexible implementation. Hence, the separable convolution formula can be expressed as in equation 4.

(4)

IV. IMPLEMENTATION The separable convolution was implemented in C, CUDA

and Matlab (built-in function) and the regular convolution was implemented in FPGA [Eq. 3]. The reason to implement the regular convolution in FPGA was due to performance limitations. The separable algorithm, although reducing the total number of operations performed [Eq. 4], requires the image data stream to be processed twice, one for lines and one for columns. Consequently only the column filter itself would take as much time as the regular convolution to process the entire image. The reason for that is due the time required to fill the shift register and the streaming interface, which can transmit only one pixel at clock cycle.

A. C Implementation The C implementation of convolution was based in [Eq. 4]

and it is fairly straightforward. Follows the sequential separable algorithm implemented.

The image was first loaded to memory with the OpenCV C

library. Later, for each input pixel, the column convolution (with mask w2 and size equal to 2*b+1) was applied to it.

B. Matlab Implementation For Matlab, the conv2() built-in function was used to

perform the convolution.

/* Line Convolution*/ for i 0 to number of lines-1 for j 0 to number of columns -1 g(i, j) 0 for l b to -b if j-l >= 0 and j-l < number of columns g(i, j) g(i, j) + f(i, j - l)*w2(b+l) end-if end-for end-for end-for

/* Column Convolution*/ for i 0 to number of lines-1 for j 0 to number of columns-1 o(i, j) 0 for k a to a if i-k >= 0 and i-k < number of lines o(i, j) o(i, j) + g(i-k, j)*w1(a+k) end-if end-for end-for end-for

C. CO

diffekernkernalgoreal

Line

Tor 4 2-D imagpixedoinreduguar

F(4x4dispmapwayregioS0,0+concloadload

Fig

Asyncare gIn o__sycorre

Icalcuthe o(Fig

CimagNUMROWNUMfetch

CUDA ImplemOn CUDA, therent kernels: nel and the secnel. The develrithm of Podimages as inp

e Convolution

The threads wlines by 16 cgrid with siz

ge. Each threals from the in

ng this, the uced, as long ranteed.

Figs. 1 and 2 4) instead of laying purposped to the fir, the thread0on), S0,0+block_s

+5*block_size (rigcerning the acded for each lded to shared m

gure 1. Example o

After the loadchronize their going to acceorder to do yncthreads() ectly.

In the final sulate four outpones that the . 2).

Concerning thges are notM_LOADS_PW_X is theM_LOADS_Phed from the

mentation he algorithm the first part i

cond one is imlopment of thedlozhnyuk [9]puts.

n Kernel

were grouped columns and, ze depending ad in the blocnput image toaccess to thas the mem

illustrate the f its real dimses. The first lrst line of th,0 is mapped

size, S0,0+2*block_ght apron rectual block sizine) x 4 (num

memory.

of a 2-D block of column

ding stage, alexecution, si

ess elements tthat, a call is issued an

stage, each tput pixels, whthread were

he flexibility t multiple

PER_THREADe number oPER_THREAD

main region.

is implemenis implemente

mplemented the convolution, extended to

in 2-D blocksin turn, they won the dimen

ck is responsibo per-block shhe Global De

mory coalescin

general idea mensions (4 line of the 2-De input imageto the pixel

_size, S0,0+3*blockgion) and s

ze, 384, 16x6 mber of block

size 16 (i.e., 16 tkernels.

ll threads witince, in the nhat were loadto the CUD

nd the progr

thread is assihich are in themapped to in

of the convolof (BLOCKD), in whichof columns D is the nIn order to s

nted through ed through thehrough the coln was based ono support any

s with size (4were grouped

nsions of the ible for fetchinhared memoryevice Memorng restrictions

with block sizx 16) for im

D block (Fig. e (Fig. 2). Inls S0,0 (left ak_size, S0,0+4*bloco on. Moreo(number of p

k lines), pixel

threads) for line a

thin a blocknext stage, thrded by other oDA API funram can pro

igned the tase same positionn the main re

lution filter, sK_SIZE_ROWh BLOCK_SI

per block number of psolve this, the

two e line lumn n the

y size

4x16) d in a input

ng six y. By ry is s are

ze of mage 1) is

n this apron ck_size, over,

pixels ls are

and

must reads ones.

nction oceed

sk to ns as egion

some W_X* IZE_

and pixels e line

kernbegi

Blast prevexacconsAddautoequa Colu

Twiththe respshar

conswarp8 linpixeaproreus

Inlaun(BLAD)of linumfollo

D. FF

on Fand

TA gMeminteg

SNIOwrapin [laye

nel is launcheinning of the i

rowOffset = w

By doing thiscolumn will

viously calculctly which wstruct the rowditionally, thomatically saal four and BL

umn ConvoluThe threads, ih size 8x16 orgrid size dep

ponsible for fred memory.

The decision sidering the

rp access to cones in the bloels loaded to son pixels/numse, since they w

n the same wnched againLOCK_SIZE_), in which Bines per block

mber of fetcheowing offset w

columnOffse height BL NUM_L

FPGA ImplemFor the FPGAFig. 3 was devVerilog HDL

The architecturayscale or bimory, is convger values bet

Such conversiOS II Fast Copper function[12]. Thereforer and is

ed again withimage:

width BLOC NUM_LOA

, every columbe calculated

ated. Hence, were the last wOffset, increahe memory tisfied for N

LOCK_SIZE_

ution Kernel in the column r 8 lines by 16pends on the ifetching six p

to 16 columnmemory coalontiguous memock size tend shared memor

mber of outpuwill be loaded

way as the linn for im

_COLUMN_YBLOCK_SIZEk and NUM_Ls per thread f

was used:

t=LOCK_SIZE_LOADS_PER_

mentation A implementaveloped with

L coding.

ure is responsiinary image inerted to RAWtween 0 (i.e., b

ion was perfoore [10], the n for decomprre, this const

controlled

h the followin

CK_SIZE_ROADS_PER_TH

mn between rd, even if soit is not nececalculated coasing the vericoalescing

NUM_LOADS_ROW_X equ

n filter, are div6 columns. Asinput image apixels from th

ns in this blolescing requirmory positionto reduce thery, reduce the

ut pixels) andd to others sha

ne kernel, themages notY*NUM_LOAE_COLUMN_LOADS_PERfrom the imag

_COLUMN_Y_THREAD

ation, the arcthe assistance

ible for the fon JPEG forma

W format, thatblack) and 25

ormed by the C library libj

ressing JPEGtitutes the ap

by the N

ng offset from

OW_X *HREAD

(5

rowOffset andome of them ssary to deterolumn in ordification overhrequeriments S_PER_THR

ual 16.

vided in 2-D bs in the line keand each threhe input imag

ock size was mrements (i.e.,s). Conversely

e number of ae ratio (numb

d increase meared memories

e column kernt multiple ADS_PER_TH

N_Y is the nuR_THREAD ige main. Thus

Y *

chitecture depe of SOPC Bu

ollowing functat, stored on Ft is, to a matr5 (i.e., white)

softcore procbjpeg [11] and

images, avaipplication softNIOS II

m the

5)

d the were

rmine der to head.

are READ

blocks ernel, ead is ge to

made half y, the apron ber of emory s.

nel is of

HREumber is the s, the

(6)

picted uilder

tions. Flash rix of .

cessor d the ilable

ftware EDS.

Figu

AbuffDMAwithremathe iwithconsFirstComconvdualbetwclockcont

Trest Inter

TstateINGEND

Fregis(DAclockshiftfloor

ure 2. Example of

Figure 3.

After that, thefer (SDRAM A (Pixel Buff

hout interruptinaining of the image is proceh its interfacestituting the o tly, the imag

mponent. Folverted to 30-b-clock queue

ween two clok, and the 25troller is used

The implemenof the arch

rface.

This module ies: DATA_FI

G_STATE_1, D_PROCESS

Firstly, upon ster was initTA_FILL_BUk rise, the inpt register (Fir (KS/2)*(1

f an image region

Pipeline Architec

e decompresseChip lower affer DMA Cong the procespipeline (Ima

essed through e called Avapipeline (Ima

ge is procesllowing, eachbit RGB (RGB

(Clock Crossock domains 5MHz, the VGto display the

nted convoluthitecture by

is based on aILL_BUFFERDATA_PRO

SING_STATE

the reset signtialized with UFFER_STAput interface ig. 4). This stIW), or in ot

n with 96 pixels m

cture for Image P

ed image is waddresses) andontroller) is sor and transmage Processinvarious stream

alon Streaminage Processinssed by the h pixel (8-bB Resampler)sing Bridge) (100MHz, th

GA clock). Ane processed im

tion module ismeans of A

a finite state mER_STATE, DOCESSING_SE.

nal, every pothe value 0

ATE) consistsand store thetate lasts untilther words, un

mapped to 16 thre

Processing.

written to the pd, in this wayable to acce

mit the pixel tg Pipeline). Tming compon

ng Interfaceg Pipeline Fig

User Streambit grayscale). Then, thereacting as a brhe general dend, lastly, a V

mage.

s interfaced toAvalon Stream

machine with DATA_PROC

STATE_2 DA

sition of the 0. The first on reading, value read in

l it has been ntil the first v

eads of line kerne

pixel y, the ess it to the Then, nents, [13], g. 3). ming e) is e is a ridge esign VGA

o the ming

four CESS ATA_

shift state each

n the read valid

pixeof

Figude

kernth

NDATstate

Fi(Lefmom

FDAT(Eq.cycl(i.e.mencomFig.DAT

el. Pixels with the

el is positionFig. 4.

ure 4. Layout of tenotes the size, innel 3x3, KS = 3; Iat means, for a 64

pixe

Next, the TA_PROCESe is depicted in

igure 5. Example ft) and the values ment of DATA_PR

From this statTA_END_PR. 3) will be ale, an output p, Avalon Stre

ntion that thimbinational ci

6). The TA_PROCES

e same color are m

ned in the c

he convolution mn one dimension, W (Image Width40x480 image, IWels used to the con

present SSING_STATEn Fig. 5.

of an image withassociated with tROCESSING_ST

value not c

e until the endROCESSING_applied to thepixel will be aeaming Sourcis calculationircuit sub mo

present SSING_STATE

mapped to the sam

center gray r

module shift regisof the used kerne

h) denotes de widW = 640. The grenvolution calcula

state is TE_1. The fir

h pixels values rathe shift register wTATE_1 state. Thconsidered.

d of the state m_STATE), thee gray area anavailable at th

ce Interface). n is performodule (Convostate will

TE_2 when al

me thread (see Fig

region coordi

ter. KS (Kernel Sel, that means, forth of the input im

ey area indicates tation.

modified st moment of

anging from 0 to 2with KS = 3 in thhe x symbol indic

machine (i.e.,e convolutionnd, at every che output inteIt is importa

med by a paolution Operal change ll input pixel

g. 1).

inate,

Size) r a

mage, the

to f this

255 e first cates a

state n sum clock

erface ant to arallel ation,

to ls are

readequa

Twithvaluand DATperfoneigpixethe oconsbotto

LstatestateregiscaseDAT

Iimagmodelemkernarch

TloadAfteclockinter

TexecmaskCUDpres

d, or more spals the number

This state is sih the exceptioue zero will be

similarly TA_PROCESSformed, since hborhood) wls or to valuesone depicted sidered are theom-right side)

Lastly, after te is modifiede. This last sster and the co). Next, the

TA_FILL_BU

It is importantges up to 640xdule. This is ments limitatinel and image hitecture itself.

Figure

The number oding period (ier this state, uk cycle, one orface.

V.The graphs focution time fok size of 15, DA speedup fented in figur

pecifically, whr of pixels of t

imilar to DATn that, as thee forced into t

to the fiSSING_ STAT

the border pwill have its v

s 0. The last min Fig. 5, wite ones located).

the calculationd to DATA_state is respoounters to its

state machiUFFER_STA

t to highlight x480 are supp

due the DEion. Thereforsizes were es.

e 6. Convolution

f clock cycles.e., state DAuntil the end output pixel is

. RESULTS Aor the convol

or various grayas well as th

for Matlab, C es 7, 8 and 9.

hen the numbthe input imag

TA_PROCESre are no pixethe shift regis

first input TE_1 state, a bpixels (i.e., wvalues convo

moment of thisth the exceptid near the end

n of the last _END_PROConsible for rdefault value

ine returns toATE.

that only kernported in the FE2 board mre, results instimated consi

module block dia

s were obtaineTA_FILL_BUof the state m

s available at

AND COMPARI

lution operatiy scale imagehe number ofand the FPG

ber of pixels ge.

SSING_ STATels to be readter. By doing pixels of

border treatmewithout a com

luted to neigs state is similon that the vaof the image

pixel, the preCESSING_STAresetting the s (i.e., zero ino its initial

nels up to 5x5FPGA convolu

memory and lnvolving diffidering the mo

agram

ed disregardinUFFER_STAmachine, at ethe output mo

ISON on comparings resolutions

f clock cyclesGA architectur

read

TE_1 d, the g this,

the ent is

mplete ghbor lar to alues (i.e.,

esent TATE

shift n this state

5 and ution logic

ferent odule

ng the ATE). every odule

g the for a

s and re are

Fthat behatendincrfor relat

Fig

Tspeeimpexceimagdue and Mat

Tservotheand imp

Twithparaboarparathaninter(Fig512xrelatCUD

Amulperflargincrthe C

For the convothe execution

aves as a expod to maintain teases. The eximage resoluted to cache p

gure 7. Average e

The speedup gedup considelementations eption to this ge samples. Tthe large amohigh resource

tlab and FPGAThe C imple

ved as a coners parallel alg

number of cllementations. The FPGA imh the parallelallelism apprord (DE2) rallelism calcun a true parresting to not

g. 13) was smx512. Besidetivity low (i.DA (i.e., 1242Another limittiplier block

formance for er FPGA witease the perfoCUDA implem

olution applicn time graph (Fonential (note the same grow

xception to thiutions 3300x2erformance.

xecution times ofand various im

graph (Fig. 9) ering CUDAand various

is the Matlab The reason foount of arithme utilization oA. ementation ditrol implemegorithms. Beclock cycles pe

mplementationconvolution

oach because resource limilation the exerallel implemtice that the F

mall, even lesses, the clock .e., 100MHz 2 MHz for the ation of the u

ks that woulthe convolut

th more resouormance of thementation.

cation, it is pFig. 7) for all the log scale

wth rate, as this is the Matl2400 and 409

f the convolutionmage resolutions.

) shows an appA in regard

image resolimplementati

or this good Cmetic operationof the GPU in

id not exploientation for ccause of that, erformed wor

n explored socalculation,

of the FPGAmitations. The

ecution timesmentation (i.eFPGA numbe

s than CUDArate used inin this desi

e processor cloused FPGA w

uld improve tion operationurces and a fahe FPGA and p

ossible to obthree architecon the y-axis

he image resolab implement96x4096, pos

n with mask size o

proximately std to the lutions. Againion for the twoCUDA speedns, high granun comparison

it parallelismcomparison toits execution

rse than the o

me parallelismbut lacked a

A and developerefore, due s were only we., CUDA). er of clock c

A for image sin the FPGA gn) comparin

ock). was the absen

significantlyn. Hence, usifaster clock shpossibly overc

serve ctures s) and lution tation ssibly

of 15

teady other n, an o last

dup is ularity

to C,

m and o the

n time others

m, as a true pment

the worse It is

cycles ize of

was ng to

nce of y the ing a hould come

Figu

Figu

Ain awhilcerta

IC, MBasethe cyclimplin imexplresomultband

re 8. Number of c

ure 9. Speedup of

A positive poia small boardle the GPU bainly much lar

In this paper, Matlab and FPed on results pbest performes and speedlemented FPGmage resolutioore better mlution imagetiple pipelinesdwidth [14] .

clock cycles of thvarious imag

f the Convolution resol

int in favor of d, with the pboard needs arger and more

VI. CONwe presented

PGA for the copresented, it i

mance in execdup in comp

GA architecturon. That is duemassive amos, based on s, high theoret

he convolution wige resolutions.

with mask size olutions.

f the FPGA is eripheral inte

a PC to be coe power consu

NCLUSIONS a comparison

onvolution of s inferable thacution time, parison to C,re and increasee to the fact thounts of data

its inherent tical peak of G

ith mask size of 1

of 15 and various

that in can operfaces integronnected whicuming.

n between CUf grayscale imaat CUDA prenumber of c

, Matlab andes with the gr

hat CUDA tena, such as features suc

GFLOPS and

15 and

image

perate rated, ch is

UDA, ages. sents clock d the rowth nds to

high ch as

high

Rgrapkeptcyclavaicerta

FFPGin trandatanummorimp

T2010ComDepRio

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11][12]

[13]

[14]

Regarding thephs that it perft a steady groles. It must bilable that caainly increase

Finally, it is GA algorithmsvarious squasmitted to pa

a streams, canmber of convre FPGA die aractical.

The authors 0/04675-4 an

mputer Sciencepartment of E

Grande do No

Nvidia Corpora[Online]. Avadownloads.htmlXilinx, C. www.xilinx.comS. Asano, T. Mof FPGA, GPUConference on 2009, Prague, 20S. Che, J. Li, Compute-Intenson Application 101-107. S. Kestur, J.D. CPU and GPU,(ISVLSI), LixourS. J. Park, D.RFPGA and GPUDOD HPCMP UR. Weber, A.Comparing HaStudy, in IEEE2011, Vol. 22, nR. C. Gonzales Domain, in DigV. PodlozhnyukAvailable: http://x86_64_websiteble.pdf Altera CO. NIcom/devices/proIndependent JPEAltera CO. Niowww.altera.comAltera CO. (20Available: wwwD. B. Kirk and WParallel Process2010.

e FPGA archiformed well, aowth in execube noticed thaan operate ine or even surpa

possible to ims even more. Dares and prorallel convolun improve tholution moduarea is consum

ACKNOWLare grateful

nd 2009/177e, Federal Uni

Electrical Engiorte for the su

REFERation. (2009). Nailable: http://

(2010) Our m/company/histor

Maruyama and Y. U and CPU in Field Programm

009, pp. 126-131.J.W. Sheaffer, Kive Applications Specific Process

Davis and O. W in IEEE Compuri and Kefalonia, . Shires and B.JU, in DoD HPCUGC, Seattle, 200

Gothandaramanrdware AcceleratE Transactions o. 1, pp. 58-68. and R. E. Wood

gital Image Procek. (2007, jun.). Im//developer.downe/projects/convol

IOS II Processoocessor/nios2/ni2-EG Group. libjpegos II System A

m/support/example011). Avalon St

w.altera.com/literaW. W. Hwu, Intsors: A Hands-on

itecture, it canalthough worsution time andat there are mn higher clocass the GPU p

mprove the pDividing the ioviding that ution moduleshe performancules. Howevemed and coul

LEDGMENTS l to FAPESP736-4, to theiversity of Sao

gineering, Fedupport through

RENCES Nvidia CUDA /developer.nvidia

History. [ry Yamaguchi, PeImage Process

mable Logic and.

K. Skadron and Jwith GPUs and F

sors - SASP 200

Williams, BLAS uter Society Annu 2010, pp. 288-29

J. Henz, CoprocPCMP Users Gro08, pp. 366-370. an, R.J. Hinde ators in Scientific

on Parallel and

ds, Image Enhaessing, 3rd ed. Prmage Convolutionnload.nvidia.com/lutionSeparable/d

sor. [Online]. A-index.html g. [Online]. Avai

Architect Designes/nios2/exm-systreaming Interfaature/manual/mnl

ntroduction, in Pn Approach, 1st e

n be seen fromse than CUDAd number of

more dense FPck rates that performance.

performance oinput image re

each regions through diffce roughly byer, by doing d possibly ma

P, grants nue Departmeno Carlos and t

deral Universihout this work

Programming a.com/object/cuda

[Online]. Ava

erformance Compsing, in Internad Applications -

J. Lach, AcceleFPGAs, in Symp8, Anaheim, 200

Comparison on Fual Symposium on93. cessor Computingoup Conference,

and G.D. Petc Applications: Ad Distributed Sy

ancement in the Srentice Hall, 2008n with CUDA [O/compute/cuda/1.doc/convolutionS

Available: www.

lable: www.ijg.or. [Online]. Ava

stem-architect.htmace, Cap. 5. [Ol_avalon_spec.pd

Programming Mased. Morgan Kauf

m the A and clock PGAs

will

of the egion n be ferent y the

this, ake it

umber nt of to the ity of k.

Guide. a_2_3_

ailable:

parison ational

FPL

erating posium 08, pp.

FPGA, n VLSI

g with 2008.

terson, A Case ystems,

Spatial 8.

Online]. 1-Beta epara

.altera.

rg ailable: ml

Online]. df ssively fmann,

/ColorImageDict > /JPEG2000ColorACSImageDict > /JPEG2000ColorImageDict > /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 150 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages true /GrayImageDownsampleType /Bicubic /GrayImageResolution 300 /GrayImageDepth -1 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /DCTEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict > /GrayImageDict > /JPEG2000GrayACSImageDict > /JPEG2000GrayImageDict > /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages true /MonoImageDownsampleType /Bicubic /MonoImageResolution 600 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict > /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False

/CreateJDFFile false /Description >>> setdistillerparams> setpagedevice

Documents

Artigo Fpga vs Cuda