56
Reconfigurable Computing Reconfigurable Computing Applications Applications Chapter 9 Chapter 9 Prof. Dr. Prof. Dr. - - Ing. Jürgen Teich Ing. Jürgen Teich Lehrstuhl für Hardware Lehrstuhl für Hardware - - Software Software - - Co Co - - Design Design Reconfigurable Computing

Reconfigurable Computing Applications Chapter 9 - … Computing Applications Chapter 9 ... The result of the comparison ... Parallel bit-serial input j-shift

Embed Size (px)

Citation preview

Page 1: Reconfigurable Computing Applications Chapter 9 - … Computing Applications Chapter 9 ... The result of the comparison ... Parallel bit-serial input j-shift

Reconfigurable ComputingReconfigurable Computing

ApplicationsApplications

Chapter 9Chapter 9

Prof. Dr.Prof. Dr.--Ing. Jürgen TeichIng. Jürgen TeichLehrstuhl für HardwareLehrstuhl für Hardware--SoftwareSoftware--CoCo--DesignDesign

Reconfigurable Computing

Page 2: Reconfigurable Computing Applications Chapter 9 - … Computing Applications Chapter 9 ... The result of the comparison ... Parallel bit-serial input j-shift

Overview Overview

Reconfigurable Computing 2

FPGAs have been used in the past mostly in

Rapid prototyping

Non-frequent reconfigurable systems

Hardware implementation, sometimes specific for the FPGA architectureThe most important application areas are:

Searching (text, genetic database, etc.)

Image processing

Mechanical control

Etc.

Page 3: Reconfigurable Computing Applications Chapter 9 - … Computing Applications Chapter 9 ... The result of the comparison ... Parallel bit-serial input j-shift

Searching Searching –– pattern matchingpattern matching

Reconfigurable Computing 3

Pattern matching is the basis of search engines

The purpose is to find and (count) the occurrence of a given pattern in a given text

Useful in:

DictionariesDocument collection indexingDocument filtering and classificationSpam avoidanceContent surveillance

Page 4: Reconfigurable Computing Applications Chapter 9 - … Computing Applications Chapter 9 ... The result of the comparison ... Parallel bit-serial input j-shift

Searching Searching –– pattern matching pattern matching –– sliding windowssliding windows

Reconfigurable Computing 4

Sliding windows (Cockscot & Foulk )

Keywords are kept in register. One character / ByteA set of comparators are used. One comparator / ByteHit signal is set whenever the text-segment matches the corresponding wordAdvantage:

Easy to replace old patternsDrawbacks:

Not flexible: Fixed length of registersRedundancy: more comparators than necessary for word with same prefix

Page 5: Reconfigurable Computing Applications Chapter 9 - … Computing Applications Chapter 9 ... The result of the comparison ... Parallel bit-serial input j-shift

Searching Searching –– pattern matching pattern matching -- sliding windowssliding windows

Reconfigurable Computing 5

Avoid redundancyUse only one comparator for common characters in different words

Data folding (Foulk)

Fold the data in the circuitConsider the bit-representation of each characterGenerate a comparator circuit for each character in the words to be searched for

Bit Bit Bit Bit Bit Bit Bit Bit Bit Bit Bit Bit Bit Bit Bit Bit

8-bit comparator

Bit Bit Bit Bit Bit Bit Bit Bit

01001110-Comparator

Page 6: Reconfigurable Computing Applications Chapter 9 - … Computing Applications Chapter 9 ... The result of the comparison ... Parallel bit-serial input j-shift

Searching Searching –– pattern matching pattern matching -- FSMFSM--BasedBased

Reconfigurable Computing 6

FSM-Based pattern matcherEach regular grammar can be recognized by an FSMIn pattern matching, the target words define the regular grammarThe target words are compiled in the automatonEach word defines a unique path from the start state to an end stateWhen scanning a text, the automaton changes its state with the appearance of charactersReaching a final state correspondsto the appearance of a wordRedundancy is avoided by implementing common prefix

FSM-Recognizer and correspondingstate transition table for the word conte

Page 7: Reconfigurable Computing Applications Chapter 9 - … Computing Applications Chapter 9 ... The result of the comparison ... Parallel bit-serial input j-shift

Searching Searching –– pattern matching pattern matching -- FSMFSM--BasedBased

Reconfigurable Computing 7

FSM-Based pattern matcherRAM-implementation

One RAM or ROM for storing the state transition tableOne state registerOne character registerA hit detectorThe Input character and the state register are used to determine the nextstateThe hit detector checks if the current state is equal to a hit state and sets a hit for the corresponding wordAdvantage:

Simple to implementDrawback:

Expensive in terms of flip flops

Char reg

RAMROM

State Reg Nextstate

Hit detect

Characterstream

RAM/ROM implementation of the word recognizer

Page 8: Reconfigurable Computing Applications Chapter 9 - … Computing Applications Chapter 9 ... The result of the comparison ... Parallel bit-serial input j-shift

Searching Searching –– pattern matching pattern matching -- FSMFSM--BasedBased

Reconfigurable Computing 8

FSM-Based pattern matcherOne-hot implementation

Each state is coded in one flip flopThe D-input of the flip flop is obtained by an AND of the output of the previous flip flop with the result of the comparatorThe comparator is character-specificOnly n FF are used to implement a word of length nAdvantage:

Low costReflects the structure of the grammar

Drawback: Not easy to buildRedundancy in the comparators

o n t e

c

Character-specific comparators

Page 9: Reconfigurable Computing Applications Chapter 9 - … Computing Applications Chapter 9 ... The result of the comparison ... Parallel bit-serial input j-shift

Searching Searching –– pattern matching pattern matching -- FSMFSM--BasedBased

Reconfigurable Computing 9

FSM-Based pattern matcherExploiting common prefix

For words with common prefix, only one common starting path corresponding to the length of the common prefix is used.Redundancy of comparators can be avoided by implementing only one comparator for each character.The result of the comparison will then be provided to all gates using them Words with common prefix

and the corresponding FSM

Page 10: Reconfigurable Computing Applications Chapter 9 - … Computing Applications Chapter 9 ... The result of the comparison ... Parallel bit-serial input j-shift

Searching Searching –– pattern matching pattern matching -- FSMFSM--BasedBased

Reconfigurable Computing 10

FSM-Based pattern matcherOptimized architecture

Implement the common prefix Redundancy of comparators is removed: Each character in the set is implemented in a position vector: pos(i) = 1 iff character i is detected Block diagram of the optimal

pattern matcher

Detailed structure of the optimal pattern matcher

Page 11: Reconfigurable Computing Applications Chapter 9 - … Computing Applications Chapter 9 ... The result of the comparison ... Parallel bit-serial input j-shift

Reconfigurable Computing 11

Searching Searching –– pattern matching pattern matching –– use of use of reconfigurationreconfiguration

Bit Bit Bit Bit Bit Bit Bit Bit

FSM-Based pattern matcherUse of reconfiguration

Replace the character comparatorsReplace the FSM for a set of words

Rec

onfig

urat

ion

New character comparator

New set of wordsReconfiguration

Page 12: Reconfigurable Computing Applications Chapter 9 - … Computing Applications Chapter 9 ... The result of the comparison ... Parallel bit-serial input j-shift

Reconfigurable Computing 12

Signal processing Signal processing –– distributed arithmetic distributed arithmetic --MotivationMotivation

Signal processing applications (FFT, Convolution, Filter algorithms) are characterized by MAC-intensive computations Signal processing functions are usually implemented on special processors

DSPsASICs

FPGAs provide the advantage of reconfigurability, but MAC-intensive applications are expensiveHowever, for MAC computations involving one constant vector, FPGAs present one of the best alternatives to DSPs

Page 13: Reconfigurable Computing Applications Chapter 9 - … Computing Applications Chapter 9 ... The result of the comparison ... Parallel bit-serial input j-shift

Signal processingSignal processing-- distributed arithmetic distributed arithmetic -- BasicsBasics

Reconfigurable Computing 13

iji XA ∗∑Because the Ai are constant, there exist 2n possible values for

We can pre-compute the possible values and store them in a LUT (DALUT) and retrieve them on demand at run-time

( ) ( )ijijj

iji XA2=2XA=XA=Z ∗∗∗∗∗ ∑∑∑∑

With the binary representation for Xi: j

iji 2X=X ∑

( )∑ ∗∗ ii XA=XA=ZA constant row vector, X column vectorSolution of the following equation:

( )ijij XA2=Z ∗∗∑∑ is the classical form of distributed arithmetic

FPGA Advantage: Computation is memory-based (use of LUTs)

Page 14: Reconfigurable Computing Applications Chapter 9 - … Computing Applications Chapter 9 ... The result of the comparison ... Parallel bit-serial input j-shift

Signal processingSignal processing-- distributed arithmetic distributed arithmetic -- BasicsBasics

Reconfigurable Computing 14

To better understand, we spread the DA equation

Z=[ ] 02( ) ( ) nn01n01n AX+AX ∗∗ −−.........................220110 AX+AX ∗∗

+ [ ] 12( ) ( ) nn11n11n AX+AX ∗∗ −−221111 AX+AX ∗∗ .........................

2W21W1 AX+AX ∗∗ ......................... ( ) ( ) nnW1nW1n AX+AX ∗∗ −− ] W2

. .........

.........

........

+ [

The bits of the variables will be used to address the memory andretrieve the required values in a bit-serial way.

The DA-datapath implementation is straightforward

Page 15: Reconfigurable Computing Applications Chapter 9 - … Computing Applications Chapter 9 ... The result of the comparison ... Parallel bit-serial input j-shift

Signal processingSignal processing-- distributed arithmetic distributed arithmetic -- DatapathDatapath

Reconfigurable Computing 15

( )1W1W1 XX − ................

......

1011XX( )1W2W2 XX − ................ 2021XX

( )1WnnWXX − ................ n0n1XX ...

1A0

2A

21 A+A3A

13 A+A23 A+A

123 A+A+A4A

Z+/-

DA-LUT AddressDA-LUT

Parallel bit-serial input j-shift

Page 16: Reconfigurable Computing Applications Chapter 9 - … Computing Applications Chapter 9 ... The result of the comparison ... Parallel bit-serial input j-shift

Signal processingSignal processing-- distributed arithmetic distributed arithmetic -- DatapathDatapath

Reconfigurable Computing 16

k-parallel( )1W1W1 XX − ................

......

1011 XX

( )1W2W2 XX − ................ 2021 XX

( )1WnnW XX − ................ n0n1 XX

DA-LUT 1 DA-LUT 2 DA-LUT k

ACC 1 ACC 2 ACCk

Z Adder tree

Page 17: Reconfigurable Computing Applications Chapter 9 - … Computing Applications Chapter 9 ... The result of the comparison ... Parallel bit-serial input j-shift

Signal processingSignal processing-- distributed arithmetic distributed arithmetic -- ExampleExample

Reconfigurable Computing 17

Recursive convolution of time domain simulation of opticalmultimode intra/system interconnects

( ) ( ) 35322415041n0n xf+xf+xfxf+tyf=ty ∗∗∗−∗∗ −

Recursive formula to be implemented on 3 intervals

Comparison of different implementations Virtex 2000E implementation on

the Celoxica RC1000-PP board

Page 18: Reconfigurable Computing Applications Chapter 9 - … Computing Applications Chapter 9 ... The result of the comparison ... Parallel bit-serial input j-shift

Signal processing Signal processing –– Fast Fourier Transform (FFT)Fast Fourier Transform (FFT)

Reconfigurable Computing 18

Fourier Series developed by the French Mathematician Joseph Fourier (1807)

Application of the initial idea in the field of heat diffusion

The advent of digital computation and “discovery” of Fast Fourier Transform (FFT) in the 50's revolutionized the field of signal processing

Practical processing and meaningful interpretation of signals with exceptional importance for human and industry

Medical monitors and scanners

Modern electronic communications

Image processing

etc.

Page 19: Reconfigurable Computing Applications Chapter 9 - … Computing Applications Chapter 9 ... The result of the comparison ... Parallel bit-serial input j-shift

Signal processing Signal processing –– FFT FFT -- BasicsBasics

Reconfigurable Computing 19

Fourier transform F(u) of a continuous function f(x):dxe)x(f)u(F )ux2j( π−

∞−∫=

Given F(u), we can obtain f(x) by means of inverse Fourier transform

Extension in 2-D, F(u,v)

Corresponding inverse in 2-D

due)u(F)x(f )ux2j( π−∞

∞−∫=

due)v,u(F)y,x(f ))vyux(2j( +−∞

∞−∫= π

dxe)y,x(f)v,u(F ))vyux(2j( +−∞

∞−

∞−∫∫= π

Page 20: Reconfigurable Computing Applications Chapter 9 - … Computing Applications Chapter 9 ... The result of the comparison ... Parallel bit-serial input j-shift

Signal processing Signal processing –– FFT FFT -- BasicsBasics

Reconfigurable Computing 20

Fourier transform of a discrete function of one variable F(u)

Given F(u), we can obtain, f(x) by means of inverse Fourier transform

1M,...,0u,1M,...,0x,e)x(fM1)u(F )M/ux2j(

1M

0x−=−== −

=∑ π

1M−1M,...,0u,1M,...,0x,e)u(F)x(f )M/ux2j(

0x−=−== −

=∑ π

The Brute force computation of the Fourier transform requires M2 multiplications and additions

The performance can be improved to (M log M) using the successive doubling method

Page 21: Reconfigurable Computing Applications Chapter 9 - … Computing Applications Chapter 9 ... The result of the comparison ... Parallel bit-serial input j-shift

Signal processing Signal processing –– FFT FFT -- BasicsBasics

Reconfigurable Computing 21

For notational convenience, we replace the previous equation with:

)M/2j(M

)ux(M

1M

0xeW,W)x(f

M1)u(F π−

=== ∑

We assume M to be of the form

With n and K being positive integers each, we have:

K22M n ==

]WW)1x2(fK1W)x2(f

K1[

21)u(F )u(

K2)ux(

K

1K

0x

))x2(u(K2

1K

0x++= ∑∑

=

=

)ux(K2

1M

0xW)x(f

K21)u(F ∑

==

Fodd(u)Feven(u)

Page 22: Reconfigurable Computing Applications Chapter 9 - … Computing Applications Chapter 9 ... The result of the comparison ... Parallel bit-serial input j-shift

Signal processing Signal processing –– FFT FFT -- BasicsBasics

Reconfigurable Computing 22

Since and , we have

]W)u(F)u(F[21)u(F u

K2oddeven +=

The FFT is a fast implementation of the Discrete Fourier Transform (DFT).

Based on a divide-and-conquer model,M log2M computations

]W)u(F)u(F[21)Ku(F u

K2oddeven −=+

uM

)Mu(M WW =+ u

M2)Mu(

M2 WW −=+

Page 23: Reconfigurable Computing Applications Chapter 9 - … Computing Applications Chapter 9 ... The result of the comparison ... Parallel bit-serial input j-shift

Signal processing Signal processing –– FFT FFT -- AlgorithmAlgorithm

Reconfigurable Computing 23

The N-point DFT computation can be divided into two N/2-point DFT computation. These N/2-point DFT computations can be divided into two N/4-point DFT computations, and so on.

There are log2(N) stages

After the division and the DFT computation, a merging process is performed, in which the transforms are reassembled. Merging process

Page 24: Reconfigurable Computing Applications Chapter 9 - … Computing Applications Chapter 9 ... The result of the comparison ... Parallel bit-serial input j-shift

Signal processing Signal processing –– FFT FFT –– Algorithm Algorithm –– butterfly unitbutterfly unit

Reconfigurable Computing 24

The butterfly unitThe reassembling is done by using the complex elements say, g and h from the previous stage. Current element = g - h*WN

k and g + h*WNk

The twiddle factorThe twiddle factor terms (of the WN

k = exp(j 2* π*k/N) must be availableThe real and complex parts of these factors are stored in a ROM.The factors correspond to a sine (imaginary part) and cosine (real part) functions, in a set of N/2 equally-spaced angles in an interval from 0 to (N-2)* π/N. Therefore, only a N/2-position memory is needed.

g

h * Wk

N-

+

Page 25: Reconfigurable Computing Applications Chapter 9 - … Computing Applications Chapter 9 ... The result of the comparison ... Parallel bit-serial input j-shift

Signal processing Signal processing –– FFT FFT –– FPGA implementationFPGA implementation

Reconfigurable Computing 25

16 points FFTPipelined READ, EXECUTE and OUTPUT stage

Read one complex input (32-bits) in every cycle16 point input is read in 16 cycles

Output one complex result (32-bits) in every cycle

EXECUTE stage also takes 16 cycles

Performance Latency: 17 cyclesThroughput: 1 transform per 16 cycles

READ EXEC OUTPUT

READ EXEC OUTPUT

Page 26: Reconfigurable Computing Applications Chapter 9 - … Computing Applications Chapter 9 ... The result of the comparison ... Parallel bit-serial input j-shift

Signal processing Signal processing –– FFT FFT –– FPGA implementationFPGA implementation

Reconfigurable Computing 26

Completing the execution stage in 16 cycles results in large fan-outs for the internal registers: very poor timing characteristics

By trial and error, divide the execution stage into 4 pipelined sub-stagesThat necessitates 5x internal storage, but 2560 bits of storage is no big deal for XC2V2000.Final performance :

Latency : 65 cyclesThroughput : 1 transform per 16 cycles (same as before)

R E1 E2 E3 E4 O

R E1 E2 E3 E4 O

Page 27: Reconfigurable Computing Applications Chapter 9 - … Computing Applications Chapter 9 ... The result of the comparison ... Parallel bit-serial input j-shift

Signal processing Signal processing –– FFT FFT –– FPGA implementationFPGA implementation

Reconfigurable Computing 27

Resource requirements:

1432 out of 21504 flip-flops (6%)1037 out of 21504 LUT (4%)65 out of 624 I/O pins (10%)

TimingMinimum clock period: 5.899ns (Maximum frequency: 169.520MHz)

Power estimation 440 mW

Page 28: Reconfigurable Computing Applications Chapter 9 - … Computing Applications Chapter 9 ... The result of the comparison ... Parallel bit-serial input j-shift

Image processing Image processing –– Basic operatorsBasic operators

Reconfigurable Computing 28

Image processing algorithms usually process an image (set of points with a given characteristics such as color, gray level, luminance, etc.) point by point. The resulting pixels depend only on the pixels in the original picture. A sequential processor needs quadratic run-time to process a complete image. By using parallelism, each pixel can be computed independently.Many image processing system are based on the following operators:

Median filteringBasic Morphological operationsConvolutionEdge detection

Algorithms are often based on the moving window operator

Page 29: Reconfigurable Computing Applications Chapter 9 - … Computing Applications Chapter 9 ... The result of the comparison ... Parallel bit-serial input j-shift

Image processing Image processing –– Moving windowMoving window

Reconfigurable Computing 29

The moving or sliding window algorithm usually processes one pixel of the image at a timeThe value of the pixel is changed by a function of a local pixel region covered by the windowThe operator moves over the image to cover all pixelsFor a pipelined implementation, all the pixel of the windows must be accessed at the same time for each clockFPGA implementation uses FIFO buffers

FIFO1

FIFO2

RAM

W11 W12 W13

W21 W22 W23

W31 W32 W33

Disposed

FIFO1

FIFO2

W11 W12 W13

W21 W22 W23

W31 W32 W33

DisposedW14 W15

W24 W25

W34 W35

W41 W42 W43

W51 W52 W53

W44 W45

W54 W55RAM

FIFO3

FIFO4

3x3 and 5x5 moving windows

Page 30: Reconfigurable Computing Applications Chapter 9 - … Computing Applications Chapter 9 ... The result of the comparison ... Parallel bit-serial input j-shift

Image processing Image processing –– Moving window Moving window -- FPGAFPGA

Reconfigurable Computing 30

FIFO ImplementationFIFOs are implemented using circular buffers constructed from Multi-ported RAMs (available in, e.g., Virtex FPGA)Indexes keep track of the front and tail items in the bufferBLOCK RAMs are readable and writable in one clock-cycle. This allows a throughput of one pixel per cycle.

3x3 windows2 buffers of size W-3 (W = image width) are usedThe two FIFO buffers must be full to access all the window pixels in one cycleIn each clock cycle, a pixel is read from the memory and placed into the bottom left cornerThe content of the window is shifted to the right with the right most member being added to the tail of the FIFOThe top right pixel is disposed after computation, since it is not used in the future computation

Page 31: Reconfigurable Computing Applications Chapter 9 - … Computing Applications Chapter 9 ... The result of the comparison ... Parallel bit-serial input j-shift

Image processing Image processing –– Median filteringMedian filtering

Reconfigurable Computing 31

BasicsAn impulse noise (or salt and pepper noise) in an image has a gray level with higher low different from the neighbor point.Linear filters have no ability to remove this type of noiseMedian filters share remarkable advantages on removing this type of noiseOften used in digital signal and image/video applications

ImplementationUse a sliding window of odd size (e.g., 3x3) over an imageAt each window position, the median of the sample values is taken to replace the value at the center of the windowHigh computational cost O(N log N) even using most efficient sorting algorithmsGeneral purpose processors are not a good solutions for real time implementation. This justifies the use of FPGAs.

Page 32: Reconfigurable Computing Applications Chapter 9 - … Computing Applications Chapter 9 ... The result of the comparison ... Parallel bit-serial input j-shift

Image processing Image processing –– Median filteringMedian filtering

Reconfigurable Computing 32

Sequential implementation(pseudo code)For x=1 to # rows

For y = 1 to # colsBuild Windows arraypixel(x,y) = Median(window array)

EndEnd

Complexity O(#rows X #cols x NlogN) (N=3)

Hardware sorting implementation

10 5 2014 3 1115 25 2

11

2 3 5 10 11 14 20 25

Median

Page 33: Reconfigurable Computing Applications Chapter 9 - … Computing Applications Chapter 9 ... The result of the comparison ... Parallel bit-serial input j-shift

Image processing Image processing –– Median filtering Median filtering -- resultresult

Reconfigurable Computing 33

Original image Filtered image

Page 34: Reconfigurable Computing Applications Chapter 9 - … Computing Applications Chapter 9 ... The result of the comparison ... Parallel bit-serial input j-shift

Image processing Image processing –– Basic Morphological OperatorsBasic Morphological Operators

Reconfigurable Computing 34

Morphology in image processing studies the appearance of objects. Useful for example in:

Skeletonization Edge detectionRestoration

ProcessingThe image is processed pixel-by-pixel using a structuring element (the sliding windows)The window may fit or not to the image

Most basic building blocks:Erosion (shrinks or erodes an object in the image)Dilation (grows the image)Operations like opening and closing of an image can be derived by performing erosion and dilation in different order

Page 35: Reconfigurable Computing Applications Chapter 9 - … Computing Applications Chapter 9 ... The result of the comparison ... Parallel bit-serial input j-shift

Image processing Image processing –– Basic Morphological OperatorsBasic Morphological Operators

Reconfigurable Computing 35

Erosion Replaces the center pixel in the sliding window by the smallest pixel value in the window arrayThe bright area of the image shrinks, or erodes

Dilation Replaces the center pixel in the sliding window by the greatest pixel value in the window arrayThe bright area of the image grows

AlgorithmSame as the medianInstead of selecting the median element,the minimum is selected for erosion and the maximum is selected for dilation

Page 36: Reconfigurable Computing Applications Chapter 9 - … Computing Applications Chapter 9 ... The result of the comparison ... Parallel bit-serial input j-shift

Image processing Image processing –– Median filtering Median filtering -- resultresult

Reconfigurable Computing 36

DilationOriginal image Erosion

Page 37: Reconfigurable Computing Applications Chapter 9 - … Computing Applications Chapter 9 ... The result of the comparison ... Parallel bit-serial input j-shift

Image processing Image processing –– Convolution Convolution -- BasicsBasics

Reconfigurable Computing 37

Convolution multiplies two arrays of numbers with different sizes and produces a third array of numbersIn image processing, convolution implements operators whose output pixels are computed as linear combinations of certain input pixels values.1-D Convolution

Formally convolution takes two input functions f(x) and g(x) andgenerates h(x) = f(x)*g(x) where g(x) is referred to as the filter:

2-D ConvolutionMost important in modern image processingA finite size window (convolution mask) is scanned over the imageThe output pixel value is the weighted sum of the input pixels within the windowThe weight is the value of the filter assigned to each pixel in the window

( ) ( ) ( )dττxgτf=xh −∫∞

∞−

Page 38: Reconfigurable Computing Applications Chapter 9 - … Computing Applications Chapter 9 ... The result of the comparison ... Parallel bit-serial input j-shift

Image processing Image processing –– Convolution Convolution -- BasicsBasics

Reconfigurable Computing 38

2-D Convolution

Mathematically represented by the following equation:

where x is the input image, h is the filter and y is the output image

Supports a virtual infinite variety of masks, each with its own feature

3x3 convolutions are most commonly used and operate only on a pixel and its directly adjacent neighbours

( ) ( ) ( )jni,mxji,h=nm,yheightimg

0=i

widthimg

0=j−−∑ ∑

− −

P1 P2 P3

P4 P5 P6

P7 P8 P9

W1 W2 W3

W4 W5 W6

W7 W8 W9

P*∑

∑9

1=ii

i9

1=ii

W

PW=P

Page 39: Reconfigurable Computing Applications Chapter 9 - … Computing Applications Chapter 9 ... The result of the comparison ... Parallel bit-serial input j-shift

Image processing Image processing –– Convolution Convolution –– Gaussian filtersGaussian filters

Reconfigurable Computing 39

Gaussian convolution filters1-D 2-D

The idea is to use the 2-D distribution as a point spread function. This is achieved by convolutionA discrete approximation of the Gauss function is required to perform the convolutionIn theory, Gauss distribution is zero anywhere. Therefore an infinite large convolution kernel may be requiredBut in practice, the convolution kernel is truncated as shown in the pictures

( ) 2

2

σ2x

eπσ21=xG

( )( )

2

22

σ2y+x

eπσ21=yx,G

21 31 21

31 48 31

21 31 21

1256

2 4 5

4 9 12

5 12 15

4 2

9 4

12 5

4 9 12

2 4 5

9 4

4 2

1115

3x3 Gaussian smooth filter 5x5 Gaussian smooth filter

1.4

Page 40: Reconfigurable Computing Applications Chapter 9 - … Computing Applications Chapter 9 ... The result of the comparison ... Parallel bit-serial input j-shift

Image processing Image processing –– Convolution Convolution –– Gaussian filtersGaussian filters

Reconfigurable Computing 40

ConvolutionOriginal image

Page 41: Reconfigurable Computing Applications Chapter 9 - … Computing Applications Chapter 9 ... The result of the comparison ... Parallel bit-serial input j-shift

Image processing Image processing –– Edge detection Edge detection -- BasicsBasics

Reconfigurable Computing 41

Edges Placed in image with strong intensity contrastOften occurs at image location representing boundaries

Edge detectionExtensively used in image segmentation, i.e., dividing an image into areas corresponding to different objectsRepresenting an image by its edges significantly reduces the amount of dataSince edges correspond to strong illumination gradients, the derivatives of the image are used to compute the edgesOperators often used are

Laplace operatorSoebel operatorCanny edge detection algorithm

Page 42: Reconfigurable Computing Applications Chapter 9 - … Computing Applications Chapter 9 ... The result of the comparison ... Parallel bit-serial input j-shift

Image processing Image processing –– Edge detection Edge detection -- operatorsoperators

Reconfigurable Computing 42

LaplaceGradient operatorIntensity difference are enhanced. The edges are more pronouncedFor each pixel, the gray value of its four neighbours (top, left, bottom, right) pixel value are subtracted from its own value

Soebel operatorCombination of two 1-D operators

One for detecting horizontal edgesOne for detecting vertical edges

-1

-1 4 -1

-1

Laplace operator

-1 1

-2 2

-1 1

Soebel-x

1 2 1

-1 -2 -1

Soebel-y

Page 43: Reconfigurable Computing Applications Chapter 9 - … Computing Applications Chapter 9 ... The result of the comparison ... Parallel bit-serial input j-shift

Image processing Image processing –– Convolution Convolution –– Gaussian filtersGaussian filters

Reconfigurable Computing 43

Edge detectionOriginal image

Page 44: Reconfigurable Computing Applications Chapter 9 - … Computing Applications Chapter 9 ... The result of the comparison ... Parallel bit-serial input j-shift

Image processing Image processing –– Use of reconfigurationUse of reconfiguration

Reconfigurable Computing 44

Intelligent image processing systemAccording to input image and other conditions,

Some operations are done to improve the image

Filtering (the correct filter is chosen)Smoothing Segmentation (Edge detection)Skeletonization

Some adjustments are done on the image input hardware

CalibrationFocussing

Everything is done while the system keeps runningFixed parts of the system will run continuouslyReconfigurable must be replaced at run-time

Page 45: Reconfigurable Computing Applications Chapter 9 - … Computing Applications Chapter 9 ... The result of the comparison ... Parallel bit-serial input j-shift

Mechanical Control Mechanical Control –– BasicsBasics

Reconfigurable Computing 45

Controller task is to influence the dynamic behavior of a plantInputs values for the plant depends on plant's outputs (Feedbacks)A plant is modeled as a linear time invariant (LTI)-SystemController is modeled as LTI-System Time discretization

Scaling to fix-pointk, k+1, k+2 …sample pointsT… sample periodtc… calculation time of controller

Plant

Controller

D/A-C

D/A-C

referencevalue

TTkk kk+1+1 kk+2+2

tt

ttcc

Page 46: Reconfigurable Computing Applications Chapter 9 - … Computing Applications Chapter 9 ... The result of the comparison ... Parallel bit-serial input j-shift

Mechanical Control Mechanical Control –– BasicsBasics

Reconfigurable Computing 46

kur

kyr

kkk

kk1kuDxCyuBxAxrrr

rrr

+=

+=+

⎥⎥⎥

⎢⎢⎢

⎡==

...vMvM

vMz row2

row1v

v

rrcombine A,B,C,D to M,combine A,B,C,D to M,represent computation asrepresent computation asset of scalar products ofset of scalar products ofeach row of M witheach row of M with

u: Controller input, x: State, y: OutputA, B, C, D: Constant coefficient matrices

vr

Page 47: Reconfigurable Computing Applications Chapter 9 - … Computing Applications Chapter 9 ... The result of the comparison ... Parallel bit-serial input j-shift

Mechanical Control Mechanical Control –– DADA--ImplementationImplementation

Reconfigurable Computing 47

∑=

==n

1iiiaxxaz (1)

rr

∑−

=

−=1w

0j

jiji 2xx (2)

∑ ∑

∑ ∑

= =

=

=

=

=

1w

0j

n

1iiij

j

n

1i

1w

0j

jiji

ax2

2xaz (3)

Scalar product:const. vector and var. vector

xi as w-bit fix-point(here x just unsigned in [0,2[ )

replace (2) in (1),swap the sums

since xij is in {0,1}right sum can have just 2n values

pre-compute it and store it in a 2n x w ROM as LUT

20 2-1 2-2 2-3 2-(w-1)…

ar x

v

Page 48: Reconfigurable Computing Applications Chapter 9 - … Computing Applications Chapter 9 ... The result of the comparison ... Parallel bit-serial input j-shift

Mechanical Control Mechanical Control –– DADA--Implementation (Parallel)Implementation (Parallel)

Reconfigurable Computing 48

Coeffic ient Look - Up T able(DA LUT )

Inpu t R eg is te r

x 1> > c

x 2> > c

DA L UT1x [1..n],w-1

DA L UT2x[1..n],w-2

Data P ath ( DP )( c + 1 Input A dder )

Res ult ( z )

> > c

DA L UTcx n> > c x[1..n],w-c

.. .

...

......

w w w

w

n

n

> > 0> > c -1 > > c -2∑−

=

−=1w

0jj]n..1[

j )x(DALUT2z

c-Bit at a time Architecture

Page 49: Reconfigurable Computing Applications Chapter 9 - … Computing Applications Chapter 9 ... The result of the comparison ... Parallel bit-serial input j-shift

Mechanical Control Mechanical Control –– Multi controller systemMulti controller system

Reconfigurable Computing 49

Many controller modules optimized for different operating regimes

Controllers have different structures not only different coefficients

Supervisor observes plant and determines best controller module

Multiplexer switches controller outputs

Page 50: Reconfigurable Computing Applications Chapter 9 - … Computing Applications Chapter 9 ... The result of the comparison ... Parallel bit-serial input j-shift

Mechanical Control Mechanical Control –– Multi controller systemMulti controller system

Reconfigurable Computing 50

Periodic execution of task graphConditional branching to controller modules (CM)CMs implement various area/time trade-offs possible

tex

area

Page 51: Reconfigurable Computing Applications Chapter 9 - … Computing Applications Chapter 9 ... The result of the comparison ... Parallel bit-serial input j-shift

Reconfigurable Computing 51

Mechanical Control Mechanical Control –– Multi controller system Multi controller system --Use of reconfigurationUse of reconfiguration

One slot solution

Page 52: Reconfigurable Computing Applications Chapter 9 - … Computing Applications Chapter 9 ... The result of the comparison ... Parallel bit-serial input j-shift

Reconfigurable Computing 52

Mechanical Control Mechanical Control –– Multi controller system Multi controller system --Use of reconfigurationUse of reconfiguration

Two slots solution

Page 53: Reconfigurable Computing Applications Chapter 9 - … Computing Applications Chapter 9 ... The result of the comparison ... Parallel bit-serial input j-shift

Reconfigurable Computing 53

Mechanical Control Mechanical Control –– Multi controller system Multi controller system --Use of reconfigurationUse of reconfiguration

)(:timeexecution

CMCMex Aft =

)2/()(22

)2()2(min

1)2(

AftTtfAA

CMex

CMex

CM

==== −

)()()(

)1()1()1(min

1)1(

AgAfttTtfAA

CMrec

CMex

CMex

CM

+=+=== −

CM

CMCMreconfig

ArAgt

×≈= )(

: timereconfig.

Page 54: Reconfigurable Computing Applications Chapter 9 - … Computing Applications Chapter 9 ... The result of the comparison ... Parallel bit-serial input j-shift

Reconfigurable Computing 54

Mechanical Control Mechanical Control –– Multi controller system Multi controller system --FPGA Implementation FPGA Implementation -- inverse penduluminverse pendulum

Synthesis results of controller:

FPGA: Virtex 800Area: 1003 slices (ca. 10%) Clock Rate: > 70 Mhz

Computation Time: tc < 1 µs

The Raptor 2000 Board

Host Simulates PlantHost configurations

RAPTOR 2000Comm. ResourcesConfiguration Manager

FPGAController ModuleSupervisorCommunication

Controller characteristics:Dimensions: p=3, n=2, q=3Word-width: Input: 16 Bit, Intern: 32 Bit

Page 55: Reconfigurable Computing Applications Chapter 9 - … Computing Applications Chapter 9 ... The result of the comparison ... Parallel bit-serial input j-shift

Reconfigurable Computing 55

Mechanical Control Mechanical Control –– Multi controller system Multi controller system --FPGAFPGA--Implementation Implementation -- ArchitectureArchitecture

Two-slot implementationOne slot implementation

CM1

reconfiguring…

CM1

CM1

CM2

CM2

reconfiguring…

CM2

CM3

Page 56: Reconfigurable Computing Applications Chapter 9 - … Computing Applications Chapter 9 ... The result of the comparison ... Parallel bit-serial input j-shift

Reconfigurable Computing 56

Mechanical Control Mechanical Control –– Multi controller system Multi controller system --FPGAFPGA--Implementation Implementation -- ArchitectureArchitecture

ReconfigurableModule Slot

(Controller Module)

Fix Module Slot(Supervisor)

Bus

Mac

ros