05958631

8/3/2019 05958631

1/26Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

FOR IEEE TRANSACTIONS ON SIGNAL PROCESSING 1

Memory-Efficient Architecture for 3-D DWT Using

Overlapped Grouping of FramesBasant K. Mohanty, Member, IEEE and Pramod K. Meher, Senior Member, IEEE

Abstract

In this paper we have presented a memory efficient architecture for 3-D DWT using overlapped grouping of frames. Proposedstructure does not involve any line-buffer or frame-buffer for 1-level 3-D DWT. It involves only a frame-buffer of size O(MN) tocompute multilevel 3-D DWT, unlike the existing folded structures which involve frame-buffer of size O(MNR). The saving ofline-buffer and frame-buffer by the proposed structure for the implementation of first-level DWT is of substantial advantage, sincethe frame-size is very often as large as 1920 1080 and frame-rate varies from 15 to 60 fps. The proposed structure has a smallcycle period, and offers small output latency compared to the existing structures. Compared with the best of the available designs,the proposed design involves significantly less memory words. For frame-size 176 144 and frame-rate 60 fps, the proposedstructure involves 7.96 times less memory words and involves 12.3% less average computation time (ACT) than the best ofthe existing folded designs. It involves 4.28 times less memory words than the recently proposed parallel design. The synthesisresult for frame-size 176144 and frame-rate 60 fps for the FPGA device 6VLX760FF1760-2 shows that the proposed structureinvolves 9.6 times less BRAMs and offers 2 times higher throughput than the folded design. It involves 1.9 times less BRAMsthan the parallel design and offers nearly same throughput rate. The proposed structure has significantly less slice-delay-product(SDP) than the existing structures. Due to less memory complexity, the proposed structure dissipates significantly less dynamicpower than the existing structures.

Index Terms

Discrete wavelet transform, 3-dimensional DWT, Overlapping frames, parallel and pipeline architecture, VLSI

I. INTRODUCTION

THREE-dimensional (3-D) discrete wavelet transform (DWT) is applied in video compression, compression of 3-D

and 4-D medical images, volumetric image compression, video watermarking and many other applications [1][6].

The generic structure for the computation of multilevel-level of 3-D DWT based on the popularly used separable approach is

shown in Fig.1, where the intra-frame DWT is performed row-wise then column-wise by the row-processor and then column-

processor, respectively, and inter-frame computation is performed by the temporal-processor. As shown in the figure, the 3-D

DWT structure is comprised of two types of hardware components: (i) combinational component and (ii) memory/storage

component. The combinational component consists mainly of arithmetic circuits and multiplexors; and the memory component

consists of a frame-memory, temporal-memory, registers and transposition-memory. Frame-memory is usually external to the

chip, while temporal-memory may either be on-chip or external. The on-chip transposition-memory stores the intermediate

values resulting after the row processing, while the temporal-memory stores the intermediate values resulting after the column

processing of a set of successive frames. The frame-memory is used for storing low-low-low (LLL) subband to compute the

Manuscript submitted on January 26, revised on 23 May and 6 July 2011. This paper was recommended by Associate Editor Tong Zhang.B. K. Mohanty is with the Dept. of Electronics and Communication Engineering, Jaypee University of Engineering and Technology, Raghogarh, Guna,

Madhy Pradesh, India-473226, (email: [email protected]).P. K. Meher is with the Department of Embedded Systems, Institute for Infocomm Research, 1 Fusionopolis Way, Singapore-138632, Email: [email protected]

star.edu.sg, URL: http://www.ntu.edu.sg/home/aspkmeher/.

8/3/2019 05958631

2/26

Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].



LOWLOWLOW

ROW

PROCESSORTRANSPOSITION

MEMORY

SUBBAND

COLUMNME

ORY

FRA

ME

MEMORY

PROCESSOR

Fig. 1. Generic structure of multi-level 3-D DWT computation

multilevel 3-D DWT level-by-level [7]. The size of frame-memory is MNR, while the size of temporal-memory is KMN

and transposition-memory is of size KN, where M and N is the height and width of each video frame, R is number of frames

in a group of frames (GOFs) and K is the order of the wavelet filter.

In general, the complexity of the combinational component of the 3-D DWT structure depends on the filter order which

is usually small, while the complexity of memory component depends on the frame-size. Since the frame-size for practical

video applications may vary from 176 144 (for low-end mobile phones) to 1920 1080 (for HDTV), the computation of

3-D DWT is highly memory-intensive and it is a challenging task to implement the complete 3-D DWT in a single chip. But,

on the other hand the communication with external memory heavily degrades the speed and power performance of the whole

system.

A few computing schemes and architectures are suggested to reduce the memory requirement of 3-D DWT [8][10], [13],

[15], [16]. A more detail discussions on these designs are given in [16]. The existing computation schemes are still found to

involve very large memory. Also, it is observed that the existing block-by-block methods degrade the PSNR quality due to the

blocking artifacts, since the transformation of infinite video frames into independent GOFs introduces noise at the boundary,

which results in loss of PSNR. To avoid this data loss, DWT need to be performed continuously on the infinite video sequence

(called the running 3-D DWT) [11]. Keeping this in view, Das et al [12][14] have suggested scan based architecture for

running 3-D DWT of infinite GOFs. But the memory requirement of this structure is significantly higher than that of [15].

The transform coder should decompose the 3-D signals in multiple-levels to achieve higher compression ratio. But, most

of the existing designs [12][14] compute only 1-level running 3-D DWT of the infinite video sequence. The structure of

8/3/2019 05958631

3/26




[15], however, computes the multilevel 3-D DWT by level-by-level approach, (similar to the folded scheme proposed by Wu

et al [7]) using an external frame-buffer of size (MNR)/8. Other existing structures which transform the 3-D signal into

independent GOFs can compute the multilevel 3-D DWT in time-multiplexed form in a folded structure. But, the multilevel

3-D DWT of an infinite GOFs cannot be implemented by such folded method using a limited frame-buffer. Since, the size of

the transposition-memory and temporal-memory remain unchanged with input-block size, they could be utilized efficiently by

calculating more DWT coefficients concurrently. The frame-buffer size could be reduced by using separate computing blocks

for different decomposition levels. To take care of this issue, recently we have proposed a parallel architecture for multilevel

3-D DWT [16], which computes multilevel running 3-D DWT on infinite GOFs and overcomes the limitations of other existing

structures. However, it has some inherent problems associated with the selection of input-block size (Q) for a given frame. The

input-block size need to be an integer multiple of the frame width (N), and to achieve 100% hardware utilization efficiency

(HUE), the minimum block size for J level DWT is 23J2. For example, the HUE of the pipeline structure for J = 3 is

100% for Q = 128, 98.64% for Q = 64, and 91.25% for Q = 16. From this observation we can infer that higher the input

block-size better is the resource utilization of the structure. But the structure demands more device resources and I/O for

higher block-sizes. If the application is a resource-constrained, then it can, however, be implemented for lower block-sizes

with less than 100% HUE. To overcome the above difficulties we propose here an alternative approach to compute multilevel

running 3-D DWT on infinite GOFs. We have derived a pipeline architecture using the proposed scheme and which involves

less on-chip and off-chip memory than the existing structures. Interestingly, the input bock-size of the proposed structure is

independent of the DWT levels. The key ideas we have used in our current approach are:

While grouping the frames, some boundary frames of consecutive groups of frames are overlapped, in order to avoid

temporal memory, which would otherwise have been needed by the temporal processor.

Row-column processing of each input frame is scheduled suitably, such that the row-processor generates the required

intermediate results to be consumed by the column-processor without transposition.

Processing of overlapping frames involves some redundant computation. The number of frames to be overlapped at

8/3/2019 05958631

4/26

8/3/2019 05958631

5/26

8/3/2019 05958631

6/26




by (K 2) frames. As shown in Fig.2(b), due to frame overlapping, row and column DWT computation corresponding to

overlapping frames, however, necessitates some redundant computation. The transposition- and temporal-memory of 1-level

3-D DWT can be avoided completely at the cost of extra (K 2) pairs row and column processors to process the overlapping

frames. In spite of using additional row- and column-processors, overlapping group of frames has the potential to provide

substantial saving in hardware by eliminating the transposition-memory of O(N) and temporal-memory ofO(MN), since the

size of the wavelet filters is usually very small compared to the frame size of all practical videos.

TABLE I

INPUT AND OUTPUT FRAMES TO ELIMINATE TEMPORAL-MEMORY

one outp ut f rame group of outp ut f rame

IF OF IF OF

IF1 to IFK OF1 IF1 to IFK+2 OF1 and OF2

IF3 to IFK+2 OF2 IF1 to IFK+4 OF1 to OF3IF5 to IFK+4 OF3 F1 to IFK+6 OF1 to OF4

: : : :

IFn to IFn+K1 OFK IF1 to IF3K2 OF1 to OFK

LEGEND: IF : input-frame, OF: output-frame, n = 2K 1

TABLE II

NUMBER OF OVERLAPPING INPUT FRAMES FOR PRODUCING OUTPUT FRAMES WITH DIFFERENT OVERLAPPING SIZE

Number of Number of Overlapping Overl apping

input frames output frames input frames output frames

K 1 K 2 0

3K 2 K

K 2 0

K 1

K + 2 2

: :

3K 6 K 2

Input frames are to be processed in parallel for temporal-memory-free computation is given in Table I. From Table I, we

can find that the group of input frames are required to be overlapped by K 2 frames. Note that, successive groups of

output frames in this case do not overlap. To produce the overlapped group of output frames, number of overlapping frames

of the input GOFs need to be increased. Table II shows the required number of input frames to be overlapped to produce the

overlapped GOFs. Using Table I and Table II, one can find the number of frames in a GOFs to be processed in parallel to

avoid temporal-memory in multilevel 3-D DWT computation. The number of overlapping frames for decomposition level-1

and level-2 is shown in Table III. The overlapping input frames of level-1 DWT are shown in grey color in Fig.2(b) and those

of level-2 DWT in Fig.3.

8/3/2019 05958631

7/26




TABLE III

OVERLAPPING INPUT FRAMES FOR TWO LEVEL DECOMPOSITION

decomposition Number of Number of Number of

level input frames overlapping frames output frames

level-1 3K 2 3K 6 K

level-2 K K 2 1

TABLE IV

COMPARISON OF MEMORY SAVING AND HARDWARE COST INVOLVED BY THE 3-D DWT COMPUTING STRUCTURE USING THE PROPOSED OVERLAPPING

GROUP OF FRAMES FOR DIFFERENT SIZE WAVELET FILTERS

Filter

Level-1 Level-2

GOF OVF Memory

Hardware cost

GOF OVF

Memory Hardware cost

Multiplier Adder KN(5M Multiplier Adder

(K) = K 2 KN(M + 1) 4K 4K (3K 2) 3 2K + 6)/4 18K 18K

Haar 2 0 2N(M + 1) 0 0 4 0 N(5M + 2)/2 0 0

Daub-4 4 2 4N(M + 1) 32 32 10 6 N(5M 2) 144 144

Daub-6 6 4 6N(M + 1) 96 96 16 12 3N(5M 6)/2 432 432

Daub-8 8 6 8N(M + 1) 192 192 22 18 2N(5M 10) 864 864

Daub-10 10 8 10N(M + 1) 320 320 28 24 5N(5M 14)/2 1440 1440

5/3 5 3 5N(M + 1) 30 48 13 9 5N(5M 4)/4 135 216

9/7 9 7 9N(M + 1) 126 224 25 21 9N(5M 12)/4 567 1008

LEGEND: GOF: group of frames, OVF: overlapping frames. It is assumed that symmetric property of 5/3 and 9/7 filter coefficients are used and thesefilters are implemented using convolution method. For 5/3 and 9/7 filter 2K equals to 5 and 9, where equals to 3 and 7. Since the row processor doesnot involve any data registers and the column processor involves only (K 1) data registers, we have excluded these registers while estimating the hardwarecost as the complexity of data registers is very small compared to multiplier and adder complexity.

As shown in Fig.2b, K pairs of row and column processors (K 2 overlapping frames shown in grey color) perform the

necessary computation ofK frames to avoid temporal-memory (of size KMN) of level-1 DWT. Similarly, from Fig.3 we can

find that, (4K 2) RPs, (4K 2) CPs and (K+ 1) TPs perform the necessary computation of (3K 2) overlapped GOFs to

avoid temporal-memory of level-1 and level-2. The row-column processing of decomposition level-1 is scheduled in such a way

that row-processor generates the required intermediate results to be consumed by the column-processor without transposition.

Using overlapped grouping of frames and row-column processing scheme, memory space of KN(5M 2K+ 6)/4 words of

2-level DWT can be avoided at the cost of 4(K 2) RPs, 4(K 2) CPs and (K 2) TPs.

To study the efficiency of the proposed frame overlapping computing scheme, we have estimated the memory (in words)

that could be saved and the extra hardware cost (in terms of multipliers and adders) involved to compute row and column

transformation of the overlapping frames. We have assumed the hardware complexity of each RP, CP and TP to be 2K

multipliers and 2K adders, where K is the filter order. The estimated memory that can be saved by using the proposed scheme

over the conventional method (frame-by-frame without frame overlapping) and the extra hardware cost involved for different

8/3/2019 05958631

8/26




TABLE V

COMPARISON OF SAVING TO OVERHEAD RATIO (SOR) INVOLVED BY THE 3-D DWT COMPUTING STRUCTURE USING OVERLAPPING GROUP OF FRAMES

OF DIFFERENT FRAME-SIZES

Filt er DWT LevelSOR for frame-size

176 144 640 480 1920 1080

Haar1

2

Daub-4 1 127.5 1540.2 10385.62 33.6 408 2755.8

Daub-61 63.7 770 5192.4

2 16.73 203.8 1377.3

Daub-81 42.5 513.3 3461.6

2 11.3 138.2 935.2

Daub-101 31.8 385 2596.2

2 8.8 107.9 730.7

5/31 151.3 1826.7 12317.5

2 39.9 486 3284

9/71 62.7 758 5111.6

2 16.45 201.4 1363.4

SOR is defined as = Saving in memory words / Combinational overhead cost, where combinational overhead cost represents the sum of the multipliers andadders required to process overlapping frames. Memory words and hardware cost are measured in terms of transistor counts.

wavelet filters are listed in Table IV. Since, frame-size of a practical video could be as high as 1920 1080 (screen size of

HDTV), significant amount of chip area could be saved by eliminating the transposition and temporal-memory of the 3-D

structure using the proposed scheme. It can found from Table IV that, the amount of memory saving offered by the proposed

scheme for two-level DWT is nearly 25% more than that of 1-level DWT, but 2-level DWT involves nearly 4.5 times extra

hardware cost than that of 1-level DWT. This is mainly due to reduction in frame-size by factor of 4 after every decomposition

level, while the number of overlapping frames required to avoid the temporal-memory increases steadily by (2K 4) frames

for every higher level. The size of the group of frames also increases by (2j1 K) times after every higher level of DWT,

where j is the DWT level.

To measure the memory saving and combinational overhead of the proposed scheme, we have defined a term saving to

overhead ratio (SOR). Since the complexity of memory and arithmetic components are widely different, we have estimated

saving of memory and combinational overhead cost in terms of transistor-counts to measure SOR which is defined as the ratio

of the memory saved and combinational overhead cost. For the 3-D DWT structure, we assumed the input pixels are 8-bit and

all the intermediate and final output signals are 12-bit. Few of the multipliers of 1-level and 2-level structure are of 8-bit size

where all other components are of 12-bit size. Out of 4K multipliers of level-1, 2K multipliers are of 8-bit size. Similarly,

8/3/2019 05958631

9/26




out of 18K multipliers of 2-level structure, 6K are of 8-bit size. The transistor count of 8-bit multiplier, 12-bit multiplier,

12-bit adder and 12-bit SRAM word are taken to be 1178, 1674, 372 and 72 transistors, respectively. Using these values, we

have estimated the SOR for different wavelet filters and frame-sizes. The values are listed in Table V. It can be found from

Table V that, SOR is maximum for Haar wavelet (K= 2), as in this case the entire transposition and temporal memory could

be eliminated from the 3-D structure without any redundant computation. For higher frame-sizes, and low-order filters like

Daub-4 and 5/3, SOR is significantly higher. We find that, SOR of the 2-level DWT is nearly 73% less than that of 1-level

DWT on average for different wavelet filters and frame-sizes. Keeping this facts in mind, we outline the proposed method to

derive a memory efficient hardware structure to compute of multilevel running 3-D DWT..

low-order wavelet filters like Harr, Daub-4, Daub-6 and 5/3 should be preferred for 3-D DWT if it meets the desired SNR

specification of the target application.

proposed frame overlapping processing scheme should be applied to eliminate the temporal-memory of 1-level only to

get maximum advantage of the scheme.

computation of higher DWT levels may be partitioned and appropriately scheduled to utilize the resource effectively.

Although the hardware cost is marginal (less than 2%) of the memory that saved if we apply redundant computation to 1-

level only and use wavelet filters (Daub-4, Daub-6, 5/3 and 9/7), the hardware cost could be reduced further by implementing

multipliers using some low-complexity design method. In this work we have considered Daubechies 4-tap (Daub-4) wavelet

filters as an example to derive the proposed structure. However, similar type of structures could be derived for other wavelet

filters as well. We also suggested an efficient design for implementation of Daubechies wavelet filters for K= 4.

III . PROPOSED ARCHITECTURES FOR 3- LEVEL 3-D DWT

The proposed structure for the implementation of 3-level 3-D DWT is shown in Fig.4. It consists of three processing units

(PUs). PU-1 performs the computation of first-level DWT, while PU-2 computes only row and column DWT of the second-level.

PU-3 computes temporal DWT of the second-level and the entire computation of third-level in time-multiplexed form.

PU-1 receives four input blocks of four successive parallel frames in every cycle from the input buffer. The input blocks

8/3/2019 05958631

10/26




are fed to the structure as per the order shown in Fig.5. As shown in Fig.5, each input block contains 6 consecutive samples

of a particular row. The input block I(m1,m2, n3) corresponding to m2-th row of m1-th frame and contain the samples

{x(m1,m2, 4n3 + 5), x(m1,m2, 4n3 + 4), x(m1,m2, 4n3 + 3), x(m1,m2, 4n3 + 2), x(m1,m2, 4n3 + 1), x(m1,m2, 4n3)}, for

0 m2 M 1, 0 n3 (N/4) 1 and m1 = 0, 1, 2, 3,.... The adjacent input blocks of a particular row are overlapped

by 2 samples. Suppose in the first cycle, the first input block of the first row of a frame is fed then during the second cycle,

the first input block of the second row is fed to the structure, such that the first input blocks of all the M rows of a particular

frame are fed in M cycles and in the next set of M cycles, second input blocks of all the M rows are fed to the structure.

The entire MN/4 input blocks of a particular frame are fed to the structure in MN/4 cycles. Input blocks of four successive

parallel frames are fed to the structure in MN/4 cycles in parallel. The successive group of frames are overlapped by 2 frames.

During the first set of MN/4 cycles, input blocks of first GOFs (F1, F2, F3, F4) are fed to the structure and in the next set

of MN/4 cycles, input blocks from the GOFs (F3, F4, F5, F6) are fed. In this manner, input blocks of an infinite GOFs are

fed to PU-1 continuously to compute first-level 3-D DWT.

from

fer

ut

zlh1

zhh1

2

PU1 PU2 PU3B

locks

mebuf

Outph

vl2

Inpufra

zllh1

zhl1

zlll2 ul

3 uh3 vl

3 vh3

6samples/

cycle

Fig. 4. Proposed structure for computation of 3-level 3-D DWT. z1lh

, z1hl

, z1hh

, respectively, represent (z1lhl

, z1lhh

), (z1hll

, z1hlh

) and (z1hhl

, z1hhh

). v2l

and v2h

,

respectively, represent (v2ll

, v2hl

) and (v2lh

, v2hh

). Output represent (z2llh

, z2lhl

, z2lhh

), (z2hll

, z2hlh

, z2hhl

, z2hhh

) or (z3lll

, z3llh

, z3lhl

, z3lhh

), (z3hll

, z3hlh

, z3hhl

, z3hhh

).

F6

GOF-2

x07 x06 x05 x04 x03 x02 x01 x00

x17 x16 x15 x14 x13 x12 x11 x10

x07 x06 x05 x04 x03 x02 x01 x005

x17 x16 x15 x14 x13 x12 x11 x10

x07 x06 x05 x04 x03 x02 x01 x00

F4

x07 x06 x05 x04 x03 x02 x01 x00

F3F2

GOF-1

: :: :: :: :: :: :: :: :

x17 x16 x15 x14 x13 x12 x11 x10

: :: :: :: :

x17 x16 x15 x14 x13 x12 x11 x10

x x x x x x x x

x17 x16 x15 x14 x13 x12 x11 x10

x07 x06 x05 x04 x03 x02 x01 x00

x17 x16 x15 x14 x13 x12 x11 x10

1

: :: :: :: : x77 x76 x75 x74 x73 x72 x71 x70

x77 x76 x75 x74 x73 x72 x71 x70 x77 x76 x75 x74 x73 x72 x71 x70

: :: :: :: :: :: :: :: :

: :: :: :: :

: :: :: :: :: :: :: :: :

x77 x76 x75 x74 x73 x72 x71 x70

: :: :: :: : x77 x76 x75 x74 x73 x72 x71 x70

: :: :: :: : x77 x76 x75 x74 x73 x72 x71 x70

6 samples/cycle

First set ofMN/4 cyclesSecond set ofMN/4 cycles

Fig. 5. Data input format of the proposed structure. Grey color boxes represent overlap area of the adjacent blocks, while the overlapping frames are shownin violet color.

8/3/2019 05958631

11/26

8/3/2019 05958631

12/26




2), x(2n1 i,m2, 4n3 + 1), x(2n1 i,m2, 4n3)} and computes a pair of intermediate coefficients ul(2n1 i,m2, 2n31) and

uh(2n1 i,m2, 2n3 1). During the same period, first subcell-1 receives last four samples {x(2n1 i,m2, 4n3 + 5), x(2n1

i,m2, 4n3 + 4), x(2n1 i,m2, 4n3 + 3), x(2n1 i,m2, 4n3 + 2)} of the input block and computes the intermediate coefficients

ul(2n1i,m2, 2n3) and uh(2n1i,m2, 2n3). Note that the successive output samples of subcell-1 belong to the same column

and these intermediate coefficients can be processed directly by subcell-2 for column DWT. In each cycle, subcell-2 receives

a pair of intermediate coefficients from the corresponding subcell-1 and computes the column-DWT in time-multiplexed form

to take the advantage of down-sampled filter computation. The structure of subcell-2 is similar to the structure of subcell-2 of

[16] (see Fig.4 and Fig.5 of [16]), except that each shift-register (SRs) in this case is replaced with register (R).

After a latency of 3 cycles, each subcell-2 produces a pair of subband components (vll/vhl) and (vlh/vhh) in each cycle,

such that during the even-numbered cycles, if it produces one component each of the pair of subbands vll and vlh, then during

the odd-numbered cycles, it produces components of other two subbands (vhl and vhh). Both subcell-1 and subcell-2 work in

separate pipeline stages and compute DWT computation concurrently. Each PE calculates DWT components of pair of columns

of each of the four subband components of a given frame in M cycles, where the components of the subband (vll, vlh) and

(vhl, vhh) are obtained in time-multiplexed form. The (i + 1)-th PE, therefore, completes the first level decomposition of the

(2n1 i)-th frame of size (MN), in MN/4 cycles with initial latency of 3 cycles.

The adjacent PEs of PU-1 generates DWT components corresponding to two successive frames. DWT components of four

successive frames are obtained from four PEs such that 2 columns of DWT components of a pair of subbands (v1ll, v1lh) or

(v1hl, v1hh) of four successive parallel frames are obtained from four PEs. Down-sampled filter computations are performed on

each of the subband coefficients generated by the PE for temporal (inter-frame) DWT. Temporal DWT computations can be

performed using subcell-1. PU-1, therefore, use four subcell-1 (see Fig.6) to calculate the temporal DWT of the columns of

the subbands (v1ll, v1lh) or (v

1hl, v

1hh) of four successive frames concurrently. Out of these four subcells, first and third subcell-1,

respectively, calculate temporal DWT of the even and odd numbered columns of the subbands (v1ll or v1hl), while the second and

the fourth subcell-1, respectively, calculate the temporal DWT of even and odd numbered columns of subbands (v1lh or v1hh)

8/3/2019 05958631

13/26




in time multiplexed form. Each subcell-1 calculates a pair of components corresponding to two subbands of the 3-D transform

in every cycle such that in every cycle, a pair of components of two adjacent columns of four subbands (z1lll, z1llh, z

1lhl, z

1lhh)

or (z1hll, z1hlh, z

1hhl, z

1hhh) are obtained from four such subcells. A pair of components of two adjacent columns of all the eight

oriented selective subbands of 1-level 3-D DWT are obtained in a couple of cycles. Two columns of each of the eight subbands

are obtained from PU-1 in M cycles and the entire coefficient matrix of 1-level 3-D DWT of the input frames of size (MN)

can be obtained in MN/4 cycles with an initial latency of 4 cycles.

zlll1(n1,n2,2n3)zlll

1(n1,n2,2n3-1)

SR1

SR2IDU)

R R R R

R R R R y-unit(

n2,n3

)

put-del

Subcell1ul2(n1, I

DMUX DMUXuh

2(n1,n2, n3)

Output2Output1

Fig. 8. Structure of PU-2. Output-1 represent (v2ll

(n1, m2, n3) or v2hl(n1, m2, n3)) and output-2 represent (v2lh

(n1, m2, n3) or v2hh(n1, m2, n3), where0 m2 (M/4) 1, 0 n3 (N/4) 1 and 0 n2 (M/2) 1

Components of the subband z1lll are send to PU-2 to calculate the DWT components of second-level. PU-2 receives a pair of

components from PU-1 corresponding to a pair of adjacent columns ofz1lll in every cycle after a gap of one cycle. The structure

of PU-2 is shown in Fig. 8. It consists of one input-delay-unit (IDU) and one subcell-1. Subcell-1 in this case perform row and

column computations pertaining to 2-level DWT in time-multiplexed form. The components of z1lll of a particular frame are

fed to subcell-1 through the IDU of PU-2 in block-by-block, similar to 1-level processing. The input block I(n1, n2, n3) in this

case contains four consecutive samples {z1lll(n1, n2, 2n3 + 3), z1lll(n1, n2, 2n3 + 2), z1lll(n1, n2, 2n3 + 1), z1lll(n1, n2, 2n3)} for

0 n2 M/21, 0 n3 N/41. Input blocks are fed to subcell-1 (see Fig.8) column-wise after a gap of one cycle such

that one column of input block of a particular frame ofz1lll are fed to subcell-1 in M cycles and input blocks of one complete

frame in MN/4 cycles. One column of input-block is derived from four successive columns of z1lll. PU-2 receive components

of two adjacent columns of z1lll from PU-1, two previous columns of z1lll are required to be stored to derive the required

8/3/2019 05958631

14/26




input blocks. The IDU, therefore, contains 2 shift-registers (SRs) (of size M/2 words each). The SRs also help to calculate

downsampled filter computation along the row direction. A pair of DWT components u2l (n1, n2, n3), and u2h(n1, n2, n3) are

obtained from the subcell in every alternate cycle. Note that the successive output samples (u2l (n1, n2, n3), and u2h(n1, n2, n3))

are corresponding to successive columns of intermediate coefficient matrix ( [u2l ] and [u

2h]). The column-DWT can be performed

on the components of ([u2l ] and [u2h]) immediately in the next cycle. Since subcell-1 of PU-2 receives the input blocks of z

1lll

only during alternate cycles, it remains idle for one cycle after every input cycle of I(n1, n2, n3). The idle cycles of subcell-1

can be utilized by assigning down-sampled filter computation of ([u2l ] and [u2h]) in time-multiplexed form. The samples of

u2l , and u2h are passed through separate delay-path to provide the column delay necessary for the filter computation. All the

registers and shift registers of IDU are clocked by CLK2 whose frequency is half of the frequency of CLK1 used by PU-1.

The 4 multiplexors (MUX) of IDU select the delayed samples ofu2l , and u2h alternately and fed them to the subcell during its

idle cycles. A pair of DWT components of two subbands (v2ll, v2lh) or (v

2hl, v

2hh) of a particular frame ofz

1lll are obtained from

PU-2 after every couple of cycles and one component each of four subbands in every four cycles. One column of each of the

four subbands are obtained in M cycles and subband components of a complete frame in NM/4 cycles. The components of

subbands (v2ll, v2lh) or (v

2hl, v

2hh) are sent to PU-3 to calculate inter-frame DWT computation of 2-level decomposition.

MB1 MB2 MB3 MB4 MB7MB5 MB6Addr_1

vl

CLK_1

vh2

MUX MUX MUX MUXsel_3

Fig. 9. Structure of frame-buffer-1

To calculate temporal-DWT of 2-level, subbands of 3 successive frames are stored in frame-buffer-1. The structure of

frame-buffer-1 is shown in Fig.9. It consists of 7 memory-blocks (MBs) and four 2-to-1 line MUXs. Each MB is of size

of MN/8 words. Components of a pair of subbands (v2ll, v2hl) or (v

2lh, v

2hh) corresponding to a particular frame are stored in

alternate MBs, such that, the components of (v2ll, v2hl) are stored in even-numbered MBs and those of (v

2lh, v

2hh) are stored

in odd-numbered MBs. One extra MB is used to store one extra frame of ( v2lh, v2hh) to provide one complete frame-delay

8/3/2019 05958631

15/26




FrameBuffer1vl

2

vl3

vh2

FrameBuffer2

R R Rzlll

2vh

3

SR2 SR4 SR6

SR1 SR3 SR5 SR7

ul3

uh3

MUX1 MUX1 MUX1 MUX1sel_3

MUX2

MUX3

MUX2

MUX3

MUX2

MUX3

MUX2

MUX3

se _

sel_4

MUX4 MUX4 MUX4 MUX4

Subcell1

sel_1

DMUXARRAYsel_1

sel_2sel_3

sel_4

zl3 zh

2zh3 zl

2

Fig. 10. Structure of PU-3. Input v2l

and v2h

, respectively, represent, (v2ll

, v2hl

) and (v2lh

, v2hh

). Intermediate results v3l

and v3h

, respectively, represent,

(v3ll

, v3hl

) and (v3lh

, v3hh

). Output z2l

and z2h

, respectively, represent (z2hll

, z2lhl

, z2hhl

) and (z2llh

, z2lhh

, z2hlh

, z2hhh

). Similarly, output z3l

and z3h

, respectively,

represent (z3lll

, z3hll

, z3lhl

, z3hhl

) and (z3llh

, z3lhh

, z3hlh

, z3hhh

).

TABLE VI

TIMING SCHEDULE FOR MULTIPLEXING DWT COMPUTATION IN THE SUBCELL OF PU-3

DWT SB clock cycles

v2ll

(2m1 + 2n + 1 + 1)

2-level v2hl

(2m1 + 2n + 3 + 1)

Temporal v2lh

((2m1 + 1) + 2n + 1)

v2ll

((2m1 + 1) + 2n + 3 + 1)

3-level z2lll

(2m1 + 4n + 2 + 2)

Column z2lll

(2m1 + 4n + 2 + 2)

3-level u3l

(2m1 + 2Mm2 + 8n + 4 + 3)

Row u3h

(2m1 + (2m2 + 1)M + 8n + 4 + 3)

v3ll

(4m1 + 2Mm2 + 8n + 6 + 4)

3-level v3hl

(4m1 + M(2m2 + 1) + 8n + 6 + 4)

Temporal v3lh

((4m1 + 2) + 2Mm2 + 8n + 6 + 4)

v3hh

((4m1 + 2) + M(2m2 + 1) + 8n + 6 + 4)

LEGEND: SB: subband, = MN/8, n = 0, 1, 2,....,M 1, m2 = 0, 1, 2, ....., (N/8) 1, m1 = 0, 1, 2, 3...., 1 = 3M N/8 cycles delay to fill theframe-buffer-1, 2 = 12 cycles delay to fill the delay-path of z2lll, 3 = 3M cycles delay to fill the shift-registers corresponding to u

3l

or u3h

, 4 = 6M N/8cycles delay to fill the frame-buffer-2.

with respect to the subband components of (v2ll, v2hl). Four MUXs of frame-buffer-1 select the frames of (v

2ll, v

2hl) from the

even-numbered MBs and the current frame during each even-numbered sets of MN/8 cycles, while they select the frames of

(v2lh, v2hh) from the odd-numbered MBs during each odd-numbered sets of MN/8 cycles. The subcell of PU-3 (as shown in

Fig.10) receives a block of 4 samples from the frame-buffer-1 through the MUXs during every alternate cycles and calculates

inter-frame DWT of (v2ll, v2hl) and (v

2lh, v

2hh) in alternate periods of MN/8 cycles. The structure of this subcell is identical to

the structure of subcell-1 of PU-1. Components of four subbands (z2lll, z2llh) and (z

2hll, z

2hlh) are obtained in time-multiplexed

during even-numbered sets ofMN/8 cycles. Similarly, during the odd-numbered period ofMN/8 cycles, components of other

8/3/2019 05958631

16/26




TABLE VII

INPUT-O UTPUT DATA FLOW OF THE SUBCELL OF PU-3

clock cycle Input output-1 output-2 clock cycle Input output-1 output-2

1 v2ll

(0, 0, 0) z2lll

(0, 0, 0) z2llh

(0, 0, 0) + 1 v2lh

(0, 0, 0) z2lhl

(0, 0, 0) z2lhh

(0, 0, 0)

2 z2lll

(0, 0, 0) u3l

(0, 0, 0) u3h

(0, 0, 0) + 2

3 v2hl

(0, 0, 0) z2hll

(0, 0, 0) z2hlh

(0, 0, 0) + 3 v2hh

(0, 0, 0) z2hhl

(0, 0, 0) z2hhh

(0, 0, 0)

: : : : : : : :

M + 1 v

2

ll(0, 0, 1) z

2

lll(0, 0, 1) z

2

llh(0, 0, 1) + M + 1 v

2

lh(0, 0, 1) z

2

lhl(0, 0, 1) z

2

lhh(0, 0, 1)M + 2 z2

lll(0, 0, 1) u3

l(0, 0, 1) u3

h(0, 0, 1) + M + 2

M + 3 v2hl

(0, 0, 1) z2hll

(0, 0, 1) z2hlh

(0, 0, 1) + M + 3 v2hh

(0, 0, 1) z2hhl

(0, 0, 1) z2hhh

(0, 0, 1)

: : : : : : : :

2 + 1 v2ll

(2, 0, 0) z2lll

(2, 0, 0) z2llh

(2, 0, 0) 3 + 1 v2lh

(2, 0, 0) z2lhl

(2, 0, 0) z2lhh

(2, 0, 0)

2 + 2 z2lll

(2, 0, 0) u3l

(2, 0, 0) u3h

(2, 0, 0) 3 + 2

2 + 3 v2hl

(2, 0, 0) z2hll

(2, 0, 0) z2hlh

(2, 0, 0) 3 + 3 v2hh

(2, 0, 0) z2hhl

(2, 0, 0) z2hhh

(2, 0, 0)

: : : : : : : :

2 + M + 1 v2ll

(2, 0, 1) z2lll

(2, 0, 1) z2llh

(2, 0, 1) 3 + M + 1 v2lh

(2, 0, 1) z2lhl

(2, 0, 1) z2lhh

(2, 0, 1)

2 + M + 2 z2lll

(2, 0, 1) u3l

(2, 0, 1) u3h

(2, 0, 1) 3 + M + 2

2 + M + 3 v2hl

(2, 0, 1) z2hll

(2, 0, 1) z2hlh

(2, 0, 1) 3 + M + 3 v2hh

(2, 0, 1) z2hhl

(2, 0, 1) z2hhh

(2, 0, 1)

: : : : : : : :

= MN/8. Input sample corresponding to the filter output is only shown in the input column. We have not counted the clock cycles involved to fill thedelay registers/shift-registers/memory-blocks.

four subbands (z2lhl, z2lhh) and (z

2hhl, z

2hhh) are obtained in time-multiplexed form. One component of z

2lll obtained from the

subcell-1 (of PU-3) after every 4 cycles and the successive components belong to a column. Successive columns of z2lll are

obtained from subcell-1 during alternate periods of MN/8 cycles. Subband z2lll is further transformed to generate the DWT

coefficients of level-3.

Since subcell-1 of PU-3 receives the components of (v2ll, v2hl) or (v

2lh, v

2hh) during alternate cycles, it remains idle for one

cycle after every input cycle of (v2ll, v2hl) or (v

2lh, v

2hh). The DWT of z

2lll can be computed by the subcell during the idle

cycles without any data overlapping, since the amount of computation required to process z2lll is (3/8)-th of the amount of

temporal-DWT of second-level. By computing temporal-DWT of second-level alone, hardware utilization of the subcell is only

50%. The processing ofz2lll can be time-multiplexed with that of second-level computation without any data overlapping. DWT

computation of z2lll are scheduled at the idle cycle of subcell-1 of PU-3. The processing of the intermediate coefficients u3l

and u3h are time-multiplexed column-wise to take the advantage of down-sampling. Similarly the temporal-DWT of a pair of

subbands (v3ll, v3hl) and (v

3lh, v

3hh) are time-multiplexed to take the advantage of down-sampling along the temporal direction.

Schedule for multiplexing the computation of third-level DWT and second-level temporal-DWT in subcell-1 is given in Table

VI. Input-output data-flow of subcell-1 of Fig.10 is derived for few cycles using the schedule of Table VI and shown in

8/3/2019 05958631

17/26




Table VII. The registers of Fig.10 provides the required delay in column-wise processing, while the shift-registers provides

the necessary delay in row-wise processing to the intermediate coefficients u3l and u3h. The extra shift-register provides one

additional row-delay for time-multiplexing the processing ofu3l and u3h. Similarly, frame-buffer-2 provides the necessary frame-

delay for the multiplexed computation of temporal-DWT (v3ll, v

3hl) and (v

3lh, v

3hh). The structure of frame-buffer-2 is similar to

the structure of frame-buffer-1 (see Fig.9) except that in this case each MB of size MN/32 words. Each shift-registers of PU-3

is of size M/8 words (equal to half of the frame height of z2lll) and clocked by CLK4 which is 8 times slower than CLK1.

Each registers of the delay-path are clocked by a separate clock CLK3 which is 4 times slower than CLK1. PU-3 uses separate

multiplexors for multiplexing the computations. Four MUX1es multiplexes the computation ofu3l and u3h, while four MUX2es

multiplexes the row and column processing of z2

lll. Similarly, four MUX3es multiplexes the temporal DWT computation with

the row and column processing of z2lll. Four MUX4es multiplexes computation of z2lll with temporal DWT computation of

second-level. Each PU works in separate pipeline stage and computes multilevel 3-D computation concurrently. The proposed

structure can compute 3-level running DWT of a video stream of frame size (MN) and frame rate R in MNR/8 cycles

with initial latency of (11 + 2M+ 1 + 2 + 3 + 4) cycles, where a delay of (6 + 2M) cycles introduced to fill the

register and shift-register of PU-2, and (1 +2 +3 +4) cycles delay is introduced to fill the MBs of frame-buffer-1, registers,

shift-registers and MBs of frame-buffer-2 of PU-3.

IV. IMPLEMENTATION OF SUBCELLS

To have a reduced-hardware structure, subcell-1 and subcell-2 of the PUs can be implemented by multiple constant

multiplication methods (MCM) using CSD-representation of the filter coefficients [19] or memory-based technique for

multiplications using look-up-tables and adders [20]. Apart from that, the interrelation and symmetries between the coefficients

of wavelet filter bases can be utilized to derive efficient structures of the subcells. We discuss here an optimal area-time efficient

implementation of the subcells for the Daubechies wavelet filters for K = 4. The transfer function of the low pass and the

high pass filters corresponding to the Daubechies 4-tap wavelet transform can be expressed as [21]:H(z) = a + bz1 + cz2 + dz3 (1a)

G(z) = d cz1 + bz2 az3 (1b)where

a = 1+3

42

, b = 3+3

42

, c = 33

42

, d = 13

42

8/3/2019 05958631

18/26




TABLE VIII

COMPARISON OF HARDWARE - AN D TIME -C OMPLEXITIES OF THE PROPOSED STRUCTURE AND THE EXISTING STRUCTURES FOR 3-LEVEL 3-D DWT

USING DAUBECHIES 4- TAP WAVELET FILTER. M: IMAGE HEIGHT, N: IMAGE WIDTH, R: FRAME-RATE

structures MULT ADD REG shift-register frame-buffer cycle period ACT Latency

Weeks et al [8] 24 18 242MN

1

8M NR TM + TA

4

7MNRx

O(MN R)

(3DW-I) +2MN R

Weeks et al [8] 8 6 110 0 MN R T M + TA 4MNRxO(MN R)

(3DW-II)Das et al [13] 24 18 8 2(2M + 1)N 1

8M NR T M + TA

4

7MNRx O(MN R)

Dai et al [15] 96 72 32 4(N + 2)R 18

M NR T M + TA 1

7MNRx O(MN R)

Mohanty et al [16] (219/16)Q (657/64)Q 5.25N 147MN/32 5MN/32 TM + 2TA MN R/Q O(M N)

Proposed Structure 44 276 239 15M/8 35MN/32 max(TM, 2TA) MNR/8 O(M N)

Legend: MULT: multiplier, ADD: adder, REG: data/pipeline register, shift-register and frame-buffer are represented in words and ACT in cyclesx = 511/512,Q: input block-size.

Ignoring the fixed factor (4

2) in the denominators of the filter coefficients, the low pass and high pass outputs correspondingto the input sequence (x0, x1, x2, x3) may be expressed otherwise in alternative form given by (2) in the following:

ul = (p1 + 2p2 + p3) +

3(p1 p3) (2a)uh = (q1 + 2q2 + q3) + 3(q3 q1) (2b)

where p1 = (x0 + x1); p2 = (x1 + x2); p3 = (x2 + x3); q1 = (x0 x1); q2 = (x2 x1), and q3 = (x2 x3);

Unlike the subcell-1, subcell-2 performs down-sampled filter computation on two signals (ul(n) and uh(n)) in time-multiplexed

form. The low-pass and high-pass filter outputs corresponding to (ul(n) and uh(n)) in this case can otherwise be expressed

in an alternative form:

vl = (2 + r)s(n) + (4 + r)t(n 1) + (2 r)s(n 2)rt(n 3) (3a)

vh = rt(n) + (r 2)s(n 1) + (r + 4)t(n 2)(2 + r)s(n 3) (3b)

where r =

3 1, s(n) and t(n), respectively represent the outputs of the LC (see Fig.5 of [16]); where input X1 = ul(n)

and X2 = uh(n). vl, respectively, represent vll and vhl when s(n) equal to ul and uh, respectively. Similarly, vh, respectively,

represent vhh and vlh when t(n) equal to uh and ul, respectively.

Equation (3) further may be expressed in z-domain as:

Vl(z) = S(z)(2 + r + (2 r)z2)+T(z)((4 + r)z1 rz3) (4a)

Vh(z) = T(z)(r + (r + 4)z2+S(z)((r 2)z1 (2 + r)z3) (4b)

where, S(z) and T(z) represent z-transform of s(n) and t(n), respectively.

8/3/2019 05958631

19/26




Each of the pair of filter outputs given by (2) and (4) involves only two multiplications. Using (2) and (4), the subcells

can, therefore, be implemented in fully-pipelined structures. The proposed structures of the subcells are shown in Fig.11 and

Fig.12. The structure of subcell-1, as shown in Fig.11, computes a pair of filter output during every cycle period according to

(2). It consists of seven adder units (AU) and two multipliers. Each AU performs a pair of additions. AU-1, AU-2 and AU-3

compute (p1, q1), (p2, q2) and (p3, q3), while AU-4 and AU-5 compute (2p2 +p3, 2q2 +q3) and (p1p3, q3q1), respectively.

AU-6 computes (p1 + 2p2 + p3, q1 + 2q2 + q3). The entire computation of subcell-1 is performed in three pipeline stages. A

pair of filter output is given out by AU-7 after a latency of 2 cycles, where the duration of a cycle period T = max(TM, 2TA),

where TM and TA, respectively, the time required for one multiplication and addition operation in the subcells. The structure

of subcell-2 has two multipliers, three shifters, 10 adders. It would yield a pair of filter output in every cycle period, after an

initial latency of 4 cycles.

x0 x1 x2 x3

AU1 AU2 AU3

p1q1 q2p2 q3p3

AU5 AU4

AU6

1- 3 q3- q1 2p2+ p3 2q2+ q3

33

AU7p1+2p2+ p3

q1+ 2q2+ q3yh

yl

Fig. 11. Structure of subcell-1 for the Daubechies 4-tap wavelet filter.

r

rs

1

8/3/2019 05958631

20/26




V. HARDWARE COMPLEXITY AND PERFORMANCE CONSIDERATION

In this section we discuss the details of the hardware and time complexities of the proposed structure and compare those

with the existing designs.

A. Hardware Complexity

The proposed structure is comprised of three PUs. PU-1 has four PEs and four subcells (subcell-1) where, each PE is

comprised of two pair of subcell-1 and subcell-2. Each subcell-1 has 2 multipliers, 14 adders and 10 pipeline registers while,

each subcell-2 requires 2 multipliers, 10 adders and 10 pipeline registers. Each PE requires 8 multipliers, 48 adders and 40

pipeline registers. PU-1 therefore, involves 40 multipliers, 248 adders and 208 pipeline registers. PU-2 is comprised of one

subcell and one IDU. IDU consists two shift registers of size M/2 words each, 8 registers, 4 MUXs and 2 DMUXs. PU-2,

therefore, involves 2 multiplier, 14 adders, (16 + M) registers, 4 MUXs and 2 DMUXs. PU-3 is comprised of one subcell-1,

one frame-buffer-1 and one frame-buffer-2, 7 shift-registers of size M/8 words, 3 registers, 16 2-to-1 line MUXs and one

DMUX array. Frame-buffer-1 is comprised of 7 MBs of size MN/8 words each and 4 2-to-1 line MUXs. Similarly frame-

buffer-2 is comprised of 7 MBs of size MN/32 words each and 4 2-to-1 line MUXs. The DMUX array is comprised of 14

1-to-2 line DMUXs. The proposed structure, therefore, involves 44 multipliers, 276 adders, 239 data/pipeline registers, 15M/8

shift-register words, (35/32)MN frame-buffer words, 28 MUXs and 16 DMUXs.

B. Time Complexity

The proposed structure calculates DWT coefficients of a four samples of each frame in every cycle and two frames are

transformed in parallel. The average computation time (ACT) to calculate 3-level DWT of a video stream of frame size

(MN) and frame rate R is MNR/8 cycles. The structure has initial latency of (23 + 5M+ 7MN/4) cycles. Out of this,

(12 + 5M + 7MN/4) cycles of delay is introduced to fill the registers, shift-registers and MBs of the frame-buffers. The

duration of one cycle period is T = max(TM, 2TA), where TM and TA is the time required to perform one multiplication and

addition in a subcell.

8/3/2019 05958631

21/26




C. Performance Comparison

The hardware and time complexity of the proposed structure and the existing structures of [8], [13], [15], [16] are listed in

Table VIII in terms of cycle period, ACT 1 and latency in clock cycles, registers, shift-register and frame-buffer size in words,

along with the number of multipliers and adders for comparison.

TABLE IX

COMPARISON OF MEMORY AND TIME COMPLEXITY OF THE PROPOSED STRUCTURE AND STRUCTURES OF [15] AN D [16] FO R DWT LEVEL J = 3

Structure Frame-size FPS Memory ACT

Dai et al [15] 176 144

15 56312 54203

30 112592 108405

60 225152 216810

Mohanty et al [16] 176 144

15 121140 23760

30 121140 47520

60 121140 95040

Proposed 176 144

15 28289 47520

30 28289 95040

60 28289 190080

Dai et al [15] 640 480

15 604952 657000

30 1609872 1314000

60 2419712 2628000

Mohanty et al [16] 640 480

15 1461720 288000

30 1461720 576000

60 1461720 1152000

Proposed 640 480

15 337439 576000

30 337439 1152000

60 337439 2304000

memory in unit of words and ACT in unit of cycles.

The structures of [8], [13], [15] are of folded type, which compute the multilevel 3-D DWT level-by-level while the structure

of [16] compute multilevel 3-D DWT concurrently. As shown in Table VIII, the structure of [15] is the most efficient among the

existing folded structures. Compared with [15], the proposed structure involves 2.18 times less multipliers, nearly 23/6 times

more adders, (32NR/15M) times less on-chip memory (sum of data/pipeline register and shift-register words) and 4R/35

times less frame-buffer than [15]. Besides, it has 8MNR/7 times less ACT than [15]. Compared with the structure of [16], the

proposed one involves 0.311Q times less multipliers, (26.88/Q) times more adders, nearly 2.45N times less on-chip memory

and 7 times more frame-buffer. It involves Q/8 times more ACT than the structure of [16]. The proposed structure has small

cycle period compared to the existing structures. It is interesting to note that, the on-chip memory of the proposed structure

1ACT is the number of cycles required for the computation of all the J-levels of 3-D DWT after the initial latency. ACT of the structure of [13] and [15],is calculated by the sum of the ACTs of each individual levels, because they compute the 3-D DWT of different levels sequentially. In case of the proposedstructure and the structure of [16] ACT is calculated by dividing the total number of 3-D DWT coefficients by the throughput per cycle

8/3/2019 05958631

22/26




varies with M while in case of [15] it varies with NR and in case of [16] it varies with MN. This results a significant saving

in memory since the frame-size and frame-rate of video applications is usually very large.

We have estimated memory complexity and ACT of the proposed structure and the existing structures [15], [16] pertaining

to some practically used video frames and frame-rates. The memory-complexity of the structures represent the sum of the

data/pipeline registers and memory words required by the shift-register and frame-buffer. We have assumed input-block size

Q = 16 for [16] and estimated memory complexity and ACT of [15], [16] and the proposed structure for 3-level DWT.

The estimated values are listed in Table IX for comparison. It can be found from Table IX that, memory complexities of

the proposed structure and that of [16] are independent of the frame-rate while in case of [15] it increases proportionately

with frame-rate. Compared with [15], proposed structure for frame-size 176 144 and frame-rates 15, 30 and 60, respectively,

requires 1.99 times, 3.98 times and 7.96 times less memory words and involves 12.3% less ACT. It involves 4.28 times less

memory words than those of [16] for the same frame-size and frame-rates and involves 2 times more ACT than other. For

frame-size 640480 and frame-rates 15, 30 and 60, the proposed structure involves 1.79 times, 4.77 times and 7.17 times less

memory words than those of [15] and calculate 3-level DWT in 12.3% less time than [15]. Compared with [16], the proposed

one involves 4.33 times less memory words than those of [16].

D. Numerical Error Consideration

To validate the proposed design we have coded it in MATLAB 7.1 and VHDL for decomposition level-1 for floating point

and fixed point implementations, resepectively. For fixed-point implementation, we have taken 8-bit pixel values and 12-bit

precision for all the intermediate signals. We have used 11-bit Baugh-Wooley multiplier for the RP and 12-bit multiplier for

the CP and TP. We have processed four successive frames of two video sequences Foreman and Xylophone to generate all the

8 subbands of 3-D DWT, and estimated absolute errors in fixed point implementation as the difference between the MATLAB

simulation and test-bench results. The average and maximum errors obtained for all the subbands of Foreman and Xylophone

video sequences are shown in Table X. We find that the average error in LLL subband is 0.49% of its average value in case

of Foreman and 0.87% in case of Xylophone.

8/3/2019 05958631

23/26




TABLE X

NUMERICAL ERROR OF 1-L EVEL 3-D DWT

SubbandsForeman Xylophone

Avg. Error Max. Error Avg. Error Max. Error

LLL 1.9797 3.0987 1.8376 2.7030

HLL 2.2756 3.6129 1.7068 2.8422

LLH 0.6019 1.2487 0.7493 1.5044

HLH 0.6468 1.1638 0.6285 1.4232LHL 1.3203 2.426 0.8831 2.8559

HHL 1.138 1.8319 0.9561 1.6683

LHH 0.4495 1.1286 0.4783 1.0656

HHH 0.4866 0.9639 0.5528 2.0059

E. Synthesis Result

We have coded the proposed design in VHDL for frame-size 176 144 and 640 480; and frame rates 15, 30, 60, and

synthesized using Xilinx ISE 12.1i tools along with the best of the existing designs [15] and [16]. We have considered Daub-4

wavelet filter for all the designs and coded the structure for 3-level DWT. We have used single port block RAM (BRAM) for

implementing the temporal-buffer of all the designs. The frame-buffer of [15] and the input-buffer of [16] is also implemented

using single-port BRAM. We have implemented all the registers and shift-registers transposition-buffer using delay-type flip-

flop. All the designs are synthesized for the FPGA device 6VLX760FF1760-2 and the results obtained from the synthesis

report are listed in Table XI. We have estimated the parameter slice delay-product (SDP) to measure area-time complexity of

the designs in FPGA platform. SDP is defined as the product of number of slices required by each design and the computation

time (CT), where CT = Number of cycles required for computation / Maximum usable frequency (MUF).

The proposed design has lowest clock period and involves less memory (in terms of BRAMs) as expected from the theoretical

estimation shown in comparison Table VIII and Table IX. Although the proposed one involves less than half of the multipliers

of [15], it involves 25% more slices than [15] due to adder complexity, pipeline registers and data selectors (MUX, DMUX).

As shown in Table XI, hardware complexity of the proposed structure and the structure of [16] is independent of the frame-

rate while in case of [15], memory complexity almost increases proportionately with frame-rate. The proposed structure for

frame-size 176 144 and frame-rate 15, 30 and 60, respectively, involves 2.56 times, 5.12 times and 9.6 times less BRAMs

than those of [15] and offers 2 times higher throughput rate than [15]. Compared with the structure of [16] the proposed one

8/3/2019 05958631

24/26




TABLE XI

COMPARISON OF SYNTHESIS RESULTS OF THE PROPOSED STRUCTURE AND STRUCTURES OF [15], [16] FOR FPGA DEVICE 6VLX760FF1760-2

StructureFPS Slices BR

MUF ACT SDP

frame-size (MHz) (ms) (s)

Dai [15]15 10751 64 50.077 1.08 11.61

30 10467 128 52.578 2.06 21.56

176 144 60 10607 240 50.121 4.32 45.82

Mohanty [16] 15 28581 48 40.21 0.59 16.86230 28581 48 40.21 1.18 33.725

176 144 60 28581 48 40.21 2.36 67.45

Proposed15 13495 25 88.096 0.539 7.27

30 13495 25 88.096 1.07 14.54

176 144 60 13495 25 88.096 2.15 29.09

Dai [15] 15 14035 544 50.036 13.13 184.27

Proposed15 13662 384 88.096 6.53 89.21

30 13662 384 88.096 13.07 178.42

640 480 60 13662 384 88.096 26.15 356.85

Legend: BR: block RAM, SDP: slice delay product, FPS: frame per second, MUF: maximum usable clock frequency, SDP is defined as the product of numberof slices required by each design and the computation time (CT), where CT = Number of cycles required for computation / Maximum usable frequency.

(MUF)).

involves 2.11 times less slices, 1.92 times less BRAMs than those of [16] and offers nearly same throughput rate for the same

frame-size and frame rates. The proposed one has 1.59 times less SDP than the structure of [15] and 2.3 times less SDP than

that of [16]. For the frame-size 640480 and frame rate 15, the proposed structure involves 1.41 times less BRAMs and 2.06

times less SDP than those of [15].

F. Comparison of Power Consumption

We have estimated the power consumption of the proposed design and the design of [15] and [16] using Xilinx Xpower

tools by implementing them in FPGA device 6VLX760FF1760-2. Xpower analyzer report for 40 MHz clock frequency is

listed in Table XII for comparison. As shown in Table XII, for frame-size 176 144 and frame rates 15, 30 and 60, the

proposed structure dissipates, respectively, 9.4%, 31.68% and 74.7% less dynamic power than the structure of [15]. Compared

with [16], the proposed one dissipates 71.9% less dynamic power. It dissipates 44.6% less dynamic power than that of [15]

for frame-size 640 480 and frame rate 15. This is mainly due to less number of BRAM used by the proposed structure than

others.

V I . CONCLUSIONS

We have shown that using overlapped grouping of frames and by appropriate scheduling of computation of different levels,

the memory requirement of multilevel 3-D DWT structures could be drastically reduced. Based on this observation, we have

8/3/2019 05958631

25/26




TABLE XII

COMPARISON OF POWER CONSUMPTION

Structure Frame-size FPSPower (watt)

Static Dynamic

Dai [15]

176 144

15 3.213 0.202

30 3.214 0.247

60 3.217 0.334

Mohanty [16] 15, 30, 60 3.229 0.652

Proposed 15, 30, 60 3.212 0.183

Dai [15]640 480

15 3.234 0.776

Proposed 15, 30, 60 3.221 0.430

suggested a computing scheme to reduce the memory complexity of 3-D DWT implementation. The remarkable feature of the

proposed structure is that, it does not involve any line-buffer or frame-buffer for level-1.

Compared with the best of the existing folded structures [15], proposed design involves significantly less on-chip memory,

less frame-buffer and less ACT. Compared with [16], which could be taken as the best among all the existing designs, the

proposed one involves 0.311Q times less multipliers and (26.88/Q) times more adders and involves Q/8 times more ACT,

where Q is the input block-size. However, it involves 2.45N times less on-chip memory than other. For frame-size 176 144

and frame-rate 60, the proposed structure involves 7.96 times less memory and 12.3% less ACT than [15]. Compared with

[16] for input-block size (Q = 16), proposed structure involves 4.28 times less memory and involves double the ACT for

the same frame size and frame-rates. The synthesis result for FPGA device 6VLX760FF1760-2 shows that proposed structure

for frame size 176 144 and frame rate 60, involves 9.6 times less BRAMs than those of [15] and offers 2 times higher

throughput rate than [15]. Compared with the structure of [16], proposed one involves 1.9 times less BRAMs and offers nearly

the same throughput rate. The proposed structure has significantly less SDP than the existing structures. Due to its less memory

complexity, proposed structure consumes less dynamic power than best of the existing structures. It can compute multilevel

running 3-D DWT on an infinite GOFs frames and involves much less memory and resource than the existing designs. It could,

therefore, be used for high-performance video processing applications.

REFERENCES

[1] G. Minami, Z. Xiong, A. Wang, and S. Mehrotra, 3-D wavelet coding of video with arbitrary regions of support, IEEE Trans. Circuit Syst. VideoTechnol., vol. 11, pp.1063-1068, Sept.2001.

[2] A. M. Baskurt, H. Benoit-Cattin, and C. Odet, 3D medical image coding method using a separable 3D wavelet transform, SPIE Proceedings on Medical Imaging 1995: Image Display, vol. 2431, pp. 173183, Apr. 1995.

[3] V. Sanchez, P. Nasiopoulos, and R. Abugharbieh, Lossless compression of 4D medical images using H.264/AVC, in IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP 2006), vol. II, May 2006, pp. 11161119.

[4] J. Wei, P. Saipetch, R. K. Panwar, D. Chen, and B. K. Ho, Volumetric image compression by 3D discrete wavelet transform (DWT), SPIE Proceedingson Medical Imaging 1995: Image Display, vol. 2431, pp. 184194, Apr. 1995.

8/3/2019 05958631

26/26



[5] L. Anqiang and L. Jing, A novel scheme for robust video watermark in the 3D-DWT domain, in International Symposium on Data, Privacy, andE-Commerce (ISDPE 2007), Nov. 2007, pp. 514516.

[6] J. -R. Ohm, M. van der Schaar, and J. W. Woods, Interframe wavelet coding: Motion picture representation for universal scalability, J. Signal Process.Image Commun., vol. 19, no. 9, pp. 877908, Oct. 2004.

[7] P.-C. Wu and L.-G. Chen, An efficient architecture for tow-dimensional discrete wavelet transform, IEEE Trans., Circuit and System for VideoTechnology, vol. 11, no. 4, pp.536545, Apr. 2001.

[8] M. Weeks, and M. A. Bayoumi, Three-dimensional discrete wavelet transform architecture, IEEE Trans. on Signal Processing vol.50, no.8, pp.2050-2063, Aug. 2002.

[9] M. Weeks, and M. A. Bayoumi, Wavelet transform: architecture, design and performance issues,, Journal of VLSI Signal Processing vol.35, Issue 2,pp.155-178, Sept. 2003.

[10] W. Badawy, M. Talley, G. Zhang, M. Weeks, and M. A. Bayoumi, Low power very large scale integration prototype for three-dimensional discretewavelet transform processor with medical application, Journal of Electronic Imaging vol.12, no.2, pp.270-277, April 2003.

[11] B. Das, A. Hazra and S. Banerjee, An efficient architecture for 3-D discrete wavelet transform, IEEE Trans. Circuit and Syst., Video Techno., vol. 20,no. 2, pp. 286296, Feb. 2010.

[12] B. Das and S. Banerjee, Low power architecture of running 3-D wavelet transform for medical imaging application, in Proc. Eng. Med. Biol.Soc./Biomed. Eng. Soc. Conf., vol. 2. 2002, pp. 10621063.

[13] B. Das and S. Banerjee, A memory efficient 3D DWT architecutre, Proceeding of 16th International Conference on VLSI Design, IEEE ComputerSociety Aug. 2003.

[14] B. Das and S. Banerjee, Data-folded architecture for running 3-D DWT using 4-tap Daubechies filters, IEE Proc. Circuits Devices Syst., vol. 152, no.1, pp. 1724, Feb. 2005.

[15] Q. Dai, X. Chen and C. Lin, A novel VLSI architecture for multidimensional discrete wavelet transform, IEEE Trans. Circuit and Syst., VideoTechno., vol.14, no.8, pp.1105-1110, Aug. 2004.

[16] B. K. Mohanty and P. K. Meher, Parallel and pipeline architecture for high-throughput computation of multilevel 3-D DWT, IEEE Trans. Circuit andSyst., Video Techno., vol.20, No.9, pp.1200-1209, Sept.2010.

[17] Z. Taghavi and S. Kasaei, A memory efficient algorithm for multidimensional wavelet transform based on lifting, in Proc. IEEE Int. Conf. Acoust.Speech Signal Process. (ICASSP), vol. 6. 2003, pp. 401404.

[18] P. K. Meher, B. K. Mohanty and J. C. Patra Hardware-efficient systolic-like modular design for two-dimensional discrete wavelet transform, IEEE

Trans. on Circuits and Syst. II, Express Brief vol. 55, no. 2, pp. 151-154, Feb 2008.[19] R. I. Hartley, Subexpression sharing in filters using canonic signed digit multipliers, IEEE Trans Circuits and Syst. II: Analog and Digital Signal

Processing, vol. 43, no. 10, pp. 677688, Oct. 1996.[20] H. -R. Lee, C. -W. Jen, and C. -M. Liu, On the design automation of the memory-based VLSI architectures for FIR filters, IEEE Trans. Consumer

Electronics, vol. 39, no. 3, pp. 619629, Aug. 1993.[21] I. Daubechies and W. Sweldens, Orthonormal bases of compactly supported wavelets, Comm. Pure Appl. Math., vol. 41, pp. 909996, 1988.

Basant K Mohanty (M06) received B.Sc and M.Sc degree (both with first-class honors) in Physics from Sambalpur University,Orissa, in 1987 and 1989, respectively. Received Ph.D degree in the field of VLSI for Digital Signal Processing from BerhampurUniversity, Orissa in 2000.

In 1992 he was selected by OPSC (Orissa Public Service Commission) and joined as faculty member in the Department of Physics,SKCG College Paralakhemundi, Orissa. In 2001 he joined as Lecturer in EEE Department, BITS Pilani, Rajasthan. Then he joinedas an Assistant Professor in the Department of ECE, Mody Institute of Education Research (Deemed University), Rajasthan. In2003 he joined Jaypee University of Engineering and Technology, Guna, Madhya Pradesh, where he become Associate Professor in

2005 and full Professor in 2007. His research interest includes design and implementation of re-configurable VLSI architectures forresource-constrained digital signal processing applications. He has published nearly 30 technical papers. Currently he serves as thereviewers of IEEE Transactions on Circuits and Systems-II: Express Briefs, IEEE Transactions on Circuits and Systems for VideoTechnology, and IEEE Transactions on Very Large Scale Integration (VLSI) Systems .

Dr.Mohanty is a life time member of The Institution of Electronics and Telecommunication Engineering, New Delhi, India.

Pramod Kumar Meher (SM03) Pramod Kumar Meher (SM03) received B.Sc. and M.Sc. degrees in Physics and Ph.D. degree inscience from Sambalpur University, Sambalpur, India, in 1976, 1978, and 1996, respectively.

He has a wide scientific and technical background covering Physics, Electronics, and Computer Engineering. Currently, he is a

Senior Scientist with the Institute for Infocomm Research, Singapore. Prior to this assignment he was a visiting faculty with the Schoolof Computer Engineering, Nanyang Technological University, Singapore. Previously, he was a Professor of Computer Applicationswith Utkal University, Bhubaneswar, India from 1997 to 2002, a Reader in Electronics with Berhampur University, Berhampur, Indiafrom 1993 to 1997, and a Lecturer in Physics with various Government Colleges in India from 1981 to 1993. His research interestincludes design of dedicated and reconfigurable architectures for computation-intensive algorithms pertaining to signal, image andvideo processing, communication, bio-informatics and intelligent computing. He has contributed more than 170 technical papers tovarious reputed journals and conference proceedings.

Dr. Meher is a Fellow of the Institution of Electronics and Telecommunication Engineers, India and a Fellow of the Institutionof Engineering and Technology, UK. He is serving as a speaker for the Distinguished Lecturer Program (DLP) of IEEE Circuits Systems Society, andAssociate Editor for the IEEE Transactions on Circuits and Systems-II: Express Briefs, IEEE Transactions on Very Large Scale Integration (VLSI) Systems ,and Journal of Circuits, Systems, and Signal Processing. He was the recipient of the Samanta Chandrasekhar Award for excellence in research in engineeringand technology for the year 1999.

Documents

05958631