05958631

Embed Size (px)

Citation preview

  • 8/3/2019 05958631

    1/26Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

    FOR IEEE TRANSACTIONS ON SIGNAL PROCESSING 1

    Memory-Efficient Architecture for 3-D DWT Using

    Overlapped Grouping of FramesBasant K. Mohanty, Member, IEEE and Pramod K. Meher, Senior Member, IEEE

    Abstract

    In this paper we have presented a memory efficient architecture for 3-D DWT using overlapped grouping of frames. Proposedstructure does not involve any line-buffer or frame-buffer for 1-level 3-D DWT. It involves only a frame-buffer of size O(MN) tocompute multilevel 3-D DWT, unlike the existing folded structures which involve frame-buffer of size O(MNR). The saving ofline-buffer and frame-buffer by the proposed structure for the implementation of first-level DWT is of substantial advantage, sincethe frame-size is very often as large as 1920 1080 and frame-rate varies from 15 to 60 fps. The proposed structure has a smallcycle period, and offers small output latency compared to the existing structures. Compared with the best of the available designs,the proposed design involves significantly less memory words. For frame-size 176 144 and frame-rate 60 fps, the proposedstructure involves 7.96 times less memory words and involves 12.3% less average computation time (ACT) than the best ofthe existing folded designs. It involves 4.28 times less memory words than the recently proposed parallel design. The synthesisresult for frame-size 176144 and frame-rate 60 fps for the FPGA device 6VLX760FF1760-2 shows that the proposed structureinvolves 9.6 times less BRAMs and offers 2 times higher throughput than the folded design. It involves 1.9 times less BRAMsthan the parallel design and offers nearly same throughput rate. The proposed structure has significantly less slice-delay-product(SDP) than the existing structures. Due to less memory complexity, the proposed structure dissipates significantly less dynamicpower than the existing structures.

    Index Terms

    Discrete wavelet transform, 3-dimensional DWT, Overlapping frames, parallel and pipeline architecture, VLSI

    I. INTRODUCTION

    THREE-dimensional (3-D) discrete wavelet transform (DWT) is applied in video compression, compression of 3-D

    and 4-D medical images, volumetric image compression, video watermarking and many other applications [1][6].

    The generic structure for the computation of multilevel-level of 3-D DWT based on the popularly used separable approach is

    shown in Fig.1, where the intra-frame DWT is performed row-wise then column-wise by the row-processor and then column-

    processor, respectively, and inter-frame computation is performed by the temporal-processor. As shown in the figure, the 3-D

    DWT structure is comprised of two types of hardware components: (i) combinational component and (ii) memory/storage

    component. The combinational component consists mainly of arithmetic circuits and multiplexors; and the memory component

    consists of a frame-memory, temporal-memory, registers and transposition-memory. Frame-memory is usually external to the

    chip, while temporal-memory may either be on-chip or external. The on-chip transposition-memory stores the intermediate

    values resulting after the row processing, while the temporal-memory stores the intermediate values resulting after the column

    processing of a set of successive frames. The frame-memory is used for storing low-low-low (LLL) subband to compute the

    Manuscript submitted on January 26, revised on 23 May and 6 July 2011. This paper was recommended by Associate Editor Tong Zhang.B. K. Mohanty is with the Dept. of Electronics and Communication Engineering, Jaypee University of Engineering and Technology, Raghogarh, Guna,

    Madhy Pradesh, India-473226, (email: [email protected]).P. K. Meher is with the Department of Embedded Systems, Institute for Infocomm Research, 1 Fusionopolis Way, Singapore-138632, Email: [email protected]

    star.edu.sg, URL: http://www.ntu.edu.sg/home/aspkmeher/.

  • 8/3/2019 05958631

    2/26

    Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

    FOR IEEE TRANSACTIONS ON SIGNAL PROCESSING 2

    LOWLOWLOW

    ROW

    PROCESSORTRANSPOSITION

    MEMORY

    SUBBAND

    COLUMNME

    ORY

    FRA

    ME

    MEMORY

    PROCESSOR

    Fig. 1. Generic structure of multi-level 3-D DWT computation

    multilevel 3-D DWT level-by-level [7]. The size of frame-memory is MNR, while the size of temporal-memory is KMN

    and transposition-memory is of size KN, where M and N is the height and width of each video frame, R is number of frames

    in a group of frames (GOFs) and K is the order of the wavelet filter.

    In general, the complexity of the combinational component of the 3-D DWT structure depends on the filter order which

    is usually small, while the complexity of memory component depends on the frame-size. Since the frame-size for practical

    video applications may vary from 176 144 (for low-end mobile phones) to 1920 1080 (for HDTV), the computation of

    3-D DWT is highly memory-intensive and it is a challenging task to implement the complete 3-D DWT in a single chip. But,

    on the other hand the communication with external memory heavily degrades the speed and power performance of the whole

    system.

    A few computing schemes and architectures are suggested to reduce the memory requirement of 3-D DWT [8][10], [13],

    [15], [16]. A more detail discussions on these designs are given in [16]. The existing computation schemes are still found to

    involve very large memory. Also, it is observed that the existing block-by-block methods degrade the PSNR quality due to the

    blocking artifacts, since the transformation of infinite video frames into independent GOFs introduces noise at the boundary,

    which results in loss of PSNR. To avoid this data loss, DWT need to be performed continuously on the infinite video sequence

    (called the running 3-D DWT) [11]. Keeping this in view, Das et al [12][14] have suggested scan based architecture for

    running 3-D DWT of infinite GOFs. But the memory requirement of this structure is significantly higher than that of [15].

    The transform coder should decompose the 3-D signals in multiple-levels to achieve higher compression ratio. But, most

    of the existing designs [12][14] compute only 1-level running 3-D DWT of the infinite video sequence. The structure of

  • 8/3/2019 05958631

    3/26

    Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

    FOR IEEE TRANSACTIONS ON SIGNAL PROCESSING 3

    [15], however, computes the multilevel 3-D DWT by level-by-level approach, (similar to the folded scheme proposed by Wu

    et al [7]) using an external frame-buffer of size (MNR)/8. Other existing structures which transform the 3-D signal into

    independent GOFs can compute the multilevel 3-D DWT in time-multiplexed form in a folded structure. But, the multilevel

    3-D DWT of an infinite GOFs cannot be implemented by such folded method using a limited frame-buffer. Since, the size of

    the transposition-memory and temporal-memory remain unchanged with input-block size, they could be utilized efficiently by

    calculating more DWT coefficients concurrently. The frame-buffer size could be reduced by using separate computing blocks

    for different decomposition levels. To take care of this issue, recently we have proposed a parallel architecture for multilevel

    3-D DWT [16], which computes multilevel running 3-D DWT on infinite GOFs and overcomes the limitations of other existing

    structures. However, it has some inherent problems associated with the selection of input-block size (Q) for a given frame. The

    input-block size need to be an integer multiple of the frame width (N), and to achieve 100% hardware utilization efficiency

    (HUE), the minimum block size for J level DWT is 23J2. For example, the HUE of the pipeline structure for J = 3 is

    100% for Q = 128, 98.64% for Q = 64, and 91.25% for Q = 16. From this observation we can infer that higher the input

    block-size better is the resource utilization of the structure. But the structure demands more device resources and I/O for

    higher block-sizes. If the application is a resource-constrained, then it can, however, be implemented for lower block-sizes

    with less than 100% HUE. To overcome the above difficulties we propose here an alternative approach to compute multilevel

    running 3-D DWT on infinite GOFs. We have derived a pipeline architecture using the proposed scheme and which involves

    less on-chip and off-chip memory than the existing structures. Interestingly, the input bock-size of the proposed structure is

    independent of the DWT levels. The key ideas we have used in our current approach are:

    While grouping the frames, some boundary frames of consecutive groups of frames are overlapped, in order to avoid

    temporal memory, which would otherwise have been needed by the temporal processor.

    Row-column processing of each input frame is scheduled suitably, such that the row-processor generates the required

    intermediate results to be consumed by the column-processor without transposition.

    Processing of overlapping frames involves some redundant computation. The number of frames to be overlapped at

  • 8/3/2019 05958631

    4/26

  • 8/3/2019 05958631

    5/26

  • 8/3/2019 05958631

    6/26

    Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

    FOR IEEE TRANSACTIONS ON SIGNAL PROCESSING 6

    by (K 2) frames. As shown in Fig.2(b), due to frame overlapping, row and column DWT computation corresponding to

    overlapping frames, however, necessitates some redundant computation. The transposition- and temporal-memory of 1-level

    3-D DWT can be avoided completely at the cost of extra (K 2) pairs row and column processors to process the overlapping

    frames. In spite of using additional row- and column-processors, overlapping group of frames has the potential to provide

    substantial saving in hardware by eliminating the transposition-memory of O(N) and temporal-memory ofO(MN), since the

    size of the wavelet filters is usually very small compared to the frame size of all practical videos.

    TABLE I

    INPUT AND OUTPUT FRAMES TO ELIMINATE TEMPORAL-MEMORY

    one outp ut f rame group of outp ut f rame

    IF OF IF OF

    IF1 to IFK OF1 IF1 to IFK+2 OF1 and OF2

    IF3 to IFK+2 OF2 IF1 to IFK+4 OF1 to OF3IF5 to IFK+4 OF3 F1 to IFK+6 OF1 to OF4

    : : : :

    IFn to IFn+K1 OFK IF1 to IF3K2 OF1 to OFK

    LEGEND: IF : input-frame, OF: output-frame, n = 2K 1

    TABLE II

    NUMBER OF OVERLAPPING INPUT FRAMES FOR PRODUCING OUTPUT FRAMES WITH DIFFERENT OVERLAPPING SIZE

    Number of Number of Overlapping Overl apping

    input frames output frames input frames output frames

    K 1 K 2 0

    3K 2 K

    K 2 0

    K 1

    K + 2 2

    : :

    3K 6 K 2

    Input frames are to be processed in parallel for temporal-memory-free computation is given in Table I. From Table I, we

    can find that the group of input frames are required to be overlapped by K 2 frames. Note that, successive groups of

    output frames in this case do not overlap. To produce the overlapped group of output frames, number of overlapping frames

    of the input GOFs need to be increased. Table II shows the required number of input frames to be overlapped to produce the

    overlapped GOFs. Using Table I and Table II, one can find the number of frames in a GOFs to be processed in parallel to

    avoid temporal-memory in multilevel 3-D DWT computation. The number of overlapping frames for decomposition level-1

    and level-2 is shown in Table III. The overlapping input frames of level-1 DWT are shown in grey color in Fig.2(b) and those

    of level-2 DWT in Fig.3.

  • 8/3/2019 05958631

    7/26

    Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

    FOR IEEE TRANSACTIONS ON SIGNAL PROCESSING 7

    TABLE III

    OVERLAPPING INPUT FRAMES FOR TWO LEVEL DECOMPOSITION

    decomposition Number of Number of Number of

    level input frames overlapping frames output frames

    level-1 3K 2 3K 6 K

    level-2 K K 2 1

    TABLE IV

    COMPARISON OF MEMORY SAVING AND HARDWARE COST INVOLVED BY THE 3-D DWT COMPUTING STRUCTURE USING THE PROPOSED OVERLAPPING

    GROUP OF FRAMES FOR DIFFERENT SIZE WAVELET FILTERS

    Filter

    Level-1 Level-2

    GOF OVF Memory

    Hardware cost

    GOF OVF

    Memory Hardware cost

    Multiplier Adder KN(5M Multiplier Adder

    (K) = K 2 KN(M + 1) 4K 4K (3K 2) 3 2K + 6)/4 18K 18K

    Haar 2 0 2N(M + 1) 0 0 4 0 N(5M + 2)/2 0 0

    Daub-4 4 2 4N(M + 1) 32 32 10 6 N(5M 2) 144 144

    Daub-6 6 4 6N(M + 1) 96 96 16 12 3N(5M 6)/2 432 432

    Daub-8 8 6 8N(M + 1) 192 192 22 18 2N(5M 10) 864 864

    Daub-10 10 8 10N(M + 1) 320 320 28 24 5N(5M 14)/2 1440 1440

    5/3 5 3 5N(M + 1) 30 48 13 9 5N(5M 4)/4 135 216

    9/7 9 7 9N(M + 1) 126 224 25 21 9N(5M 12)/4 567 1008

    LEGEND: GOF: group of frames, OVF: overlapping frames. It is assumed that symmetric property of 5/3 and 9/7 filter coefficients are used and thesefilters are implemented using convolution method. For 5/3 and 9/7 filter 2K equals to 5 and 9, where equals to 3 and 7. Since the row processor doesnot involve any data registers and the column processor involves only (K 1) data registers, we have excluded these registers while estimating the hardwarecost as the complexity of data registers is very small compared to multiplier and adder complexity.

    As shown in Fig.2b, K pairs of row and column processors (K 2 overlapping frames shown in grey color) perform the

    necessary computation ofK frames to avoid temporal-memory (of size KMN) of level-1 DWT. Similarly, from Fig.3 we can

    find that, (4K 2) RPs, (4K 2) CPs and (K+ 1) TPs perform the necessary computation of (3K 2) overlapped GOFs to

    avoid temporal-memory of level-1 and level-2. The row-column processing of decomposition level-1 is scheduled in such a way

    that row-processor generates the required intermediate results to be consumed by the column-processor without transposition.

    Using overlapped grouping of frames and row-column processing scheme, memory space of KN(5M 2K+ 6)/4 words of

    2-level DWT can be avoided at the cost of 4(K 2) RPs, 4(K 2) CPs and (K 2) TPs.

    To study the efficiency of the proposed frame overlapping computing scheme, we have estimated the memory (in words)

    that could be saved and the extra hardware cost (in terms of multipliers and adders) involved to compute row and column

    transformation of the overlapping frames. We have assumed the hardware complexity of each RP, CP and TP to be 2K

    multipliers and 2K adders, where K is the filter order. The estimated memory that can be saved by using the proposed scheme

    over the conventional method (frame-by-frame without frame overlapping) and the extra hardware cost involved for different

  • 8/3/2019 05958631

    8/26

    Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

    FOR IEEE TRANSACTIONS ON SIGNAL PROCESSING 8

    TABLE V

    COMPARISON OF SAVING TO OVERHEAD RATIO (SOR) INVOLVED BY THE 3-D DWT COMPUTING STRUCTURE USING OVERLAPPING GROUP OF FRAMES

    OF DIFFERENT FRAME-SIZES

    Filt er DWT LevelSOR for frame-size

    176 144 640 480 1920 1080

    Haar1

    2

    Daub-4 1 127.5 1540.2 10385.62 33.6 408 2755.8

    Daub-61 63.7 770 5192.4

    2 16.73 203.8 1377.3

    Daub-81 42.5 513.3 3461.6

    2 11.3 138.2 935.2

    Daub-101 31.8 385 2596.2

    2 8.8 107.9 730.7

    5/31 151.3 1826.7 12317.5

    2 39.9 486 3284

    9/71 62.7 758 5111.6

    2 16.45 201.4 1363.4

    SOR is defined as = Saving in memory words / Combinational overhead cost, where combinational overhead cost represents the sum of the multipliers andadders required to process overlapping frames. Memory words and hardware cost are measured in terms of transistor counts.

    wavelet filters are listed in Table IV. Since, frame-size of a practical video could be as high as 1920 1080 (screen size of

    HDTV), significant amount of chip area could be saved by eliminating the transposition and temporal-memory of the 3-D

    structure using the proposed scheme. It can found from Table IV that, the amount of memory saving offered by the proposed

    scheme for two-level DWT is nearly 25% more than that of 1-level DWT, but 2-level DWT involves nearly 4.5 times extra

    hardware cost than that of 1-level DWT. This is mainly due to reduction in frame-size by factor of 4 after every decomposition

    level, while the number of overlapping frames required to avoid the temporal-memory increases steadily by (2K 4) frames

    for every higher level. The size of the group of frames also increases by (2j1 K) times after every higher level of DWT,

    where j is the DWT level.

    To measure the memory saving and combinational overhead of the proposed scheme, we have defined a term saving to

    overhead ratio (SOR). Since the complexity of memory and arithmetic components are widely different, we have estimated

    saving of memory and combinational overhead cost in terms of transistor-counts to measure SOR which is defined as the ratio

    of the memory saved and combinational overhead cost. For the 3-D DWT structure, we assumed the input pixels are 8-bit and

    all the intermediate and final output signals are 12-bit. Few of the multipliers of 1-level and 2-level structure are of 8-bit size

    where all other components are of 12-bit size. Out of 4K multipliers of level-1, 2K multipliers are of 8-bit size. Similarly,

  • 8/3/2019 05958631

    9/26

    Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

    FOR IEEE TRANSACTIONS ON SIGNAL PROCESSING 9

    out of 18K multipliers of 2-level structure, 6K are of 8-bit size. The transistor count of 8-bit multiplier, 12-bit multiplier,

    12-bit adder and 12-bit SRAM word are taken to be 1178, 1674, 372 and 72 transistors, respectively. Using these values, we

    have estimated the SOR for different wavelet filters and frame-sizes. The values are listed in Table V. It can be found from

    Table V that, SOR is maximum for Haar wavelet (K= 2), as in this case the entire transposition and temporal memory could

    be eliminated from the 3-D structure without any redundant computation. For higher frame-sizes, and low-order filters like

    Daub-4 and 5/3, SOR is significantly higher. We find that, SOR of the 2-level DWT is nearly 73% less than that of 1-level

    DWT on average for different wavelet filters and frame-sizes. Keeping this facts in mind, we outline the proposed method to

    derive a memory efficient hardware structure to compute of multilevel running 3-D DWT..

    low-order wavelet filters like Harr, Daub-4, Daub-6 and 5/3 should be preferred for 3-D DWT if it meets the desired SNR

    specification of the target application.

    proposed frame overlapping processing scheme should be applied to eliminate the temporal-memory of 1-level only to

    get maximum advantage of the scheme.

    computation of higher DWT levels may be partitioned and appropriately scheduled to utilize the resource effectively.

    Although the hardware cost is marginal (less than 2%) of the memory that saved if we apply redundant computation to 1-

    level only and use wavelet filters (Daub-4, Daub-6, 5/3 and 9/7), the hardware cost could be reduced further by implementing

    multipliers using some low-complexity design method. In this work we have considered Daubechies 4-tap (Daub-4) wavelet

    filters as an example to derive the proposed structure. However, similar type of structures could be derived for other wavelet

    filters as well. We also suggested an efficient design for implementation of Daubechies wavelet filters for K= 4.

    III . PROPOSED ARCHITECTURES FOR 3- LEVEL 3-D DWT

    The proposed structure for the implementation of 3-level 3-D DWT is shown in Fig.4. It consists of three processing units

    (PUs). PU-1 performs the computation of first-level DWT, while PU-2 computes only row and column DWT of the second-level.

    PU-3 computes temporal DWT of the second-level and the entire computation of third-level in time-multiplexed form.

    PU-1 receives four input blocks of four successive parallel frames in every cycle from the input buffer. The input blocks

  • 8/3/2019 05958631

    10/26

    Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

    FOR IEEE TRANSACTIONS ON SIGNAL PROCESSING 10

    are fed to the structure as per the order shown in Fig.5. As shown in Fig.5, each input block contains 6 consecutive samples

    of a particular row. The input block I(m1,m2, n3) corresponding to m2-th row of m1-th frame and contain the samples

    {x(m1,m2, 4n3 + 5), x(m1,m2, 4n3 + 4), x(m1,m2, 4n3 + 3), x(m1,m2, 4n3 + 2), x(m1,m2, 4n3 + 1), x(m1,m2, 4n3)}, for

    0 m2 M 1, 0 n3 (N/4) 1 and m1 = 0, 1, 2, 3,.... The adjacent input blocks of a particular row are overlapped

    by 2 samples. Suppose in the first cycle, the first input block of the first row of a frame is fed then during the second cycle,

    the first input block of the second row is fed to the structure, such that the first input blocks of all the M rows of a particular

    frame are fed in M cycles and in the next set of M cycles, second input blocks of all the M rows are fed to the structure.

    The entire MN/4 input blocks of a particular frame are fed to the structure in MN/4 cycles. Input blocks of four successive

    parallel frames are fed to the structure in MN/4 cycles in parallel. The successive group of frames are overlapped by 2 frames.

    During the first set of MN/4 cycles, input blocks of first GOFs (F1, F2, F3, F4) are fed to the structure and in the next set

    of MN/4 cycles, input blocks from the GOFs (F3, F4, F5, F6) are fed. In this manner, input blocks of an infinite GOFs are

    fed to PU-1 continuously to compute first-level 3-D DWT.

    from

    fer

    ut

    zlh1

    zhh1

    2

    PU1 PU2 PU3B

    locks

    mebuf

    Outph

    vl2

    Inpufra

    zllh1

    zhl1

    zlll2 ul

    3 uh3 vl

    3 vh3

    6samples/

    cycle

    Fig. 4. Proposed structure for computation of 3-level 3-D DWT. z1lh

    , z1hl

    , z1hh

    , respectively, represent (z1lhl

    , z1lhh

    ), (z1hll

    , z1hlh

    ) and (z1hhl

    , z1hhh

    ). v2l

    and v2h

    ,

    respectively, represent (v2ll

    , v2hl

    ) and (v2lh

    , v2hh

    ). Output represent (z2llh

    , z2lhl

    , z2lhh

    ), (z2hll

    , z2hlh

    , z2hhl

    , z2hhh

    ) or (z3lll

    , z3llh

    , z3lhl

    , z3lhh

    ), (z3hll

    , z3hlh

    , z3hhl

    , z3hhh

    ).

    F6

    GOF-2

    x07 x06 x05 x04 x03 x02 x01 x00

    x17 x16 x15 x14 x13 x12 x11 x10

    x07 x06 x05 x04 x03 x02 x01 x005

    x17 x16 x15 x14 x13 x12 x11 x10

    x07 x06 x05 x04 x03 x02 x01 x00

    F4

    x07 x06 x05 x04 x03 x02 x01 x00

    F3F2

    GOF-1

    : :: :: :: :: :: :: :: :

    x17 x16 x15 x14 x13 x12 x11 x10

    : :: :: :: :

    x17 x16 x15 x14 x13 x12 x11 x10

    x x x x x x x x

    x17 x16 x15 x14 x13 x12 x11 x10

    x07 x06 x05 x04 x03 x02 x01 x00

    x17 x16 x15 x14 x13 x12 x11 x10

    1

    : :: :: :: : x77 x76 x75 x74 x73 x72 x71 x70

    x77 x76 x75 x74 x73 x72 x71 x70 x77 x76 x75 x74 x73 x72 x71 x70

    : :: :: :: :: :: :: :: :

    : :: :: :: :

    : :: :: :: :: :: :: :: :

    x77 x76 x75 x74 x73 x72 x71 x70

    : :: :: :: : x77 x76 x75 x74 x73 x72 x71 x70

    : :: :: :: : x77 x76 x75 x74 x73 x72 x71 x70

    6 samples/cycle

    First set ofMN/4 cyclesSecond set ofMN/4 cycles

    Fig. 5. Data input format of the proposed structure. Grey color boxes represent overlap area of the adjacent blocks, while the overlapping frames are shownin violet color.

  • 8/3/2019 05958631

    11/26

  • 8/3/2019 05958631

    12/26

    Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

    FOR IEEE TRANSACTIONS ON SIGNAL PROCESSING 12

    2), x(2n1 i,m2, 4n3 + 1), x(2n1 i,m2, 4n3)} and computes a pair of intermediate coefficients ul(2n1 i,m2, 2n31) and

    uh(2n1 i,m2, 2n3 1). During the same period, first subcell-1 receives last four samples {x(2n1 i,m2, 4n3 + 5), x(2n1

    i,m2, 4n3 + 4), x(2n1 i,m2, 4n3 + 3), x(2n1 i,m2, 4n3 + 2)} of the input block and computes the intermediate coefficients

    ul(2n1i,m2, 2n3) and uh(2n1i,m2, 2n3). Note that the successive output samples of subcell-1 belong to the same column

    and these intermediate coefficients can be processed directly by subcell-2 for column DWT. In each cycle, subcell-2 receives

    a pair of intermediate coefficients from the corresponding subcell-1 and computes the column-DWT in time-multiplexed form

    to take the advantage of down-sampled filter computation. The structure of subcell-2 is similar to the structure of subcell-2 of

    [16] (see Fig.4 and Fig.5 of [16]), except that each shift-register (SRs) in this case is replaced with register (R).

    After a latency of 3 cycles, each subcell-2 produces a pair of subband components (vll/vhl) and (vlh/vhh) in each cycle,

    such that during the even-numbered cycles, if it produces one component each of the pair of subbands vll and vlh, then during

    the odd-numbered cycles, it produces components of other two subbands (vhl and vhh). Both subcell-1 and subcell-2 work in

    separate pipeline stages and compute DWT computation concurrently. Each PE calculates DWT components of pair of columns

    of each of the four subband components of a given frame in M cycles, where the components of the subband (vll, vlh) and

    (vhl, vhh) are obtained in time-multiplexed form. The (i + 1)-th PE, therefore, completes the first level decomposition of the

    (2n1 i)-th frame of size (MN), in MN/4 cycles with initial latency of 3 cycles.

    The adjacent PEs of PU-1 generates DWT components corresponding to two successive frames. DWT components of four

    successive frames are obtained from four PEs such that 2 columns of DWT components of a pair of subbands (v1ll, v1lh) or

    (v1hl, v1hh) of four successive parallel frames are obtained from four PEs. Down-sampled filter computations are performed on

    each of the subband coefficients generated by the PE for temporal (inter-frame) DWT. Temporal DWT computations can be

    performed using subcell-1. PU-1, therefore, use four subcell-1 (see Fig.6) to calculate the temporal DWT of the columns of

    the subbands (v1ll, v1lh) or (v

    1hl, v

    1hh) of four successive frames concurrently. Out of these four subcells, first and third subcell-1,

    respectively, calculate temporal DWT of the even and odd numbered columns of the subbands (v1ll or v1hl), while the second and

    the fourth subcell-1, respectively, calculate the temporal DWT of even and odd numbered columns of subbands (v1lh or v1hh)

  • 8/3/2019 05958631

    13/26

    Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

    FOR IEEE TRANSACTIONS ON SIGNAL PROCESSING 13

    in time multiplexed form. Each subcell-1 calculates a pair of components corresponding to two subbands of the 3-D transform

    in every cycle such that in every cycle, a pair of components of two adjacent columns of four subbands (z1lll, z1llh, z

    1lhl, z

    1lhh)

    or (z1hll, z1hlh, z

    1hhl, z

    1hhh) are obtained from four such subcells. A pair of components of two adjacent columns of all the eight

    oriented selective subbands of 1-level 3-D DWT are obtained in a couple of cycles. Two columns of each of the eight subbands

    are obtained from PU-1 in M cycles and the entire coefficient matrix of 1-level 3-D DWT of the input frames of size (MN)

    can be obtained in MN/4 cycles with an initial latency of 4 cycles.

    zlll1(n1,n2,2n3)zlll

    1(n1,n2,2n3-1)

    SR1

    SR2IDU)

    R R R R

    R R R R y-unit(

    n2,n3

    )

    put-del

    Subcell1ul2(n1, I

    DMUX DMUXuh

    2(n1,n2, n3)

    Output2Output1

    Fig. 8. Structure of PU-2. Output-1 represent (v2ll

    (n1, m2, n3) or v2hl(n1, m2, n3)) and output-2 represent (v2lh

    (n1, m2, n3) or v2hh(n1, m2, n3), where0 m2 (M/4) 1, 0 n3 (N/4) 1 and 0 n2 (M/2) 1

    Components of the subband z1lll are send to PU-2 to calculate the DWT components of second-level. PU-2 receives a pair of

    components from PU-1 corresponding to a pair of adjacent columns ofz1lll in every cycle after a gap of one cycle. The structure

    of PU-2 is shown in Fig. 8. It consists of one input-delay-unit (IDU) and one subcell-1. Subcell-1 in this case perform row and

    column computations pertaining to 2-level DWT in time-multiplexed form. The components of z1lll of a particular frame are

    fed to subcell-1 through the IDU of PU-2 in block-by-block, similar to 1-level processing. The input block I(n1, n2, n3) in this

    case contains four consecutive samples {z1lll(n1, n2, 2n3 + 3), z1lll(n1, n2, 2n3 + 2), z1lll(n1, n2, 2n3 + 1), z1lll(n1, n2, 2n3)} for

    0 n2 M/21, 0 n3 N/41. Input blocks are fed to subcell-1 (see Fig.8) column-wise after a gap of one cycle such

    that one column of input block of a particular frame ofz1lll are fed to subcell-1 in M cycles and input blocks of one complete

    frame in MN/4 cycles. One column of input-block is derived from four successive columns of z1lll. PU-2 receive components

    of two adjacent columns of z1lll from PU-1, two previous columns of z1lll are required to be stored to derive the required

  • 8/3/2019 05958631

    14/26

    Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

    FOR IEEE TRANSACTIONS ON SIGNAL PROCESSING 14

    input blocks. The IDU, therefore, contains 2 shift-registers (SRs) (of size M/2 words each). The SRs also help to calculate

    downsampled filter computation along the row direction. A pair of DWT components u2l (n1, n2, n3), and u2h(n1, n2, n3) are

    obtained from the subcell in every alternate cycle. Note that the successive output samples (u2l (n1, n2, n3), and u2h(n1, n2, n3))

    are corresponding to successive columns of intermediate coefficient matrix ( [u2l ] and [u

    2h]). The column-DWT can be performed

    on the components of ([u2l ] and [u2h]) immediately in the next cycle. Since subcell-1 of PU-2 receives the input blocks of z

    1lll

    only during alternate cycles, it remains idle for one cycle after every input cycle of I(n1, n2, n3). The idle cycles of subcell-1

    can be utilized by assigning down-sampled filter computation of ([u2l ] and [u2h]) in time-multiplexed form. The samples of

    u2l , and u2h are passed through separate delay-path to provide the column delay necessary for the filter computation. All the

    registers and shift registers of IDU are clocked by CLK2 whose frequency is half of the frequency of CLK1 used by PU-1.

    The 4 multiplexors (MUX) of IDU select the delayed samples ofu2l , and u2h alternately and fed them to the subcell during its

    idle cycles. A pair of DWT components of two subbands (v2ll, v2lh) or (v

    2hl, v

    2hh) of a particular frame ofz

    1lll are obtained from

    PU-2 after every couple of cycles and one component each of four subbands in every four cycles. One column of each of the

    four subbands are obtained in M cycles and subband components of a complete frame in NM/4 cycles. The components of

    subbands (v2ll, v2lh) or (v

    2hl, v

    2hh) are sent to PU-3 to calculate inter-frame DWT computation of 2-level decomposition.

    MB1 MB2 MB3 MB4 MB7MB5 MB6Addr_1

    vl

    CLK_1

    vh2

    MUX MUX MUX MUXsel_3

    Fig. 9. Structure of frame-buffer-1

    To calculate temporal-DWT of 2-level, subbands of 3 successive frames are stored in frame-buffer-1. The structure of

    frame-buffer-1 is shown in Fig.9. It consists of 7 memory-blocks (MBs) and four 2-to-1 line MUXs. Each MB is of size

    of MN/8 words. Components of a pair of subbands (v2ll, v2hl) or (v

    2lh, v

    2hh) corresponding to a particular frame are stored in

    alternate MBs, such that, the components of (v2ll, v2hl) are stored in even-numbered MBs and those of (v

    2lh, v

    2hh) are stored

    in odd-numbered MBs. One extra MB is used to store one extra frame of ( v2lh, v2hh) to provide one complete frame-delay

  • 8/3/2019 05958631

    15/26

    Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

    FOR IEEE TRANSACTIONS ON SIGNAL PROCESSING 15

    FrameBuffer1vl

    2

    vl3

    vh2

    FrameBuffer2

    R R Rzlll

    2vh

    3

    SR2 SR4 SR6

    SR1 SR3 SR5 SR7

    ul3

    uh3

    MUX1 MUX1 MUX1 MUX1sel_3

    MUX2

    MUX3

    MUX2

    MUX3

    MUX2

    MUX3

    MUX2

    MUX3

    se _

    sel_4

    MUX4 MUX4 MUX4 MUX4

    Subcell1

    sel_1

    DMUXARRAYsel_1

    sel_2sel_3

    sel_4

    zl3 zh

    2zh3 zl

    2

    Fig. 10. Structure of PU-3. Input v2l

    and v2h

    , respectively, represent, (v2ll

    , v2hl

    ) and (v2lh

    , v2hh

    ). Intermediate results v3l

    and v3h

    , respectively, represent,

    (v3ll

    , v3hl

    ) and (v3lh

    , v3hh

    ). Output z2l

    and z2h

    , respectively, represent (z2hll

    , z2lhl

    , z2hhl

    ) and (z2llh

    , z2lhh

    , z2hlh

    , z2hhh

    ). Similarly, output z3l

    and z3h

    , respectively,

    represent (z3lll

    , z3hll

    , z3lhl

    , z3hhl

    ) and (z3llh

    , z3lhh

    , z3hlh

    , z3hhh

    ).

    TABLE VI

    TIMING SCHEDULE FOR MULTIPLEXING DWT COMPUTATION IN THE SUBCELL OF PU-3

    DWT SB clock cycles

    v2ll

    (2m1 + 2n + 1 + 1)

    2-level v2hl

    (2m1 + 2n + 3 + 1)

    Temporal v2lh

    ((2m1 + 1) + 2n + 1)

    v2ll

    ((2m1 + 1) + 2n + 3 + 1)

    3-level z2lll

    (2m1 + 4n + 2 + 2)

    Column z2lll

    (2m1 + 4n + 2 + 2)

    3-level u3l

    (2m1 + 2Mm2 + 8n + 4 + 3)

    Row u3h

    (2m1 + (2m2 + 1)M + 8n + 4 + 3)

    v3ll

    (4m1 + 2Mm2 + 8n + 6 + 4)

    3-level v3hl

    (4m1 + M(2m2 + 1) + 8n + 6 + 4)

    Temporal v3lh

    ((4m1 + 2) + 2Mm2 + 8n + 6 + 4)

    v3hh

    ((4m1 + 2) + M(2m2 + 1) + 8n + 6 + 4)

    LEGEND: SB: subband, = MN/8, n = 0, 1, 2,....,M 1, m2 = 0, 1, 2, ....., (N/8) 1, m1 = 0, 1, 2, 3...., 1 = 3M N/8 cycles delay to fill theframe-buffer-1, 2 = 12 cycles delay to fill the delay-path of z2lll, 3 = 3M cycles delay to fill the shift-registers corresponding to u

    3l

    or u3h

    , 4 = 6M N/8cycles delay to fill the frame-buffer-2.

    with respect to the subband components of (v2ll, v2hl). Four MUXs of frame-buffer-1 select the frames of (v

    2ll, v

    2hl) from the

    even-numbered MBs and the current frame during each even-numbered sets of MN/8 cycles, while they select the frames of

    (v2lh, v2hh) from the odd-numbered MBs during each odd-numbered sets of MN/8 cycles. The subcell of PU-3 (as shown in

    Fig.10) receives a block of 4 samples from the frame-buffer-1 through the MUXs during every alternate cycles and calculates

    inter-frame DWT of (v2ll, v2hl) and (v

    2lh, v

    2hh) in alternate periods of MN/8 cycles. The structure of this subcell is identical to

    the structure of subcell-1 of PU-1. Components of four subbands (z2lll, z2llh) and (z

    2hll, z

    2hlh) are obtained in time-multiplexed

    during even-numbered sets ofMN/8 cycles. Similarly, during the odd-numbered period ofMN/8 cycles, components of other

  • 8/3/2019 05958631

    16/26

    Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

    FOR IEEE TRANSACTIONS ON SIGNAL PROCESSING 16

    TABLE VII

    INPUT-O UTPUT DATA FLOW OF THE SUBCELL OF PU-3

    clock cycle Input output-1 output-2 clock cycle Input output-1 output-2

    1 v2ll

    (0, 0, 0) z2lll

    (0, 0, 0) z2llh

    (0, 0, 0) + 1 v2lh

    (0, 0, 0) z2lhl

    (0, 0, 0) z2lhh

    (0, 0, 0)

    2 z2lll

    (0, 0, 0) u3l

    (0, 0, 0) u3h

    (0, 0, 0) + 2

    3 v2hl

    (0, 0, 0) z2hll

    (0, 0, 0) z2hlh

    (0, 0, 0) + 3 v2hh

    (0, 0, 0) z2hhl

    (0, 0, 0) z2hhh

    (0, 0, 0)

    : : : : : : : :

    M + 1 v

    2

    ll(0, 0, 1) z

    2

    lll(0, 0, 1) z

    2

    llh(0, 0, 1) + M + 1 v

    2

    lh(0, 0, 1) z

    2

    lhl(0, 0, 1) z

    2

    lhh(0, 0, 1)M + 2 z2

    lll(0, 0, 1) u3

    l(0, 0, 1) u3

    h(0, 0, 1) + M + 2

    M + 3 v2hl

    (0, 0, 1) z2hll

    (0, 0, 1) z2hlh

    (0, 0, 1) + M + 3 v2hh

    (0, 0, 1) z2hhl

    (0, 0, 1) z2hhh

    (0, 0, 1)

    : : : : : : : :

    2 + 1 v2ll

    (2, 0, 0) z2lll

    (2, 0, 0) z2llh

    (2, 0, 0) 3 + 1 v2lh

    (2, 0, 0) z2lhl

    (2, 0, 0) z2lhh

    (2, 0, 0)

    2 + 2 z2lll

    (2, 0, 0) u3l

    (2, 0, 0) u3h

    (2, 0, 0) 3 + 2

    2 + 3 v2hl

    (2, 0, 0) z2hll

    (2, 0, 0) z2hlh

    (2, 0, 0) 3 + 3 v2hh

    (2, 0, 0) z2hhl

    (2, 0, 0) z2hhh

    (2, 0, 0)

    : : : : : : : :

    2 + M + 1 v2ll

    (2, 0, 1) z2lll

    (2, 0, 1) z2llh

    (2, 0, 1) 3 + M + 1 v2lh

    (2, 0, 1) z2lhl

    (2, 0, 1) z2lhh

    (2, 0, 1)

    2 + M + 2 z2lll

    (2, 0, 1) u3l

    (2, 0, 1) u3h

    (2, 0, 1) 3 + M + 2

    2 + M + 3 v2hl

    (2, 0, 1) z2hll

    (2, 0, 1) z2hlh

    (2, 0, 1) 3 + M + 3 v2hh

    (2, 0, 1) z2hhl

    (2, 0, 1) z2hhh

    (2, 0, 1)

    : : : : : : : :

    = MN/8. Input sample corresponding to the filter output is only shown in the input column. We have not counted the clock cycles involved to fill thedelay registers/shift-registers/memory-blocks.

    four subbands (z2lhl, z2lhh) and (z

    2hhl, z

    2hhh) are obtained in time-multiplexed form. One component of z

    2lll obtained from the

    subcell-1 (of PU-3) after every 4 cycles and the successive components belong to a column. Successive columns of z2lll are

    obtained from subcell-1 during alternate periods of MN/8 cycles. Subband z2lll is further transformed to generate the DWT

    coefficients of level-3.

    Since subcell-1 of PU-3 receives the components of (v2ll, v2hl) or (v

    2lh, v

    2hh) during alternate cycles, it remains idle for one

    cycle after every input cycle of (v2ll, v2hl) or (v

    2lh, v

    2hh). The DWT of z

    2lll can be computed by the subcell during the idle

    cycles without any data overlapping, since the amount of computation required to process z2lll is (3/8)-th of the amount of

    temporal-DWT of second-level. By computing temporal-DWT of second-level alone, hardware utilization of the subcell is only

    50%. The processing ofz2lll can be time-multiplexed with that of second-level computation without any data overlapping. DWT

    computation of z2lll are scheduled at the idle cycle of subcell-1 of PU-3. The processing of the intermediate coefficients u3l

    and u3h are time-multiplexed column-wise to take the advantage of down-sampling. Similarly the temporal-DWT of a pair of

    subbands (v3ll, v3hl) and (v

    3lh, v

    3hh) are time-multiplexed to take the advantage of down-sampling along the temporal direction.

    Schedule for multiplexing the computation of third-level DWT and second-level temporal-DWT in subcell-1 is given in Table

    VI. Input-output data-flow of subcell-1 of Fig.10 is derived for few cycles using the schedule of Table VI and shown in

  • 8/3/2019 05958631

    17/26

    Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

    FOR IEEE TRANSACTIONS ON SIGNAL PROCESSING 17

    Table VII. The registers of Fig.10 provides the required delay in column-wise processing, while the shift-registers provides

    the necessary delay in row-wise processing to the intermediate coefficients u3l and u3h. The extra shift-register provides one

    additional row-delay for time-multiplexing the processing ofu3l and u3h. Similarly, frame-buffer-2 provides the necessary frame-

    delay for the multiplexed computation of temporal-DWT (v3ll, v

    3hl) and (v

    3lh, v

    3hh). The structure of frame-buffer-2 is similar to

    the structure of frame-buffer-1 (see Fig.9) except that in this case each MB of size MN/32 words. Each shift-registers of PU-3

    is of size M/8 words (equal to half of the frame height of z2lll) and clocked by CLK4 which is 8 times slower than CLK1.

    Each registers of the delay-path are clocked by a separate clock CLK3 which is 4 times slower than CLK1. PU-3 uses separate

    multiplexors for multiplexing the computations. Four MUX1es multiplexes the computation ofu3l and u3h, while four MUX2es

    multiplexes the row and column processing of z2

    lll. Similarly, four MUX3es multiplexes the temporal DWT computation with

    the row and column processing of z2lll. Four MUX4es multiplexes computation of z2lll with temporal DWT computation of

    second-level. Each PU works in separate pipeline stage and computes multilevel 3-D computation concurrently. The proposed

    structure can compute 3-level running DWT of a video stream of frame size (MN) and frame rate R in MNR/8 cycles

    with initial latency of (11 + 2M+ 1 + 2 + 3 + 4) cycles, where a delay of (6 + 2M) cycles introduced to fill the

    register and shift-register of PU-2, and (1 +2 +3 +4) cycles delay is introduced to fill the MBs of frame-buffer-1, registers,

    shift-registers and MBs of frame-buffer-2 of PU-3.

    IV. IMPLEMENTATION OF SUBCELLS

    To have a reduced-hardware structure, subcell-1 and subcell-2 of the PUs can be implemented by multiple constant

    multiplication methods (MCM) using CSD-representation of the filter coefficients [19] or memory-based technique for

    multiplications using look-up-tables and adders [20]. Apart from that, the interrelation and symmetries between the coefficients

    of wavelet filter bases can be utilized to derive efficient structures of the subcells. We discuss here an optimal area-time efficient

    implementation of the subcells for the Daubechies wavelet filters for K = 4. The transfer function of the low pass and the

    high pass filters corresponding to the Daubechies 4-tap wavelet transform can be expressed as [21]:H(z) = a + bz1 + cz2 + dz3 (1a)

    G(z) = d cz1 + bz2 az3 (1b)where

    a = 1+3

    42

    , b = 3+3

    42

    , c = 33

    42

    , d = 13

    42

  • 8/3/2019 05958631

    18/26

    Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

    FOR IEEE TRANSACTIONS ON SIGNAL PROCESSING 18

    TABLE VIII

    COMPARISON OF HARDWARE - AN D TIME -C OMPLEXITIES OF THE PROPOSED STRUCTURE AND THE EXISTING STRUCTURES FOR 3-LEVEL 3-D DWT

    USING DAUBECHIES 4- TAP WAVELET FILTER. M: IMAGE HEIGHT, N: IMAGE WIDTH, R: FRAME-RATE

    structures MULT ADD REG shift-register frame-buffer cycle period ACT Latency

    Weeks et al [8] 24 18 242MN

    1

    8M NR TM + TA

    4

    7MNRx

    O(MN R)

    (3DW-I) +2MN R

    Weeks et al [8] 8 6 110 0 MN R T M + TA 4MNRxO(MN R)

    (3DW-II)Das et al [13] 24 18 8 2(2M + 1)N 1

    8M NR T M + TA

    4

    7MNRx O(MN R)

    Dai et al [15] 96 72 32 4(N + 2)R 18

    M NR T M + TA 1

    7MNRx O(MN R)

    Mohanty et al [16] (219/16)Q (657/64)Q 5.25N 147MN/32 5MN/32 TM + 2TA MN R/Q O(M N)

    Proposed Structure 44 276 239 15M/8 35MN/32 max(TM, 2TA) MNR/8 O(M N)

    Legend: MULT: multiplier, ADD: adder, REG: data/pipeline register, shift-register and frame-buffer are represented in words and ACT in cyclesx = 511/512,Q: input block-size.

    Ignoring the fixed factor (4

    2) in the denominators of the filter coefficients, the low pass and high pass outputs correspondingto the input sequence (x0, x1, x2, x3) may be expressed otherwise in alternative form given by (2) in the following:

    ul = (p1 + 2p2 + p3) +

    3(p1 p3) (2a)uh = (q1 + 2q2 + q3) + 3(q3 q1) (2b)

    where p1 = (x0 + x1); p2 = (x1 + x2); p3 = (x2 + x3); q1 = (x0 x1); q2 = (x2 x1), and q3 = (x2 x3);

    Unlike the subcell-1, subcell-2 performs down-sampled filter computation on two signals (ul(n) and uh(n)) in time-multiplexed

    form. The low-pass and high-pass filter outputs corresponding to (ul(n) and uh(n)) in this case can otherwise be expressed

    in an alternative form:

    vl = (2 + r)s(n) + (4 + r)t(n 1) + (2 r)s(n 2)rt(n 3) (3a)

    vh = rt(n) + (r 2)s(n 1) + (r + 4)t(n 2)(2 + r)s(n 3) (3b)

    where r =

    3 1, s(n) and t(n), respectively represent the outputs of the LC (see Fig.5 of [16]); where input X1 = ul(n)

    and X2 = uh(n). vl, respectively, represent vll and vhl when s(n) equal to ul and uh, respectively. Similarly, vh, respectively,

    represent vhh and vlh when t(n) equal to uh and ul, respectively.

    Equation (3) further may be expressed in z-domain as:

    Vl(z) = S(z)(2 + r + (2 r)z2)+T(z)((4 + r)z1 rz3) (4a)

    Vh(z) = T(z)(r + (r + 4)z2+S(z)((r 2)z1 (2 + r)z3) (4b)

    where, S(z) and T(z) represent z-transform of s(n) and t(n), respectively.

  • 8/3/2019 05958631

    19/26

    Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

    FOR IEEE TRANSACTIONS ON SIGNAL PROCESSING 19

    Each of the pair of filter outputs given by (2) and (4) involves only two multiplications. Using (2) and (4), the subcells

    can, therefore, be implemented in fully-pipelined structures. The proposed structures of the subcells are shown in Fig.11 and

    Fig.12. The structure of subcell-1, as shown in Fig.11, computes a pair of filter output during every cycle period according to

    (2). It consists of seven adder units (AU) and two multipliers. Each AU performs a pair of additions. AU-1, AU-2 and AU-3

    compute (p1, q1), (p2, q2) and (p3, q3), while AU-4 and AU-5 compute (2p2 +p3, 2q2 +q3) and (p1p3, q3q1), respectively.

    AU-6 computes (p1 + 2p2 + p3, q1 + 2q2 + q3). The entire computation of subcell-1 is performed in three pipeline stages. A

    pair of filter output is given out by AU-7 after a latency of 2 cycles, where the duration of a cycle period T = max(TM, 2TA),

    where TM and TA, respectively, the time required for one multiplication and addition operation in the subcells. The structure

    of subcell-2 has two multipliers, three shifters, 10 adders. It would yield a pair of filter output in every cycle period, after an

    initial latency of 4 cycles.

    x0 x1 x2 x3

    AU1 AU2 AU3

    p1q1 q2p2 q3p3

    AU5 AU4

    AU6

    1- 3 q3- q1 2p2+ p3 2q2+ q3

    33

    AU7p1+2p2+ p3

    q1+ 2q2+ q3yh

    yl

    Fig. 11. Structure of subcell-1 for the Daubechies 4-tap wavelet filter.

    r

    rs

    1

  • 8/3/2019 05958631

    20/26

    Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

    FOR IEEE TRANSACTIONS ON SIGNAL PROCESSING 20

    V. HARDWARE COMPLEXITY AND PERFORMANCE CONSIDERATION

    In this section we discuss the details of the hardware and time complexities of the proposed structure and compare those

    with the existing designs.

    A. Hardware Complexity

    The proposed structure is comprised of three PUs. PU-1 has four PEs and four subcells (subcell-1) where, each PE is

    comprised of two pair of subcell-1 and subcell-2. Each subcell-1 has 2 multipliers, 14 adders and 10 pipeline registers while,

    each subcell-2 requires 2 multipliers, 10 adders and 10 pipeline registers. Each PE requires 8 multipliers, 48 adders and 40

    pipeline registers. PU-1 therefore, involves 40 multipliers, 248 adders and 208 pipeline registers. PU-2 is comprised of one

    subcell and one IDU. IDU consists two shift registers of size M/2 words each, 8 registers, 4 MUXs and 2 DMUXs. PU-2,

    therefore, involves 2 multiplier, 14 adders, (16 + M) registers, 4 MUXs and 2 DMUXs. PU-3 is comprised of one subcell-1,

    one frame-buffer-1 and one frame-buffer-2, 7 shift-registers of size M/8 words, 3 registers, 16 2-to-1 line MUXs and one

    DMUX array. Frame-buffer-1 is comprised of 7 MBs of size MN/8 words each and 4 2-to-1 line MUXs. Similarly frame-

    buffer-2 is comprised of 7 MBs of size MN/32 words each and 4 2-to-1 line MUXs. The DMUX array is comprised of 14

    1-to-2 line DMUXs. The proposed structure, therefore, involves 44 multipliers, 276 adders, 239 data/pipeline registers, 15M/8

    shift-register words, (35/32)MN frame-buffer words, 28 MUXs and 16 DMUXs.

    B. Time Complexity

    The proposed structure calculates DWT coefficients of a four samples of each frame in every cycle and two frames are

    transformed in parallel. The average computation time (ACT) to calculate 3-level DWT of a video stream of frame size

    (MN) and frame rate R is MNR/8 cycles. The structure has initial latency of (23 + 5M+ 7MN/4) cycles. Out of this,

    (12 + 5M + 7MN/4) cycles of delay is introduced to fill the registers, shift-registers and MBs of the frame-buffers. The

    duration of one cycle period is T = max(TM, 2TA), where TM and TA is the time required to perform one multiplication and

    addition in a subcell.

  • 8/3/2019 05958631

    21/26

    Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

    FOR IEEE TRANSACTIONS ON SIGNAL PROCESSING 21

    C. Performance Comparison

    The hardware and time complexity of the proposed structure and the existing structures of [8], [13], [15], [16] are listed in

    Table VIII in terms of cycle period, ACT 1 and latency in clock cycles, registers, shift-register and frame-buffer size in words,

    along with the number of multipliers and adders for comparison.

    TABLE IX

    COMPARISON OF MEMORY AND TIME COMPLEXITY OF THE PROPOSED STRUCTURE AND STRUCTURES OF [15] AN D [16] FO R DWT LEVEL J = 3

    Structure Frame-size FPS Memory ACT

    Dai et al [15] 176 144

    15 56312 54203

    30 112592 108405

    60 225152 216810

    Mohanty et al [16] 176 144

    15 121140 23760

    30 121140 47520

    60 121140 95040

    Proposed 176 144

    15 28289 47520

    30 28289 95040

    60 28289 190080

    Dai et al [15] 640 480

    15 604952 657000

    30 1609872 1314000

    60 2419712 2628000

    Mohanty et al [16] 640 480

    15 1461720 288000

    30 1461720 576000

    60 1461720 1152000

    Proposed 640 480

    15 337439 576000

    30 337439 1152000

    60 337439 2304000

    memory in unit of words and ACT in unit of cycles.

    The structures of [8], [13], [15] are of folded type, which compute the multilevel 3-D DWT level-by-level while the structure

    of [16] compute multilevel 3-D DWT concurrently. As shown in Table VIII, the structure of [15] is the most efficient among the

    existing folded structures. Compared with [15], the proposed structure involves 2.18 times less multipliers, nearly 23/6 times

    more adders, (32NR/15M) times less on-chip memory (sum of data/pipeline register and shift-register words) and 4R/35

    times less frame-buffer than [15]. Besides, it has 8MNR/7 times less ACT than [15]. Compared with the structure of [16], the

    proposed one involves 0.311Q times less multipliers, (26.88/Q) times more adders, nearly 2.45N times less on-chip memory

    and 7 times more frame-buffer. It involves Q/8 times more ACT than the structure of [16]. The proposed structure has small

    cycle period compared to the existing structures. It is interesting to note that, the on-chip memory of the proposed structure

    1ACT is the number of cycles required for the computation of all the J-levels of 3-D DWT after the initial latency. ACT of the structure of [13] and [15],is calculated by the sum of the ACTs of each individual levels, because they compute the 3-D DWT of different levels sequentially. In case of the proposedstructure and the structure of [16] ACT is calculated by dividing the total number of 3-D DWT coefficients by the throughput per cycle

  • 8/3/2019 05958631

    22/26

    Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

    FOR IEEE TRANSACTIONS ON SIGNAL PROCESSING 22

    varies with M while in case of [15] it varies with NR and in case of [16] it varies with MN. This results a significant saving

    in memory since the frame-size and frame-rate of video applications is usually very large.

    We have estimated memory complexity and ACT of the proposed structure and the existing structures [15], [16] pertaining

    to some practically used video frames and frame-rates. The memory-complexity of the structures represent the sum of the

    data/pipeline registers and memory words required by the shift-register and frame-buffer. We have assumed input-block size

    Q = 16 for [16] and estimated memory complexity and ACT of [15], [16] and the proposed structure for 3-level DWT.

    The estimated values are listed in Table IX for comparison. It can be found from Table IX that, memory complexities of

    the proposed structure and that of [16] are independent of the frame-rate while in case of [15] it increases proportionately

    with frame-rate. Compared with [15], proposed structure for frame-size 176 144 and frame-rates 15, 30 and 60, respectively,

    requires 1.99 times, 3.98 times and 7.96 times less memory words and involves 12.3% less ACT. It involves 4.28 times less

    memory words than those of [16] for the same frame-size and frame-rates and involves 2 times more ACT than other. For

    frame-size 640480 and frame-rates 15, 30 and 60, the proposed structure involves 1.79 times, 4.77 times and 7.17 times less

    memory words than those of [15] and calculate 3-level DWT in 12.3% less time than [15]. Compared with [16], the proposed

    one involves 4.33 times less memory words than those of [16].

    D. Numerical Error Consideration

    To validate the proposed design we have coded it in MATLAB 7.1 and VHDL for decomposition level-1 for floating point

    and fixed point implementations, resepectively. For fixed-point implementation, we have taken 8-bit pixel values and 12-bit

    precision for all the intermediate signals. We have used 11-bit Baugh-Wooley multiplier for the RP and 12-bit multiplier for

    the CP and TP. We have processed four successive frames of two video sequences Foreman and Xylophone to generate all the

    8 subbands of 3-D DWT, and estimated absolute errors in fixed point implementation as the difference between the MATLAB

    simulation and test-bench results. The average and maximum errors obtained for all the subbands of Foreman and Xylophone

    video sequences are shown in Table X. We find that the average error in LLL subband is 0.49% of its average value in case

    of Foreman and 0.87% in case of Xylophone.

  • 8/3/2019 05958631

    23/26

    Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

    FOR IEEE TRANSACTIONS ON SIGNAL PROCESSING 23

    TABLE X

    NUMERICAL ERROR OF 1-L EVEL 3-D DWT

    SubbandsForeman Xylophone

    Avg. Error Max. Error Avg. Error Max. Error

    LLL 1.9797 3.0987 1.8376 2.7030

    HLL 2.2756 3.6129 1.7068 2.8422

    LLH 0.6019 1.2487 0.7493 1.5044

    HLH 0.6468 1.1638 0.6285 1.4232LHL 1.3203 2.426 0.8831 2.8559

    HHL 1.138 1.8319 0.9561 1.6683

    LHH 0.4495 1.1286 0.4783 1.0656

    HHH 0.4866 0.9639 0.5528 2.0059

    E. Synthesis Result

    We have coded the proposed design in VHDL for frame-size 176 144 and 640 480; and frame rates 15, 30, 60, and

    synthesized using Xilinx ISE 12.1i tools along with the best of the existing designs [15] and [16]. We have considered Daub-4

    wavelet filter for all the designs and coded the structure for 3-level DWT. We have used single port block RAM (BRAM) for

    implementing the temporal-buffer of all the designs. The frame-buffer of [15] and the input-buffer of [16] is also implemented

    using single-port BRAM. We have implemented all the registers and shift-registers transposition-buffer using delay-type flip-

    flop. All the designs are synthesized for the FPGA device 6VLX760FF1760-2 and the results obtained from the synthesis

    report are listed in Table XI. We have estimated the parameter slice delay-product (SDP) to measure area-time complexity of

    the designs in FPGA platform. SDP is defined as the product of number of slices required by each design and the computation

    time (CT), where CT = Number of cycles required for computation / Maximum usable frequency (MUF).

    The proposed design has lowest clock period and involves less memory (in terms of BRAMs) as expected from the theoretical

    estimation shown in comparison Table VIII and Table IX. Although the proposed one involves less than half of the multipliers

    of [15], it involves 25% more slices than [15] due to adder complexity, pipeline registers and data selectors (MUX, DMUX).

    As shown in Table XI, hardware complexity of the proposed structure and the structure of [16] is independent of the frame-

    rate while in case of [15], memory complexity almost increases proportionately with frame-rate. The proposed structure for

    frame-size 176 144 and frame-rate 15, 30 and 60, respectively, involves 2.56 times, 5.12 times and 9.6 times less BRAMs

    than those of [15] and offers 2 times higher throughput rate than [15]. Compared with the structure of [16] the proposed one

  • 8/3/2019 05958631

    24/26

    Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

    FOR IEEE TRANSACTIONS ON SIGNAL PROCESSING 24

    TABLE XI

    COMPARISON OF SYNTHESIS RESULTS OF THE PROPOSED STRUCTURE AND STRUCTURES OF [15], [16] FOR FPGA DEVICE 6VLX760FF1760-2

    StructureFPS Slices BR

    MUF ACT SDP

    frame-size (MHz) (ms) (s)

    Dai [15]15 10751 64 50.077 1.08 11.61

    30 10467 128 52.578 2.06 21.56

    176 144 60 10607 240 50.121 4.32 45.82

    Mohanty [16] 15 28581 48 40.21 0.59 16.86230 28581 48 40.21 1.18 33.725

    176 144 60 28581 48 40.21 2.36 67.45

    Proposed15 13495 25 88.096 0.539 7.27

    30 13495 25 88.096 1.07 14.54

    176 144 60 13495 25 88.096 2.15 29.09

    Dai [15] 15 14035 544 50.036 13.13 184.27

    Proposed15 13662 384 88.096 6.53 89.21

    30 13662 384 88.096 13.07 178.42

    640 480 60 13662 384 88.096 26.15 356.85

    Legend: BR: block RAM, SDP: slice delay product, FPS: frame per second, MUF: maximum usable clock frequency, SDP is defined as the product of numberof slices required by each design and the computation time (CT), where CT = Number of cycles required for computation / Maximum usable frequency.

    (MUF)).

    involves 2.11 times less slices, 1.92 times less BRAMs than those of [16] and offers nearly same throughput rate for the same

    frame-size and frame rates. The proposed one has 1.59 times less SDP than the structure of [15] and 2.3 times less SDP than

    that of [16]. For the frame-size 640480 and frame rate 15, the proposed structure involves 1.41 times less BRAMs and 2.06

    times less SDP than those of [15].

    F. Comparison of Power Consumption

    We have estimated the power consumption of the proposed design and the design of [15] and [16] using Xilinx Xpower

    tools by implementing them in FPGA device 6VLX760FF1760-2. Xpower analyzer report for 40 MHz clock frequency is

    listed in Table XII for comparison. As shown in Table XII, for frame-size 176 144 and frame rates 15, 30 and 60, the

    proposed structure dissipates, respectively, 9.4%, 31.68% and 74.7% less dynamic power than the structure of [15]. Compared

    with [16], the proposed one dissipates 71.9% less dynamic power. It dissipates 44.6% less dynamic power than that of [15]

    for frame-size 640 480 and frame rate 15. This is mainly due to less number of BRAM used by the proposed structure than

    others.

    V I . CONCLUSIONS

    We have shown that using overlapped grouping of frames and by appropriate scheduling of computation of different levels,

    the memory requirement of multilevel 3-D DWT structures could be drastically reduced. Based on this observation, we have

  • 8/3/2019 05958631

    25/26

    Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

    FOR IEEE TRANSACTIONS ON SIGNAL PROCESSING 25

    TABLE XII

    COMPARISON OF POWER CONSUMPTION

    Structure Frame-size FPSPower (watt)

    Static Dynamic

    Dai [15]

    176 144

    15 3.213 0.202

    30 3.214 0.247

    60 3.217 0.334

    Mohanty [16] 15, 30, 60 3.229 0.652

    Proposed 15, 30, 60 3.212 0.183

    Dai [15]640 480

    15 3.234 0.776

    Proposed 15, 30, 60 3.221 0.430

    suggested a computing scheme to reduce the memory complexity of 3-D DWT implementation. The remarkable feature of the

    proposed structure is that, it does not involve any line-buffer or frame-buffer for level-1.

    Compared with the best of the existing folded structures [15], proposed design involves significantly less on-chip memory,

    less frame-buffer and less ACT. Compared with [16], which could be taken as the best among all the existing designs, the

    proposed one involves 0.311Q times less multipliers and (26.88/Q) times more adders and involves Q/8 times more ACT,

    where Q is the input block-size. However, it involves 2.45N times less on-chip memory than other. For frame-size 176 144

    and frame-rate 60, the proposed structure involves 7.96 times less memory and 12.3% less ACT than [15]. Compared with

    [16] for input-block size (Q = 16), proposed structure involves 4.28 times less memory and involves double the ACT for

    the same frame size and frame-rates. The synthesis result for FPGA device 6VLX760FF1760-2 shows that proposed structure

    for frame size 176 144 and frame rate 60, involves 9.6 times less BRAMs than those of [15] and offers 2 times higher

    throughput rate than [15]. Compared with the structure of [16], proposed one involves 1.9 times less BRAMs and offers nearly

    the same throughput rate. The proposed structure has significantly less SDP than the existing structures. Due to its less memory

    complexity, proposed structure consumes less dynamic power than best of the existing structures. It can compute multilevel

    running 3-D DWT on an infinite GOFs frames and involves much less memory and resource than the existing designs. It could,

    therefore, be used for high-performance video processing applications.

    REFERENCES

    [1] G. Minami, Z. Xiong, A. Wang, and S. Mehrotra, 3-D wavelet coding of video with arbitrary regions of support, IEEE Trans. Circuit Syst. VideoTechnol., vol. 11, pp.1063-1068, Sept.2001.

    [2] A. M. Baskurt, H. Benoit-Cattin, and C. Odet, 3D medical image coding method using a separable 3D wavelet transform, SPIE Proceedings on Medical Imaging 1995: Image Display, vol. 2431, pp. 173183, Apr. 1995.

    [3] V. Sanchez, P. Nasiopoulos, and R. Abugharbieh, Lossless compression of 4D medical images using H.264/AVC, in IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP 2006), vol. II, May 2006, pp. 11161119.

    [4] J. Wei, P. Saipetch, R. K. Panwar, D. Chen, and B. K. Ho, Volumetric image compression by 3D discrete wavelet transform (DWT), SPIE Proceedingson Medical Imaging 1995: Image Display, vol. 2431, pp. 184194, Apr. 1995.

  • 8/3/2019 05958631

    26/26

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

    FOR IEEE TRANSACTIONS ON SIGNAL PROCESSING 26

    [5] L. Anqiang and L. Jing, A novel scheme for robust video watermark in the 3D-DWT domain, in International Symposium on Data, Privacy, andE-Commerce (ISDPE 2007), Nov. 2007, pp. 514516.

    [6] J. -R. Ohm, M. van der Schaar, and J. W. Woods, Interframe wavelet coding: Motion picture representation for universal scalability, J. Signal Process.Image Commun., vol. 19, no. 9, pp. 877908, Oct. 2004.

    [7] P.-C. Wu and L.-G. Chen, An efficient architecture for tow-dimensional discrete wavelet transform, IEEE Trans., Circuit and System for VideoTechnology, vol. 11, no. 4, pp.536545, Apr. 2001.

    [8] M. Weeks, and M. A. Bayoumi, Three-dimensional discrete wavelet transform architecture, IEEE Trans. on Signal Processing vol.50, no.8, pp.2050-2063, Aug. 2002.

    [9] M. Weeks, and M. A. Bayoumi, Wavelet transform: architecture, design and performance issues,, Journal of VLSI Signal Processing vol.35, Issue 2,pp.155-178, Sept. 2003.

    [10] W. Badawy, M. Talley, G. Zhang, M. Weeks, and M. A. Bayoumi, Low power very large scale integration prototype for three-dimensional discretewavelet transform processor with medical application, Journal of Electronic Imaging vol.12, no.2, pp.270-277, April 2003.

    [11] B. Das, A. Hazra and S. Banerjee, An efficient architecture for 3-D discrete wavelet transform, IEEE Trans. Circuit and Syst., Video Techno., vol. 20,no. 2, pp. 286296, Feb. 2010.

    [12] B. Das and S. Banerjee, Low power architecture of running 3-D wavelet transform for medical imaging application, in Proc. Eng. Med. Biol.Soc./Biomed. Eng. Soc. Conf., vol. 2. 2002, pp. 10621063.

    [13] B. Das and S. Banerjee, A memory efficient 3D DWT architecutre, Proceeding of 16th International Conference on VLSI Design, IEEE ComputerSociety Aug. 2003.

    [14] B. Das and S. Banerjee, Data-folded architecture for running 3-D DWT using 4-tap Daubechies filters, IEE Proc. Circuits Devices Syst., vol. 152, no.1, pp. 1724, Feb. 2005.

    [15] Q. Dai, X. Chen and C. Lin, A novel VLSI architecture for multidimensional discrete wavelet transform, IEEE Trans. Circuit and Syst., VideoTechno., vol.14, no.8, pp.1105-1110, Aug. 2004.

    [16] B. K. Mohanty and P. K. Meher, Parallel and pipeline architecture for high-throughput computation of multilevel 3-D DWT, IEEE Trans. Circuit andSyst., Video Techno., vol.20, No.9, pp.1200-1209, Sept.2010.

    [17] Z. Taghavi and S. Kasaei, A memory efficient algorithm for multidimensional wavelet transform based on lifting, in Proc. IEEE Int. Conf. Acoust.Speech Signal Process. (ICASSP), vol. 6. 2003, pp. 401404.

    [18] P. K. Meher, B. K. Mohanty and J. C. Patra Hardware-efficient systolic-like modular design for two-dimensional discrete wavelet transform, IEEE

    Trans. on Circuits and Syst. II, Express Brief vol. 55, no. 2, pp. 151-154, Feb 2008.[19] R. I. Hartley, Subexpression sharing in filters using canonic signed digit multipliers, IEEE Trans Circuits and Syst. II: Analog and Digital Signal

    Processing, vol. 43, no. 10, pp. 677688, Oct. 1996.[20] H. -R. Lee, C. -W. Jen, and C. -M. Liu, On the design automation of the memory-based VLSI architectures for FIR filters, IEEE Trans. Consumer

    Electronics, vol. 39, no. 3, pp. 619629, Aug. 1993.[21] I. Daubechies and W. Sweldens, Orthonormal bases of compactly supported wavelets, Comm. Pure Appl. Math., vol. 41, pp. 909996, 1988.

    Basant K Mohanty (M06) received B.Sc and M.Sc degree (both with first-class honors) in Physics from Sambalpur University,Orissa, in 1987 and 1989, respectively. Received Ph.D degree in the field of VLSI for Digital Signal Processing from BerhampurUniversity, Orissa in 2000.

    In 1992 he was selected by OPSC (Orissa Public Service Commission) and joined as faculty member in the Department of Physics,SKCG College Paralakhemundi, Orissa. In 2001 he joined as Lecturer in EEE Department, BITS Pilani, Rajasthan. Then he joinedas an Assistant Professor in the Department of ECE, Mody Institute of Education Research (Deemed University), Rajasthan. In2003 he joined Jaypee University of Engineering and Technology, Guna, Madhya Pradesh, where he become Associate Professor in

    2005 and full Professor in 2007. His research interest includes design and implementation of re-configurable VLSI architectures forresource-constrained digital signal processing applications. He has published nearly 30 technical papers. Currently he serves as thereviewers of IEEE Transactions on Circuits and Systems-II: Express Briefs, IEEE Transactions on Circuits and Systems for VideoTechnology, and IEEE Transactions on Very Large Scale Integration (VLSI) Systems .

    Dr.Mohanty is a life time member of The Institution of Electronics and Telecommunication Engineering, New Delhi, India.

    Pramod Kumar Meher (SM03) Pramod Kumar Meher (SM03) received B.Sc. and M.Sc. degrees in Physics and Ph.D. degree inscience from Sambalpur University, Sambalpur, India, in 1976, 1978, and 1996, respectively.

    He has a wide scientific and technical background covering Physics, Electronics, and Computer Engineering. Currently, he is a

    Senior Scientist with the Institute for Infocomm Research, Singapore. Prior to this assignment he was a visiting faculty with the Schoolof Computer Engineering, Nanyang Technological University, Singapore. Previously, he was a Professor of Computer Applicationswith Utkal University, Bhubaneswar, India from 1997 to 2002, a Reader in Electronics with Berhampur University, Berhampur, Indiafrom 1993 to 1997, and a Lecturer in Physics with various Government Colleges in India from 1981 to 1993. His research interestincludes design of dedicated and reconfigurable architectures for computation-intensive algorithms pertaining to signal, image andvideo processing, communication, bio-informatics and intelligent computing. He has contributed more than 170 technical papers tovarious reputed journals and conference proceedings.

    Dr. Meher is a Fellow of the Institution of Electronics and Telecommunication Engineers, India and a Fellow of the Institutionof Engineering and Technology, UK. He is serving as a speaker for the Distinguished Lecturer Program (DLP) of IEEE Circuits Systems Society, andAssociate Editor for the IEEE Transactions on Circuits and Systems-II: Express Briefs, IEEE Transactions on Very Large Scale Integration (VLSI) Systems ,and Journal of Circuits, Systems, and Signal Processing. He was the recipient of the Samanta Chandrasekhar Award for excellence in research in engineeringand technology for the year 1999.