1
Case Study: DCT
賴瑞欣(Larry) Slides are prepared by Prof. Shao-Yi Chien,
Information Theory and Coding Technique
DSP in VLSI Design Shao-Yi Chien 2
Outline
DCT Algorithm
1-D DCT : Row-Column Method
Direct 2-D Architecture
DSP in VLSI Design Shao-Yi Chien 9
DCT-Based Coding
Optimal transform is KLT, but
KLT is image dependent
High computing complexity
DCT-based coding,
Image independent, unlike KLT for highly
correlated image data
DCT compaction efficiency is close to KLT
Computations of DCT can be performed with fast
algorithms which can be easily implemented on
parallel architectures.
DSP in VLSI Design Shao-Yi Chien 10
Roles in Video Encoder
DCT Q
IQ
IDCT
Frame(s)
Memory
Motion
Estimation
Entropy
Coding
Rate Control
Compressed
Data T
Motion Vector
Q-Step
Original
SequenceM
u
l
t
i
p
l
e
x
e
r
Bitstream
Motion
Coding
Compressed
Data M
DSP in VLSI Design Shao-Yi Chien 11
Roles in Video Decoder
IQ IDCT
Frame(s)
Memory
Entropy
Decoding
Compressed
Data T
Motion Vector
Q-StepD
e
m
u
l
t
i
p
l
e
x
e
r
Bitstream
Motion
Decoding
Reconstructed
Sequence
Compressed
Data M
DSP in VLSI Design Shao-Yi Chien 12
Hardware/Software Trade-Off
For low-end applications, using software
approach is powerful enough
For high-end applications, must use hardware
approach
For middle-end applications, either software or
hardware approach is possible, depending on
the target design platform
DSP in VLSI Design Shao-Yi Chien 13
Basic Transformation Forms
2-D forward transforms
2-D inverse transforms
1
0
1
0),,,(),(),(
N
x
N
yvuyxfyxgvuT
1
0
1
0),,,(),(),(
N
u
N
vvuyxivuTyxg
DSP in VLSI Design Shao-Yi Chien 15
1D 8-Point DCT Basis Functions
k = 0
k = 1
k = 5
k = 3
k = 4 k = 7
k = 2
k = 6
DSP in VLSI Design Shao-Yi Chien 16
2-D Discrete Cosine Transform
Forward DCT
Backward DCT
otherwise , 1
0,for , 2/1)(),( where
nmnumu
Block size = N x N
X(m,n) : DCT coefficients
x(i,j) : image samples in block
DSP in VLSI Design Shao-Yi Chien 17
2-D 8x8 DCT Basis Functions
The DCT represents each block of image
samples as a weighted sum of 2-D cosine
functions (basis functions)
DSP in VLSI Design Shao-Yi Chien 18
DCT Coefficients
dc Low
frequencies
Medium
frequencies
High
frequencies
dc Vertical edges
High
frequencies
DSP in VLSI Design Shao-Yi Chien 20
An Example of DCT
DC: F(0,0)= (1/8) SS f(m,n)
related to the average value of the block
52 55 61 66 71 61 64 7363 59 66 90 109 585 69 7262 59 68 113 144 104 66 7363 58 71 122 154 106 70 6967 61 68 104 126 88 68 7079 65 60 70 77 68 58 7585 71 64 59 55 61 65 8387 79 69 58 65 76 78 94
609 -29 -62 25 55 -20 -1 37 -21 -62 9 11 -7 -6 6
-46 8 77 -25 -30 10 7 -5-50 13 35 -15 -9 6 0 311 -8 -13 -2 -1 1 -4 1-10 1 3 -3 -1 0 2 -1-4 -1 2 -1 2 -3 1 -2-1 -1 -1 -2 -1 -1 0 -1
DCT IDCT
Pixel Domain
Frequency Domain
DC
AC
DSP in VLSI Design Shao-Yi Chien 21
DCT Algorithm Classification
Direct 2-D Method The 2-D transforms, DCT and IDCT, to be applied directly on
the N x N input data items
Row-Column Method The 2-D transform can be carried out with two passes of 1-D
transforms
The separability property of 2-D DCT/IDCT allows the transform to be applied on one dimension (row) then on the other (column)
Requires 2N instances of N-point 1-D DCT to implement an N x N 2-D DCT
DSP in VLSI Design Shao-Yi Chien 22
Row-Column Decomposition
1D DCT
UNIT
1D DCT
UNIT
TRANSPOSE
MEMORY
X Z
(Y)
Y=AX Z=YAT
Separable,
row-column decomposition
TAXAZ 0for 1)( and 2
1)0(
1,...,1,0,
]4
)12(2cos[)(
2),(
kkcc
Nnk
N
knkc
Nnka
DSP in VLSI Design Shao-Yi Chien 23
Straightforward Approach
Carry out the computation as full matrix-vector
multiplications
1-D transform requires N*N multiplications and N* (N-1)
additions
2-D transform requires N*N*N*N multiplications and N * N
*(N * N-1) additions
Although requiring the most number of operations, this
method is very regular
Most suitable for vector processors or deeply pipelined
architectures for high PE utilization
1-D fast algorithms => O(N*logN)
2-D fast algorithms => O(N*N*logN)
DSP in VLSI Design Shao-Yi Chien 35
Row-Column Method(1/2)
Basic concept:
2-D DCT = 1-D DCT(row) → 1-D DCT(column)
1-D DCT 1-D DCTMEMORY
DCT for row DCT for column
Input (X) Output (Z)
(Y)
DSP in VLSI Design Shao-Yi Chien 36
1-D DCTTRANSPOSE
MEMORY
Row-Column Method(2/2)
Use transpose memory
Input (X) Output
(Y)
(Z)
Z = AXAT
(A)
37
Row-Column
Method Example
A. Madisetti and A. N. Willson Jr., “A 100 MHz 2-D
8x8 DCT/IDCT processor for HDTV applications,”
IEEE Transactions on Circuits and Systems for Video
Technology, vol. 5, no. 2, April 1995
DSP in VLSI Design Shao-Yi Chien 38
Matrix Decomposition
Reduce an 8x8 matrix computation to two
4x4 matrix computation
DCT
IDCT
DSP in VLSI Design Shao-Yi Chien 40
Implementation(2/9)
Use symmetrical property of DCT coefficients:
DCT:
IDCT:
DSP in VLSI Design Shao-Yi Chien 42
Overall Architecture (2/2)
BDEG MATRIX VECTOR MULTIPLIER
TRANSPOSEMEMORY
IDRUDRU
ACF MATRIX VECTOR MULTIPLIER
DSP in VLSI Design Shao-Yi Chien 43
X
Y
MUX
MUX LIFO
Data Reorder Unit (1/3)
DRU architecture
add sub even odd
DCT IDCT
DSP in VLSI Design Shao-Yi Chien 44
Y7Y6Y5Y4
X3X2X1X0Y3Y2Y1Y0 Y0Y1Y2Y3
X3X2X1X0
Y5Y4
X1X0Y3Y2Y1Y0 Y2Y3
X1X0Y1Y0
Y6Y5Y4
X2X1X0Y3Y2Y1Y0 Y1Y2Y3
X2X1X0Y0
Y0
Y0
Y2Y1Y0
Y2Y1Y0
Y1Y0
Y1Y0
Y3Y2Y1Y0
Y3Y2Y1Y0
Y4
X0Y3Y2Y1Y0 Y3
X0Y2Y1Y0
X
Y
MUX
MUX LIFO
Data Reorder Unit (2/3)
DRU_1 Data flow:
DSP in VLSI Design Shao-Yi Chien 45
Data Reorder Unit (3/3)
DRU_2 Data flow:
add sub even odd
Y0Y1Y2Y3
Y7Y6Y5Y4
Y0+ Y7
Y1+ Y6
Y2+ Y5
Y3+ Y4
Y0 - Y7
Y1 - Y6
Y2 - Y5
Y3 - Y4
Y0
Y6
Y2
Y4
Y7
Y1
Y5
Y3
DSP in VLSI Design Shao-Yi Chien 48
Matrix multiplier (3/4)
Hardwired multipliers
Use Sign Digit Code(SDC) number system
DSP in VLSI Design Shao-Yi Chien 50
Transpose Memory (1/2)
Pin-pong mode
Using RAMs ADDR
0 1 2 3... 7 8 9 10 11...15 16 17...62 63
0 8 16 24...56 1 9 17 25...57 2 10...55 63
0 1 2 3... 7 8 9 10 11...15 16 17...62 63
...
TIME
DSP in VLSI Design Shao-Yi Chien 51
Transpose Memory (2/2)
Using registers
(0,7) (0,6) (0,0)
(6,7)
(7,7) (7,6)
(6,6)
(7,0)
(6,0)
IN
. . .
. . .
. . .
. . .
. . .
. . .
MUX OUT
DSP in VLSI Design Shao-Yi Chien 52
Scheduling
i: block index
R: row
C: column
X
XiR0
XiR7
Y
Y
YiC7YiC0
Z
ZiC7ZiC0
YiR0
YiR7
DSP in VLSI Design Shao-Yi Chien 53
Scheduling
DRU
ACF/
BDEG
IDRU
X0R0
X0R0 >
Y0R0
Y0R0
X0R1
X0R1 >
Y0R1
Y0R1
X0R2
X0R2 >
Y0R2
Y0R2
X0R3
X0R3 >
Y0R3
Y0R3
X0R4
X0R4 >
Y0R4
Y0R4
X0R5
X0R5 >
Y0R5
Y0R5
X0R6
X0R6 >
Y0R6
8 cycles
DSP in VLSI Design Shao-Yi Chien 54
Scheduling
DRU
ACF/
BDEG
IDRUY0R5
X0R6
X0R6 >
Y0R6
Y0R6
X0R7
X0R7 >
Y0R7
Y0R7
8 cycles
X1R0
X1R0 >
Y1R0
Y1R0
X1R1
X1R1 >
Y1R1
Y1R1
X1R2
X1R2 >
Y1R2
Y1R2
X1R3
X1R3 >
Y1R3
Y1R3
X1R4
X1R4 >
Y1R4
Y0C0
Y0C0 >
Z0C0
Z0C0
Y0C1
Y0C1 >
Z0C1
Z0C1
Y0C2
Y0C2 >
Z0C2
Z0C2
Y0C3
Y0C3 >
Z0C3
Z0C3
Y0C4
100% hardware utilization!
DSP in VLSI Design Shao-Yi Chien 56
Direct 2-D DCT Method
Computing the transform directly from the N x
N input numbers
Derive fast DCT algorithms from the signal flow
graph (like FFT)
Based on 1-D DCT
Larger flow graph
Global routing
More temporal storage
Larger datapath