Ch5 Dictionary Coding

8/10/2019 Ch5 Dictionary Coding

1/56

Chapter 5 DICTIONARY

CODING


2/56

Dictionary Coding

Dictionary coding is different from Huffmancoding and arithmetic coding.

Both Huffman and arithmetic coding techniquesare based on a statistical model (e.g.,occurrence probabilities).

In dictionary-based data compressiontechniques, a symbol or a stringof symbolsgenerated from a source alphabet isrepresented by an index to a dictionaryconstructed from the source alphabet.


3/56

Dictionary Coding

A dictionary is a list of symbols and strings

of symbols.

There are many examples of this in our dailylives.

the string September vs. 9,

a social security number vs. a person in the U.S.

Dictionary coding is widely used in textcoding.


4/56

Dictionary Coding

Consider English text coding. The source alphabet includes 26 English letters in both

lower and upper cases, numbers, various punctuationmarks and the space bar.

Huffman or arithmetic coding treats each symbol basedon its occurrence probability. That is, the source ismodeled as a memoryless source.

It is well known, however, that this is not true in manyapplications.

In text coding, st ructure or context plays a significantrole. Very likely that the letter uappears after the letter q.

Likewise, it is likely that the word concerned will appear afterAs far as the weather is.


5/56

Dictionary Coding

The strategy of the dictionary coding is tobuild a dictionary that contains frequentlyoccurring symbols and string of symbols.

When a symbol or a string is encountered and itis contained in the dictionary, it is encoded withan index to the dictionary.

Otherwise, if not in the dictionary, the symbol orthe string of symbols is encoded in a lessefficient manner.


6/56

Dictionary Coding

All the dictionary schemes have an equivalentstatistical scheme, which achieves exactly thesame compression.

Statistical scheme using high-order contextmodels may outperform dictionary coding (withhigh computational complexity).

But, dictionary coding is now superior for fast

speed and economy of memory. In future, however, statistical scheme may be

preferred [bell 1990].


7/56

Formulation of Dictionary Coding

Define dictionary coding in a precise manner[bell 1990].

We denote a source alphabet by S.

A dictionary consisting of two elements isdefined as ,Pis a finite set of phrases generated from the S,

Cis a coding function mapping Ponto a set ofcodewords.

),( CPD


8/56

Formulation of Dictionary Coding

The set Pis said to be complete if any input stringcan be represented by a series of phrases chosenfrom the P.

The coding function Cis said to obey the prefixproperty if there is no codeword that is a prefix ofany other codeword.

For practical usage, i.e., for reversible

compression of any input text, the phrase set Pmust be complete and the coding function Cmustsatisfy the prefix property.


9/56

Categorization of Dictionary-

Based Coding Techniques

The heart of dictionary coding is theformulation of the dictionary.

A successfully built dictionary results in datacompression; the opposite case may lead todata expansion.

According to the ways in which dictionaries

are constructed, dictionary codingtechniques can be classified as static oradaptive.


10/56

Static Dictionary Coding

A fixed dictionary, Produced before the coding process

Used at both the transmitting and receiving ends

It is possible when the knowledge about the sourcealphabet and the related strings of symbols, alsoknown as phrases, is sufficient.

Merit of the static approach: its simplicity.

Its drawbacks lie on Relatively lower coding efficiency

Less flexibility compared with adaptive dictionarytechniques


11/56


An example of static algorithms occurs is diagramcoding. A simple and fast coding technique.

The dictionary contains: all source symbols and

some frequently used pairs of symbols.

In encoding, two symbols are checked at once to see ifthey are in the dictionary. If so, they are replaced by the index of the two symbols in the

dictionary, and the next pair of symbols is encoded in the nextstep.


12/56


If not, then the index of the first symbol is used to encode the

first symbol. The second symbol is combined with the third

symbol to form a new pair, which is encoded in the next step.

The diagram can be straightforwardly extendedto n-gram. In the extension, the size of the

dictionary increases and so is its coding

efficiency.


13/56

Adaptive Dictionary Coding

A completely defined dictionary does not existprior to the encoding process and the dictionary isnot fixed. At the beginning of coding, only an initial dictionary

exists. It adapts itself to the input during the coding process.

All adaptive dictionary coding algorithms can betraced back to two different original works by Ziv

and Lempel [ziv 1977, ziv 1978]. The algorithms based on [ziv 1977] are referred to as

the LZ77 algorithms.

Those based on [ziv 1978] the LZ78 algorithms.


14/56

Parsing Strategy

Once have a dictionary,

Need to examine the input text and find a string of

symbols that matches an item in the dictionary.

Then the index of the item to the dictionary is encoded.

This process of segmenting the input text into

disjoint strings(whose union equals the input

text) for coding is referred to as parsing .

Obviously, the way to segment the input text into

strings is not unique.


15/56

Parsing Strategy

In terms of the highest coding efficiency, optimalparsingis essentially a shortest-path problem[bell 1990].

In practice, however, a method calledgreedy

parsingis used in all the LZ77 and LZ78algorithms. With greedy parsing, the encoder searches for the

longest string of symbols in the input that matches an

item in the dictionary at each coding step. Greedy parsing may not be optimal, but it is simple inimplementation.


16/56

Parsing Strategy

Example 6.1 Consider a dictionary, D, whose phrase set

. The codewords assigned tothese strings are , , , ,

, , and . Now the input text: abbaab.

Using greedy parsing, we then encode the text as, which is a 10-bit string:

In above representations, the periods are used to

indicate the division of segments in the parsing. This, however, is not an optimum solution. Obviously,

the following parsing will be more efficient, i.e.,, which is a 6- bit string:

},,,,,,{ bbbaabbbbaabbaP10)( aC 011)( bC 010)( abC 0101)( baC

01)( bbC 11)( aabC 0110)( bbbC

)().().( abCbaCabC .010.0101.010

)().().( aabCbbCaC .11.01.10


17/56

Sliding Window (LZ77) Algorithms

Introduction The dictionary used is actually a portion of the input text,

which has been recently encoded.

The text that needs to be encoded is compared with thestrings of symbols in the dictionary.

The longest matched string in the dictionary ischaracterized by apointer(sometimes called a token),which is represented by a triple of data items.

Note that this triple functions as an index to thedictionary.

In this way, a variable-length string of symbols ismapped to a fixed-length pointer.


18/56

Introduction

There is a sliding window in the LZ77 algorithms. Thewindow consists of two parts: a search buffer and a look-ahead buffer. The search buffer contains: the portion of the text stream that has

recently been encoded --- the dictionary.

The look-ahead buffer contains: the text to be encoded next.

The window slides through the input text stream frombeginning to end during the entire encoding process.

The size of the search buffer is much larger than that of thelook-ahead buffer. The sliding window: usually on the order of a few thousand

symbols.

The look-ahead buffer: on the order of several tens to one hundredsymbols.


19/56

Encoding and Decoding

Example 6.2

Encoding:


20/56


Decoding: Now let us see how

the decoder recoversthe string baccbaccgifrom these threetriples.

In Figure 6.5, foreach part, the last

previously encodedsymbol cprior to thereceiving of the threetriples is shaded.


21/56

Summary of the LZ77 Approach

The first symbol in the look-ahead buffer is the

symbol or the beginning of a string of symbols to

be encoded at the moment. Let us call it the

symbol s.

In encoding, the search pointer moves to the left,

away from the symbol s, to find a match of the

symbol sin the search buffer. When there aremultiple matches, the match that produces the

longest matched string is chosen. The match is

denoted by a triple .


22/56


The first item in the triple, i, is the offset, which is the

distance between the pointer pointing to the symbol

giving the maximum match and the symbol s.

The second item, j, is the lengthof the matched string. The third item, k, is the codeword of the symbol

following the matched string in the look-ahead buffer.

At the very beginning, the content of the search

buffer can be arbitrarily selected. For instance, thesymbols in the search buffer may all be the space

symbol.


23/56


Denote the size of the search buffer by SB, thesize of the look-ahead buffer by Land the size ofthe source alphabet byA. Assume that the natural

binary code is used. Then we see that the LZ77approach encodes variable-length strings ofsymbols with fixed-length codewords. Specifically, the offset i is of coding length ,

the length of matched string j is of coding length,

and the codeword k is of coding length , wherethe sign denotes the smallest integer larger than .

SB2log

)(log 2 LSB

)(log2 A

a a


24/56


The length of the matched string is equal tobecause the search for the maximum matchingcan enter into the look-ahead buffer.

The decoding process is simpler than theencoding process since there is no comparisoninvolved in the decoding.

The most recently encoded symbols in the search

buffer serve as the dictionary used in the LZ77approach. The merit of doing so is that thedictionary is adapted to the input text well.

)(log 2 LSB


25/56


26/56

LZ78 Algorithms

Introduction

Limitations with LZ77: If the distance between two repeated patterns is larger than the

size of the search buffer, the LZ77 algorithms cannot workefficiently.

The fixed size of the both buffers implies that the matched string

cannot be longer than the sum of the sizes of the two buffers,

meaning another limitation on coding efficiency.

Increasing the sizes of the search buffer and the look-ahead

buffer seemingly will resolve the problems. A close look,

however, reveals that it also leads to increases in the number of

bits required to encode the offset and matched string length as

well as an increase in processing complexity.


27/56

Introduction

LZ78: No use of the sliding window.

Use encoded text as a dictionary which, potentially, does not

have a fixed size.

Each time a pointer (token) is issued, the encoded string is

included in the dictionary.

Once a preset limit to the dictionary size has been reached,

either the dictionary is fixed for the future (if the coding

efficiency is good), or it is reset to zero, i.e., it must be restarted. Instead of the triples used in the LZ77, only pairs are used in

the LZ78.

Specifically, only the position of the pointer to the matched

string and the symbol following the matched string need to be

encoded.


28/56


Example 6.3 Encoding:

Consider the text stream: baccbaccacbcabccbbacc.

In Table 6.4, in general, as the encoding proceeds, the entriesin the dictionary become longer and longer. First, entries withsingle symbols come out, but later, more and more entries withtwo symbols show up. After that more and more entries withthree symbols appear. This means that coding efficiency isincreasing.

Decoding: Since the decoder knows the rule applied in the encoding, it can

reconstruct the dictionary and decode the input text stream fromthe received doubles.


29/56


Index Doubles Encoded symbols

1 < 0, C(b) > b

2 < 0, C(a) > a

3 < 0, C(c) > c

4 < 3, 1 > cb

5 < 2, 3 > ac

6 < 3, 2 > ca

7 < 4, 3 > cbc

8 < 2, 1 > ab

9 < 3, 3 > cc

10 < 1, 1 > bb

11 < 5, 3 > acc

Table 6.4 An encoding example using the LZ78 algorithm


30/56

LZW Algorithm

Both the LZ77 and LZ78 approaches, when published in1977 and 1978, respectively, were theory-oriented.

The effective and practical improvement over the LZ78 in[welch 1984] brought much attention to the LZ dictionary

coding techniques. The resulting algorithm is referred tothe LZW algorithm.

It removed the second item in the double (the index of thesymbol following the longest matched string) and hence, it

enhanced coding efficiency.In other words, the LZW only sends the indexes of thedictionary to the decoder.


31/56

LZW Algorithm

Example 6.4

Consider the following input text stream:

accbadaccbaccbacc. We see that the source

alphabet is S={a,b,c,d,}.

Encoding:


32/56

LZW Algorithm

Index Entry Input Smbols Encoded Index

1

23

4

a

bc

d

Initial dictionary

5

6

7

8

910

11

12

13

14

15

ac

cc

cb

ba

adda

acc

cba

accb

bac

cc

a

C

C

b

ad

a,c

c,b

a,c,c

b,a,

c,c,

1

3

3

2

14

5

7

11

8

Table 6.5 An example of the dictionary coding using the

LZW algorithm


33/56

LZW Algorithm

Decoding: Initially, the decoder has the same dictionary (the top four rows

in Table 6.5) as that in the encoder.

Once the first index 1 comes, the decoder decodes a symbol a.

The second index is 3, which indicates that the next symbol is c.

From the rule applied in encoding, the decoder knows further thata new entry achas been added to the dictionary with an index 5.

The next index is 3. It is known that the next symbol is also c.

It is also known that the string cchas been added into thedictionary as the sixth entry.

In this way, the decoder reconstructs the dictionary and

decodes the input text stream.


34/56

LZ78 Compression Algorithm (contd)Dictionary empty ; Prefix empty ; DictionaryIndex 1;

while(characterStream is not empty)

{

Char next character in characterStream;if(Prefix + Char exists in the Dictionary)

Prefix Prefix + Char ;

else

{

if(Prefix is empty)

CodeWordForPrefix 0 ;

else

CodeWordForPrefix DictionaryIndex for Prefix ;

Output: (CodeWordForPrefix, Char) ;

insertInDictionary( ( DictionaryIndex , Prefix + Char) );

DictionaryIndex++ ;

Prefix empty ;

}}

if(Prefix is not empty)

{

CodeWordForPrefix DictionaryIndex for Prefix;

Output: (CodeWordForPrefix , ) ;

}


35/56

Exercises

1. Encode (i.e., compress) the following strings

using the LZ77 algorithm.

Window size : 5 ABBABCABBBBC

2. Encode (i.e., compress) the following strings

using the LZ78 algorithm.

1.ABBCBCABABCAABCAAB

2.BABAABRRRA

3.AAAAAAAAA


36/56

LZ77 - Example

Window size : 5

ABBABCABBBBC

NextSeq Code

A (0,0,A)

B (0,0,B)

BA (1,1,A)

BC (3,1,C)ABB (3,2,B)

BBC (2,2,C)

Example 1: LZ78 Compression


37/56

Example 1: LZ78 CompressionEncode (i.e., compress) the string ABBCBCABABCAABCAABusing the LZ78 algorithm.

The compressed message is: (0,A)(0,B)(2,C)(3,A)(2,A)(4,A)(6,B)

Note:The above is just a representation, the commas and parentheses are not transmitted;

we will discuss the actual form of the compressed message later on in slide 12.


38/56

Example 1: LZ78 Compression (contd)

1. Ais not in the Dictionary; insert it

2. Bis not in the Dictionary; insert it

3. Bis in the Dictionary.BCis not in the Dictionary; insert it.

4. Bis in the Dictionary.

BCis in the Dictionary.

BCAis not in the Dictionary; insert it.


BAis not in the Dictionary; insert it.6. Bis in the Dictionary.


BCAis in the Dictionary.

BCAAis not in the Dictionary; insert it.



BCAis in the Dictionary.

BCAAis in the Dictionary.

BCAABis not in the Dictionary; insert it.

Example 2: LZ78 Compression


39/56

p p

Encode (i.e., compress) the string BABAABRRRAusing the LZ78 algorithm.

The compressed message is: (0,B)(0,A)(1,A)(2,B)(0,R)(5,R)(2, )

Example 2: LZ78 Compression (contd)


40/56

Example 2: LZ78 Compression (cont d)

1. Bis not in the Dictionary; insert it

2. Ais not in the Dictionary; insert it


BAis not in the Dictionary; insert it.4. Ais in the Dictionary.

ABis not in the Dictionary; insert it.

5. Ris not in the Dictionary; insert it.

6. Ris in the Dictionary.

RRis not in the Dictionary; insert it.

7. Ais in the Dictionary and it is the last input character; output a pair

containing its index: (2, )

Examp e 3: LZ78 Compress on


41/56

p pEncode (i.e., compress) the string AAAAAAAAAusing the LZ78 algorithm.

1. A is not in the Dictionary; insert it

2. A is in the Dictionary

AA is not in the Dictionary; insert it

3. A is in the Dictionary.AA is in the Dictionary.

AAA is not in the Dictionary; insert it.

4. A is in the Dictionary.

AA is in the Dictionary.

AAA is in the Dictionary and it is the last pattern; output a pair containing its index:

(3, )

LZ78 Compression: Number of bits transmitted


42/56

p Example: Uncompressed String: ABBCBCABABCAABCAAB

Number of bits = Total number of characters * 8

= 18 * 8

= 144 bits

Suppose the codewords are indexed starting from 1:

Compressed string( codewords): (0, A) (0, B) (2, C) (3, A) (2, A) (4, A) (6, B)

Codeword index 1 2 3 4 5 6 7

Each code word consists of an integer and a character:

The character is represented by 8bits.

The number of bits nrequired to represent the integer part of the codeword with

index iis given by:

Alternatively number of bits required to represent the integer part of the codeword

with indexi is the number of significant bits required to represent the integer i1

LZ78 Compression: Number of bits transmitted (contd)


43/56

p ( )

Codeword (0, A) (0, B) (2, C) (3, A) (2, A) (4, A) (6, B)

index 1 2 3 4 5 6 7Bits: (1 + 8) + (1 + 8) + (2 + 8) + (2 + 8) + (3 + 8) + (3 + 8) + (3 + 8) = 71 bits

The actual compressed message is: 0A0B10C11A010A100A110B

where each character is replaced by its binary 8-bit ASCII code.

LZ78 Decompression Algorithm


44/56

Dictionaryempty ; DictionaryIndex1 ;

while(there are more (CodeWord, Char) pairs in codestream){

CodeWordnext CodeWord in codestream ;

Charcharacter corresponding to CodeWord ;

if(CodeWord = =0)Stringempty ;

else

Stringstring at index CodeWordin Dictionary ;

Output: String + Char ;

insertInDictionary( (DictionaryIndex , String + Char) ) ;

DictionaryIndex++;}

Summary: input:(CW, character) pairs

output:if(CW == 0)

output: currentCharacter

else

output: stringAtIndex CW + currentCharacter Insert:current output in dictionary

Example 1: LZ78 Decompression


45/56

p p

Decode (i.e., decompress) the sequence (0, A) (0, B) (2, C) (3, A) (2, A) (4, A) (6, B)

The decompressed message is: ABBCBCABABCAABCAAB



46/56

Decode (i.e., decompress) the sequence (0, B) (0, A) (1, A) (2, B) (0, R) (5, R) (2, )

The decompressed message is: BABAABRRRA



47/56

p p

Decode (i.e., decompress) the sequence (0, A) (1, A) (2, A) (3, )

The decompressed message is: AAAAAAAAA


48/56

Applications

The CCITT Recommendation V.42 bis is a datacompression standard used in modems that connectcomputers with remote users via the GSTN. In thecompressed mode, the LZW algorithm is recommended fordata compression.

In image compression, the LZW finds its application aswell. Specifically, it is utilized in the graphic interchangeformat (GIF) which was created to encode graphicalimages. GIF is now also used to encode natural images,though it is not very efficient in this regard. For more

information, readers are referred to [sayood 1996]. The LZW algorithm is also used in the Unix Compress

command.


49/56

International Standards for

Lossless Still Image Compression

Lossless Bilevel Still Image Compression

Algorithms

As mentioned above, there are fourdifferent

international standard algorithms falling into this

category.

MH (Modified Huffman coding): This algorithm is defined inCCITT Recommendation T.4 for facsimile coding. It uses

the 1-D RLC technique followed by the modified Huffmancoding technique.


50/56

Algorithms

MR (Modified READ (Relative Element Address Designate)coding): Defined in CCITT Recommendation T.4 for

facsimile coding. It uses the 2-D RLC technique followed by

the modified Huffman coding technique.

MMR (Modified Modified READ coding): Defined in CCITTRecommendation T.6. It is based on MR, but is modified tomaximize compression.

JBIG (Joint Bilevel Image experts Group coding): Defined inCCITT Recommendation T.82. It uses an adaptive 2-D

coding model, followed by an adaptive arithmetic codingtechnique.


51/56

Performance Comparison

The JBIG test image set was used to compare theperformance of the above-mentioned algorithms. The setcontains scanned business documents with differentdensities, graphic images, digital halftones, and mixed

(document and halftone) images. Note that digital halftones, also named (digital) halftone

images, are generated by using only binary devices. Some small black units are imposed on a white background.

The units may assume different shapes: a circle, a square and so

on. The more dense the black units in a spot of an image, the darker

the spot appears.

The digital halftoning method has been used for printing gray-levelimages in newspapers and books.


52/56


For bilevel images excluding digital halftones, thecompression ratio achieved by these techniques rangesfrom 3 to 100.The compression ratio increases monotonically in the order

of the following standard algorithms: MH, MR, MMR, JBIG. For digital halftones, MH, MR, and MMR result in data

expansion, while JBIG achieves compression ratios in therange of 5 to 20.

This demonstrates that among the techniques, JBIG is theonly one suitable for the compression of digital halftones.


53/56

Lossless Multilevel Still Image

Compression

Algorithms

There are two international standards formultilevel still image compression:

JBIG (Joint Bilevel Image experts Group coding):Defined in CCITT Recommendation T. 82.

It uses an adaptive arithmetic coding technique. To encode multilevel images, the JIBG decomposes

multilevel images into bit-planes, then compresses these

bit-planes using its bilevel image compression technique. To further enhance compression ratio, it uses Gary coding

to represent pixel amplitudes instead of weighted binarycoding.


54/56

Algorithms

JPEG (Joint Photographic (image) Experts Group

coding): Defined in CCITT Recommendation T. 81.

For lossless coding (Lossless JPEG), it uses

differential coding technique. The predictive error is

encoded using either Huffman coding or adaptive

arithmetic coding techniques.


55/56


A set of color test images from the JPEGstandards committee was used forperformance comparison.

The luminance component (Y) is of resolutionpixels, while the chrominance

components (U and V) are of pixels.

The compression ratios calculated are the

combined results for all the three components.The following observations have been reported.

576720

576360


56/56


When quantized in 8 bits/pixel, the compression ratios varymuch less for multilevel images than for bilevel images,and are roughly equal to 2.

When quantized with 5 bits/pixel down to 2 bits/pixel,

compared with the lossless JPEG, the JBIG achieves anincreasingly higher compression ratio, up to a maximum of29%.

When quantized with 6 bits/pixel, JBIG and lossless JPEGachieve similar compression ratios.

When quantized with 7 bits/pixel to 8 bits/pixel, the losslessJPEG achieves a 2.4 % to 2.6 % higher compression ratiothan JBIG.

Documents

Ch5 Dictionary Coding