48
January 5, 2016 1 [email protected]

Lecture 7 sequitur

Embed Size (px)

Citation preview

Page 1: Lecture 7 sequitur

January 5, 2016 1 [email protected]

Page 2: Lecture 7 sequitur

January 5, 2016 2 [email protected]

Introduction

Contents

Context Free Grammar

Sequitur Principles

Context-Free Grammar Example

Page 3: Lecture 7 sequitur

January 5, 2016 [email protected] 3

Sequitur (or Nevill-Manning algorithm) is a recursive algorithm developed

by Craig Nevill-Manning and Ian H. Witten in 1997 that infers a hierarchical

structure (context free grammar) from a sequence of discrete symbols. The

algorithm operates in linear space and time. It can be used in data

compression software applications

Sequitur is based on the concept of context-free grammars, so we start

with a short review of this field.

Introduction

Page 4: Lecture 7 sequitur

January 5, 2016 [email protected] 4

It reads the input symbol by symbol and uses repeated phrases in the

input data to build a set of context-free production rules.

Sequitur (from the Latin for “it follows”) is based on the concept of context-

free grammars.

It considers the input stream a valid sequence in some formal language.

Page 5: Lecture 7 sequitur

January 5, 2016 5 [email protected]

A (natural) language starts with a small number of building blocks (letters

and punctuation marks) and uses them to construct words and sentences.

A sentence is a finite sequence (a string) of symbols that obeys certain

grammar rules.

Similarly, a formal language uses a small number of symbols (called

terminal symbols) from which valid sequences can be constructed.

The rules can be used to construct valid sequences and also to

determine whether a given sequence is valid.

A production rule consists of a nonterminal symbol on the left and a

string of terminal and nonterminal symbols on the right.

Page 6: Lecture 7 sequitur

January 5, 2016 [email protected] 6

– terminals: b, e

– non-terminals: S, A

– Production Rules:

– S is the start symbol

Page 7: Lecture 7 sequitur

January 5, 2016 [email protected] 7

The nonterminal symbol on the left becomes the name of the string on

the right.

In general, the right-hand side may contain several alternative strings,

but the rules generated by sequitur have just a single string.

The BNF notation, used to describe the syntax of programming

languages, is based on the concept of production rules.

We use lowercase letters to denote terminal symbols and uppercase

letters for the non-terminals.

BNF is an acronym for “Backus Naur Form“

Page 8: Lecture 7 sequitur

January 5, 2016 [email protected] 8

Suppose that the following production rules are given:

A → ab, B → Ac, C → BdA.

Now verify that the string abcdab is valid

It is clear that the production rules reduce the redundancy of the original

sequence, so they can serve as the basis of a compression method.

Using these rules we can generate the valid strings ab (an application of

the nonterminal A), abc (an application of B), abcdab (an application of C),

as well as many others.

Context Free Grammar

Page 9: Lecture 7 sequitur

January 5, 2016 [email protected] 9

Each repetition results in a rule, is replaced by the name of the rule (a

nonterminal symbol), thereby resulting in a shorter representation.

Generally, a set of production rules can be used to generate many valid

sequences, but the production rules produced by sequitur are not general.

They can be used only to reconstruct the original data.

The production rules themselves are not much smaller than the original

data, so sequitur has to go through one more step, where it compresses

the production rules.

The compressed production rules become the compressed stream, and

the sequitur decoder uses the rules (after decompressing them) to

reconstruct the original data.

Page 10: Lecture 7 sequitur

January 5, 2016 [email protected] 10

If the input is a typical text in a natural language, the top-level rule

becomes very long, typically 10–20% of the size of the input, and the

other rules are short, with typically 2–3 symbols each.

Page 11: Lecture 7 sequitur

January 5, 2016 [email protected] 11

Sequitur Principles

• Digram Uniqueness:

– no pair of adjacent symbols (digram) appears more than once in the

grammar.

• Rule Utility:

– Every production rule is used more than once.

• These two principles are maintained as an invariant while inferring a

grammar for the input string.

Sequitur constructs its grammars by observing two principles (or enforcing

two constraints) that we denote by p1 and p2.

Constraint p1 is; No pair of adjacent symbols will appear more than once in

the grammar (this can be rephrased as; Every digram in the grammar is

unique).

Constraint p2 says; Every rule should be used more than once.

This ensures that rules are useful. A rule that occurs just once is useless

and should be deleted.

Page 12: Lecture 7 sequitur

January 5, 2016 [email protected] 12

The result is a two-rule grammar, where the first rule is the input

sequence with its redundancy removed, and the second rule is short,

replacing the digram bc with the single nonterminal symbol A.

Page 13: Lecture 7 sequitur

January 5, 2016 [email protected] 13

The input S is considered a one-rule grammar. It has redundancy, so each

occurrence of abcdbc is replaced with A. Rule A still has redundancy

because of a repetition of the phrase bc, which justifies the introduction of

a second rule B.

Page 14: Lecture 7 sequitur

January 5, 2016 [email protected] 14

Above Figure shows how the two constraints can be violated. The first

grammar of Figure contains two occurrences of bc, thereby violating p1.

The second grammar contains rule B, which is used just once. It is easy to

see how removing B reduces the size of the grammar. The resulting,

shorter grammar is shown in following Figure. It is one rule and one symbol

shorter.

Page 15: Lecture 7 sequitur

January 5, 2016 [email protected] 15

The sequitur encoder constructs the grammar rules while enforcing the

two constraints at all times.

If constraint p1 is violated, the encoder generates a new production rule.

When p2 is violated, the useless rule is deleted.

The encoder starts by setting rule S to the first input symbol. It then goes

into a loop where new symbols are input and appended to S.

Each time a new symbol is appended to S, the symbol and its

predecessor become the current digram.

If the current digram already occurs in the grammar, then p1 has been

violated, and the encoder generates a new rule with the current digram

on the right-hand side and with a new nonterminal symbol on the left.

The two occurrences of the digram are replaced by this nonterminal.

Page 16: Lecture 7 sequitur

January 5, 2016 [email protected] 16

Page 17: Lecture 7 sequitur

January 5, 2016 [email protected] 17

Notice that generating rule C has made rule B underused (i.e., used just

once), which is why it was removed in the previous Figure.

One more detail, namely rule utilization, still needs to be discussed.

When a new rule X is generated, the encoder also generates a counter

associated with X, and initializes the counter to the number of times X is

used (a new rule is normally used twice when it is first generated). Each

time X is used in another rule Y, the encoder increments X’s counter by

1. When Y is deleted, the counter for X is decremented by 1. If X’s

counter reaches 1, rule X is deleted.

Page 18: Lecture 7 sequitur

January 5, 2016 [email protected] 18

As an example, we show the information sent to the decoder for the input

string abcdbcabcdbc (above Figure). Rule S consists of two copies of rule

A. The first time rule A is encountered, its contents aBdB are sent. This

involves sending rule B twice. The first time rule B is sent, its contents bc

are sent (and the decoder does not know that the string bc it is receiving is

the contents of a rule). The second time rule B is sent, the pair (1, 2) is sent

(offset 1, count 2).

The decoder identifies the pair and uses it to set up the rule 1 → bc.

Sending the first copy of rule A therefore amounts to sending abcd(1, 2).

The second copy of rule A is sent as the pair (0, 4) since A starts at offset

0 in S and its length is 4. The decoder identifies this pair and uses it to set

up the rule 2 → a 1 d 1 . The final result is therefore abcd(1, 2)(0, 4).

Page 19: Lecture 7 sequitur

January 5, 2016 [email protected] 19

Context-Free Grammar Example

Page 20: Lecture 7 sequitur

January 5, 2016 [email protected] 20

Arithmetic Expressions

Page 21: Lecture 7 sequitur

January 5, 2016 [email protected] 21

Sequitur Example

Page 22: Lecture 7 sequitur

January 5, 2016 [email protected] 22

Page 23: Lecture 7 sequitur

January 5, 2016 [email protected] 23

Page 24: Lecture 7 sequitur

January 5, 2016 [email protected] 24

Page 25: Lecture 7 sequitur

January 5, 2016 [email protected] 25

Page 26: Lecture 7 sequitur

January 5, 2016 [email protected] 26

Page 27: Lecture 7 sequitur

January 5, 2016 [email protected] 27

Page 28: Lecture 7 sequitur

January 5, 2016 [email protected] 28

Page 29: Lecture 7 sequitur

January 5, 2016 [email protected] 29

Page 30: Lecture 7 sequitur

January 5, 2016 [email protected] 30

Page 31: Lecture 7 sequitur

January 5, 2016 [email protected] 31

Page 32: Lecture 7 sequitur

January 5, 2016 [email protected] 32

Page 33: Lecture 7 sequitur

January 5, 2016 [email protected] 33

Page 34: Lecture 7 sequitur

January 5, 2016 [email protected] 34

Page 35: Lecture 7 sequitur

January 5, 2016 [email protected] 35

Page 36: Lecture 7 sequitur

January 5, 2016 [email protected] 36

Page 37: Lecture 7 sequitur

January 5, 2016 [email protected] 37

Page 38: Lecture 7 sequitur

January 5, 2016 [email protected] 38

Page 39: Lecture 7 sequitur

January 5, 2016 [email protected] 39

Page 40: Lecture 7 sequitur

January 5, 2016 [email protected] 40

Page 41: Lecture 7 sequitur

January 5, 2016 [email protected] 41

Page 42: Lecture 7 sequitur

January 5, 2016 [email protected] 42

Page 43: Lecture 7 sequitur

January 5, 2016 [email protected] 43

Page 44: Lecture 7 sequitur

January 5, 2016 [email protected] 44

Page 45: Lecture 7 sequitur

January 5, 2016 [email protected] 45

Page 46: Lecture 7 sequitur

January 5, 2016 [email protected] 46

Page 47: Lecture 7 sequitur

January 5, 2016 [email protected] 47

The Hierarchy

Page 48: Lecture 7 sequitur

January 5, 2016 48 [email protected]