Chapter6. Statistical Inference : n-gram Model over Sparse Data 2005. 1. 13 이 동 훈 [email protected] Foundations of Statistic Natural Language Processing

Chapter6. Statistical Inference : n-gram Modelover Sparse Data

2005. 1. 13

이 동 훈 [email protected]

Foundations of Statistic Natural Language Processing

2 / 20

Table of Contents

Introduction Bins : Forming Equivalence Classes

Reliability vs. Discrimination N-gram models

Statistical Estimators Maximum Likelihood Estimation (MLE) Laplace’s law, Lidstone’s law and the Jeffreys-Perks law Held out estimation Cross-validation (deleted estimation) Good-Turing estimation

Combining Estimators Simple linear interpolation Katz’s backing-off General linear interpolation

Conclusions

3 / 20

Introduction

Object of Statistical NLP Do statistical inference for the field of natural language.

Statistical inference in general consists of : Taking some data generated by unknown probability distribution. Making some inferences about this distribution.

Divides the problem into three areas : Dividing the training data into equivalence class. Finding a good statistical estimator for each equivalence class. Combining multiple estimators.

4 / 20

Bins : Forming Equivalence Classes[1/2]

Reliability vs. Discrimination“large green ___________”

tree? mountain? frog? car?

“swallowed the large green ________”

pill? broccoli?

larger n: more information about the context of the specific instance (greater discrimination)

smaller n: more instances in training data, better statistical estimates (more reliability)

5 / 20

Bins : Forming Equivalence Classes[2/2]

N-gram models “n-gram” = sequence of n words predicting the next word : Markov assumption

- Only the prior local con text - the last few words – affects the next word.

Selecting an n : Vocabulary size = 20,000 words

11 ,,| nn wwwP

n Number of bins

2 (bigrams) 400,000,000

3 (trigrams) 8,000,000,000,000

4 (4-grams) 1.6 x 1017

6 / 20

Statistical Estimators[1/3]

Given the observed training data.

How do you develop a model (probability distribution) to predict future events?

Probability estimate target feature

-

Estimating the unknown probability distribution of n-grams.

11

111|

n

nnn wwP

wwPwwwP

7 / 20


Notation for the statistical estimation chapter.

N Number of training instances

B Number of bins training instances are divided into

w1n An n-gram w1…wn in the training text

C(w1…wn) Freq. of n-gram w1…wn in the training text

r Freq. of an n-gram

f(•) Freq. estimate of a model

Nr Number of bins that have r training instances in them

Tr Total count of n-grams of freq. r in further data

h ‘History’ of preceding words

8 / 20


Example - Instances in the training corpus:

“inferior to ________”

9 / 20

Maximum Likelihood Estimation (MLE)[1/2]

Definition Using the relative frequency as a probability estimate.

Example : In corpus, found 10 training instances of the word “comes across” 8 times they were followed by “as” : P(as) = 0.8 Once by “more” and “a” : P(more) = 0.1 , P(a) = 0.1 Not among the above 3 word : P(x) = 0.0

Formula

11

111

11

|

n

nnnMLE

nnMLE

wwC

wwCwwwP

N

wwCwwP

10 / 20

Maximum Likelihood Estimation (MLE)[2/2]

11 / 20

Laplace’s law, Lidstone’s law and the Jeffreys-Perks law[1/2]

Laplace’s law

Add a little bit of probability space to unseen events

BN

wwCwwP nnLAP

111

12 / 20

Laplace’s law, Lidstone’s law and the Jeffreys-Perks law[2/2]

Lidstone’s law and the Jeffreys-Perks law

Lidstone’s Law

- add some positive value

Jeffreys-Perks Law

- = 0.5

- Called ELE (Expected Likelihood Estimation)

BλN

λ)wC(w)w(wP n

nLid

11

13 / 20

Held out estimation

Validate by holding out part of the training data.

- C1 (w1n) = Frequency of w1n in training data

- C2(w1n) = Frequency of w1n in held out data

- T = Number of token in held out data

})(:{

2

111

1 )(rwCw

r

nn

nwCT

TN

TwP

r

rho n )( 1

14 / 20

Cross-validation (deleted estimation)[1/2]

Use data for both training and validation

Divide test data into 2 parts Train on A, validate on B Train on B, validate on A

Combine two models

A B

train validate

validate train

Model 1

Model 2

Model 1 Model 2+ Final Model

15 / 20

Cross-validation (deleted estimation)[2/2]

Cross validation : training data is used both as initial training data held out data

On large training corpora, deleted estimation works better than held-out estimation

rwwCwhereNNN

TTwwP

rwwCwhereNN

Tor

NN

TwwP

nrr

rrndel

nr

r

r

rnho

110

1001

1

11

10

0

01

1

16 / 20

Good-Turing estimation

Suitable for large number of observations from a large vocabulary

Works well for n-grams

rr

GT

NE

NErr

N

rP

1*

*

1

( r* is an adjusted frequency )

( E denotes the expectation of

random variable )

17 / 20

Combining Estimators[1/3]

Basic Idea Consider how to combine multiple probability estimate from various di

fferent models How can you develop a model to utilize different length n-grams as ap

propriate?

Simple linear interpolation

Combination of trigram , bigram and unigram

21331221112 ,||,| nnnnnnnnnli wwwPwwPwPwwwP 110

i ii andwhree

18 / 20


Katz’s backing-off used to smooth or to combine information source n-gram appeared more than k time n-gram estimate k or less than k estimate from a shorter n-gram

otherwisewwwP

kwwCifwwC

wwCd

wwwP

iniiboww

iniini

iniww

iniibo

ini

ini

12

111

1

11

...|

...

...1

|

11

11

19 / 20


General linear interpolation weight : function of history Very general way to combine models (commonly used)

i 1(h)iλand1(h)iλ0h,where

1

||k

iiili hwPhhwP

20 / 20

Conclusions

problems of sparse data Good-Turing, linear interpolation or back-off Good-Turing smoothing is good

- Church & Gale (1991)

Active research combining probability models dealing with sparse data

Documents

Chapter6. Statistical Inference : n-gram Model over Sparse Data 2005. 1. 13 이 동 훈 [email protected] Foundations of Statistic Natural Language Processing