Upload
lydia-haynes
View
217
Download
4
Embed Size (px)
Citation preview
Chapter6. Statistical Inference : n-gram Modelover Sparse Data
2005. 1. 13
이 동 훈 [email protected]
Foundations of Statistic Natural Language Processing
2 / 20
Table of Contents
Introduction Bins : Forming Equivalence Classes
Reliability vs. Discrimination N-gram models
Statistical Estimators Maximum Likelihood Estimation (MLE) Laplace’s law, Lidstone’s law and the Jeffreys-Perks law Held out estimation Cross-validation (deleted estimation) Good-Turing estimation
Combining Estimators Simple linear interpolation Katz’s backing-off General linear interpolation
Conclusions
3 / 20
Introduction
Object of Statistical NLP Do statistical inference for the field of natural language.
Statistical inference in general consists of : Taking some data generated by unknown probability distribution. Making some inferences about this distribution.
Divides the problem into three areas : Dividing the training data into equivalence class. Finding a good statistical estimator for each equivalence class. Combining multiple estimators.
4 / 20
Bins : Forming Equivalence Classes[1/2]
Reliability vs. Discrimination“large green ___________”
tree? mountain? frog? car?
“swallowed the large green ________”
pill? broccoli?
larger n: more information about the context of the specific instance (greater discrimination)
smaller n: more instances in training data, better statistical estimates (more reliability)
5 / 20
Bins : Forming Equivalence Classes[2/2]
N-gram models “n-gram” = sequence of n words predicting the next word : Markov assumption
- Only the prior local con text - the last few words – affects the next word.
Selecting an n : Vocabulary size = 20,000 words
11 ,,| nn wwwP
n Number of bins
2 (bigrams) 400,000,000
3 (trigrams) 8,000,000,000,000
4 (4-grams) 1.6 x 1017
6 / 20
Statistical Estimators[1/3]
Given the observed training data.
How do you develop a model (probability distribution) to predict future events?
Probability estimate target feature
-
Estimating the unknown probability distribution of n-grams.
11
111|
n
nnn wwP
wwPwwwP
7 / 20
Statistical Estimators[2/3]
Notation for the statistical estimation chapter.
N Number of training instances
B Number of bins training instances are divided into
w1n An n-gram w1…wn in the training text
C(w1…wn) Freq. of n-gram w1…wn in the training text
r Freq. of an n-gram
f(•) Freq. estimate of a model
Nr Number of bins that have r training instances in them
Tr Total count of n-grams of freq. r in further data
h ‘History’ of preceding words
8 / 20
Statistical Estimators[3/3]
Example - Instances in the training corpus:
“inferior to ________”
9 / 20
Maximum Likelihood Estimation (MLE)[1/2]
Definition Using the relative frequency as a probability estimate.
Example : In corpus, found 10 training instances of the word “comes across” 8 times they were followed by “as” : P(as) = 0.8 Once by “more” and “a” : P(more) = 0.1 , P(a) = 0.1 Not among the above 3 word : P(x) = 0.0
Formula
11
111
11
|
n
nnnMLE
nnMLE
wwC
wwCwwwP
N
wwCwwP
10 / 20
Maximum Likelihood Estimation (MLE)[2/2]
11 / 20
Laplace’s law, Lidstone’s law and the Jeffreys-Perks law[1/2]
Laplace’s law
Add a little bit of probability space to unseen events
BN
wwCwwP nnLAP
111
12 / 20
Laplace’s law, Lidstone’s law and the Jeffreys-Perks law[2/2]
Lidstone’s law and the Jeffreys-Perks law
Lidstone’s Law
- add some positive value
Jeffreys-Perks Law
- = 0.5
- Called ELE (Expected Likelihood Estimation)
BλN
λ)wC(w)w(wP n
nLid
11
13 / 20
Held out estimation
Validate by holding out part of the training data.
- C1 (w1n) = Frequency of w1n in training data
- C2(w1n) = Frequency of w1n in held out data
- T = Number of token in held out data
})(:{
2
111
1 )(rwCw
r
nn
nwCT
TN
TwP
r
rho n )( 1
14 / 20
Cross-validation (deleted estimation)[1/2]
Use data for both training and validation
Divide test data into 2 parts Train on A, validate on B Train on B, validate on A
Combine two models
A B
train validate
validate train
Model 1
Model 2
Model 1 Model 2+ Final Model
15 / 20
Cross-validation (deleted estimation)[2/2]
Cross validation : training data is used both as initial training data held out data
On large training corpora, deleted estimation works better than held-out estimation
rwwCwhereNNN
TTwwP
rwwCwhereNN
Tor
NN
TwwP
nrr
rrndel
nr
r
r
rnho
110
1001
1
11
10
0
01
1
16 / 20
Good-Turing estimation
Suitable for large number of observations from a large vocabulary
Works well for n-grams
rr
GT
NE
NErr
N
rP
1*
*
1
( r* is an adjusted frequency )
( E denotes the expectation of
random variable )
17 / 20
Combining Estimators[1/3]
Basic Idea Consider how to combine multiple probability estimate from various di
fferent models How can you develop a model to utilize different length n-grams as ap
propriate?
Simple linear interpolation
Combination of trigram , bigram and unigram
21331221112 ,||,| nnnnnnnnnli wwwPwwPwPwwwP 110
i ii andwhree
18 / 20
Combining Estimators[2/3]
Katz’s backing-off used to smooth or to combine information source n-gram appeared more than k time n-gram estimate k or less than k estimate from a shorter n-gram
otherwisewwwP
kwwCifwwC
wwCd
wwwP
iniiboww
iniini
iniww
iniibo
ini
ini
12
111
1
11
...|
...
...1
|
11
11
19 / 20
Combining Estimators[3/3]
General linear interpolation weight : function of history Very general way to combine models (commonly used)
i 1(h)iλand1(h)iλ0h,where
1
||k
iiili hwPhhwP
20 / 20
Conclusions
problems of sparse data Good-Turing, linear interpolation or back-off Good-Turing smoothing is good
- Church & Gale (1991)
Active research combining probability models dealing with sparse data