Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Statistical Methods for NLP
Statistical Inference
Joakim Nivre
Uppsala UniversityDepartment of Linguistics and Philology
Statistical Methods for NLP 1(27)
Stochastic Variables
Stochastic Variables
I A stochastic variable (or random variable) X is a functionfrom a sample space Ω (the domain of X ) to a value spaceΩX (the range of X ).
I Examples:1. The part-of-speech of an arbitrary word from a corpus is a
stochastic variable X with Ω = w |w is a corpus token andΩX = noun, verb,adjective, . . ..
2. The sum of two dice is a stochastic variable Y withΩ = (x , y)|1 ≤ x , y ≤ 6 and ΩY = z|2 ≤ z ≤ 12.
Statistical Methods for NLP 2(27)
Stochastic Variables
Types of Variables
I If ΩX is a subset of the set of real numbers, then X is saidto to be numerical; otherwise it is categorical.
I If ΩX is finite or countably infinite, then X is said to bediscrete.
I Examples:1. The part-of-speech X of an arbitrary word from a corpus is
a discrete, categorical variable, since ΩX is finite and notnumerical.
2. The sum Y of two dice is a discrete numerical variable,since ΩY is finite and numerical.
Statistical Methods for NLP 3(27)
Stochastic Variables
Frequency Functions
I The probability P(X = x) of variable X assuming value x isgiven by the frequency function fX :
fX (x) = P(X = x)
I For discrete variables, this is equivalent to summing theprobability of all outcomes in Ω that are mapped to x by X :
fX (x) = P(u ∈ Ω|X (u) = x) =∑
u:X(u)=x
P(u)
I Example:I The probability of sampling a noun from a corpus is the
sum of the probabilities of sampling each noun.
Statistical Methods for NLP 4(27)
Stochastic Variables
Expectation
I Let X be a discrete numerical variable with value spaceΩX . The expectation of X , E [X ], is defined as follows:
E [X ] =∑
x∈ΩX
x · fX (x)
I Example: The expectation of the sum Y of two dice:
E [Y ] =12∑
y=2
y · fY (y) =25236
= 7
Statistical Methods for NLP 5(27)
Stochastic Variables
Variance
I Let X be a discrete stochastic variable with expectation µ.The variance of X , Var [X ], is defined as follows:
Var [X ] = E [(X − µ)2] =∑
x∈ΩX
(x − µ)2 · fX (x)
I Example: The variance of the sum Y of two dice:
Var [Y ] =12∑
y=2
(y − 7)2 · fY (y) =21036≈ 5.83
I If X is a variable with variance σ2, then σ =√σ2 is the
standard deviation of X .
Statistical Methods for NLP 6(27)
Stochastic Variables
Entropy
I Let X be a discrete stochastic variable. The entropy of X ,H[X ], is defined as follows:
H[X ] = E [− log2 fX ] = −∑
x∈ΩX
fX (x) log2 fX (x)
I Example: The entropy of the sum Y of two dice:
H[Y ] = −12∑
y=2
fY (y) log2 fY (y) ≈ 3.27
Statistical Methods for NLP 7(27)
Stochastic Variables
More on Entropy
I The entropy of a variable X can be interpreted as theexpected amount of information (measured in bits) whenlearning the value of X :
IX (x) = − log2 fX (x)
I Given a finite value space ΩX of size n, entropy ismaximized if fX (x) = 1
n for all x ∈ ΩX .I Example: Entropy of the outcome Z of an 11-sided die
(2–12):
H[Z ] = −12∑
z=2
111
log21
11≈ 3.46
Statistical Methods for NLP 8(27)
Stochastic Variables
Joint and Conditional Probability
I Let X and Y be stochastic variables with sample spacesΩ1 and Ω2 and value spaces ΩX and ΩY , respectively.
1. The joint probability of X and Y is given by their jointprobability function f(X ,Y ):
f(X ,Y )(x , y) = P(X = x ,Y = y)= P((u, v) ∈ Ω1 × Ω2|X (u) = x ,Y (v) = y)
2. The conditional probability of X given Y is given by theconditional probability function fX |Y :
fX |Y (x |y) = P(X = x |Y = y) =P(X = x ,Y = y)
P(Y = y)
Statistical Methods for NLP 9(27)
Stochastic Variables
Independence
I Stochastic variables X and Y (defined on the sameunderlying sample space) are independent if and only ifthe following holds for all x ∈ ΩX and y ∈ ΩY :
P(X = x ,Y = y) = P(X = x)P(Y = y)
I Corollary: If X and Y are independent variables then thefollowing conditions hold (for all x ∈ ΩX and y ∈ ΩY ):
1. P(X = x |Y = y) = P(X = x)2. P(Y = y |X = x) = P(Y = y)
Statistical Methods for NLP 10(27)
Stochastic Variables
Part-of-Speech Bigrams 1
I Let (X1,X2) be the parts-of-speech of an arbitrary bigramand let the following probabilities be given:
1. P(X1 = noun) = P(X2 = noun) = 0.22. P(X1 = adj) = P(X2 = adj) = 0.053. P(X1 = det|X2 = noun) = 0.34. P(X1 = det|X2 = adj) = 0.65. P(X1 = det|X2 6∈ noun,adj) = 0
I Question: What is P(X2 = noun|X1 = det)?
Statistical Methods for NLP 11(27)
Stochastic Variables
Part-of-Speech Bigrams 2
I Using Bayes’ law:
P(X2 = noun) · P(X1 = det|X2 = noun)
P(X1 = det)
I Using the law of total probability:
P(X2 = noun) · P(X1 = det|X2 = noun)
P(X1 = d|X2 = n) · P(X2 = n) + P(X1 = d|X2 = a) · P(X2 = a)
I Putting in the numbers:
0.2 · 0.30.3 · 0.2 + 0.6 · 0.05
= 0.67
Statistical Methods for NLP 12(27)
Stochastic Variables
Part-of-Speech Bigrams 3
I Consider:1. P(X1 = det) = 0.092. P(X2 = noun) = 0.23. P(X1 = d,X2 = n) = P(X1 = d) · P(X2 = n|X1 = d) = 0.064. P(X1 = det) · P(X2 = noun) = 0.2 · 0.09 = 0.0185. 0.018 6= 0.06
I Conclusion: X1 and X2 are not independent variables.
Statistical Methods for NLP 13(27)
Statistical Inference
Statistical Inference
I Statistical inference is the science of making predictions orinferences from finite sets of observations (samples) to(potentially infinite) sets of new observations (populations).
I Two main kinds of statistical inference:1. Estimation: Use samples and sample variables to predict
population variables.2. Hypothesis testing: Use samples and sample variables to
test hypotheses about population variables.I Note: In statistical modeling, we often talk about models
instead of populations.
Statistical Methods for NLP 14(27)
Statistical Inference
Sampling
I Let X be a stochastic variable.1. A vector (X1, . . . ,Xn) of independent variables Xi with the
same distribution as X is said to be a random sample of X.2. A value vector (x1, . . . , xn) such that X1 = x1, . . . ,Xn = xn in
a particular experiment is called a statistical material.I Example:
I Consider a corpus C consisting of words (w1, . . . ,wn).I Can we regard C as a statistical material resulting from a
sample (W1, . . . ,Wn) of the word variable W?I Why (not)?
Statistical Methods for NLP 15(27)
Statistical Inference
Sample Variables
I Given a random sample of a variable X , we can define newstochastic variables that are functions of the sample, calledsample variables:
1. The sample mean: X n =1n
n∑i=1
Xi
2. The sample variance: sn2 =
1(n − 1)
n∑i=1
(Xi − X n)2
I These variables are called sample variables to distinguishthem from the expectation µ and (true) variance σ2 of X ,which are called population variables or model parameters.
Statistical Methods for NLP 16(27)
Statistical Inference
Estimation
I Two kinds of estimation:1. Point estimation: Use sample variable f (X1, . . . ,Xn) to
estimate parameter φ.2. Interval estimation: Use sample variables f1(X1, . . . ,Xn) and
f2(X1, . . . ,Xn) to construct an interval such thatP(f1(X1, . . . ,Xn) < φ < f2(X1, . . . ,Xn)) = p, where p is theconfidence level adopted.
Statistical Methods for NLP 17(27)
Statistical Inference
Maximum Likelihood Estimation (MLE)
I Given a statistical material x1, . . . , xn and a set ofparameters θ, the likelihood function L is:
L(x1, . . . , xn, θ) =n∏
i=1
Pθ(xi)
where Pθ(xi) is the probability that the variable Xi assumesthe value xi given a set of values for the parameters in θ.
I Maximum likelihood estimation means choosing θ so thatthe likelihood function is maximized:
maxθ
L(x1, . . . , xn, θ)
Statistical Methods for NLP 18(27)
Statistical Inference
MLE: Example 1
I Given a random sample (X1, . . . ,Xn) of a numericalvariable X , the sample mean X n is a maximum likelihoodestimate of the expectation E [X ].
I The average sentence length X in a certain type of textcan be estimated with the mean sentence length in arepresentative sample:
E [X ] = X n
Statistical Methods for NLP 19(27)
Statistical Inference
MLE: Example 2
I Given a random sample (X1, . . . ,Xn) of a categoricalvariable X , the relative frequency of the value x , fn(x), is amaximum likelihood estimate of the probability P(X = x).
I The probability of an arbitrary word from a text being anoun can be estimated with the relative frequency of nounsin a suitable corpus of texts:
P(noun) = fn(noun)
Statistical Methods for NLP 20(27)
Statistical Inference
The Rationale of MLE
I We want to choose the most probable model given thedata:
P(θ|x1, . . . , xn) =P(x1, . . . , xn|θ)P(θ)
P(x1, . . . , xn)
argmaxθ
P(θ|x1, . . . , xn) = argmaxθ
P(x1, . . . , xn|θ)P(θ)
I If we assume a uniform distribution for P(θ), then
argmaxθ
P(θ|x1, . . . , xn) = argmaxθ
P(x1, . . . , xn|θ)
I The status of P(θ) is controversial in statistical theory(Bayesians vs. Frequentists)
Statistical Methods for NLP 21(27)
Statistical Inference
MLE and Smoothing
I MLE is usually a good solution to the estimation problem ifthe statistical material is large enough.
I For language data, MLE is often suboptimal because ofsparse data and requires smoothing (or regularization).
I Example:I Additive smoothing:
Padd(X = x) =fn(x) + m
n + m · |ΩX |
where m is a constant (usually m ≤ 1).
I Note: Padd(X = x) 6= 0.
Statistical Methods for NLP 22(27)
Statistical Inference
Interval Estimation
I In general, we can derive a 95% confidence interval for ourmaximum likelihood estimate µ of a mean as follows:
µ± 1.96
√σ2
n
I Examples:
1. Sentence length: E [X ] = X n ± 1.96√
σ2
n
2. Noun probability: P(noun) = fn(noun)± 1.96√
σ2
n
I Note:I The interval grows with increasing variance σ2.I The interval shrinks with the sample size n.
Statistical Methods for NLP 23(27)
Statistical Inference
More on Interval Estimation
I Where does the number 1.96 come from?
I Assumptions:1. Parameter has a normal distribution – okay for large n.2. True variance σ2 is known – usually not the case.
Statistical Methods for NLP 24(27)
Statistical Inference
Hypothesis Testing
I When are differences in sample variables f (x1, . . . , xm) andf (y1, . . . , yn) significant?
I Do they reflect differences in population variables ξ and υ?I Null hypothesis H0: ξ = υ
I Test procedure:I Choose a test statistic f (X ,Y ) whose distribution is known
when H0 is true.I Calculate the probability p of f (x , y) given H0.I If p < α, reject H0 at significance level α.
Statistical Methods for NLP 25(27)
Statistical Inference
Example: Z-test
I Data:I Mean sentence length in 50 novels from 1950s: X 1 = 19.3.I Mean sentence length in 50 novels from 2000s: X 2 = 16.4.I X is normally distributed with variance σ2 = 134.2.
I Test statistic:
Z =X 1 − X 2√
σ2
n1+n2
=19.3− 16.9√
134.250+50
= 2.28
I Probability calculation:I Z = 2.28 corresponds to p = P(Z = 2.28|H0) = 0.0226.I Reject H0 at α = 0.05 (but not α = 0.01).I Note: This does not mean that P(H1) > 0.95.
Statistical Methods for NLP 26(27)
Statistical Inference
Tips and Tricks
I What to do in real life?I When variance is not known – use sample variance and a
t-test (instead of Z -test):
t =X 1 − X 2√
s21
n1+
s22
n2
I When distribution is not normal – use a non-parametric test.I The special case of proportions:
I If X is binary with P(X = 1) = p, then Var [X ] = p(1− p):
p = fn(1)±√
p(1−p)n Z = p1−p2r
p1(1−p1)
n1+
p2(1−p2)
n2
Statistical Methods for NLP 27(27)