74
Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning & H. Sch¨ utze, ch. 2, MIT Press, 2002 “Probability theory is nothing but common sense reduced to calculation.” Pierre Simon, Marquis de Laplace (1749-1827) 0.

Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

Embed Size (px)

Citation preview

Page 1: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

Basic Statistics and Probability Theory

Based on

“Foundations of Statistical NLP”

C. Manning & H. Schutze, ch. 2, MIT Press, 2002

“Probability theory is nothing but common sense

reduced to calculation.”

Pierre Simon, Marquis de Laplace (1749-1827)

0.

Page 2: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

PLAN1. Elementary Probability Notions:• Event Space, and Probability Function

• Conditional Probabiblity

• Bayes’ Theorem

• Independence of Probabilistic Events

2. Random Variables:• Discrete Variables and Continuous Variables

• Mean, Variance and Standard Deviation

• Standard Distributions

• Joint, Marginal and and Conditional Distributions

• Independence of Random Variables

3. Limit Theorems

4. Estimating the parameters of probab. models from data

5. Elementary Information Theory

1.

Page 3: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

1. Elementary Probability Notions

• sample/event space: Ω (either discrete or continuous)

• event: A ⊆ Ω

– the certain event: Ω

– the impossible event: ∅• event space: F = 2Ω (or a subspace of 2Ω that contains ∅ and is closed

under complement and countable union)

• probability function/distribution: P : F → [0, 1] such that:

– P (Ω) = 1

– the “countable additivity” property:∀A1, ..., Ak disjoint events, P (∪Ai) =

∑P (Ai)

Consequence: for a uniform distribution in a finite sample space:

P (A) =#favorable events

#all events

2.

Page 4: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

Conditional Probabiblity

• P (A | B) =P (A ∩ B)

P (B)

Note: P (A | B) is called the a posteriory probability of A, given B.

• The “multiplication” rule:

P (A ∩ B) = P (A | B)P (B) = P (B | A)P (A)

• The “chain” rule:

P (A1 ∩ A2 ∩ . . . ∩ An) =P (A1)P (A2 | A1)P (A3 | A1, A2) . . . P (An | A1, A2, . . . , An−1)

3.

Page 5: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

• The “total probability” formula:

P (A) = P (A | B)P (B) + P (A | ¬B)P (¬B)

More generally:

if A ⊆ ∪Bi and ∀i 6= j Bi ∩ Bj = ∅, then

P (A) =∑

i P (A | Bi)P (Bi)

• Bayes’ Theorem:

P (B | A) =P (A | B) P (B)

P (A)

or P (B | A) =P (A | B) P (B)

P (A | B)P (B) + P (A | ¬B)P (¬B)

or ...

4.

Page 6: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

Independence of Probabilistic Events

• Independent events: P (A ∩ B) = P (A)P (B)

Note: When P (B) 6= 0, the above definition is equivalent toP (A|B) = P (A).

• Conditionally independent events:P (A ∩ B | C) = P (A | C)P (B | C), assuming, of course, thatP (C) 6= 0.

Note: When P (B ∩ C) 6= 0, the above definition is equivalentto P (A|B, C) = P (A|C).

5.

Page 7: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

2. Random Variables

2.1 Basic Definitions

Let Ω be a sample space, andP : 2Ω → [0, 1] a probability function.

• A random variable of distribution P is a function

X : Ω → Rn

For now, let us consider n = 1.

The cumulative distribution function of X is F : R → [0,∞) defined by

F (x) = P (X ≤ x) = P (ω ∈ Ω | X(ω) ≤ x)

6.

Page 8: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

2.2 Discrete Random Variables

Definition: Let P : 2Ω → [0, 1] be a probability function, and X be a randomvariable of distribution P .

• If Image(X) is either finite or unfinite countable, thenX is called a discrete random variable.

For such a variable we define the probability mass function (pmf)

p : R → [0, 1] as p(x)def= p(X = x) = P (ω ∈ Ω | X(ω) = x).

(Obviously, it follows that∑

xi∈Image(X) p(xi) = 1.)

Mean, Variance, and Standard Deviation:

• Expectation / mean of X:

E(X)not.= E[X] =

x xp(x) if X is a discrete random variable.

• Variance of X: Var(X)not.= Var[X] = E((X − E(X))2).

• Standard deviation: σ =√

Var(X).

Covariance of X and Y , two random variables of distribution P :

• Cov(X, Y ) = E[(X − E[X])(Y − E[Y ])]

7.

Page 9: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

Exemplification:

• the Binomial distribution: b(r; n, p) = Crn pr(1 − p)n−r (0 ≤ r ≤ n)

mean: np, variance: np(1 − p)

the Bernoulli distribution: b(r; 1, p)

The probability mass function and the cumulative distribution function ofthe Binomial distribution:

8.

Page 10: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

2.3 Continuous Random VariablesDefinitions:

Let P : 2Ω → [0, 1] be a probability function, andX : Ω → R be a random variable of distribution P .

• If Image(X) is unfinite non-countable set, andF , the cumulative distribution function of X is continuous, then

X is called a continuous random variable.

(It follows, naturally, that P (X = x) = 0, for all x ∈ R.)

If there exists p : R → [0,∞) such that F (x) =∫ x

−∞p(t)dt,

then X is called absolutely continuous.In such a case, p is called the probability density function (pdf) of X.

For B ⊆ R for which∫

Bp(x)dx exists,

Pr(B)def= P (ω ∈ Ω | X(ω) ∈ B) =

Bp(x)dx.

• In particular,∫ +∞

−∞p(x)dx = 1.

• Expectation / mean of X: E(X)not.= E[X] =

∫xp(x)dx.

9.

Page 11: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

Exemplification:

• Normal (Gaussean) distribution: N(x; µ, σ) = 1√2πσ

e−(x− µ)2

2σ2

mean: µ, variance: σ2

Standard Normal distribution: N(x; 0, 1)

• Remark:

For n, p such that np(1 − p) > 5, the Binomial distributions can beapproximated by Normal distributions.

10.

Page 12: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

The Normal distribution:the probability density function and the cumulative

distribution function

11.

Page 13: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

2.4 Basic Properties of Random Variables

Let P : 2Ω → [0, 1] be a probability function,X : Ω → R

n be a random discrete/continuous variable of distribution P .

• If g : Rn → Rm is a function, then g(X) is a random variable.

If g(X) is discrete, then E(g(X)) =∑

x g(x)p(x).

If g(X) is continuous, then E(g(X)) =∫

g(x)p(x)dx.

• E(aX + b) = aE(X) + b.

If g is non-linear 6⇒ E(g(X)) = g(E(X)).

• E(X + Y ) = E(X) + E(Y ).

• Var(X) = E(X2) − E2(X).

Var(aX) = a2Var(X).

• Cov(X, Y ) = E[XY ] − E[X]E[Y ].

12.

Page 14: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

2.5 Joint, Marginal and Conditional DistributionsExemplification for the bi-variate case:

Let Ω be a sample space, P : 2Ω → [0, 1] a probability function, andV : Ω → R

2 be a random variable of distribution P .One could naturally see V as a pair of two random variables X : Ω → R andY : Ω → R. (More precisely, V (ω) = (x, y) = (X(ω), Y (ω)).)

• the joint pmf/pdf of X and Y is defined by

p(x, y)not.= pX,Y (x, y) = P (X = x, Y = y) = P (ω ∈ Ω | X(ω) = x, Y (ω) = y).

• the marginal pmf/pdf functions of X and Y are:

for the discrete case:pX(x) =

y p(x, y), pY (y) =∑

x p(x, y)

for the continuous case:pX(x) =

yp(x, y) dy, pY (y) =

xp(x, y) dx

• the conditional pmf/pdf of X given Y is:

pX|Y (x | y) =pX,Y (x, y)

pY (y)

13.

Page 15: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

2.6 Independence of Random Variables

Definitions:

• Let X, Y be random variables of the same type (i.e. either discrete orcontinuous), and pX,Y their joint pmf/pdf.

X and Y are are said to be independent if

pX,Y (x, y) = pX(x) · pY (y)

for all possible values x and y of X and Y respectively.

• Similarly, let X, Y and Z be random variables of the same type, and p

their joint pmf/pdf.

X and Y are conditionally independent given Z if

pX,Y |Z(x, y | z) = pX|Z(x | z) · pY |Z(y | z)

for all possible values x, y and z of X, Y and Z respectively.

14.

Page 16: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

Properties of random variables pertaining to independence

• If X, Y are independent, thenVar(X + Y ) = Var(X) + Var(Y ).

• If X, Y are independent, thenE(XY ) = E(X)E(Y ), i.e. Cov(X, Y ) = 0.

Cov(X, Y ) = 0 6⇒ X, Y are independent.

The covariance matrix corresponding to a vector of random variablesis symmetric and positive semi-definite.

• If the covariance matrix of a multi-variate Gaussian distribution isdiagonal, then the marginal distributions are independent.

15.

Page 17: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

3. Limit Theorems[ Sheldon Ross, A first course in probability, 5th ed., 1998 ]

“The most important results in probability theory are limit theo-rems. Of these, the most important are...

laws of large numbers, concerned with stating conditions underwhich the average of a sequence of random variables converge (insome sense) to the expected average;

central limit theorems, concerned with determining the conditionsunder which the sum of a large number of random variables has aprobability distribution that is approximately normal.”

16.

Page 18: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

Two basic inequalities and the weak law of large numbers

Markov’s inequality:If X is a random variable that takes only non-negative values,then for any value a > 0,

P (X ≥ a) ≤ E[X]

a

Chebyshev’s inequality:If X is a random variable with finite mean µ and variance σ2,then for any value k > 0,

P (| X − µ |≥ k) ≤ σ2

k2

The weak law of large numbers (Bernoulli; Khintchine):Let X1, X2, . . . , Xn be a sequence of independent and identically dis-tributed random variables, each having a finite mean E[Xi] = µ.Then, for any value ǫ > 0,

P

(∣∣∣∣

X1 + . . . + Xn

n− µ

∣∣∣∣≥ ǫ

)

→ 0 as n → ∞

17.

Page 19: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

The central limit theoremfor i.i.d. random variables

[ Pierre Simon, Marquis de Laplace; Liapunoff in 1901-1902 ]

Let X1, X2, . . . , Xn be a sequence of independent random variables,each having mean µ and variance σ2.Then the distribution of

X1 + . . . + Xn − nµ

σ√

n

tends to be the standard normal (Gaussian) as n → ∞.

That is, for −∞ < a < ∞,

P

(X1 + . . . + Xn − nµ

σ√

n≤ a

)

→ 1√2π

∫ a

−∞

e−x2/2dx as n → ∞

18.

Page 20: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

The central limit theorem

for independent random variables

Let X1, X2, . . . , Xn be a sequence of independent and identically distributedrandom variables having respective means µi and variances σ2

i .

If

(a) the variables Xi are uniformly bounded,i.e. for some M ∈ R

+ P (| Xi |< M) = 1 for all i,

and

(b)∑∞

i=1 σ2i = ∞,

then

P

(∑ni=1(Xi − µi)√∑n

i=1 σ2i

≤ a

)

→ Φ(a) as n → ∞

where Φ is the cumulative distribution function for the standard normal(Gaussian) distribution.

19.

Page 21: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

The strong law of large numbers

Let X1, X2, . . . , Xn be a sequence of independent and identicallydistributed random variables, each having a finite mean E[Xi] = µ.Then, with probability 1,

X1 + . . . + Xn

n→ µ as n → ∞

That is,

P(

limn→∞

(X1 + . . . + Xn)/n = µ)

= 1

20.

Page 22: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

Other inequalitiesOne-sided Chebyshev inequality:

If X is a random variable with mean 0 and finite variance σ2,then for any a > 0,

P (X ≥ a) ≤ σ2

σ2 + a2

Corollary:

If E[X] = µ, Var(X) = σ2, then for a > 0

P (X ≥ µ + a) ≤ σ2

σ2 + a2

P (X ≤ µ − a) ≤ σ2

σ2 + a2

Chernoff bounds:

Let M(t)not= E[etX ]. Then

P (X ≥ a) ≤ e−taM(t) for all t > 0

P (X ≥ a) ≤ e−taM(t) for all t < 0

21.

Page 23: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

4. Estimation/inference of the parameters ofprobabilistic models from data

(based on [Durbin et al, Biological Sequence Analysis, 1998],p. 311-313, 319-321)

A probabilistic model can be anything from a simple distributionto a complex stochastic grammar with many implicit probabilitydistributions. Once the type of the model is chosen, the parame-ters have to be inferred from data.

We will first consider the case of the multinomial distribution, andthen we will present the different strategies that can be used ingeneral.

22.

Page 24: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

A case study: Estimation of the parameters ofa multinomial distribution from data

Assume that the observations — for example, when rolling a die aboutwhich we don’t know whether it is fair or not, or when counting the numberof times the amino acid i occurs in a column of a multiple sequence align-ment — can be expressed as counts ni for each outcome i (i = 1, l . . . , K),and we want to estimate the probabilities θi of the underlying distribution.

Case 1:

When we have plenty of data, it is natural to use the maximum likeli-hood (ML) solution, i.e. the observed frequency θML

i =ni

j nj

not.=

ni

N.

Note: it is easy to show that indeed P (n | θML) > P (n | θ) for any θ 6= θML.

logP (n | θML)

P (n | θ) = logΠi(θ

MLi )ni

Πiθni

i

=∑

i

ni logθML

i

θi= N

i

θMLi log

θMLi

θi> 0

The inequality follows from the fact that the relative entropy is alwayspositive except when the two distributions are identical.

23.

Page 25: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

Case 2:

When the data is scarce, it is not clear what is the best estimate.In general, we should use prior knowledge, via Bayesian statistics.For instance, one can use the Dirichlet distribution with parameters α.

P (θ | n) =P (n | θ)D(θ | α)

P (n)

It can be shown (see calculus on R. Durbin et. al. BSA book, pag. 320)

that the posterior mean estimation (PME) of the parameters is

θPMEi

def.=

θP (θ | n)dθ =ni + αi

N +∑

j αj

The α′s are like pseudocounts added to the real counts. (If we think of theα′s as extra observations added to the real ones, this is precisely the MLestimate!) This makes the Dirichlet regulariser very intuitive.

How to use the pseudocounts: If it is fairly obvious that a certain residue,let’s say i, is very common, than we should give it a very high pseudocountαi; if the residue j is generaly rare, we should give it a low pseudocount.

24.

Page 26: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

Strategies to be used in the general case

A. The Maximum Likelihood (ML) Estimate

When we wish to infer the parameters θ = (θi) for a model M from a setof data D, the most obvious strategy is to maximise P (D | θ, M) over allpossible values of θ. Formally:

θML = argmaxθ

P (D | θ, M)

Note: Generally speaking, when we treat P (x | y) as a function of x (andy is fixed), we refer to it as a probability. When we treat P (x | y) as afunction of y (and x is fixed), we call it a likelihood. Note that a likelihoodis not a probability distribution or density; it is simply a function of thevariable y.

A serious drawback of maximum likelihood is that it gives poor resultswhen data is scarce. The solution then is to introduce more prior knowl-edge, using Bayes’ theorem. (In the Bayesian framework, the parametersare themselves seen as random variables!)

25.

Page 27: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

B. The Maximum A posteriori Probability (MAP) Estimate

θMAP def.= argmax

θP (θ | D,M) = argmax

θ

P (D | θ,M)P (θ |M)

P (D |M)

= argmaxθ

P (D | θ,M)P (θ |M)

The prior probability P (θ | M) has to be chosen in some reasonable manner,and this is the art of Bayesian estimation (although this freedom to choosea prior has made Bayesian statistics controversial at times...).

C. The Posterior Mean Estimator (PME)

θPME =

θP (θ | D, M)dθ

where the integral is over all probability vectors, i.e. all those that sum toone.

D. Yet another solution is to use the posterior probability P (θ | D, M) tosample from it (see [Durbin et al, 1998], section 11.4) and thereby locateregions of high probability for the model parameters.

26.

Page 28: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

5. Elementary Information Theory

Definitions:Let X and Y be discrete random variables.

• Entropy: H(X)def.=∑

x p(x) log 1p(x)

= −∑x p(x)log p(x) = Ep[−log p(X)].

• Specific Conditional entropy: H(Y | X = x)def.= −∑y∈Y p(y)log p(y | x).

• Average conditional entropy:

H(Y | X)def.=∑

x∈X p(x)H(Y | X = x)imed.= −∑x∈X

y∈Y p(x, y)log p(y | x).

• Joint entropy:

H(X, Y )def.= −∑x p(x, y) log p(x, y)

dem.= H(X)+H(Y |X)

dem.= H(Y )+H(X |Y ).

• Mutual information (or: Information gain):

IG(X; Y )def.= H(X) − H(X | Y )

imed.= H(Y ) − H(Y | X)

imed.= H(X, Y ) − H(X | Y ) − H(Y | X).

27.

Page 29: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

Basic properties of

Entropy, Conditional Entropy, Joint Entropy and

Mutual Information / Information Gain

• 0 ≤ H(p1, . . . , pn) ≤ H

(1

n, . . . ,

1

n

)

;

H(X) = 0 iff X is a constant random variable.

• H(X, Y ) ≤ H(X) + H(Y );H(X, Y ) = H(X) + H(Y ) iff X and Y are independent;

H(X | Y ) = H(X) iff X and Y are independent.

• IG(X; Y ) ≥ 0;IG(X; Y ) = 0 iff X and Y are independent.

28.

Page 30: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

The Relationship between

Entropy, Conditional Entropy, Joint Entropy and

Mutual Information

I(X,Y)H(X|Y) H(Y|X)

H(X,Y)

H(X) H(Y)

29.

Page 31: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

Other definitions

Let X and Y be discrete random variables, and p and q their respectivepmf’s.

• Relative entropy (or, Kullback-Leibler divergence):

KL(p || q) = −∑x∈X p(x)logq(x)

p(x)= Ep

[

logp(X)

q(X)

]

• Cross-entropy: CH(X, q) = −∑x∈X p(x)log q(x) = Ep

[

log1

q(X)

]

30.

Page 32: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

Basic properties ofrelative entropy and cross-entropy

• KL(p || q) ≥ 0 for all p and q;KL(p || q) = 0 iff p and q are identical.

• KL is NOT a distance metric (because it is not symmetric)!!

The quantity

d(X, Y )def= H(X, Y ) − IG(X; Y ) = H(X) + H(Y ) − 2IG(X; Y )

= H(X | Y ) + H(Y | X)

known as variation of information, is a distance metric.

• IG(X; Y ) = KL(pXY || pX pY ) =∑

x

y p(x, y) log(

p(x)p(y)p(x,y)

)

.

• If X is a discrete random variable, p its pmf and q another pmf (usuallya model of p),then CH(X, q) = H(X) + KL(p || q),and therefore CH(X, q) ≥ H(X) ≥ 0.

31.

Page 33: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

6. Recommended Exercises

• From [Manning & Schutze, 2002 , ch. 2:]

Examples 1, 2, 4, 5, 7, 8, 9

Exercises 2.1, 2.3, 2.4, 2.5

• From [Sheldon Ross, 1998 , ch. 8:]

Examples 2a, 2b, 3a, 3b, 3c, 5a, 5b

32.

Page 34: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

Addenda

A. Other Examples of Probabilistic Distributions

33.

Page 35: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

Multinomial distribution:generalises the binomial distribution to the case where there are K inde-pendent outcomes with probabilities θi, i = 1, . . . , K.The probability of getting ni occurrence of outcome i is given by

P (n | θ) =n!

Πini!ΠK

i=1θni

i

where n = n1 + . . .+ nK, θ = (θ1, . . . , θK).

Example: The outcome of rolling a die N times is described by a multi-nomial. The probabilities of each of the 6 outcomes are θ1, . . . , θ6. For afair die, θ1 = . . . = θ6, and the probability of rolling it 12 times and gettingeach outcome twice is:

12!

(2!)6

(1

6

)12

= 3.4 × 10−3

Note: The particular case n = 1 represents the categorical distribu-tion. This is a generalisation of the Bernoulli distribution.

34.

Page 36: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

Poisson distribution (or, Poisson law of small numbers):

p(k) =λk

k!· e−λ, with k ∈ N and parameter λ > 0.

Mean = variance = λ.

The probability mass function and the cumulative distribution function:

35.

Page 37: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

Exponential distribution (a.k.a. the negative exponential distri-bution):

p(x) = λe−λx for x ≥ 0 and parameter λ > 0.

Mean = λ−1, variance = λ−2.

The probability density function and the cumulative distribution function:

36.

Page 38: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

Gamma distribution:

xk−1 e−x/θ

Γ(k)θkfor x ≥ 0 and parameters k > 0 (shape) and θ > 0 (scale).

Mean = kθ, variance = kθ2.

The gamma function is a generalisation of the factorial function to realvalues. For any positive real number x, Γ(x+1) = xΓ(x). (Thus, for integersΓ(n) = (n − 1)!.)

The probability density function and the cumulative distribution function:

37.

Page 39: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

Student’s t distribution:

p(x) =Γ(ν+1

2)√

νπ Γ(ν2)

(

1 + x2

ν

)− ν+1

2

for x ∈ R and thye parameter ν > 0 (the degree

of freedom).

Mean = 0 for ν > 1, otherwise undefined.Variance = ν

ν−2for ν > 2, ∞ for 1 < ν ≤ 2, otherwise undefined.

The probability density function and the cumulative distribution function:

Note [from Wiki]: The t-distribution is symmetric and bell-shaped, like the normal distribution, but it hashavier tails, meaning that it is more prone to producing values that fall far from its mean.

38.

Page 40: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

Dirichlet distribution:

D(θ | α) =1

Z(α)ΠK

i=1θαi−1i δ(

∑Ki=1 θi − 1)

where

α = α1, . . . , αK with αi > 0 are the parameters,

θi satisfy 0 ≤ θi ≤ 1 and sum to 1, this being indicated by the delta functionterm δ(

i θi − 1), and

the normalising factor can be expressed in terms of the gamma function:

Z(α) =∫

ΠKi=1θ

αi−1i δ(

i −1)dθ =ΠiΓ(αi)

Γ(∑

i αi)

Mean of θi:αi

j αj.

For K = 2, the Dirichlet distribution reduces to beta distribution, an dthenormalising constant is thebeta function.

39.

Page 41: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

Remark:Concerning the multinomial and Dirichlet distributions:

The algebraic expression for the parameters θi is similar in the two distri-butions.However, the multinomial is a distribution over its exponents ni, whereasthe Dirichlet is a distributionover the numbers θi that are exponentiated.

The two distributions are said to be conjugate distributions and theirclose formal relationship leads to a harmonious interplay in many estimatedproblems.

Similarly, the gamma distribution is the conjugate of the Poisson distribu-tion.

40.

Page 42: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

Addenda

B. Some Proofs

41.

Page 43: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

E[X + Y ] = E[X] + E[Y ]where X and Y are random variables of the same type (i.e. either discrete or cont.)

The discrete case:

E[X + Y ] =∑

ω∈Ω

(X(ω) + Y (ω)) · P (ω)

=∑

ω

X(ω) · P (ω) +∑

ω

Y (ω) · P (ω) = E[X] + E[Y ]

The continuous case:

E[X + Y ] =

x

y

(x + y)pXY (x, y)dydx

=

x

y

xpXY (x, y)dydx +

x

y

ypXY (x, y)dydx

=

x

x

y

pXY (x, y)dydx +

y

y

x

pXY (x, y)dxdy

=

x

xpX(x)dx +

y

ypY (y)dy = E[X] + E[Y ]

42.

Page 44: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

X and Y are independent ⇒ E[XY ] = E[X] · E[Y ],X and Y being random variables of the same type (i.e. either discrete or continuous)

The discrete case:

E[XY ] =∑

x∈V al(X)

y∈V al(Y )

xyP (X = x, Y = y) =∑

x∈V al(X)

y∈V al(Y )

xyP (X = x) · P (Y = y)

=∑

x∈V al(X)

xP (X = x)∑

y∈V al(Y )

yP (Y = y)

=∑

x∈V al(X)

xP (X = x)E[Y ] = E[X ] · E[Y ]

The continuous case:

E[XY ] =

x

y

xy p(X = x, Y = y)dydx =

x

y

xy p(X = x) · p(Y = y)dydx

=

x

x p(X = x)

(∫

y

y p(Y = y)dy

)

dx =

x

x p(X = x)E[Y ]dx

= E[Y ] ·∫

x

x p(X = x)dx = E[X ] ·E[Y ]

43.

Page 45: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

Binomial distribution: b(r; n, p)def.= Cr

n pr(1 − p)n−r

Significance: b(r; n, p) is the number of heads in n independent flips of acoin having the head probability p.

b(r; n, p) indeed represents a probability distribution:

• b(r; n, p) = Crn pr(1 − p)n−r ≥ 0 for all p ∈ [0, 1], n ∈ N and r ∈ 0, 1, . . . , n,

∑nr=0 b(r; n, p) = 1:

(1 − p)n + C1np(1 − p)n−1 + · · · + Cn−1

n pn−1(1 − p) + pn = [p + (1 − p)]n = 1

44.

Page 46: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

Binomial distribution: calculating the mean

E[b(r; n, p)]def.=

n∑

r=0

r · b(r; n, p) =

= 1 · C1np(1 − p)n−1 + 2 · C2

np2(1 − p)n−2 + · · ·+ (n − 1) · Cn−1n pn−1(1 − p) + n · pn

= p[C1

n(1 − p)n−1 + 2 · C2np(1 − p)n−2 + · · ·+ (n − 1) · Cn−1

n pn−2(1 − p) + n · pn−1]

= np[(1 − p)n−1 + C1

n−1p(1 − p)n−2 + · · ·+ Cn−2n−1p

n−2(1 − p) + Cn−1n−1p

n−1]

= np[p + (1 − p)]n−1 = np

45.

Page 47: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

Binomial distribution: calculating the variancefollowing www.proofwiki.org/wiki/Variance of Binomial Distribution, which cites

“Probability: An Introduction”, by Geoffrey Grimmett and Dominic Welsh,

Oxford Science Publications, 1986

We will make use of the formula Var[X] = E[X2] − E2[X].By denoting q = 1 − p, it follows:

E[b2(r; n, p)]def.=

n∑

r=0

r2Crnprqn−r =

n∑

r=0

r2n(n − 1) . . . (n − r + 1)

r!

=

n∑

r=1

rn(n − 1) . . . (n − r + 1)

(r − 1)!prqn−r =

n∑

r=1

rn Cr−1n−1 prqn−r

= np

n∑

r=1

r Cr−1n−1 pr−1q(n−1)−(r−1)

46.

Page 48: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

Binomial distribution: calculating the variance (cont’d)

By denoting j = r − 1 and m = n − 1, we’ll get:

E[b2(r;n, p)] = np

m∑

j=0

(j + 1)Cjm pjqm−j

= np

m∑

j=0

j Cjm pjqm−j +

m∑

j=0

Cjm pjqm−j

= np

m∑

j=0

jm · . . . · (m− j + 1)

j!pjqm−j + (p+ q

︸ ︷︷ ︸

1

)m

= np

m∑

j=1

mCj−1m−1 p

jqm−j + 1

= np

mp

m∑

j=1

Cj−1m−1 p

j−1q(m−1)−(j−1) + 1

= np[(n− 1)p(p+ q︸ ︷︷ ︸

1

)m−1 + 1] = np[(n− 1)p+ 1] = n2p2 − np2 + np

Finally,

Var[X ] = E[b2(r;n, p)] − (E[b(r;n, p)])2 = n2p2 − np2 + np− n2p2 = np(1 − p)

47.

Page 49: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

Binomial distribution: calculating the variance

Another solution

• se demonstreaza relativ usor ca orice variabila aleatoare urmanddistributia binomiala b(r; n, p) poate fi vazuta ca o suma de n vari-abile independente care urmeaza distributia Bernoulli de parametrup;a

• stim (sau, se poate dovedi imediat) ca varianta distributiei Bernoullide parametru p este p(1 − p);

• tinand cont de proprietatea de liniaritate a variantelor — Var[X1 +X2 . . . + Xn] = Var[X1] + Var[X2] . . . + Var[Xn], daca X1, X2, . . . , Xn suntvariabile independente —, rezulta ca Var[X] = np(1 − p).

aVezi www.proofwiki.org/wiki/Bernoulli Process as Binomial Distribution, care citeaza de asemenea ca sursa “Proba-bility: An Introduction” de Geoffrey Grimmett si Dominic Welsh, Oxford Science Publications, 1986.

48.

Page 50: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

The Gaussian distribution: p(X = x) = 1√2πσ

e−

(x − µ)2

2σ2

Calculating the mean

E[Nµ,σ(x)]def.=

∫∞

−∞

xp(x)dx =1√2πσ

∫∞

−∞

x · e−

(x− µ)2

2σ2 dx

Using the variable transformation v =x− µ

σwill imply x = σv + µ and dx = σdv, so:

E[X ] =1√2πσ

∫∞

−∞

(σv + µ)e−

v2

2 (σdv) =σ√2πσ

σ

∫∞

−∞

ve−

v2

2 dv + µ

∫∞

−∞

e−

v2

2 dv

=1√2π

−σ

∫∞

−∞

(−v)e−

v2

2 dv + µ

∫∞

−∞

e−

v2

2 dv

=

1√2π

−σ e−

v2

2

∣∣∣∣∣∣∣

−∞︸ ︷︷ ︸

=0

∫∞

−∞

e−

v2

2 dv

=µ√2π

∫∞

−∞

e−

v2

2 dv. The last integral is computed as shown on the next slide.

49.

Page 51: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

The Gaussian distribution: calculating the mean (Cont’d)0

B

@

Z

v=−∞

e−

v2

2 dv

1

C

A

2

=

0

B

@

Z

x=−∞

e−

x2

2 dx

1

C

0

B

@

Z

y=−∞

e−

y2

2 dy

1

C

A=

Z

x=−∞

Z

y=−∞

e−

x2 + y2

2 dydx

=

ZZ

R2

e−

x2 + y2

2 dydx

By switching from x, y to polar coordinates r, θ, it follows:

0

B

@

Z

v=−∞

e−

v2

2 dv

1

C

A

2

=

Z

r=0

Z

θ=0

e−

r2

2 (rdrdθ) =

Z

r=0

re−

r2

2

„Z

θ=0

«

dr =

Z

r=0

re−

r2

2 θ|2π0 dr

= 2π

Z

r=0

re−

r2

2 dr = 2π (−e−

r2

2 )

˛

˛

˛

˛

˛

˛

˛

0

= 2π(0 − (−1)) = 2π

Note: x = r cos θ and y = r sin θ, with r ≥ 0 and r ∈ [0, 2π). Therefore, x2 + y2 = r2, and the

Jacobian matrix is

∂(x, y)

∂(r, θ)=

˛

˛

˛

˛

˛

˛

˛

˛

∂x

∂r

∂x

∂θ

∂y

∂r

∂y

∂θ

˛

˛

˛

˛

˛

˛

˛

˛

=

˛

˛

˛

˛

˛

cos θ −r sin θ

sin θ r cos θ

˛

˛

˛

˛

˛

= r cos2 θ + r sin2θ = r ≥ 0. So, dxdy = rdrdθ.

50.

Page 52: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

The Gaussian distribution: calculating the variance

We will make use of the formula Var[X ] = E[X2] − E2[X ].

E[X2] =

∫∞

−∞

x2p(x)dx =1√2πσ

∫∞

−∞

x2 · e−

(x− µ)2

2σ2 dx

Again, using v =x− µ

σwill imply x = σv + µ, and dx = σdv, therefore:

E[X2] =1√2πσ

∫∞

−∞

(σv + µ)2 e−

v2

2 (σdv)

=σ√2π

∫∞

−∞

(σ2v2 + 2σµv + µ2) e−

v2

2 dv

=1√2π

σ2

∫∞

−∞

v2 e−

v2

2 dv + 2σµ

∫∞

−∞

v e−

v2

2 dv + µ2

∫∞

−∞

e−

v2

2 dv

Note that we have already computed∫∞

−∞ve

v2

2 dv = 0 and∫∞

−∞e−

v2

2 dv =√

2π.

51.

Page 53: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

The Gaussian distribution: calculating the variance (Cont’d)

Therefore, we only need to compute

∫∞

−∞

v2e−

v2

2 dv =

∫∞

−∞

(−v)

−ve

v2

2

dv =

∫∞

−∞

(−v)

e

v2

2

dv

= (−v) e−

v2

2

∣∣∣∣∣∣∣

−∞

−∫

−∞

(−1)e−

v2

2 dv = 0 +

∫∞

−∞

e−

v2

2 dv =√

2π.

So,

E[X2] =1√2π

(

σ2√

2π + 2σµ · 0 + µ2√

2π)

= σ2 + µ2.

Finally,

Var[X ] = E[X2] − (E[X ])2 = (σ2 + µ2) − µ2 = σ2.

52.

Page 54: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

The covariance matrix Σ corresponding to a vector X madeof n random variables is symmetric and positive semi-definite

a. Cov(X)i,jdef.= Cov(Xi, Xj), for all i, j ∈ 1, . . . , n, and

Cov(Xi, Xj)def.= E[(X − E[Xi])(X − E[Xj ])] = Cov(Xj , Xi),

therefore Cov(X) is a symmetric matrix.

b. We will show that zT Σz ≥ 0 for any z ∈ Rn:

zT Σz =n∑

i=1

n∑

j=1

(ziΣijzj) =n∑

i=1

n∑

j=1

(zi Cov[Xi, Xj ] zj)

=n∑

i=1

n∑

j=1

(zi E[(Xi − E[Xi])(Xj − E[Xj ])] zj)

=

n∑

i=1

n∑

j=1

(E[(Xi − E[Xi])(Xj − E[Xj ])] zizj)

= E

n∑

i=1

n∑

j=1

(Xi − E[Xi])(Xj − E[Xj ]) zizj

= E[((X − E[X ])T · z)2] ≥ 0

53.

Page 55: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

If the covariance matrix of a multi-variate Gaussian

distribution is diagonal, then the density of this is equal tothe product of independent univariate Gaussian densities

Let’s consider X = [X1 . . . Xn]T , µ ∈ Rn and Σ ∈ Sn+, where Sn

+ is the set of symmetricpositive definite matrices (which implies |Σ| 6= 0 and (x− µ)T Σ−1(x− µ) > 0, therefore −12 (x − µ)T Σ−1(x− µ) < 0).

The probability density function of a multi-variate Gaussian distribution of parametersµ and Σ is:

p(x;µ,Σ) =1

(2π)n/2|Σ|1/2exp

(

−1

2(x− µ)T Σ−1(x− µ)

)

,

Notation: X ∼ N (µ,Σ)

We will make the proof for n = 2:

x =

[x1

x2

]

µ =

[µ1

µ2

]

Σ =

[σ2

1 00 σ2

2

]

54.

Page 56: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

A property of ulti-variate Gaussians whose covariance

matrices are diagonal (Cont’d)

p(x;µ,Σ) =1

∣∣∣∣

σ21 0

0 σ22

∣∣∣∣

12

exp

(

−1

2

[x1 − µ1

x2 − µ2

]T [σ2

1 00 σ2

2

]−1 [x1 − µ1

x2 − µ2

])

=1

2π σ21σ

22

exp

−1

2

[x1 − µ1

x2 − µ2

]T

1

σ21

0

01

σ22

[x1 − µ1

x2 − µ2

]

=1

2π σ21σ

22

exp

−1

2

[x1 − µ1

x2 − µ2

]T

1

σ21

(x1 − µ1)

1

σ22

(x2 − µ2)

=1

2π σ21σ

22

exp

(

− 1

2σ21

(x1 − µ1)2 − 1

2σ22

(x2 − µ2)2

)

= p(x1;µ1, σ21) p(x2;µ2, σ

22).

55.

Page 57: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

Derivation of entropy definition,

starting from a set of desirable properties

CMU, 2005 fall, T. Mitchell, A. Moore, HW1, pr. 2.2

56.

Page 58: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

Remark:

The definition Hn(X) = −∑i pi log pi is not very intuitive.

Theorem:

If ψn(p1, . . . , pn) satisfies the following axioms

A1. Hn should be continuous in pi and symmetric in its arguments;

A2. if pi = 1/n then Hn should be a monotonically increasing function of n;(If all events are equally likely, then having more events means beingmore uncertain.)

A3. if a choice among N events is broken down into successive choices,then entropy should be the weighted sum of the entropy at each stage;

then ψn(p1, . . . , pn) = −K∑i pi log pi where K is a positive constant.

Note: We will restrict the proof to the p1, . . . , pn ∈ Q case.

57.

Page 59: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

Example for the axiom A3:

Encoding 1

(a,b,c)

1/2 1/3 1/6

b ca

Encoding 2

(a,b,c)

a

1/2 1/2

(b,c)

b c

2/3 1/3

H

1

2,1

3,1

6

«

=1

2log 2 +

1

3log 3 +

1

6log 6 =

1

2+

1

6

«

log 2 +

1

3+

1

6

«

log 3 =2

3+

1

2log 3

H

1

2,1

2

«

+1

2H

2

3,1

3

«

= 1 +1

2

2

3log

3

2+

1

3log 3

«

= 1 +1

2

log 3 −2

3

«

=2

3+

1

2log 3

The next 3 slides:

Case 1: pi = 1/n for i = 1, . . . , n; proof steps

58.

Page 60: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

a. A(n)not.= ψ(1/n, 1/n, . . . , 1/n) implies

A(sm) = mA(s) for any s,m ∈ N∗. (1)

b. If s,m ∈ N⋆ (fixed), s 6= 1, and t, n ∈ N

⋆ such that sm ≤ tn ≤ sm+1, then

| mn − log tlog s |≤ 1

n . (2)

c. For sm ≤ tn ≤ sm+1 as above, it follows (imediately)

ψsm

(1

sm, . . . ,

1

sm

)

≤ ψtn

(1

tn, . . . ,

1

tn

)

≤ ψsm+1

(1

sm+1, . . . ,

1

sm+1

)

i.e. A(sm) ≤ A(tn) ≤ A(sm+1)c. Show that

| mn − A(t)A(s) |≤ 1

n for s 6= 1. (3)

d. Combining (2) + (3) gives imediately

| A(t)A(s) −

log tlog s |≤ 2

n pentru s 6= 1 (4)

d. Show that this inequation implies

A(t) = K log t with K > 0 (due to A2). (5)

59.

Page 61: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

Proofa.

ms

1m

s

1m

s

1

. . . . . . . . .. . .

. . . . . .. . .

. . .

1/s 1/s 1/s

1/s 1/s 1/s

nivel 2

nivel 1

1/s 1/s 1/s

nivel m

de s ori

. . .de s ori

de s ori

Applying the axion A3 on the right encoding from above gives:

A(sm) = A(s) + s · 1

sA(s) + s2 · 1

s2A(s) + . . .+ sm−1 · 1

sm−1A(s)

= A(s) +A(s) +A(s) + . . .+A(s)︸ ︷︷ ︸

m times

= mA(s)

60.

Page 62: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

Proof (cont’d)

b.

sm ≤ t

n ≤ sm+1 ⇒ m log s ≤ n log t ≤ (m + 1) log s ⇒

m

n≤ log t

log s≤ m

n+

1

n⇒ 0 ≤ log t

log s− m

n≤ 1

n⇒∣∣∣∣

log t

log s− m

n

∣∣∣∣≤ 1

n

c.

A(sm) ≤ A(tn) ≤ A(sm+1)1⇒ m A(s) ≤ n A(t) ≤ (m + 1) A(s)

s 6=1⇒m

n≤ A(t)

A(s)≤ m

n+

1

n⇒ 0 ≤ A(t)

A(s)− m

n≤ 1

n⇒∣∣∣∣

A(t)

A(s)− m

n

∣∣∣∣≤ 1

n

d. Consider again sm ≤ t

n ≤ sm+1 with s, t fixed. If m → ∞ then n → ∞ and

from

∣∣∣∣

A(t)

A(s)− log t

log s

∣∣∣∣≤ 1

nit follows that

∣∣∣∣

A(t)

A(s)− log t

log s

∣∣∣∣→ 0.

Therefore

∣∣∣∣

A(t)

A(s)− log t

log s

∣∣∣∣= 0 and so

A(t)

A(s)=

log t

log s.

Finally, A(t) =A(s)

log slog t = K log t, where K =

A(s)

log s> 0 (if s 6= 1).

61.

Page 63: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

Case 2: pi ∈ Q for i = 1, . . . , n

Let’s consider a set of N equiprobablerandom events, and P = (S1, S2, . . . , Sk)a partition of this set. Let’s denotepi =| Si | /N .

A “natural” two-step ecoding (asshown in the nearby figure) leads toA(N) = ψk(p1, . . . , pk) +

i piA(| Si |),based on the axiom A3.

Finally, using the result A(t) = K log t,gives:

K logN = ψk(p1, . . . , pk) +K∑

i pi log | Si |

S i

1/|S |i1/|S |i 1/|S |i

|S |/N2 |S |/Ni

S 1 S 2 S k

|S |/Nk|S |/N1

level 1

level 2

. . .. . .

. . .

. . . . . . . . .

⇒ ψk(p1, . . . , pk) = K[ logN −∑

i

pi log | Si | ]

= K[ logN∑

i

pi −∑

i

pi log | Si | ] = −K∑

i

pi log| Si |N

= −K∑

i

pi log pi

62.

Page 64: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

Addenda

C. Some Examples

63.

Page 65: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

Exemplifying

the computation of expected values for random variables

and the [use of] sensitivity of a test

in a real-world application

CMU, 2009 fall, Geoff Gordon, HW1, pr. 2

64.

Page 66: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

There is a disease which affects 1 in 500 people. A 100.00 dollar blood test canhelp reveal whether a person has the disease. A positive outcome indicatesthat the person may have the disease. The test has perfect sensitivity (truepositive rate), i.e., a person who has the disease tests positive 100% of thetime. However, the test has 99% specificity (true negative rate), i.e., a healthyperson tests positive 1% of the time.

a. A randomly selected individual is tested and the result is positive. Whatis the probability of the individual having the disease?

b. There is a second more expensive test which costs 10, 000.00 dollars but isexact with 100% sensitivity and specificity. If we require all people who testpositive with the less expensive test to be tested with the more expensive test,what is the expected cost to check whether an individual has the disease?

c. A pharmaceutical company is attempting to decrease the cost of the second(perfect) test. How much would it have to make the second test cost, so thatthe first test is no longer needed? That is, at what cost is it cheaper simplyto use the perfect test alone, instead of screening with the cheaper test asdescribed in part b?

65.

Page 67: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

Random variables:

B:

1/true for persons having this disease0/false otherwise;

T1: the result of the first test: + (in case of disease) or −;

T2: the result of the second test: again + or −.

Known facts:

P (B) =1

500

P (T1 = + | B) = 1, P (T1 = + | B) =1

100,

P (T2 = + | B) = 1, P (T2 = + | B) = 0

a.

P (B | T1 = +) =P (T1 = + | B) · P (B)

P (T1 = + | B) · P (B) + P (T1 = + | B) · P (B)

=1 · 1

500

1 · 1

500+

1

100· 499

500

=100

599≈ 0.1669

66.

Page 68: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

b.

C =

c1 if the person is tested only with the first test

c1 + c2 if the person is tested with both tests

⇒ P (C = c1) = P (T1 = −) si P (C = c1 + c2) = P (T1 = +)

⇒ E[C] = c1 · (1 − P (T1 = +)) + (c1 + c2) · P (T1 = +)

= c1 − c1 · P (T1 = +) + c1 · P (T1 = +) + c2 · P (T1 = +)

= c1 + c2 · P (T1 = +)

= 100 + 10000 · 599

50000= 219.8 ≈ 220$

P (T1 = +) = P (T1 = + | B) · P (B) + P (T1 = + | B) · P (B)

= 1 · 1

500+

1

100· 499

500=

599

50000= 0.01198

⇒ E[C] = c1 · (1 − P (T1 = +)) + (c1 + c2) · P (T1 = +)

= c1 − c1 · P (T1 = +) + c1 · P (T1 = +) + c2 · P (T1 = +)

= c1 + c2 · P (T1 = +) = 100 + 10000 · 599

50000= 219.8 ≈ 220$

67.

Page 69: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

c.cn the new price forr the second test (T ′

2)

cn ≤ E[C ′] = c1 · P (C = c1) + (c1 + cn) · P (C = c1 + cn)

= c1 + cn · P (T1 = +) = 100 + cn · 599

50000

cn = 100 + cn · 0.01198 ⇒ cn ≈ 101.2125.

68.

Page 70: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

Using the Central Limit Theorem (the i.i.d. version)

to compute the real error of a classifier

CMU, 2008 fall, Eric Xing, HW3, pr. 3.3

69.

Page 71: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

Chris recently adopts a new (binary) classifier to filter emailspams. He wants to quantitively evaluate how good the classi-fier is.

He has a small dataset of 100 emails on hand which, you canassume, are randomly drawn from all emails.

He tests the classifier on the 100 emails and gets 83 classifiedcorrectly, so the error rate on the small dataset is 17%.

However, the number on 100 samples could be either higher orlower than the real error rate just by chance.

With a confidence level of 95%, what is likely to be the range ofthe real error rate? Please write down all important steps.

(Hint: You need some approximation in this problem.)

70.

Page 72: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

Notations:

Let Xi, i = 1, . . . , n = 100 be defined as:Xi = 1 if the email i was incorrectly classified, and 0 otherwise;

E[Xi]not.= µ

not.= ereal ; Var(Xi)

not.= σ2

esamplenot.=

X1 + . . .+Xn

n= 0.17

Zn =X1 + . . .+Xn − nµ√

n σ(the standardized form of X1 + . . .+Xn)

Key insight:

Calculating the real error of the classifier (more exactly, a symmetric interval

around the real error pnot.= µ) with a “confidence” of 95% amounts to finding a > 0

sunch that P (|Zn| ≤ a) ≥ 0.95.

71.

Page 73: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

Calculus:

| Zn |≤ a ⇔∣∣∣∣

X1 + . . . + Xn − nµ√n σ

∣∣∣∣≤ a ⇔

∣∣∣∣

X1 + . . . + Xn − nµ

∣∣∣∣≤ a√

n

⇔∣∣∣∣

X1 + . . . + Xn − nµ

n

∣∣∣∣≤ aσ√

n⇔∣∣∣∣

X1 + . . . + Xn

n− µ

∣∣∣∣≤ aσ√

n

⇔ |esample − ereal| ≤aσ√

n⇔ |ereal − esample| ≤

aσ√n

⇔ − aσ√n≤ ereal − esample ≤

aσ√n

⇔ esample −aσ√

n≤ ereal ≤ esample +

aσ√n

⇔ ereal ∈[

esample −aσ√

n, esample +

aσ√n

]

72.

Page 74: Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

Important facts:

The Central Limit Theorem: Zn → N(0; 1)

Therefore, P (|Zn| ≤ a) ≈ P (|X| ≤ a) = Φ(a) − Φ(−a), where X ∼ N(0; 1)

and Φ is the cumulative function distribution of N(0; 1).

Calculus:

Φ(−a) + Φ(a) = 1 ⇒ P (| Zn |≤ a) = Φ(a) − Φ(−a) = 2Φ(a) − 1

P (| Zn |≤ a)=0.95⇔2Φ(a)−1=0.95⇔Φ(a)=0.975 ⇔ a ∼= 1.97 (see Φ table)

Finally:

σ2 not.= Varreal ≈ Varsample due to the above theorem, and

Varsample = esample(1 − esample) because Xi are Bernoulli variables.

⇒ aσ√n

= 1.97 ·√

0.17(1 − 0.17)√100

∼= 0.07

|ereal − esample| ≤ 0.07 ⇔ |ereal − 0.17| ≤ 0.07 ⇔ −0.07 ≤ ereal − 0.17 ≤ 0.07

⇔ ereal ∈ [0.10, 0.24]

73.