Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning

Basic Statistics and Probability Theory

Based on

“Foundations of Statistical NLP”

C. Manning & H. Schutze, ch. 2, MIT Press, 2002

“Probability theory is nothing but common sense

reduced to calculation.”

Pierre Simon, Marquis de Laplace (1749-1827)

0.

PLAN1. Elementary Probability Notions:• Event Space, and Probability Function

• Conditional Probabiblity

• Bayes’ Theorem

• Independence of Probabilistic Events

2. Random Variables:• Discrete Variables and Continuous Variables

• Mean, Variance and Standard Deviation

• Standard Distributions

• Joint, Marginal and and Conditional Distributions

• Independence of Random Variables

3. Limit Theorems

4. Estimating the parameters of probab. models from data

5. Elementary Information Theory

1.

1. Elementary Probability Notions

• sample/event space: Ω (either discrete or continuous)

• event: A ⊆ Ω

– the certain event: Ω

– the impossible event: ∅• event space: F = 2Ω (or a subspace of 2Ω that contains ∅ and is closed

under complement and countable union)

• probability function/distribution: P : F → [0, 1] such that:

– P (Ω) = 1

– the “countable additivity” property:∀A1, ..., Ak disjoint events, P (∪Ai) =

∑P (Ai)

Consequence: for a uniform distribution in a finite sample space:

P (A) =#favorable events

#all events

2.

Conditional Probabiblity

• P (A | B) =P (A ∩ B)

P (B)

Note: P (A | B) is called the a posteriory probability of A, given B.

• The “multiplication” rule:

P (A ∩ B) = P (A | B)P (B) = P (B | A)P (A)

• The “chain” rule:

P (A1 ∩ A2 ∩ . . . ∩ An) =P (A1)P (A2 | A1)P (A3 | A1, A2) . . . P (An | A1, A2, . . . , An−1)

3.

• The “total probability” formula:

P (A) = P (A | B)P (B) + P (A | ¬B)P (¬B)

More generally:

if A ⊆ ∪Bi and ∀i 6= j Bi ∩ Bj = ∅, then

P (A) =∑

i P (A | Bi)P (Bi)

• Bayes’ Theorem:

P (B | A) =P (A | B) P (B)

P (A)

or P (B | A) =P (A | B) P (B)

P (A | B)P (B) + P (A | ¬B)P (¬B)

or ...

4.

Independence of Probabilistic Events

• Independent events: P (A ∩ B) = P (A)P (B)

Note: When P (B) 6= 0, the above definition is equivalent toP (A|B) = P (A).

• Conditionally independent events:P (A ∩ B | C) = P (A | C)P (B | C), assuming, of course, thatP (C) 6= 0.

Note: When P (B ∩ C) 6= 0, the above definition is equivalentto P (A|B, C) = P (A|C).

5.

2. Random Variables

2.1 Basic Definitions

Let Ω be a sample space, andP : 2Ω → [0, 1] a probability function.

• A random variable of distribution P is a function

X : Ω → Rn

For now, let us consider n = 1.

The cumulative distribution function of X is F : R → [0,∞) defined by

F (x) = P (X ≤ x) = P (ω ∈ Ω | X(ω) ≤ x)

6.

2.2 Discrete Random Variables

Definition: Let P : 2Ω → [0, 1] be a probability function, and X be a randomvariable of distribution P .

• If Image(X) is either finite or unfinite countable, thenX is called a discrete random variable.

For such a variable we define the probability mass function (pmf)

p : R → [0, 1] as p(x)def= p(X = x) = P (ω ∈ Ω | X(ω) = x).

(Obviously, it follows that∑

xi∈Image(X) p(xi) = 1.)

Mean, Variance, and Standard Deviation:

• Expectation / mean of X:

E(X)not.= E[X] =

∑

x xp(x) if X is a discrete random variable.

• Variance of X: Var(X)not.= Var[X] = E((X − E(X))2).

• Standard deviation: σ =√

Var(X).

Covariance of X and Y , two random variables of distribution P :

• Cov(X, Y ) = E[(X − E[X])(Y − E[Y ])]

7.

Exemplification:

• the Binomial distribution: b(r; n, p) = Crn pr(1 − p)n−r (0 ≤ r ≤ n)

mean: np, variance: np(1 − p)

the Bernoulli distribution: b(r; 1, p)

The probability mass function and the cumulative distribution function ofthe Binomial distribution:

8.

2.3 Continuous Random VariablesDefinitions:

Let P : 2Ω → [0, 1] be a probability function, andX : Ω → R be a random variable of distribution P .

• If Image(X) is unfinite non-countable set, andF , the cumulative distribution function of X is continuous, then

X is called a continuous random variable.

(It follows, naturally, that P (X = x) = 0, for all x ∈ R.)

If there exists p : R → [0,∞) such that F (x) =∫ x

−∞p(t)dt,

then X is called absolutely continuous.In such a case, p is called the probability density function (pdf) of X.

For B ⊆ R for which∫

Bp(x)dx exists,

Pr(B)def= P (ω ∈ Ω | X(ω) ∈ B) =

∫

Bp(x)dx.

• In particular,∫ +∞

−∞p(x)dx = 1.

• Expectation / mean of X: E(X)not.= E[X] =

∫xp(x)dx.

9.

Exemplification:

• Normal (Gaussean) distribution: N(x; µ, σ) = 1√2πσ

e−(x− µ)2

2σ2

mean: µ, variance: σ2

Standard Normal distribution: N(x; 0, 1)

• Remark:

For n, p such that np(1 − p) > 5, the Binomial distributions can beapproximated by Normal distributions.

10.

The Normal distribution:the probability density function and the cumulative

distribution function

11.

2.4 Basic Properties of Random Variables

Let P : 2Ω → [0, 1] be a probability function,X : Ω → R

n be a random discrete/continuous variable of distribution P .

• If g : Rn → Rm is a function, then g(X) is a random variable.

If g(X) is discrete, then E(g(X)) =∑

x g(x)p(x).

If g(X) is continuous, then E(g(X)) =∫

g(x)p(x)dx.

• E(aX + b) = aE(X) + b.

If g is non-linear 6⇒ E(g(X)) = g(E(X)).

• E(X + Y ) = E(X) + E(Y ).

• Var(X) = E(X2) − E2(X).

Var(aX) = a2Var(X).

• Cov(X, Y ) = E[XY ] − E[X]E[Y ].

12.

2.5 Joint, Marginal and Conditional DistributionsExemplification for the bi-variate case:

Let Ω be a sample space, P : 2Ω → [0, 1] a probability function, andV : Ω → R

2 be a random variable of distribution P .One could naturally see V as a pair of two random variables X : Ω → R andY : Ω → R. (More precisely, V (ω) = (x, y) = (X(ω), Y (ω)).)

• the joint pmf/pdf of X and Y is defined by

p(x, y)not.= pX,Y (x, y) = P (X = x, Y = y) = P (ω ∈ Ω | X(ω) = x, Y (ω) = y).

• the marginal pmf/pdf functions of X and Y are:

for the discrete case:pX(x) =

∑

y p(x, y), pY (y) =∑

x p(x, y)

for the continuous case:pX(x) =

∫

yp(x, y) dy, pY (y) =

∫

xp(x, y) dx

• the conditional pmf/pdf of X given Y is:

pX|Y (x | y) =pX,Y (x, y)

pY (y)

13.

2.6 Independence of Random Variables

Definitions:

• Let X, Y be random variables of the same type (i.e. either discrete orcontinuous), and pX,Y their joint pmf/pdf.

X and Y are are said to be independent if

pX,Y (x, y) = pX(x) · pY (y)

for all possible values x and y of X and Y respectively.

• Similarly, let X, Y and Z be random variables of the same type, and p

their joint pmf/pdf.

X and Y are conditionally independent given Z if

pX,Y |Z(x, y | z) = pX|Z(x | z) · pY |Z(y | z)

for all possible values x, y and z of X, Y and Z respectively.

14.

Properties of random variables pertaining to independence

• If X, Y are independent, thenVar(X + Y ) = Var(X) + Var(Y ).

• If X, Y are independent, thenE(XY ) = E(X)E(Y ), i.e. Cov(X, Y ) = 0.

Cov(X, Y ) = 0 6⇒ X, Y are independent.

The covariance matrix corresponding to a vector of random variablesis symmetric and positive semi-definite.

• If the covariance matrix of a multi-variate Gaussian distribution isdiagonal, then the marginal distributions are independent.

15.

3. Limit Theorems[ Sheldon Ross, A first course in probability, 5th ed., 1998 ]

“The most important results in probability theory are limit theo-rems. Of these, the most important are...

laws of large numbers, concerned with stating conditions underwhich the average of a sequence of random variables converge (insome sense) to the expected average;

central limit theorems, concerned with determining the conditionsunder which the sum of a large number of random variables has aprobability distribution that is approximately normal.”

16.

Two basic inequalities and the weak law of large numbers

Markov’s inequality:If X is a random variable that takes only non-negative values,then for any value a > 0,

P (X ≥ a) ≤ E[X]

a

Chebyshev’s inequality:If X is a random variable with finite mean µ and variance σ2,then for any value k > 0,

P (| X − µ |≥ k) ≤ σ2

k2

The weak law of large numbers (Bernoulli; Khintchine):Let X1, X2, . . . , Xn be a sequence of independent and identically dis-tributed random variables, each having a finite mean E[Xi] = µ.Then, for any value ǫ > 0,

P

(∣∣∣∣

X1 + . . . + Xn

n− µ

∣∣∣∣≥ ǫ

)

→ 0 as n → ∞

17.

The central limit theoremfor i.i.d. random variables

[ Pierre Simon, Marquis de Laplace; Liapunoff in 1901-1902 ]

Let X1, X2, . . . , Xn be a sequence of independent random variables,each having mean µ and variance σ2.Then the distribution of

X1 + . . . + Xn − nµ

σ√

n

tends to be the standard normal (Gaussian) as n → ∞.

That is, for −∞ < a < ∞,

P

(X1 + . . . + Xn − nµ

σ√

n≤ a

)

→ 1√2π

∫ a

−∞

e−x2/2dx as n → ∞

18.

The central limit theorem

for independent random variables

Let X1, X2, . . . , Xn be a sequence of independent and identically distributedrandom variables having respective means µi and variances σ2

i .

If

(a) the variables Xi are uniformly bounded,i.e. for some M ∈ R

+ P (| Xi |< M) = 1 for all i,

and

(b)∑∞

i=1 σ2i = ∞,

then

P

(∑ni=1(Xi − µi)√∑n

i=1 σ2i

≤ a

)

→ Φ(a) as n → ∞

where Φ is the cumulative distribution function for the standard normal(Gaussian) distribution.

19.

The strong law of large numbers

Let X1, X2, . . . , Xn be a sequence of independent and identicallydistributed random variables, each having a finite mean E[Xi] = µ.Then, with probability 1,

X1 + . . . + Xn

n→ µ as n → ∞

That is,

P(

limn→∞

(X1 + . . . + Xn)/n = µ)

= 1

20.

Other inequalitiesOne-sided Chebyshev inequality:

If X is a random variable with mean 0 and finite variance σ2,then for any a > 0,

P (X ≥ a) ≤ σ2

σ2 + a2

Corollary:

If E[X] = µ, Var(X) = σ2, then for a > 0

P (X ≥ µ + a) ≤ σ2

σ2 + a2

P (X ≤ µ − a) ≤ σ2

σ2 + a2

Chernoff bounds:

Let M(t)not= E[etX ]. Then

P (X ≥ a) ≤ e−taM(t) for all t > 0

P (X ≥ a) ≤ e−taM(t) for all t < 0

21.

4. Estimation/inference of the parameters ofprobabilistic models from data

(based on [Durbin et al, Biological Sequence Analysis, 1998],p. 311-313, 319-321)

A probabilistic model can be anything from a simple distributionto a complex stochastic grammar with many implicit probabilitydistributions. Once the type of the model is chosen, the parame-ters have to be inferred from data.

We will first consider the case of the multinomial distribution, andthen we will present the different strategies that can be used ingeneral.

22.

A case study: Estimation of the parameters ofa multinomial distribution from data

Assume that the observations — for example, when rolling a die aboutwhich we don’t know whether it is fair or not, or when counting the numberof times the amino acid i occurs in a column of a multiple sequence align-ment — can be expressed as counts ni for each outcome i (i = 1, l . . . , K),and we want to estimate the probabilities θi of the underlying distribution.

Case 1:

When we have plenty of data, it is natural to use the maximum likeli-hood (ML) solution, i.e. the observed frequency θML

i =ni

∑

j nj

not.=

ni

N.

Note: it is easy to show that indeed P (n | θML) > P (n | θ) for any θ 6= θML.

logP (n | θML)

P (n | θ) = logΠi(θ

MLi )ni

Πiθni

i

=∑

i

ni logθML

i

θi= N

∑

i

θMLi log

θMLi

θi> 0

The inequality follows from the fact that the relative entropy is alwayspositive except when the two distributions are identical.

23.

Case 2:

When the data is scarce, it is not clear what is the best estimate.In general, we should use prior knowledge, via Bayesian statistics.For instance, one can use the Dirichlet distribution with parameters α.

P (θ | n) =P (n | θ)D(θ | α)

P (n)

It can be shown (see calculus on R. Durbin et. al. BSA book, pag. 320)

that the posterior mean estimation (PME) of the parameters is

θPMEi

def.=

∫

θP (θ | n)dθ =ni + αi

N +∑

j αj

The α′s are like pseudocounts added to the real counts. (If we think of theα′s as extra observations added to the real ones, this is precisely the MLestimate!) This makes the Dirichlet regulariser very intuitive.

How to use the pseudocounts: If it is fairly obvious that a certain residue,let’s say i, is very common, than we should give it a very high pseudocountαi; if the residue j is generaly rare, we should give it a low pseudocount.

24.

Strategies to be used in the general case

A. The Maximum Likelihood (ML) Estimate

When we wish to infer the parameters θ = (θi) for a model M from a setof data D, the most obvious strategy is to maximise P (D | θ, M) over allpossible values of θ. Formally:

θML = argmaxθ

P (D | θ, M)

Note: Generally speaking, when we treat P (x | y) as a function of x (andy is fixed), we refer to it as a probability. When we treat P (x | y) as afunction of y (and x is fixed), we call it a likelihood. Note that a likelihoodis not a probability distribution or density; it is simply a function of thevariable y.

A serious drawback of maximum likelihood is that it gives poor resultswhen data is scarce. The solution then is to introduce more prior knowl-edge, using Bayes’ theorem. (In the Bayesian framework, the parametersare themselves seen as random variables!)

25.

B. The Maximum A posteriori Probability (MAP) Estimate

θMAP def.= argmax

θP (θ | D,M) = argmax

θ

P (D | θ,M)P (θ |M)

P (D |M)

= argmaxθ

P (D | θ,M)P (θ |M)

The prior probability P (θ | M) has to be chosen in some reasonable manner,and this is the art of Bayesian estimation (although this freedom to choosea prior has made Bayesian statistics controversial at times...).

C. The Posterior Mean Estimator (PME)

θPME =

∫

θP (θ | D, M)dθ

where the integral is over all probability vectors, i.e. all those that sum toone.

D. Yet another solution is to use the posterior probability P (θ | D, M) tosample from it (see [Durbin et al, 1998], section 11.4) and thereby locateregions of high probability for the model parameters.

26.

5. Elementary Information Theory

Definitions:Let X and Y be discrete random variables.

• Entropy: H(X)def.=∑

x p(x) log 1p(x)

= −∑x p(x)log p(x) = Ep[−log p(X)].

• Specific Conditional entropy: H(Y | X = x)def.= −∑y∈Y p(y)log p(y | x).

• Average conditional entropy:

H(Y | X)def.=∑

x∈X p(x)H(Y | X = x)imed.= −∑x∈X

∑

y∈Y p(x, y)log p(y | x).

• Joint entropy:

H(X, Y )def.= −∑x p(x, y) log p(x, y)

dem.= H(X)+H(Y |X)

dem.= H(Y )+H(X |Y ).

• Mutual information (or: Information gain):

IG(X; Y )def.= H(X) − H(X | Y )

imed.= H(Y ) − H(Y | X)

imed.= H(X, Y ) − H(X | Y ) − H(Y | X).

27.

Basic properties of

Entropy, Conditional Entropy, Joint Entropy and

Mutual Information / Information Gain

• 0 ≤ H(p1, . . . , pn) ≤ H

(1

n, . . . ,

1

n

)

;

H(X) = 0 iff X is a constant random variable.

• H(X, Y ) ≤ H(X) + H(Y );H(X, Y ) = H(X) + H(Y ) iff X and Y are independent;

H(X | Y ) = H(X) iff X and Y are independent.

• IG(X; Y ) ≥ 0;IG(X; Y ) = 0 iff X and Y are independent.

28.

The Relationship between

Entropy, Conditional Entropy, Joint Entropy and

Mutual Information

I(X,Y)H(X|Y) H(Y|X)

H(X,Y)

H(X) H(Y)

29.

Other definitions

Let X and Y be discrete random variables, and p and q their respectivepmf’s.

• Relative entropy (or, Kullback-Leibler divergence):

KL(p || q) = −∑x∈X p(x)logq(x)

p(x)= Ep

[

logp(X)

q(X)

]

• Cross-entropy: CH(X, q) = −∑x∈X p(x)log q(x) = Ep

[

log1

q(X)

]

30.

Basic properties ofrelative entropy and cross-entropy

• KL(p || q) ≥ 0 for all p and q;KL(p || q) = 0 iff p and q are identical.

• KL is NOT a distance metric (because it is not symmetric)!!

The quantity

d(X, Y )def= H(X, Y ) − IG(X; Y ) = H(X) + H(Y ) − 2IG(X; Y )

= H(X | Y ) + H(Y | X)

known as variation of information, is a distance metric.

• IG(X; Y ) = KL(pXY || pX pY ) =∑

x

∑

y p(x, y) log(

p(x)p(y)p(x,y)

)

.

• If X is a discrete random variable, p its pmf and q another pmf (usuallya model of p),then CH(X, q) = H(X) + KL(p || q),and therefore CH(X, q) ≥ H(X) ≥ 0.

31.

6. Recommended Exercises

• From [Manning & Schutze, 2002 , ch. 2:]

Examples 1, 2, 4, 5, 7, 8, 9

Exercises 2.1, 2.3, 2.4, 2.5

• From [Sheldon Ross, 1998 , ch. 8:]

Examples 2a, 2b, 3a, 3b, 3c, 5a, 5b

32.

Addenda

A. Other Examples of Probabilistic Distributions

33.

Multinomial distribution:generalises the binomial distribution to the case where there are K inde-pendent outcomes with probabilities θi, i = 1, . . . , K.The probability of getting ni occurrence of outcome i is given by

P (n | θ) =n!

Πini!ΠK

i=1θni

i

where n = n1 + . . .+ nK, θ = (θ1, . . . , θK).

Example: The outcome of rolling a die N times is described by a multi-nomial. The probabilities of each of the 6 outcomes are θ1, . . . , θ6. For afair die, θ1 = . . . = θ6, and the probability of rolling it 12 times and gettingeach outcome twice is:

12!

(2!)6

(1

6

)12

= 3.4 × 10−3

Note: The particular case n = 1 represents the categorical distribu-tion. This is a generalisation of the Bernoulli distribution.

34.

Poisson distribution (or, Poisson law of small numbers):

p(k) =λk

k!· e−λ, with k ∈ N and parameter λ > 0.

Mean = variance = λ.

The probability mass function and the cumulative distribution function:

35.

Exponential distribution (a.k.a. the negative exponential distri-bution):

p(x) = λe−λx for x ≥ 0 and parameter λ > 0.

Mean = λ−1, variance = λ−2.

The probability density function and the cumulative distribution function:

36.

Gamma distribution:

xk−1 e−x/θ

Γ(k)θkfor x ≥ 0 and parameters k > 0 (shape) and θ > 0 (scale).

Mean = kθ, variance = kθ2.

The gamma function is a generalisation of the factorial function to realvalues. For any positive real number x, Γ(x+1) = xΓ(x). (Thus, for integersΓ(n) = (n − 1)!.)


37.

Student’s t distribution:

p(x) =Γ(ν+1

2)√

νπ Γ(ν2)

(

1 + x2

ν

)− ν+1

2

for x ∈ R and thye parameter ν > 0 (the degree

of freedom).

Mean = 0 for ν > 1, otherwise undefined.Variance = ν

ν−2for ν > 2, ∞ for 1 < ν ≤ 2, otherwise undefined.


Note [from Wiki]: The t-distribution is symmetric and bell-shaped, like the normal distribution, but it hashavier tails, meaning that it is more prone to producing values that fall far from its mean.

38.

Dirichlet distribution:

D(θ | α) =1

Z(α)ΠK

i=1θαi−1i δ(

∑Ki=1 θi − 1)

where

α = α1, . . . , αK with αi > 0 are the parameters,

θi satisfy 0 ≤ θi ≤ 1 and sum to 1, this being indicated by the delta functionterm δ(

∑

i θi − 1), and

the normalising factor can be expressed in terms of the gamma function:

Z(α) =∫

ΠKi=1θ

αi−1i δ(

∑

i −1)dθ =ΠiΓ(αi)

Γ(∑

i αi)

Mean of θi:αi

∑

j αj.

For K = 2, the Dirichlet distribution reduces to beta distribution, an dthenormalising constant is thebeta function.

39.

Remark:Concerning the multinomial and Dirichlet distributions:

The algebraic expression for the parameters θi is similar in the two distri-butions.However, the multinomial is a distribution over its exponents ni, whereasthe Dirichlet is a distributionover the numbers θi that are exponentiated.

The two distributions are said to be conjugate distributions and theirclose formal relationship leads to a harmonious interplay in many estimatedproblems.

Similarly, the gamma distribution is the conjugate of the Poisson distribu-tion.

40.

Addenda

B. Some Proofs

41.

E[X + Y ] = E[X] + E[Y ]where X and Y are random variables of the same type (i.e. either discrete or cont.)

The discrete case:

E[X + Y ] =∑

ω∈Ω

(X(ω) + Y (ω)) · P (ω)

=∑

ω

X(ω) · P (ω) +∑

ω

Y (ω) · P (ω) = E[X] + E[Y ]

The continuous case:

E[X + Y ] =

∫

x

∫

y

(x + y)pXY (x, y)dydx

=

∫

x

∫

y

xpXY (x, y)dydx +

∫

x

∫

y

ypXY (x, y)dydx

=

∫

x

x

∫

y

pXY (x, y)dydx +

∫

y

y

∫

x

pXY (x, y)dxdy

=

∫

x

xpX(x)dx +

∫

y

ypY (y)dy = E[X] + E[Y ]

42.

X and Y are independent ⇒ E[XY ] = E[X] · E[Y ],X and Y being random variables of the same type (i.e. either discrete or continuous)

The discrete case:

E[XY ] =∑

x∈V al(X)

∑

y∈V al(Y )

xyP (X = x, Y = y) =∑

x∈V al(X)

∑

y∈V al(Y )

xyP (X = x) · P (Y = y)

=∑

x∈V al(X)

xP (X = x)∑

y∈V al(Y )

yP (Y = y)

=∑

x∈V al(X)

xP (X = x)E[Y ] = E[X ] · E[Y ]

The continuous case:

E[XY ] =

∫

x

∫

y

xy p(X = x, Y = y)dydx =

∫

x

∫

y

xy p(X = x) · p(Y = y)dydx

=

∫

x

x p(X = x)

(∫

y

y p(Y = y)dy

)

dx =

∫

x

x p(X = x)E[Y ]dx

= E[Y ] ·∫

x

x p(X = x)dx = E[X ] ·E[Y ]

43.

Binomial distribution: b(r; n, p)def.= Cr

n pr(1 − p)n−r

Significance: b(r; n, p) is the number of heads in n independent flips of acoin having the head probability p.

b(r; n, p) indeed represents a probability distribution:

• b(r; n, p) = Crn pr(1 − p)n−r ≥ 0 for all p ∈ [0, 1], n ∈ N and r ∈ 0, 1, . . . , n,

•

∑nr=0 b(r; n, p) = 1:

(1 − p)n + C1np(1 − p)n−1 + · · · + Cn−1

n pn−1(1 − p) + pn = [p + (1 − p)]n = 1

44.

Binomial distribution: calculating the mean

E[b(r; n, p)]def.=

n∑

r=0

r · b(r; n, p) =

= 1 · C1np(1 − p)n−1 + 2 · C2

np2(1 − p)n−2 + · · ·+ (n − 1) · Cn−1n pn−1(1 − p) + n · pn

= p[C1

n(1 − p)n−1 + 2 · C2np(1 − p)n−2 + · · ·+ (n − 1) · Cn−1

n pn−2(1 − p) + n · pn−1]

= np[(1 − p)n−1 + C1

n−1p(1 − p)n−2 + · · ·+ Cn−2n−1p

n−2(1 − p) + Cn−1n−1p

n−1]

= np[p + (1 − p)]n−1 = np

45.

Binomial distribution: calculating the variancefollowing www.proofwiki.org/wiki/Variance of Binomial Distribution, which cites

“Probability: An Introduction”, by Geoffrey Grimmett and Dominic Welsh,

Oxford Science Publications, 1986

We will make use of the formula Var[X] = E[X2] − E2[X].By denoting q = 1 − p, it follows:

E[b2(r; n, p)]def.=

n∑

r=0

r2Crnprqn−r =

n∑

r=0

r2n(n − 1) . . . (n − r + 1)

r!

=

n∑

r=1

rn(n − 1) . . . (n − r + 1)

(r − 1)!prqn−r =

n∑

r=1

rn Cr−1n−1 prqn−r

= np

n∑

r=1

r Cr−1n−1 pr−1q(n−1)−(r−1)

46.

Binomial distribution: calculating the variance (cont’d)

By denoting j = r − 1 and m = n − 1, we’ll get:

E[b2(r;n, p)] = np

m∑

j=0

(j + 1)Cjm pjqm−j

= np

m∑

j=0

j Cjm pjqm−j +

m∑

j=0

Cjm pjqm−j

= np

m∑

j=0

jm · . . . · (m− j + 1)

j!pjqm−j + (p+ q

︸︷︷︸

1

)m

= np

m∑

j=1

mCj−1m−1 p

jqm−j + 1

= np

mp

m∑

j=1

Cj−1m−1 p

j−1q(m−1)−(j−1) + 1

= np[(n− 1)p(p+ q︸︷︷︸

1

)m−1 + 1] = np[(n− 1)p+ 1] = n2p2 − np2 + np

Finally,

Var[X ] = E[b2(r;n, p)] − (E[b(r;n, p)])2 = n2p2 − np2 + np− n2p2 = np(1 − p)

47.

Binomial distribution: calculating the variance

Another solution

• se demonstreaza relativ usor ca orice variabila aleatoare urmanddistributia binomiala b(r; n, p) poate fi vazuta ca o suma de n vari-abile independente care urmeaza distributia Bernoulli de parametrup;a

• stim (sau, se poate dovedi imediat) ca varianta distributiei Bernoullide parametru p este p(1 − p);

• tinand cont de proprietatea de liniaritate a variantelor — Var[X1 +X2 . . . + Xn] = Var[X1] + Var[X2] . . . + Var[Xn], daca X1, X2, . . . , Xn suntvariabile independente —, rezulta ca Var[X] = np(1 − p).

aVezi www.proofwiki.org/wiki/Bernoulli Process as Binomial Distribution, care citeaza de asemenea ca sursa “Proba-bility: An Introduction” de Geoffrey Grimmett si Dominic Welsh, Oxford Science Publications, 1986.

48.

The Gaussian distribution: p(X = x) = 1√2πσ

e−

(x − µ)2

2σ2

Calculating the mean

E[Nµ,σ(x)]def.=

∫∞

−∞

xp(x)dx =1√2πσ

∫∞

−∞

x · e−

(x− µ)2

2σ2 dx

Using the variable transformation v =x− µ

σwill imply x = σv + µ and dx = σdv, so:

E[X ] =1√2πσ

∫∞

−∞

(σv + µ)e−

v2

2 (σdv) =σ√2πσ

σ

∫∞

−∞

ve−

v2

2 dv + µ

∫∞

−∞

e−

v2

2 dv

=1√2π

−σ

∫∞

−∞

(−v)e−

v2

2 dv + µ

∫∞

−∞

e−

v2

2 dv

=

1√2π

−σ e−

v2

2

∣∣∣∣∣∣∣

∞

−∞︸︷︷︸

=0

+µ

∫∞

−∞

e−

v2

2 dv

=µ√2π

∫∞

−∞

e−

v2

2 dv. The last integral is computed as shown on the next slide.

49.

The Gaussian distribution: calculating the mean (Cont’d)0

B

@

Z

∞

v=−∞

e−

v2

2 dv

1

C

A

2

=

0

B

@

Z

∞

x=−∞

e−

x2

2 dx

1

C

A·

0

B

@

Z

∞

y=−∞

e−

y2

2 dy

1

C

A=

Z

∞

x=−∞

Z

∞

y=−∞

e−

x2 + y2

2 dydx

=

ZZ

R2

e−

x2 + y2

2 dydx

By switching from x, y to polar coordinates r, θ, it follows:

0

B

@

Z

∞

v=−∞

e−

v2

2 dv

1

C

A

2

=

Z

∞

r=0

Z

2π

θ=0

e−

r2

2 (rdrdθ) =

Z

∞

r=0

re−

r2

2

„Z

2π

θ=0

dθ

«

dr =

Z

∞

r=0

re−

r2

2 θ|2π0 dr

= 2π

Z

∞

r=0

re−

r2

2 dr = 2π (−e−

r2

2 )

˛

˛

˛

˛

˛

˛

˛

∞

0

= 2π(0 − (−1)) = 2π

Note: x = r cos θ and y = r sin θ, with r ≥ 0 and r ∈ [0, 2π). Therefore, x2 + y2 = r2, and the

Jacobian matrix is

∂(x, y)

∂(r, θ)=

˛

˛

˛

˛

˛

˛

˛

˛

∂x

∂r

∂x

∂θ

∂y

∂r

∂y

∂θ

˛

˛

˛

˛

˛

˛

˛

˛

=

˛

˛

˛

˛

˛

cos θ −r sin θ

sin θ r cos θ

˛

˛

˛

˛

˛

= r cos2 θ + r sin2θ = r ≥ 0. So, dxdy = rdrdθ.

50.

The Gaussian distribution: calculating the variance

We will make use of the formula Var[X ] = E[X2] − E2[X ].

E[X2] =

∫∞

−∞

x2p(x)dx =1√2πσ

∫∞

−∞

x2 · e−

(x− µ)2

2σ2 dx

Again, using v =x− µ

σwill imply x = σv + µ, and dx = σdv, therefore:

E[X2] =1√2πσ

∫∞

−∞

(σv + µ)2 e−

v2

2 (σdv)

=σ√2π

∫∞

−∞

(σ2v2 + 2σµv + µ2) e−

v2

2 dv

=1√2π

σ2

∫∞

−∞

v2 e−

v2

2 dv + 2σµ

∫∞

−∞

v e−

v2

2 dv + µ2

∫∞

−∞

e−

v2

2 dv

Note that we have already computed∫∞

−∞ve

−

v2

2 dv = 0 and∫∞

−∞e−

v2

2 dv =√

2π.

51.

The Gaussian distribution: calculating the variance (Cont’d)

Therefore, we only need to compute

∫∞

−∞

v2e−

v2

2 dv =

∫∞

−∞

(−v)

−ve

−

v2

2

dv =

∫∞

−∞

(−v)

e

−

v2

2

′

dv

= (−v) e−

v2

2

∣∣∣∣∣∣∣

∞

−∞

−∫

∞

−∞

(−1)e−

v2

2 dv = 0 +

∫∞

−∞

e−

v2

2 dv =√

2π.

So,

E[X2] =1√2π

(

σ2√

2π + 2σµ · 0 + µ2√

2π)

= σ2 + µ2.

Finally,

Var[X ] = E[X2] − (E[X ])2 = (σ2 + µ2) − µ2 = σ2.

52.

The covariance matrix Σ corresponding to a vector X madeof n random variables is symmetric and positive semi-definite

a. Cov(X)i,jdef.= Cov(Xi, Xj), for all i, j ∈ 1, . . . , n, and

Cov(Xi, Xj)def.= E[(X − E[Xi])(X − E[Xj ])] = Cov(Xj , Xi),

therefore Cov(X) is a symmetric matrix.

b. We will show that zT Σz ≥ 0 for any z ∈ Rn:

zT Σz =n∑

i=1

n∑

j=1

(ziΣijzj) =n∑

i=1

n∑

j=1

(zi Cov[Xi, Xj ] zj)

=n∑

i=1

n∑

j=1

(zi E[(Xi − E[Xi])(Xj − E[Xj ])] zj)

=

n∑

i=1

n∑

j=1

(E[(Xi − E[Xi])(Xj − E[Xj ])] zizj)

= E

n∑

i=1

n∑

j=1

(Xi − E[Xi])(Xj − E[Xj ]) zizj

= E[((X − E[X ])T · z)2] ≥ 0

53.

If the covariance matrix of a multi-variate Gaussian

distribution is diagonal, then the density of this is equal tothe product of independent univariate Gaussian densities

Let’s consider X = [X1 . . . Xn]T , µ ∈ Rn and Σ ∈ Sn+, where Sn

+ is the set of symmetricpositive definite matrices (which implies |Σ| 6= 0 and (x− µ)T Σ−1(x− µ) > 0, therefore −12 (x − µ)T Σ−1(x− µ) < 0).

The probability density function of a multi-variate Gaussian distribution of parametersµ and Σ is:

p(x;µ,Σ) =1

(2π)n/2|Σ|1/2exp

(

−1

2(x− µ)T Σ−1(x− µ)

)

,

Notation: X ∼ N (µ,Σ)

We will make the proof for n = 2:

x =

[x1

x2

]

µ =

[µ1

µ2

]

Σ =

[σ2

1 00 σ2

2

]

54.

A property of ulti-variate Gaussians whose covariance

matrices are diagonal (Cont’d)

p(x;µ,Σ) =1

2π

∣∣∣∣

σ21 0

0 σ22

∣∣∣∣

12

exp

(

−1

2

[x1 − µ1

x2 − µ2

]T [σ2

1 00 σ2

2

]−1 [x1 − µ1

x2 − µ2

])

=1

2π σ21σ

22

exp

−1

2

[x1 − µ1

x2 − µ2

]T

1

σ21

0

01

σ22

[x1 − µ1

x2 − µ2

]

=1

2π σ21σ

22

exp

−1

2

[x1 − µ1

x2 − µ2

]T

1

σ21

(x1 − µ1)

1

σ22

(x2 − µ2)

=1

2π σ21σ

22

exp

(

− 1

2σ21

(x1 − µ1)2 − 1

2σ22

(x2 − µ2)2

)

= p(x1;µ1, σ21) p(x2;µ2, σ

22).

55.

Derivation of entropy definition,

starting from a set of desirable properties

CMU, 2005 fall, T. Mitchell, A. Moore, HW1, pr. 2.2

56.

Remark:

The definition Hn(X) = −∑i pi log pi is not very intuitive.

Theorem:

If ψn(p1, . . . , pn) satisfies the following axioms

A1. Hn should be continuous in pi and symmetric in its arguments;

A2. if pi = 1/n then Hn should be a monotonically increasing function of n;(If all events are equally likely, then having more events means beingmore uncertain.)

A3. if a choice among N events is broken down into successive choices,then entropy should be the weighted sum of the entropy at each stage;

then ψn(p1, . . . , pn) = −K∑i pi log pi where K is a positive constant.

Note: We will restrict the proof to the p1, . . . , pn ∈ Q case.

57.

Example for the axiom A3:

Encoding 1

(a,b,c)

1/2 1/3 1/6

b ca

Encoding 2

(a,b,c)

a

1/2 1/2

(b,c)

b c

2/3 1/3

H

„

1

2,1

3,1

6

«

=1

2log 2 +

1

3log 3 +

1

6log 6 =

„

1

2+

1

6

«

log 2 +

„

1

3+

1

6

«

log 3 =2

3+

1

2log 3

H

„

1

2,1

2

«

+1

2H

„

2

3,1

3

«

= 1 +1

2

„

2

3log

3

2+

1

3log 3

«

= 1 +1

2

„

log 3 −2

3

«

=2

3+

1

2log 3

The next 3 slides:

Case 1: pi = 1/n for i = 1, . . . , n; proof steps

58.

a. A(n)not.= ψ(1/n, 1/n, . . . , 1/n) implies

A(sm) = mA(s) for any s,m ∈ N∗. (1)

b. If s,m ∈ N⋆ (fixed), s 6= 1, and t, n ∈ N

⋆ such that sm ≤ tn ≤ sm+1, then

| mn − log tlog s |≤ 1

n . (2)

c. For sm ≤ tn ≤ sm+1 as above, it follows (imediately)

ψsm

(1

sm, . . . ,

1

sm

)

≤ ψtn

(1

tn, . . . ,

1

tn

)

≤ ψsm+1

(1

sm+1, . . . ,

1

sm+1

)

i.e. A(sm) ≤ A(tn) ≤ A(sm+1)c. Show that

| mn − A(t)A(s) |≤ 1

n for s 6= 1. (3)

d. Combining (2) + (3) gives imediately

| A(t)A(s) −

log tlog s |≤ 2

n pentru s 6= 1 (4)

d. Show that this inequation implies

A(t) = K log t with K > 0 (due to A2). (5)

59.

Proofa.

ms

1m

s

1m

s

1

. . . . . . . . .. . .

. . . . . .. . .

. . .

1/s 1/s 1/s

1/s 1/s 1/s

nivel 2

nivel 1

1/s 1/s 1/s

nivel m

de s ori

. . .de s ori

de s ori

Applying the axion A3 on the right encoding from above gives:

A(sm) = A(s) + s · 1

sA(s) + s2 · 1

s2A(s) + . . .+ sm−1 · 1

sm−1A(s)

= A(s) +A(s) +A(s) + . . .+A(s)︸︷︷︸

m times

= mA(s)

60.

Proof (cont’d)

b.

sm ≤ t

n ≤ sm+1 ⇒ m log s ≤ n log t ≤ (m + 1) log s ⇒

m

n≤ log t

log s≤ m

n+

1

n⇒ 0 ≤ log t

log s− m

n≤ 1

n⇒∣∣∣∣

log t

log s− m

n

∣∣∣∣≤ 1

n

c.

A(sm) ≤ A(tn) ≤ A(sm+1)1⇒ m A(s) ≤ n A(t) ≤ (m + 1) A(s)

s 6=1⇒m

n≤ A(t)

A(s)≤ m

n+

1

n⇒ 0 ≤ A(t)

A(s)− m

n≤ 1

n⇒∣∣∣∣

A(t)

A(s)− m

n

∣∣∣∣≤ 1

n

d. Consider again sm ≤ t

n ≤ sm+1 with s, t fixed. If m → ∞ then n → ∞ and

from

∣∣∣∣

A(t)

A(s)− log t

log s

∣∣∣∣≤ 1

nit follows that

∣∣∣∣

A(t)

A(s)− log t

log s

∣∣∣∣→ 0.

Therefore

∣∣∣∣

A(t)

A(s)− log t

log s

∣∣∣∣= 0 and so

A(t)

A(s)=

log t

log s.

Finally, A(t) =A(s)

log slog t = K log t, where K =

A(s)

log s> 0 (if s 6= 1).

61.

Case 2: pi ∈ Q for i = 1, . . . , n

Let’s consider a set of N equiprobablerandom events, and P = (S1, S2, . . . , Sk)a partition of this set. Let’s denotepi =| Si | /N .

A “natural” two-step ecoding (asshown in the nearby figure) leads toA(N) = ψk(p1, . . . , pk) +

∑

i piA(| Si |),based on the axiom A3.

Finally, using the result A(t) = K log t,gives:

K logN = ψk(p1, . . . , pk) +K∑

i pi log | Si |

S i

1/|S |i1/|S |i 1/|S |i

|S |/N2 |S |/Ni

S 1 S 2 S k

|S |/Nk|S |/N1

level 1

level 2

. . .. . .

. . .

. . . . . . . . .

⇒ ψk(p1, . . . , pk) = K[ logN −∑

i

pi log | Si | ]

= K[ logN∑

i

pi −∑

i

pi log | Si | ] = −K∑

i

pi log| Si |N

= −K∑

i

pi log pi

62.

Addenda

C. Some Examples

63.

Exemplifying

the computation of expected values for random variables

and the [use of] sensitivity of a test

in a real-world application

CMU, 2009 fall, Geoff Gordon, HW1, pr. 2

64.

There is a disease which affects 1 in 500 people. A 100.00 dollar blood test canhelp reveal whether a person has the disease. A positive outcome indicatesthat the person may have the disease. The test has perfect sensitivity (truepositive rate), i.e., a person who has the disease tests positive 100% of thetime. However, the test has 99% specificity (true negative rate), i.e., a healthyperson tests positive 1% of the time.

a. A randomly selected individual is tested and the result is positive. Whatis the probability of the individual having the disease?

b. There is a second more expensive test which costs 10, 000.00 dollars but isexact with 100% sensitivity and specificity. If we require all people who testpositive with the less expensive test to be tested with the more expensive test,what is the expected cost to check whether an individual has the disease?

c. A pharmaceutical company is attempting to decrease the cost of the second(perfect) test. How much would it have to make the second test cost, so thatthe first test is no longer needed? That is, at what cost is it cheaper simplyto use the perfect test alone, instead of screening with the cheaper test asdescribed in part b?

65.

Random variables:

B:

1/true for persons having this disease0/false otherwise;

T1: the result of the first test: + (in case of disease) or −;

T2: the result of the second test: again + or −.

Known facts:

P (B) =1

500

P (T1 = + | B) = 1, P (T1 = + | B) =1

100,

P (T2 = + | B) = 1, P (T2 = + | B) = 0

a.

P (B | T1 = +) =P (T1 = + | B) · P (B)

P (T1 = + | B) · P (B) + P (T1 = + | B) · P (B)

=1 · 1

500

1 · 1

500+

1

100· 499

500

=100

599≈ 0.1669

66.

b.

C =

c1 if the person is tested only with the first test

c1 + c2 if the person is tested with both tests

⇒ P (C = c1) = P (T1 = −) si P (C = c1 + c2) = P (T1 = +)

⇒ E[C] = c1 · (1 − P (T1 = +)) + (c1 + c2) · P (T1 = +)

= c1 − c1 · P (T1 = +) + c1 · P (T1 = +) + c2 · P (T1 = +)

= c1 + c2 · P (T1 = +)

= 100 + 10000 · 599

50000= 219.8 ≈ 220$

P (T1 = +) = P (T1 = + | B) · P (B) + P (T1 = + | B) · P (B)

= 1 · 1

500+

1

100· 499

500=

599

50000= 0.01198

⇒ E[C] = c1 · (1 − P (T1 = +)) + (c1 + c2) · P (T1 = +)

= c1 − c1 · P (T1 = +) + c1 · P (T1 = +) + c2 · P (T1 = +)

= c1 + c2 · P (T1 = +) = 100 + 10000 · 599

50000= 219.8 ≈ 220$

67.

c.cn the new price forr the second test (T ′

2)

cn ≤ E[C ′] = c1 · P (C = c1) + (c1 + cn) · P (C = c1 + cn)

= c1 + cn · P (T1 = +) = 100 + cn · 599

50000

cn = 100 + cn · 0.01198 ⇒ cn ≈ 101.2125.

68.

Using the Central Limit Theorem (the i.i.d. version)

to compute the real error of a classifier

CMU, 2008 fall, Eric Xing, HW3, pr. 3.3

69.

Chris recently adopts a new (binary) classifier to filter emailspams. He wants to quantitively evaluate how good the classi-fier is.

He has a small dataset of 100 emails on hand which, you canassume, are randomly drawn from all emails.

He tests the classifier on the 100 emails and gets 83 classifiedcorrectly, so the error rate on the small dataset is 17%.

However, the number on 100 samples could be either higher orlower than the real error rate just by chance.

With a confidence level of 95%, what is likely to be the range ofthe real error rate? Please write down all important steps.

(Hint: You need some approximation in this problem.)

70.

Notations:

Let Xi, i = 1, . . . , n = 100 be defined as:Xi = 1 if the email i was incorrectly classified, and 0 otherwise;

E[Xi]not.= µ

not.= ereal ; Var(Xi)

not.= σ2

esamplenot.=

X1 + . . .+Xn

n= 0.17

Zn =X1 + . . .+Xn − nµ√

n σ(the standardized form of X1 + . . .+Xn)

Key insight:

Calculating the real error of the classifier (more exactly, a symmetric interval

around the real error pnot.= µ) with a “confidence” of 95% amounts to finding a > 0

sunch that P (|Zn| ≤ a) ≥ 0.95.

71.

Calculus:

| Zn |≤ a ⇔∣∣∣∣

X1 + . . . + Xn − nµ√n σ

∣∣∣∣≤ a ⇔

∣∣∣∣

X1 + . . . + Xn − nµ

nσ

∣∣∣∣≤ a√

n

⇔∣∣∣∣

X1 + . . . + Xn − nµ

n

∣∣∣∣≤ aσ√

n⇔∣∣∣∣

X1 + . . . + Xn

n− µ

∣∣∣∣≤ aσ√

n

⇔ |esample − ereal| ≤aσ√

n⇔ |ereal − esample| ≤

aσ√n

⇔ − aσ√n≤ ereal − esample ≤

aσ√n

⇔ esample −aσ√

n≤ ereal ≤ esample +

aσ√n

⇔ ereal ∈[

esample −aσ√

n, esample +

aσ√n

]

72.

Important facts:

The Central Limit Theorem: Zn → N(0; 1)

Therefore, P (|Zn| ≤ a) ≈ P (|X| ≤ a) = Φ(a) − Φ(−a), where X ∼ N(0; 1)

and Φ is the cumulative function distribution of N(0; 1).

Calculus:

Φ(−a) + Φ(a) = 1 ⇒ P (| Zn |≤ a) = Φ(a) − Φ(−a) = 2Φ(a) − 1

P (| Zn |≤ a)=0.95⇔2Φ(a)−1=0.95⇔Φ(a)=0.975 ⇔ a ∼= 1.97 (see Φ table)

Finally:

σ2 not.= Varreal ≈ Varsample due to the above theorem, and

Varsample = esample(1 − esample) because Xi are Bernoulli variables.

⇒ aσ√n

= 1.97 ·√

0.17(1 − 0.17)√100

∼= 0.07

|ereal − esample| ≤ 0.07 ⇔ |ereal − 0.17| ≤ 0.07 ⇔ −0.07 ≤ ereal − 0.17 ≤ 0.07

⇔ ereal ∈ [0.10, 0.24]

73.

Documents

Basic Statistics and Probability Theory - …ciortuz/SLIDES/2015/foundations.pdf · Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning