35
Probability and Statistics Basic concepts II (from a physicist point of view) Benoit CLEMENT – Université J. Fourier / LPSC [email protected]

Probability and Statistics Basic concepts II (from a physicist point of view)

Embed Size (px)

DESCRIPTION

Probability and Statistics Basic concepts II (from a physicist point of view) Benoit CLEMENT – Université J. Fourier / LPSC [email protected]. Statistics. PHYSICS parameters θ. Observable. POPULATION f(x; θ ). SAMPLE Finite size x i. INFERENCE. EXPERIMENT. - PowerPoint PPT Presentation

Citation preview

Page 1: Probability and Statistics Basic concepts II (from a physicist point of view)

Probability and Statistics

Basic concepts II(from a physicist point of view)

Benoit CLEMENT – Université J. Fourier / LPSC

[email protected]

Page 2: Probability and Statistics Basic concepts II (from a physicist point of view)

Statistics

2

Parametric estimation : give one value to each parameter

Interval estimation : derive a interval that probably contains the true value

Non-parametric estimation : estimate the full pdf of the population

SAMPLEFinite size

xi

POPULATIONf(x;θ)

PHYSICSparameters

θ

EXPERIMENT

INFERENCE

Observable

Page 3: Probability and Statistics Basic concepts II (from a physicist point of view)

Parametric estimation

From a finite sample {xi} estimating a parameter θ

Statistic = a function S = f({xi})

Any statistic can be considered as an estimator of θ To be a good estimator it needs to satisfy :

• Consistency : limit of the estimator for a infinite sample. • Bias : difference between the estimator and the true

value• Efficiency : speed of convergence• Robustness : sensitivity to statistical fluctuations

A good estimator should at least be consistent and asymptotically unbiased

Efficient / Unbiased / Robust often contradict each other different choices for different applications

3

Page 4: Probability and Statistics Basic concepts II (from a physicist point of view)

Bias and consistencyAs the sample is a set of realization of random variables (or one vector variable), so is the estimator :

it has a mean, a variance,… and a probability density function

Bias : Mean value of the estimator unbiased estimator : asymptotically unbiased :

Consistency: formally in practice, if asymptotically unbiased

4

Θθ ˆˆ of nrealizatio a is

0ˆ0 - ] -ˆE[ )b( θμθΘθΘ

0 )b( θ

0 )b( n θ

εεθθ 0 0, ) -ˆP( n

0 nˆ Θσ

biasedasymptotically unbiased

unbiased

Page 5: Probability and Statistics Basic concepts II (from a physicist point of view)

Empirical estimatorSample mean is a good estimator of the population mean weak law of large numbers : convergent, unbiased

Sample variance as an estimator of the population variance :

5

22

22ˆ

i

22

2

i

2i

i

2i

2

n1n

nn1

]sE[

ˆ)(xn1

)(xn1

s

σσ

σσσ

μμμμ

μ

n])-ˆE[( ,]ˆE[ ,x

n1

ˆ2

22ˆˆi

σμμσμμμμ μμ

biased, asymptotically unbiased

unbiased variance estimator :

variance of the estimator (convergence)

i

2i

2 )(x1-n

1ˆ μσ

n2

2n

1n1-n

4

2

42ˆ2

σγ

σσ

σ

Page 6: Probability and Statistics Basic concepts II (from a physicist point of view)

Errors on these estimatorUncertainty Estimator standard deviation

Use an estimator of standard deviation : (!!! Biased )

Mean :

Variance :

Central-Limit theorem empirical estimators of mean and variance are normally distributed, for large enough samples

define 68% confidence intervals

6

ˆ n

,xn1

ˆ22

2ˆi

σμΔ

σσμ μ

224

i

2i

2 ˆn2

ˆ n

2 ,)(x

1-n1

ˆ 2 σσΔσ

σμσσ

2ˆˆ σσ

22 ˆˆ; ˆˆ σΔσμΔμ

Page 7: Probability and Statistics Basic concepts II (from a physicist point of view)

Likelihood function

Generic function k(x,θ) x : random variable(s) θ : parameter(s)

7

Probability density function

f(x;θ) = k(x,θ0)

∫ f(x;θ) dx=1for Bayesian f(x| θ)= f(x; θ)

Likelihood functionL (θ) = k(u,θ)

∫ L (θ) dθ =???for Bayesian f(θ|x)= L (θ)/∫ L

(θ)dθ

fix θ= θ0 (true value) fix x= u (one realization of the random variable)

For a sample : n independent realizations of the same variable X

ii

ii );f(x),k(x) ( θθθL

Page 8: Probability and Statistics Basic concepts II (from a physicist point of view)

Estimator varianceStart from the generic k function, differentiate twice, with respect to θ, the pdf normalization condition:

Now differentiating the estimator bias :

Finally, using Cauchy-Schwartz inequality

Cramer-Rao bound8

2

2

22

2

2

2

2 lnkE

lnkEdx

lnkkdx

lnkkdx

k0

0lnk

)E(blnk

Edxlnk

kdxk

0

θθθθθ

θθ

θθθ

θ)dxk(x,1

)dx(x)k(x,ˆb θθθ

dxlnk

)k-b-ˆ(dxlnk

kdxkˆ)dx(x)k(x,ˆb

1

θ

θθθ

θθ

θθθθθ

2

22ˆ

22

2

lnkE

b'1dx

lnkkkdx)-b-ˆ(

b1

θ

σθ

θθθ Θ

Page 9: Probability and Statistics Basic concepts II (from a physicist point of view)

Efficiency

For any unbiased estimator of θ, the variance cannot exceed :

The efficiency of a convergent estimator, is given by its variance.

An efficient estimator reaches the Cramer-Rao bound (at least asymptotically) : Minimal variance estimator

MVE will often be biased, asymptotically unbiased9

2

22

lnE

1

lnE

1

θθ

σΘ LL

Page 10: Probability and Statistics Basic concepts II (from a physicist point of view)

Maximum likelihoodFor a sample of measurements, {xi}

The analytical form of the density is known It depends on several unknown parameters θeg. event counting : Follow a Poisson distribution, with a parameter that depends on the physics : λi(θ)

An estimator of the parameters of θ, are the ones that maximize of observing the observed result. Maximum of the likelihood function

rem : system of equations for several parametersrem : often minimize -lnL : simplify expressions 10

i i

xi

) (

!x) (e

) (ii θλ

θθλ

L

θθθ

L

Page 11: Probability and Statistics Basic concepts II (from a physicist point of view)

Properties of MLEMostly asymptotic properties : valid for large sample, often assumed in any case for lack of better information Asymptotically unbiased Asymptotically efficient (reaches the CR bound) Asymptotically normally distributed Multinormal law, with covariance given by generalization of CR Bound :

Goodness of fit = The value of is Khi-2 distributed, with

ndf = sample size – number of parameters

11

)-ˆ()-ˆ(21 1

e2

1),;f(

θθΣθθ Τ

ΣπΣθθ

ji

21

ij

lnE

θθΣ

L

) (2ln- θL

) 2lnL(-

ndf)dx(x;fvaluep 2θ χ

Probability of getting a worse agreement

Page 12: Probability and Statistics Basic concepts II (from a physicist point of view)

Least squaresSet of measurements (xi, yi) with uncertainties on yi

Theoretical law : y = f(x,θ) Naïve approach : use regression

Reweight each term by the error

Maximum likelihood : assume each yi is normally distributed with a mean equal to f(xi,θ) and a variance equal to Δyi

Then the likelihood is :

12

0w

,)),f(x(y) w(ii

2ii

θθθ

0K

,y

),f(xy) (K

i

2

i

2

i

ii2

θΔ

θθ

i

y),f(xy

21

i

2

i

ii

ey2

1) ( Δ

θ

ΔπθL

0Kln

202

θθθ

LL Least squares or Khi-2 fit is the MLE, for Gaussian errors

))(x,f-y())(x,f-y(21

)(K 12 θΣθθ Τ

Generic case with correlations:

Page 13: Probability and Statistics Basic concepts II (from a physicist point of view)

Errors on MLE

Errors on parameter -> from the covariance matrix

For one parameter, 68% interval

More generally :

Confidence contour are defined by the equation :

Values of β for different

number of parameters nθ

and confidence levels α

13

2

2ˆ ln1

ˆ

θ

σΔθθ

L

only one realization of the estimator -> empirical mean of 1 value…

)O()ˆ-)(ˆ-(21

)(ln)(ln ln 3

ji,jjii

1ij θθθθθΣθθΔ LLL

dx)n(x;f with ),(n ln2

02

β

θχθ ααβΔ L

nθ α

1 (0.

5*nθ2)

2 3

68.3 0.5 1.15 1.76

95.4 2 3.09 4.01

99.7 4.5 5.92 7.08

)-ˆ()-ˆ(21 1

e2

1),;f(

θθΣθθ Τ

ΣπΣθθ

ji

21

ij

lnE

θθΣ

L

Page 14: Probability and Statistics Basic concepts II (from a physicist point of view)

Example : fitting a lineFor f(x)=ax

14

i2i

2i

i2i

ii

i2i

2i

22

yy

C ,yyx

B ,yx

A

-2lnC2BaAa(a)K

ΔΔΔ

L

A1

a 2

2Aa

K

,AB

a 02B2AaaK

a2

22

2

σΔσ2

a

Page 15: Probability and Statistics Basic concepts II (from a physicist point of view)

Example : fitting a line

For f(x)=ax+b

15

i2i

2i

i2i

i

i2i

ii

i2i

i

i2ii

2i

2i

222

yy

F,yy

E ,yyx

D ,yx

C ,y1

B ,yx

A

-2lnF2Eb2Da2CabBbAab)(a,K

ΔΔΔΔΔΔ

L

2b2a

21-

112

22

1222

22

1112

22

222

2

CABA

b ,CAB

Ba

AC

CB

CAB1

BC

CA

22Cba

K

22BbK

22Aa

K

CABBCAE

b ,CABECBD

a

02E2Bb2CabK

02D2Cb2AaaK

σΔσΔ

ΣΣ

Σ

Σ

Σ

Page 16: Probability and Statistics Basic concepts II (from a physicist point of view)

Example : fitting a line

2 dimensional error contours on a and b68.3% : 2=2.3 95.4% : 2=6.2 99.7% :

2=11.8

16

Page 17: Probability and Statistics Basic concepts II (from a physicist point of view)

Example : fitting a line

1 dimensional error from contours on a and b1d errors (68.3%) are given by the edges of the 2=1 ellipse : depends on covariance

17

2a

2b

baabtan2σ-σ

σσρφ

Page 18: Probability and Statistics Basic concepts II (from a physicist point of view)

Frequentist vs BayesFrequentist

Estimator of the parameters θ maximises the probability to observe the data .Maximum of the likelihood function

18

BayesianBayes theorem links : The probability of θ kwnowing the data : a posteriori to the probability of the data kwnowing θ : likelihood

ji

21

ij

lnE 0,

θθΣ

θ θθ

LL

ˆ dx)n(x;ffor

),(n ln2

02

β

θχ

θ

α

αβΔ L

θθ

θ

θ θπ θ

θπ θθ

d)(

)(

d) () (

) () (m)|f(

L

L

L

Lαθθ

b

am)d|f(

Page 19: Probability and Statistics Basic concepts II (from a physicist point of view)

Confidence intervalFor a random variable, a confidence interval with confidence level α, is any interval [a,b] such as :

Generalization of the concept of uncertainty: interval that contains the true value with a given probability

slightly different concepts

For Bayesians : the posterior density is the probability density of the true value. It can be used to derive interval :

No such thing for a Frequentist : The interval itself becomes the random variable [a,b] is a realization of [A,B]

Independently of θ 19

α b

a X(x)dxfb])[a,P(XProbability of finding a realization inside the interval

αθ b])[a,P(

αθθ )B and P(A

Page 20: Probability and Statistics Basic concepts II (from a physicist point of view)

Confidence intervalMean centered, symetric interval [μ-a, μ+a]

20

Mean centered, probability symmetric interval : [a, b] ,

2f(x)dxf(x)dx

b

a

αμ

μ

Highest Probability Density (HDP) : [a, b]

αμ

μ

a

af(x)dx

b][a,y and b][a,x for f(y)f(x)

f(x)dxb

a

α

Page 21: Probability and Statistics Basic concepts II (from a physicist point of view)

Confidence BeltTo build a frequentist interval for an estimator of θ1. Make pseudo-experiments for several values of θ and

compute he estimator for each (MC sampling of the estimator pdf)

2. For each θ, determine (θ) and (θ) such as :

These 2 curves are the confidence belt, for a CL α.

3. Inverse these functions. The interval satisfy:

21

θ

sexperiment-pseudo the of )/2-(1 fractiona for )(ˆ αθΞθ

)]( ),([ -1-1 θΞθΩ ˆˆ

sexperiment-pseudo the of )/2-(1 fractiona for )(ˆ αθΩθ

Confidence Belt for Poisson parameter λ estimated with the empirical mean of 3 realizations (68%CL)

θ

αθΩθθΞθ

θθΩθθΞθΞθθΩ

)(ˆP- )(ˆP1

)(P- )(P1)()(P -1-1-1-1

Page 22: Probability and Statistics Basic concepts II (from a physicist point of view)

Dealing with systematics

The variance of the estimator only measure the statistical uncertainty.

Often, we will have to deal with some parameters whose values are known with limited precision.

Systematic uncertainties

The likelihood function becomes :

The known parameters ν are nuisance parameters22

or ) , ( -00

νΔ

νΔνΔννννθL

Page 23: Probability and Statistics Basic concepts II (from a physicist point of view)

Bayesian inferenceIn Bayesian statistics, nuisance parameters are dealt with by assigning them a prior π(ν).Usually a multinormal law is used with mean ν0 and covariance matrix estimated from Δν0 (+correlation, if needed)

The final prior is obtained by marginalization over the nuisance parameters

23

d d)()(),|f(x

)()(),|f(xx)|, f(

νθνπθπνθ

νπθπνθνθ

d d)()(),|f(x

)d()(),|f(xx)d|, f(x)|f(

νθνπθπνθ

ννπθπνθννθθ

Page 24: Probability and Statistics Basic concepts II (from a physicist point of view)

Profile Likelihood

24

No true frequentist way to add systematic effects. Popular method of the day : profilingDeal with nuisance parameters as realization if random variables : extend the likelihood :G(v) is the likelihood of the new parameters (identical to prior) For each value of θ, maximize the likelihood with respect to nuisance : profile likelihood PL(θ).

PL(θ) has the same statistical asymptotical properties than the regular likelihood

)( ) , ( ννθνθ GLL ) , ('

Page 25: Probability and Statistics Basic concepts II (from a physicist point of view)

Non parametric estimation

Directly estimating the probability density function

• Likelihood ratio discriminant• Separating power of variables• Data/MC agreement• …

Frequency Table : For a sample {xi} , i=1..n

1.Define successive intervals (bins) Ck=[ak,ak+1[

2.Count the number of events nk in Ck

Histogram : Graphical representation of the frequency table

25 kk Cx if nh(x)

Page 26: Probability and Statistics Basic concepts II (from a physicist point of view)

Histogram

26

N/Z for stable heavy nuclei1.321, 1.357, 1.392, 1.410, 1.428, 1.446, 1.464, 1.421, 1.438, 1.344, 1.379, 1.413, 1.448, 1.389, 1.366, 1.383, 1.400, 1.416, 1.433, 1.466, 1.500, 1.322, 1.370, 1.387, 1.403, 1.419, 1.451, 1.483, 1.396, 1.428, 1.375, 1.406, 1.421, 1.437, 1.453, 1.468, 1.500, 1.446, 1.363, 1.393, 1.424, 1.439, 1.454, 1.469, 1.484, 1.462, 1.382, 1.411, 1.441, 1.455, 1.470, 1.500, 1.449, 1.400, 1.428, 1.442, 1.457, 1.471, 1.485, 1.514, 1.464, 1.478, 1.416, 1.444, 1.458, 1.472, 1.486, 1.500, 1.465, 1.479, 1.432, 1.459, 1.472, 1.486, 1.513, 1.466, 1.493, 1.421, 1.447, 1.460, 1.473, 1.486, 1.500, 1.526, 1.480, 1.506, 1.435, 1.461, 1.487, 1.500, 1.512, 1.538, 1.493, 1.450, 1.475, 1.500, 1.512, 1.525, 1.550, 1.506, 1.530, 1.487, 1.512, 1.524, 1.536, 1.518, 1.577, 1.554, 1.586, 1.586

Page 27: Probability and Statistics Basic concepts II (from a physicist point of view)

HistogramStatistical description : nk are multinomial random variables.

with parameters :

27

kC

Xkkk

k (x)dxf)CP(xp nn

0pnp)n,Cov(n )p(1np np1prkrkn1pkk

2nkn

kk

kkk

μσμFor a large sample : For small classes (width δ):

So finally :

The histogram is an estimator of the probability density

Each bin can be described by a Poisson density.

The 1σ error on nk is then :

h(x)n1

limf(x)0

n δδ

kkk

np

nnn

lim

μ f(x)p

lim)f(x(x)dxfp k

0c

CXk

k

δ

δδ

nˆˆn kn2nk kk

μσΔ

Page 28: Probability and Statistics Basic concepts II (from a physicist point of view)

Kernel density estimatorsHistogram is a step function -> sometime need smoother estimatorOn possible solution : Kernel Density EstimatorAttribute to each point of the sample a “kernel” function k(u)

Triangle kernel : Parabolic kernel :

Gaussian kernel : …

w = kernel width, similar to bin width of the histogram

The pdf estimator is :

Rem : for multidimensional pdf : 28

1k(u)du u),k(k(u) ,w

xxu i

1u1- for |,u|1-k(u) 1u1- for ),u-(1k(u) 2

43

2u

-

21

2

ek(u)π

i

i

ii w

xxk

n1

)k(un1

K(x)

k

2

(k)

(k)i

(k)2

wxx

u

Page 29: Probability and Statistics Basic concepts II (from a physicist point of view)

Kernel density estimatorsIf the estimated density is normal, the optimal width is :

with n that sample size and d the dimension

As for the histogram binning, no generic result : try and see

29

4d1

2)n(d3

w

σ

Page 30: Probability and Statistics Basic concepts II (from a physicist point of view)

Statistical TestsStatistical tests aim at:

• Checking the compatibility of a dataset {xi} with a given distribution

• Checking the compatibility of two datasets {xi}, {yi} : are they issued from the same distribution.

• Comparing different hypothesis : background vs signal+background

In every case : • build a statistic that quantify the agreement

with the hypothesis• convert it into a probability of

compatibility/incompatibility : p-value30

Page 31: Probability and Statistics Basic concepts II (from a physicist point of view)

Pearson testTest for binned data : use the Poisson limit of the histogram

• Sort the sample into k bins Ci : ni

• Compute the probability of this class : pi=∫Cif(x)dx

• The test statistics compare, for each bin the deviation of the observation from the expected mean to the theoretical standard deviation.

Then χ2 follow (asymptotically) a Khi-2 law with k-1 degrees of freedom (1 constraint ∑ni=n)

p-value : probability of doing worse,For a “good” agreement χ2 /(k-1) ~ 1, More precisely (1σ interval ~ 68%CL)

31

i bins i

2ii2

np)np(n

χ

Poisson mean

Poisson variance

Data

2 2 1)dx-k(x;fvaluep

χ χ

1)2(k1)(k2 χ

Page 32: Probability and Statistics Basic concepts II (from a physicist point of view)

Kolmogorov-Smirnov testTest for unbinned data : compare the sample cumulative density function to the tested one

Sample Pdf (ordered sample)

The the Kolmogorov statistic is the largest deviation :

The test distribution has been computed by Kolmogorov:

[0;β] define a confidence interval for Dn

β=0.9584/√n for 68.3% CL β=1.3754/√n for 95.4% CL

32

xx 1

xxx nk

xx 0

(x)Fi)-(x n1

(x)f

n

1kk

0

si

s δ

F(x)(x)FsupD Sx

n

22z2r

r

1rn e1)(2)nP(D β

Page 33: Probability and Statistics Basic concepts II (from a physicist point of view)

ExampleTest compatibility with an exponential law :

33

0.008, 0.036, 0.112, 0.115, 0.133, 0.178, 0.189, 0.238, 0.274, 0.323, 0.364, 0.386, 0.406, 0.409, 0.418, 0.421, 0.423, 0.455, 0.459, 0.496, 0.519, 0.522, 0.534, 0.582, 0.606, 0.624, 0.649, 0.687, 0.689, 0.764, 0.768, 0.774, 0.825, 0.843, 0.921, 0.987, 0.992, 1.003, 1.004, 1.015, 1.034, 1.064, 1.112, 1.159, 1.163, 1.208, 1.253, 1.287, 1.317, 1.320, 1.333, 1.412, 1.421, 1.438, 1.574, 1.719, 1.769, 1.830, 1.853, 1.930, 2.041, 2.053, 2.119, 2.146, 2.167, 2.237, 2.243, 2.249, 2.318, 2.325, 2.349, 2.372, 2.465, 2.497, 2.553, 2.562, 2.616, 2.739, 2.851, 3.029, 3.327, 3.335, 3.390, 3.447, 3.473, 3.568, 3.627, 3.718, 3.720, 3.814, 3.854, 3.929, 4.038, 4.065, 4.089, 4.177, 4.357, 4.403, 4.514, 4.771, 4.809, 4.827, 5.086, 5.191, 5.928, 5.952, 5.968, 6.222, 6.556, 6.670, 7.673, 8.071, 8.165, 8.181, 8.383, 8.557, 8.606, 9.032, 10.482, 14.174

0.4λ ,λef(x) λx

Dn = 0.069

p-value = 0.0617

1σ : [0, 0.0875]

Page 34: Probability and Statistics Basic concepts II (from a physicist point of view)

Hypothesis testingTwo exclusive hypotheses H0 and H1

-which one is the most compatible with data - how incompatible is the other one P(data|H0) vs P(data|H1)

Build a statistic, define an interval w - if the observation falls into w : accept H1

- else accept H0

Size of the test : how often did you get it right

Power of the test : how often do you get it wrong !

Neyman-Pearson lemma : optimal statistic for testing hypothesis is the Likelihood ratio

34

w

0)dxH|(xα L

w

1)dxH|(x1 Lβ

αλ k)H|(x)H|(x

1

0 LL

Page 35: Probability and Statistics Basic concepts II (from a physicist point of view)

CLb and CLs

Two hypothesis, for counting experiment - background only : expect 10 events - signal+ background : expect 15 eventsYou observe 16 events

35

CLb = confidence in the background hypothesis (power of the test)Discovery : 1- CLb < 5.7x10-7

CLs+b = confidence in the

signal+background hypothesis (size of the test)Rejection : CLs+b < 5x10-2

Test for signal (non standard)CLs = CLs+b/CLb

CLb =1-0.049

CLs+b =0.663