Review: Population, sample, and sampling distributions

Cohen Empirical Methods CS650

Review: Population, sample, and sampling distributions

0 1

A population with mean µ andstandard deviation σ

For instance, µ = 0, σ = 1

Sample 1, N=30 Sample 2, N=30 Sample 100000000000

InterquartileRange = 1.25 InterquartileRange = 1.7 InterquartileRange = 0.65

The sampling distribution ofthe interquartile range forsamples of size N = 30


How’s your IQ?

Ø Suppose the population IQ is a normal distribution withmean 100 and standard deviation 20.

Ø The mean IQ in this class, 23 students, is 130.

Ø Should we reject the null hypothesis that this class is nodifferent in IQ from the population?


The Logic

Ø Our result: R = 130Ø Assume Ho: Π = 100, this class is a random sample

drawn from the population of people with mean IQ 100Ø If the result is very unlikely under Ho, if Pr(R=130 | Π =

100) ≤ α, then we are inclined to reject Ho.

Ø Pick a value of α (say, .01) and calculate the conditionalprobability p = Pr(R=130 | Π = 100)

Ø Our residual uncertainty that Ho might be right is lessthan or equal to α


Calculate p = Pr(R=130 | Π = 100)Find the sampling distribution of R for N = 23

Ø Since we know the population parameters (normal, mean = 100,standard deviation = 20) we can get the sampling distribution byMonte Carlo sampling:

10

20

30

90 10095 110105

(defun sampling-distribution (n mean std k) "N is the sample size, MEAN and STD are the parametersof a normal distribution, K is the size (number of samples)of the sampling distribution." (loop repeat k collect

(mean (sample-normal-to-list mean std n))))


Calculate p = Pr(R=130 | Π = 100)Find the sampling distribution of R for N = 23

Ø Since we know the population parameters (normal, mean = 100,standard deviation = 20) we can get the sampling distribution byMonte Carlo sampling:

Ø The probability of getting a sample of size 23 with mean 130 byrandom sampling from a population with mean 100 and standarddeviation 20 is virtually zero.

10

20

30

90 10095 110105 130


Another way to write the code:

(defun sampling-distribution (n mean std r k) (loop repeat k counting (> (mean (sample-normal-to-list mean std n)) r)))

(sampling-distribution 23 100 20 130 1000)=> 0


Parametric statistical inference

Ø Testing hypotheses by simulating the process ofsampling is cool but not always necessary

Ø The probability of tossing 15 heads in 20 with a fair coincan be worked out exactly

Ø The probability that a sample from a population has aparticular mean can be estimated

Ø However, theory tells us about the sampling distributionsof very few statistics; for the rest, simulation works great


Central Limit Theorem

Ø The sampling distribution of the mean of samples of sizeN drawn from a population with mean µ and standarddeviation σ approaches a normal distribution with mean

µ and standard deviation σ / √N as N becomes large

Ø Good news! We know the sampling distribution of themean and can estimate the probability of sample results!


The Logic

Ø Our result: R = 130Ø Assume Ho: Π = 100, this class is a random sample drawn from the

population of people with mean IQ 100Ø If the result is very unlikely under Ho, if Pr(R=130 | Π = 100) ≤ α,

then we are inclined to reject Ho.Ø Pick a value of α (say, .01) and calculate the conditional probability

p = Pr(R=130 | Π = 100)

Ø The sampling distribution of the mean approaches a normaldistribution with mean = 100 and std = 20 / √ 23 = 4.17

Ø So our sample result is 30 / 4.17 = 7.2 standard deviations abovethe mean of the sampling distribution!


Standard error: The standard deviation of the samplingdistribution

100 104

Standard Error of the Mean under Ho: Π = 100, the samplingdistribution is normal, its mean is 100, itsstandard deviation is 20 / √ 23 = 4.17

The standard error is 4.17

The sample result is 4.17 standard errorunits above the mean under Ho

130

99% of a normal distribution lies within two standard deviations of the mean.How probable is our sample result?


Try it again with a less extreme result




p = Pr(R=108 | Π = 100)

Ø The sampling distribution of the mean approaches a normaldistribution with mean = 100 and std = 20 / √ 23 = 4.17

Ø So our sample result is 8 / 4.17 = 1.92 standard errors above themean of the sampling distribution.


p values

100 104

s.e.

under Ho: Π = 100, the samplingdistribution is normal, its mean is 100,its standard deviation is 20 / √ 23 =4.17

The sample result, R=108, is 1.92standard error units above the meanunder Ho.

108

Now it isn’t so obvious that we should reject Ho.

How can we find p = Pr(R=108 | Π = 100) ?

State the result in standard error units and look up its probability in a table.


p values

100 104

s.e.

The sample result, R=108, is 1.92standard error units above the meanunder Ho.

108


Standardizing – subtract the mean, divide by thestandard error

100 104

s.e. under Ho: Π = 100, the sampling distribution isnormal, its mean is 100, its standard deviation is20 / √ 23 = 4.17, and the sample result is 108

108

0 1

s.e.

1.92

under Ho: Π = 0, the sampling distribution isnormal, its mean is 0, its standard deviation is 1.0,the sample result is (108 - 100) / (20 / √ 23) = 1.92


Z scores or standard scores – subtract the mean, divideby the standard error

100 104

s.e.

108

0 1

s.e.

1.92

108 - 100

4.17= 1.92

x – µs.e.

Z = x – µ

σ / √ N=


The Z test

Z is the number of standard errorunits the sample mean is from themean of the sampling distributionunder the null hypothesis.

If Z ≥ 1.645 then the sample resulthas p ≤ .05 probability given the nullhypothesis

If Z ≥ 1.96 then the sample resulthas p ≤ .01 probability given the nullhypothesis

Z = x – µ

s.e.

1.92 = __8 – 100

20 / √ 23


The Z test




p = Pr(R=108 | Π = 100)Ø The sampling distribution of the mean approaches a normal

distribution with mean = 100 and std = 20 / √ 23 = 4.17Ø So our sample result is 8 / 4.17 = 1.92 standard errors above the

mean of the sampling distributionØ Equivalently, Z = (108 - 100) / 4.17 = 1.92Ø p = Pr(R=108 | Π = 100) = Pr(Z) ≤ .0274,Ø α = .01, do not reject Ho.


You do it:

Ø A sample of size 25 has mean 8. Test the hypothesisthat the sample is drawn from a population with mean12, standard deviation 10.


You do it:

Ø A sample of size 25 has mean 8. Test the hypothesisthat the sample is drawn from a population with mean12, standard deviation 10.

Z =8 - 12

10 / √ 25= – 2


Central limit theorem demo

VAR

100

200

300

400

500

600

-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1

Histogram OF Var[Dataset-3]

Population

10

20

30

40

50

60

-1.1

102030405060708090

-1.4

50

100

-1.5 -1 0-0.5

N=20Std = .25

N=30Std = .21

N=50Std = .16

(loop repeat 1000 collect (mean (sample-from-population n))))

Std(population) = 1.11

s.e.(20) = 1.11 / √ 20 = .248

s.e.(30) = 1.11 / √ 30 = .203

s.e.(50) = 1.11 / √ 20 = .157


Three components of all test statistics

Z = x −x

= x −

N

Effect size

backgroundvariance

sample size

You can make any Z score significant with a big enough sample, butyou shouldn’t. Always try to control variance before increasing N.


Parametric and computer-intensive hypothesis testing

100 104

std under Ho: Π = 100, the mean ofsampling distribution is 100, thestandard deviation is 20 / √ 23 = 4.17

130

10

20

30

90 10095 110105 130

Empirically (by simulation) thisdistribution has a mean of 100.05and a standard deviation of 4.38


We do not know the sampling distribution of moststatistics – but we can estimate them empirically!

(defun sampling-distribution (n mean std k) "N is the sample size, MEAN and STD are the parametersof a normal distribution, K is the size (number of samples)of the sampling distribution."

(loop repeat k collect (mean (sample-normal-to-list mean std n))))

median interquartile-range trimmed-mean median-divided-by-mom’s-age


Some issues for parametric and computer-intensive tests

Ø Z is fine if you know σ, (recall, z = (x - µ ) / (σ / √ n)) butwhat if you don’t? Estimate σ from s and for smallersamples run t tests.

Ø Monte Carlo tests are fine if you know the parameters ofthe population from which samples are drawn, but whatif you don’t? Estimate these parameters from thesample and run bootstrap or randomization tests.

Documents

Review: Population, sample, and sampling distributions