Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
Cohen Empirical Methods CS650
Review: Population, sample, and sampling distributions
0 1
A population with mean µ andstandard deviation σ
For instance, µ = 0, σ = 1
Sample 1, N=30 Sample 2, N=30 Sample 100000000000
InterquartileRange = 1.25 InterquartileRange = 1.7 InterquartileRange = 0.65
The sampling distribution ofthe interquartile range forsamples of size N = 30
Cohen Empirical Methods CS650
How’s your IQ?
Ø Suppose the population IQ is a normal distribution withmean 100 and standard deviation 20.
Ø The mean IQ in this class, 23 students, is 130.
Ø Should we reject the null hypothesis that this class is nodifferent in IQ from the population?
Cohen Empirical Methods CS650
The Logic
Ø Our result: R = 130Ø Assume Ho: Π = 100, this class is a random sample
drawn from the population of people with mean IQ 100Ø If the result is very unlikely under Ho, if Pr(R=130 | Π =
100) ≤ α, then we are inclined to reject Ho.
Ø Pick a value of α (say, .01) and calculate the conditionalprobability p = Pr(R=130 | Π = 100)
Ø Our residual uncertainty that Ho might be right is lessthan or equal to α
Cohen Empirical Methods CS650
Calculate p = Pr(R=130 | Π = 100)Find the sampling distribution of R for N = 23
Ø Since we know the population parameters (normal, mean = 100,standard deviation = 20) we can get the sampling distribution byMonte Carlo sampling:
10
20
30
90 10095 110105
(defun sampling-distribution (n mean std k) "N is the sample size, MEAN and STD are the parametersof a normal distribution, K is the size (number of samples)of the sampling distribution." (loop repeat k collect
(mean (sample-normal-to-list mean std n))))
Cohen Empirical Methods CS650
Calculate p = Pr(R=130 | Π = 100)Find the sampling distribution of R for N = 23
Ø Since we know the population parameters (normal, mean = 100,standard deviation = 20) we can get the sampling distribution byMonte Carlo sampling:
Ø The probability of getting a sample of size 23 with mean 130 byrandom sampling from a population with mean 100 and standarddeviation 20 is virtually zero.
10
20
30
90 10095 110105 130
Cohen Empirical Methods CS650
Another way to write the code:
(defun sampling-distribution (n mean std r k) (loop repeat k counting (> (mean (sample-normal-to-list mean std n)) r)))
(sampling-distribution 23 100 20 130 1000)=> 0
Cohen Empirical Methods CS650
Parametric statistical inference
Ø Testing hypotheses by simulating the process ofsampling is cool but not always necessary
Ø The probability of tossing 15 heads in 20 with a fair coincan be worked out exactly
Ø The probability that a sample from a population has aparticular mean can be estimated
Ø However, theory tells us about the sampling distributionsof very few statistics; for the rest, simulation works great
Cohen Empirical Methods CS650
Central Limit Theorem
Ø The sampling distribution of the mean of samples of sizeN drawn from a population with mean µ and standarddeviation σ approaches a normal distribution with mean
µ and standard deviation σ / √N as N becomes large
Ø Good news! We know the sampling distribution of themean and can estimate the probability of sample results!
Cohen Empirical Methods CS650
The Logic
Ø Our result: R = 130Ø Assume Ho: Π = 100, this class is a random sample drawn from the
population of people with mean IQ 100Ø If the result is very unlikely under Ho, if Pr(R=130 | Π = 100) ≤ α,
then we are inclined to reject Ho.Ø Pick a value of α (say, .01) and calculate the conditional probability
p = Pr(R=130 | Π = 100)
Ø The sampling distribution of the mean approaches a normaldistribution with mean = 100 and std = 20 / √ 23 = 4.17
Ø So our sample result is 30 / 4.17 = 7.2 standard deviations abovethe mean of the sampling distribution!
Cohen Empirical Methods CS650
Standard error: The standard deviation of the samplingdistribution
100 104
Standard Error of the Mean under Ho: Π = 100, the samplingdistribution is normal, its mean is 100, itsstandard deviation is 20 / √ 23 = 4.17
The standard error is 4.17
The sample result is 4.17 standard errorunits above the mean under Ho
130
99% of a normal distribution lies within two standard deviations of the mean.How probable is our sample result?
Cohen Empirical Methods CS650
Try it again with a less extreme result
Ø Our result: R = 108Ø Assume Ho: Π = 100, this class is a random sample drawn from the
population of people with mean IQ 100Ø If the result is very unlikely under Ho, if Pr(R=108 | Π = 100) ≤ α,
then we are inclined to reject Ho.Ø Pick a value of α (say, .01) and calculate the conditional probability
p = Pr(R=108 | Π = 100)
Ø The sampling distribution of the mean approaches a normaldistribution with mean = 100 and std = 20 / √ 23 = 4.17
Ø So our sample result is 8 / 4.17 = 1.92 standard errors above themean of the sampling distribution.
Cohen Empirical Methods CS650
p values
100 104
s.e.
under Ho: Π = 100, the samplingdistribution is normal, its mean is 100,its standard deviation is 20 / √ 23 =4.17
The sample result, R=108, is 1.92standard error units above the meanunder Ho.
108
Now it isn’t so obvious that we should reject Ho.
How can we find p = Pr(R=108 | Π = 100) ?
State the result in standard error units and look up its probability in a table.
Cohen Empirical Methods CS650
p values
100 104
s.e.
The sample result, R=108, is 1.92standard error units above the meanunder Ho.
108
Cohen Empirical Methods CS650
Standardizing – subtract the mean, divide by thestandard error
100 104
s.e. under Ho: Π = 100, the sampling distribution isnormal, its mean is 100, its standard deviation is20 / √ 23 = 4.17, and the sample result is 108
108
0 1
s.e.
1.92
under Ho: Π = 0, the sampling distribution isnormal, its mean is 0, its standard deviation is 1.0,the sample result is (108 - 100) / (20 / √ 23) = 1.92
Cohen Empirical Methods CS650
Z scores or standard scores – subtract the mean, divideby the standard error
100 104
s.e.
108
0 1
s.e.
1.92
108 - 100
4.17= 1.92
x – µs.e.
Z = x – µ
σ / √ N=
Cohen Empirical Methods CS650
The Z test
Z is the number of standard errorunits the sample mean is from themean of the sampling distributionunder the null hypothesis.
If Z ≥ 1.645 then the sample resulthas p ≤ .05 probability given the nullhypothesis
If Z ≥ 1.96 then the sample resulthas p ≤ .01 probability given the nullhypothesis
Z = x – µ
s.e.
1.92 = __8 – 100
20 / √ 23
Cohen Empirical Methods CS650
The Z test
Ø Our result: R = 108Ø Assume Ho: Π = 100, this class is a random sample drawn from the
population of people with mean IQ 100Ø If the result is very unlikely under Ho, if Pr(R=108 | Π = 100) ≤ α,
then we are inclined to reject Ho.Ø Pick a value of α (say, .01) and calculate the conditional probability
p = Pr(R=108 | Π = 100)Ø The sampling distribution of the mean approaches a normal
distribution with mean = 100 and std = 20 / √ 23 = 4.17Ø So our sample result is 8 / 4.17 = 1.92 standard errors above the
mean of the sampling distributionØ Equivalently, Z = (108 - 100) / 4.17 = 1.92Ø p = Pr(R=108 | Π = 100) = Pr(Z) ≤ .0274,Ø α = .01, do not reject Ho.
Cohen Empirical Methods CS650
You do it:
Ø A sample of size 25 has mean 8. Test the hypothesisthat the sample is drawn from a population with mean12, standard deviation 10.
Cohen Empirical Methods CS650
You do it:
Ø A sample of size 25 has mean 8. Test the hypothesisthat the sample is drawn from a population with mean12, standard deviation 10.
Z =8 - 12
10 / √ 25= – 2
Cohen Empirical Methods CS650
Central limit theorem demo
VAR
100
200
300
400
500
600
-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1
Histogram OF Var[Dataset-3]
Population
10
20
30
40
50
60
-1.1
102030405060708090
-1.4
50
100
-1.5 -1 0-0.5
N=20Std = .25
N=30Std = .21
N=50Std = .16
(loop repeat 1000 collect (mean (sample-from-population n))))
Std(population) = 1.11
s.e.(20) = 1.11 / √ 20 = .248
s.e.(30) = 1.11 / √ 30 = .203
s.e.(50) = 1.11 / √ 20 = .157
Cohen Empirical Methods CS650
Three components of all test statistics
Z = x −x
= x −
N
Effect size
backgroundvariance
sample size
You can make any Z score significant with a big enough sample, butyou shouldn’t. Always try to control variance before increasing N.
Cohen Empirical Methods CS650
Parametric and computer-intensive hypothesis testing
100 104
std under Ho: Π = 100, the mean ofsampling distribution is 100, thestandard deviation is 20 / √ 23 = 4.17
130
10
20
30
90 10095 110105 130
Empirically (by simulation) thisdistribution has a mean of 100.05and a standard deviation of 4.38
Cohen Empirical Methods CS650
We do not know the sampling distribution of moststatistics – but we can estimate them empirically!
(defun sampling-distribution (n mean std k) "N is the sample size, MEAN and STD are the parametersof a normal distribution, K is the size (number of samples)of the sampling distribution."
(loop repeat k collect (mean (sample-normal-to-list mean std n))))
median interquartile-range trimmed-mean median-divided-by-mom’s-age
Cohen Empirical Methods CS650
Some issues for parametric and computer-intensive tests
Ø Z is fine if you know σ, (recall, z = (x - µ ) / (σ / √ n)) butwhat if you don’t? Estimate σ from s and for smallersamples run t tests.
Ø Monte Carlo tests are fine if you know the parameters ofthe population from which samples are drawn, but whatif you don’t? Estimate these parameters from thesample and run bootstrap or randomization tests.