Upload
cherrygroth
View
3
Download
1
Embed Size (px)
DESCRIPTION
Stepts in hypothesis testing are explained
Citation preview
I. Estimation continued… • Least Squares (OLS)
• Likelihood
II. Hypothesis Testing • t-test
one sample
two sample
paired
• non-parametric tests assumptions
Mann-Whitney-Wilcoxon
Frequentist (classical approaches)
• when sampling distribution is known
– Ordinary Least Squares
– Maximum Likelihood
• when sampling distribution is unknown
– Numerical Resampling
• Bootstrap
• Jackknife
• Permutation / Randomization tests
Bayesian inference - estimation
Methods for estimating parameters
Ordinary Least Squares (OLS)
• the parameter estimate that minimizes the sum of
squared differences between each value in a
sample and the parameter
in this example, the parameter is the mean
OLS =min yi y ( )i=1
n 2
SS = d2
b
• OLS: identifies parameter estimate that minimizes the sum
of squared differences between each value in a sample
and the parameter
parameter value
Ordinary Least Squares (OLS)
• frequently used to estimate parameters of linear models (e.g., linear regression, y=a+bx)
• unbiased and has minimum variance when distributional assumptions are met (i.e., is a precise estimator)
• no distributional assumptions required for point estimates
• for interval estimation & hypothesis testing, OLS estimators have restrictive assumptions of normality and patterns of variance
Maximum Likelihood (ML)
• estimate of a parameter that maximizes the
likelihood function based on the observed data
likelihood function estimates likelihood of observing sample data for all possible values of the parameter
• e.g., likelihood of observing the data for all possible values of
the mean, , in a normally distributed population
when assumptions of specified underlying distribution
are met, ML estimators are unbiased and have
minimum variance
Maximum Likelihood (ML)
• differences between a likelihood function and a
probability distribution
in a probability distribution for a random variable, the data are variable and the parameter fixed
in a likelihood function, the sample data are fixed and
the parameter varies across all possible values
• the maximum likelihood is the value of the
parameter that best fits the observed data
constraints of ML estimators:
• requires knowing sampling distribution underlying the statistic
(e.g., normal, multinomial, etc.)
• great for large samples, biased estimates for small samples
General Likelihood function:
L(y; ) = f (yi; )i=1
n
• is the joint probability distribution of yi and (the
probability distribution of y for possible values of )
Where:
L = likelihood
y = the frequency distribution of your data
= some parameter you want to estimate (e.g., mean)
yi = variates of your sample
f(yi; ) = the function describing the sampling distribution (e.g.,
the equation for normal dist.)
Log-Likelihood function:
• instead, we maximize the log-likelihood function rather than
the likelihood function because…
working with products is computationally difficult and the probability
distribution of L is poorly known
natural logarithm of L is easy to work with and large sample sizes
are approximately 2 distributed
lnL = ln f (yi; )i=1
n
= ln f yi;( )[ ]
i=1
n
Maximum Likelihood vs. Ordinary Least Squares
• for most population parameters, ML and OLS estimators
are the same when normality assumptions of OLS are met
exception is the variance for which ML estimator is slightly biased
unless n is large
• in balanced linear models (e.g., regression and ANOVA) for which normality assumptions hold, ML and OLS
estimators are identical
• OLS cannot be used for estimation for other distributions
(e.g., binomial and multinomial)
so generalized linear modeling (e.g., logistic regression and log-
linear models) and non-linear modeling are based on ML
estimation
Introduction to hypothesis testing
• main approach to statistical inference in biology
So far covered:
• collecting data samples
• estimating parameters from sample statistics
• calculated SE and CI as estimators of reliability
of these statistics
Now, we want…
some objective way to decide whether our sample
differs from some expected distribution, or from other similarly collected samples
Hypothesis testing
Classical statistical hypothesis testing rests on disproving H0:
• must state a statistical null hypothesis (Ho)
includes all possibilities except the prediction of research hypothesis, HA
H0 is usually a hypothesis of no difference or effect
because…
(1) inductive reasoning requires verification of all possible
observations (impossible) and…
(2) we often don’t know what constitutes proof of a hypothesis
(alternative biological hypotheses are often more vague than
H0)
in contrast, we know what constitutes disproof (falsification),
because null hypotheses are exact
disproof of H0 constitutes evidence for HA
Common philosophy of all tests
H0
Accepted Rejected
H0
True correct
decision
Type I
error
False Type II
error
correct
decision
Possible outcomes of tests of null hypothesis (H0)
• P(Type 1 error) = = critical p-value
• the smaller the magnitude of an allowable type I error, the more deviant an outcome has to be from the expected outcome in order to reject Ho
• AKA “significance level”, since it is the level at which we decide to accept or reject the Ho – by convention, it is 0.05 (5%)
– this level is a convention, and is arbitrary, not a law! might alter it based on the context
– the 5% in the tails of the distribution comprises the rejection region, and the range comprising 95% of the outcomes comprises the acceptance region
What level of Type I error is acceptable?
• when we reject Ho at a specified significance level, , we say that the sample is significantly different from what we expect under the null hypothesis
• the 5% rejection region can be split between 2 tails (2.5% in each one), or can be all in one tail
– these are called 2-tailed or 1-tailed tests, respectively
– decision about which to use is based on whether you have any a priori knowledge or assumptions about possible alternatives to the null hypothesis
• if expected outcomes could either above or below the expectation under the null, then a two-tailed test is appropriate
• if outcomes are only likely and interesting in one direction, then the test is 1-tailed
– advantage of 1-tailed tests: more powerful, easier to reject Ho, with a lower probability of a type II error
One vs. Two-tailed tests and rejection region
• power = 1-
• once we specify , then P(type I),
P(type II) and power are set
• decreasing increases and
thus decreases power
• increasing decreases and
thus increases power
• why is a one-tailed test more
powerful than a two-tailed test?
Power = probability that we
correctly accept Ha
Power of a test
• with a constant and a given
s2, the power decreases as
the location (mean) of the distribution of Ha approaches
that of Ho
• power also decreases as
spread (s2) increases
• thus, increase power by…
reducing spread (s2), (i.e.,
maximize n)
or test for large effects (big
difference between means)
Power of a test
the t-test review: t statistic – development of a confidence interval
t =y μ
s / n=
y μ
SEM
Rearrange to solve for μ for confidence
interval
and
1.
2.
3.
for a two-tailed test:
Solve for (using df):
1. calculated t values or…
2. desired confidence level (to determine range in values that are
likely to contain μ ) t(s / n ) = (y μ)
t =(y μ)
s / n
μ = y t(s / n )
μ = y + t(s / n )
P[y t(s / n ) μ y + t(s / n )]
Review
Degrees of Freedom .01 .02 .05 .10 .20
1 63.66 31.82 12.71 6.314 3.078
2 9.925 6.965 4.303 2.920 1.886
3 5.841 4.541 3.182 2.353 1.638
4 4.604 3.747 2.776 2.132 1.533
5 4.032 3.365 2.571 2.015 1.476
10 3.169 2.764 2.228 1.812 1.372
15 2.947 2.602 2.132 1.753 1.341
20 2.845 2.528 2.086 1.725 1.325
25 2.787 2.485 2.060 1.708 1.316
38 2.705 2.426 2.020 1.685 1.302
Probability
61.92
95%
Lovett et al. (2000)
38 df
61.92 – 2.02(0.84)
60.22
61.92 + 2.02(0.84)
63.62 < μ <
Sample mean 61.92
SEM 0.84 DF 38
P[y t(s / n ) μ y + t(s / n )]
y t(s / n ) y + t(s / n )
95% Confidence Interval (2-tailed) Review
3 types of t-tests
• one-sample t-test: the mean of a sample is different from a constant
• two-sample t-test: the means of two samples are different
• paired t-test: the mean difference between paired observations is different from a constant (usually 0)
there are 3 types of t-tests — hypothesis tests based on the t distribution
• t-tests are parametric tests
• thus, the t statistic only follows t distribution if:
– variable has normal distribution (normality assumption)
– two groups have equal population variances
(homogeneity of variance assumption)
– observations are independent or specifically paired (independence assumption)
Assumptions of t-test
• test of H0 that population mean equals a
particular value (H0: μ = )
– e.g., population mean density of kelp after some
impact (e.g. oil spill) is same as before (H0: = x
before)
• mean ( ) may be from literature or other
research or legislation
One-sample t-test: testing a simple H0
general form of the t statistic:
where…
• St is sample statistic
• is parameter value specified in H0 • SE is standard error of sample statistic
value of mean
specified in H0
St
SE
t =y
SEM=
y μ
s n
t-statistic
specific form for population mean:
• different sampling distributions of t for different
sample sizes
– use degrees of freedom (df = n - 1)
• area under each sampling (probability) distribution
equals one
• we determine probabilities of obtaining particular
ranges of t when H0 is true
sampling distribution of t
P(t)
t = 0 t > 0 t < 0
1) only reject H0 for large + values of t, i.e. when sample mean is
much greater than
2) only reject H0 for large - values of t, i.e. when sample mean is
much less than
Two possible independent
alternative hypotheses:
1) H0: μ < HA: μ >
2) H0: μ > HA: μ <
(1) (2)
one-tailed tests
two possible tests > 0
< 0
1)
2)
y
SE
y
SE
H0: μ = HA: μ > or μ <
• reject H0 for large + or - values of t, i.e. when sample
mean is much greater than or less than
P(t)
t = 0 t > 0 t < 0 / 2 = 0.025
/ 2 = 0.025
= 0.05
0
Only one alternative
hypothesis: two-tailed tests
only one test possible y
SE
Degrees of Freedom .005/.01 .01/.02 .025/.05 .05/.10 .10/.20
1 63.66 31.82 12.71 6.314 3.078
2 9.925 6.965 4.303 2.920 1.886
3 5.841 4.541 3.182 2.353 1.638
4 4.604 3.747 2.776 2.132 1.533
5 4.032 3.365 2.571 2.015 1.476
10 3.169 2.764 2.228 1.812 1.372
15 2.947 2.602 2.132 1.753 1.341
20 2.845 2.528 2.086 1.725 1.325
25 2.787 2.485 2.060 1.708 1.316
2.575 2.326 1.960 1.645 1.282
-5 -4 -3 -2 -1 0 1 2 3 4 5 -5 -4 -3 -2 -1 0 1 2 3 4 5
-2.78 +2.78
95%
-5 -4 -3 -2 -1 0 1 2 3 4 5 -5 -4 -3 -2 -1 0 1 2 3 4 5
+2.132 95%
-5 -4 -3 -2 -1 0 1 2 3 4 5 -5 -4 -3 -2 -1 0 1 2 3 4 5
-2.132 95%
One and two tailed t-values (df = 4)
2 tailed 1 tailed 1 tailed
t
General question:
Are birth:death ratios of human
populations near the no-population-growth ratio of 1.25?
Ho: B/D ratios = 1.25
HA: B/D ratios 1.25
• Are the B/D ratios for any of these
groups 1.25?
• test using a one sample t-test
Ourworld.syd
Example: one-sample t-test
Single population:
H0: μ = 1.25
df = n - 1
Example: one-sample t-test
t =y 1.25
SEM=
y 1.25
sy
=y 1.25
s / n
1. Box plot
2. Normal approximation 3. Dot plot
Hypothesis Testing: One-sample t-test
Results for GROUP$ = Europe
One-sample t-test of B_TO_D with 20 Cases
Ho: Mean = 1.25000 vs Alternative = 'not equal'
Mean : 1.25701
95.00% Confidence Interval : 1.15735 to 1.35668
Standard Deviation : 0.21295
t : 0.14727
df : 19
p-value : 0.88447
Example: one-sample t-test: Results
Results for GROUP$ = Islamic
One-sample t-test of B_TO_D with 16 Cases
Ho: Mean = 1.25000 vs Alternative = 'not equal'
Mean : 3.47825
95.00% Confidence Interval : 2.84977 to 4.10672
Standard Deviation : 1.17943
t : 7.55705
df : 15
p-value : 0.00000
Results for GROUP$ = NewWorld
One-sample t-test of B_TO_D with 21 Cases
Ho: Mean = 1.25000 vs Alternative = 'not equal'
Mean : 3.95091
95.00% Confidence Interval : 3.26380 to 4.63802
Standard Deviation : 1.50949
t : 8.19954
df : 20
p-value : 0.00000
Example: one-sample t-test: Results
Europe
Islam
ic
New World
0
1
2
3
4
5
6
7
8
Birth
s /
Death
s (
± 9
5%
CI)
H0: μ = 1.25
Example: one-sample t-test:
• a way to present the results
• used to compare two populations, each of which has
been sampled
• the simplest form of tests comparing populations
• example: does the average annual income differ for
males and females?
H0: μ1 = μ2; income (males) = income (females)
Survey2.syd
Two-sample t-test
H 0 : μ 1 = μ
2 , i.e. μ 1 - μ
2 = 0
- independent observations
df = ( n 1 - 1) + ( n 2 - 1) = n 1 + n 2 - 2
2 1 2 1
2 1 2 1 2 1 ) (
y y y y s
y y
s
y y t
=
=
μ μ 2 1 y y
=
where sp = the pooled standard deviation (more later), and…
Calculation of t for two-sample t-test
y1
t =
y2
1
n1
1
n2
+ sp
- 5 - 4 - 3 - 2 - 1 0 1 2 3 4 5 0.0
0.1
0.2
0.3
0.4
Pro
babili
ty o
f t
- 5 - 4 - 3 - 2 - 1 0 1 2 3 4 5 0.0
0.1
0.2
0.3
0.4
6 7 8 9
HA true
- 5 - 4 - 3 - 2 - 1 0 1 2 3 4 5 0.0
0.1
0.2
0.3
0.4
Pro
babili
ty o
f t
- 5 - 4 - 3 - 2 - 1 0 1 2 3 4 5 0.0
0.1
0.2
0.3
0.4 Ho true Ho: μ1 = μ2
HA: μ1 > μ2
1) if Ho is true then the
null distribution is known (for a set df)
2) if HA is true, we don’t
know the distribution
but we do know that is not the null
distribution
Assume
Logic of two-sample t-test
t
- 5 - 4 - 3 - 2 - 1 0 1 2 3 4 5 0.0
0.1
0.2
0.3
0.4
- 5 - 4 - 3 - 2 - 1 0 1 2 3 4 5 0.0
0.1
0.2
0.3
0.4
6 7 8 9 - 5 - 4 - 3 - 2 - 1 0 1 2 3 4 5 0.0
0.1
0.2
0.3
0.4
- 5 - 4 - 3 - 2 - 1 0 1 2 3 4 5 0.0
0.1
0.2
0.3
0.4 H o true
- 5 - 4 - 3 - 2 - 1 0 1 2 3 4 5 0.0
0.1
0.2
0.3
0.4
- 5 - 4 - 3 - 2 - 1 0 1 2 3 4 5 0.0
0.1
0.2
0.3
0.4
6 7 8 9 - 5 - 4 - 3 - 2 - 1 0 1 2 3 4 5 0.0
0.1
0.2
0.3
0.4
- 5 - 4 - 3 - 2 - 1 0 1 2 3 4 5 0.0
0.1
0.2
0.3
0.4 H o true
t 0.05, 4 df = 2.14
Any t >2.14 will lead to
incorrect rejection of Ho
1. this implies that the
difference between y1
and y2 is > 2.14
standard errors (pooled)
2. this will happen 5 % of
the time
If Ho is true: μ1 = μ2 (given 4 df)
- 5 - 4 - 3 - 2 - 1 0 1 2 3 4 5 0.0
0.1
0.2
0.3
0.4
- 5 - 4 - 3 - 2 - 1 0 1 2 3 4 5 0.0
0.1
0.2
0.3
0.4
6 7 8 9 - 5 - 4 - 3 - 2 - 1 0 1 2 3 4 5 0.0
0.1
0.2
0.3
0.4
- 5 - 4 - 3 - 2 - 1 0 1 2 3 4 5 0.0
0.1
0.2
0.3
0.4
- 5 - 4 - 3 - 2 - 1 0 1 2 3 4 5 0.0
0.1
0.2
0.3
0.4
- 5 - 4 - 3 - 2 - 1 0 1 2 3 4 5 0.0
0.1
0.2
0.3
0.4
6 7 8 9 - 5 - 4 - 3 - 2 - 1 0 1 2 3 4 5 0.0
0.1
0.2
0.3
0.4
- 5 - 4 - 3 - 2 - 1 0 1 2 3 4 5 0.0
0.1
0.2
0.3
0.4 HA true
t 0.05, 4 df = 2.14
Any t < 2.14 will lead to incorrect
rejection of HA (i.e., incorrect
acceptance of HO)
1. this means that the
difference between y1 and
y2 is < than 2.14
standard errors (pooled)
2. the probability that this will
happen is dependent on ,
n and the true difference
between μ1 and μ2
If Ho is false: μ1 > μ2 (given 4 df)
t
Two-sample t-test on INCOME Grouped by SEX$ vs Alternative = 'not equal'
Standard
GROUP N Mean Deviation
-------+---------------------------
Female 152 20.25658 14.82771
Male 104 24.97115 16.41776
Separate Variance
Difference in Means : -4.71457
95.00% Confidence Interval : -8.67643 to -0.75272
t : -2.34611
df : 206.23313
p-value : 0.01992
Pooled Variance
Difference in Means : -4.71457
95.00% Confidence Interval : -8.59712 to -0.83203
t : -2.39138
df : 254.00000
p-value : 0.01751
The separate variance t-test is based
on the Satterthwaite adjustment (of degrees of freedom), it is not
recommended unless the variance terms are very different and the
sample sizes (n) are very different
What is the conclusion?
Results of example
• probability of obtaining our sample data if H0 is
true, i.e., [P(data|H0)]
• NOT the probability that H0 is true!
• strictly, is the long run probability (from repeated
sampling) of obtaining sample result if HO is true
meaning of the P value, review
Male Female
SEX
0
10
20
30
40
50
60
70
INC
OM
E
0 10 20 30 40 50 Count
0 10 20 30 40 50 Count
Female Male SEX
0
10
20
30
40
50
60
70
Annual In
com
e (
x 1
000)
+ 9
5%
CI
n=152 n=104
Graphical results of example
• which graph would you present in a
talk or paper?
• which tells you that the assumptions of this analysis may have been
violated?
1. often we want to compare observations that can be considered ‘paired” within a subject (replicate)
for example:
i. comparison of activity level before and after eating in the same individual
ii. comparison of longevity of males vs females, where county is the replicate
2. in such cases, there is often benefit in accounting for variance that could be caused by differences among subjects (= replicates)
3. and it is wrong to consider the observations on the same subject as being true replicates – they are not independent
Paired t-test: the logic
Paired t-test uses paired observations
H0: μd = 0
where…
d = the difference between paired observations
sd = standard deviation of the differences df = n - 1 where n is number of pairs
• the null hypothesis is no difference between the paired
observations
t =d
sd
=d
sd / nd
• Sea star Pisaster comes in two colors along the west
coast: purple and orange:
– Ho: density of purple per site = density of orange
– individual reefs are the replicates of interest
– looks like a no brainer
Sea star colors all sites two sample.syd
Paired t-test: example
barely significant
WHY?
Standard
GROUP N Mean Deviation
-------+--------------------------
Orange 7 144.71429 101.75086
Purple 7 457.28571 353.47829
Separate Variance
Difference in Means : -312.57143
95.00% Confidence Interval : -641.43752 to 16.29466
t : -2.24827
df : 6.98755
p-value : 0.05942
Pooled Variance
Difference in Means : -312.57143
95.00% Confidence Interval : -615.48591 to -9.65695
t : -2.24827
df : 12.00000
p-value : 0.04413
Paired t-test: example, results if incorrectly
treated as a 2-sample t-test
Orange Purple Color of seastars
0
200
400
600
800
1000
1200
Density (
95%
CI)
equal variances?
• given that the observations are paired at the level of
site, can we account for variation among sites?
Consider site-to-site variability (remember, sites are replicates)
Note slopes – are they the
same?
• perhaps log
transform
Sea star colors all sites.syd
Paired Samples t-test on PURPLE vs ORANGE with 7 Cases Alternative = 'not equal'
Mean PURPLE : 457.28571
Mean ORANGE : 144.71429
Mean Difference : 312.57143
95.00% Confidence Interval : 74.58766 to 550.55520 Standard Deviation of Difference : 257.32266
t : 3.21381
df : 6
p-value : 0.01828
Paired t-test: details of calculation
Smaller than before
WHY?
Note slopes – much more
similar
indicates that:
• purples are more
common
by a constant ratio
rather than by a
constant amount Paired Samples t-test on LPURPLE vs LORANGE with 7 Cases Alternative = 'not equal'
Mean LPURPLE : 2.48624
Mean LORANGE : 1.99536
Mean Difference : 0.49088
95.00% Confidence Interval : 0.37685 to 0.60492 Standard Deviation of Difference : 0.12330
t : 10.53299
df : 6
p-value : 0.00004
Paired t-test: details of calculation:
use of log-transformed data
Smaller than before
WHY?
• One-sample test
• Two-sample test
• Paired test
y μ
s n
Review of calculations of t for the 3 kinds of t-test
d
sd nd
y 1 y 2
sp
1n1+1n2
Standard Error used for calculating t
• One-sample test
• Two-sample test
• Paired test
(calculation based on
pooled variance term)
SE Variance (s2)
s
ns2 =
SS
(n 1)
sp1
n1+1
n2sp2=
SS1 + SS2(n1 1) + (n2 1)
sd2=
SSd(nd 1)
sdnd
(n1 1)s12
+ (n2 1)s22
n1 + n2 2
1
n1+1
n2
or…
• Methods:
– “A two-sample t-test was used to compare the
mean number of eggs per capsule from the two
zones. Assumptions were checked with….”
• Results:
– “The mean number of eggs per capsule from the
mussel zone was significantly greater than that
from the littorinid zone (t = 5.39, df = 77, P <
0.001; Fig. 2).”
Presenting results of t-tests in scientific writing
• Assumption: data in each group are normally distributed
• Checks:
– Frequency distributions – be careful
– Boxplots
– Probability plots
– formal tests for normality (too powerful, not powerful enough?)
• Solutions: – transformations
– don’t worry, run it anyway, give disclaimer • if there is another appropriate test where assumptions
are met, use it instead, but often violations make little difference in reported P-value
Evaluating Assumptions of the t-test: Normality
• Assumption: population variances equal in 2
groups
• Checks:
– subjective comparison of sample variances
– boxplots
– F-ratio test of H0: 12 = 2
2
• Solutions
– transformations
– run it anyway – same comments as for normality
assumption
Evaluating Assumptions of t-test: Homogeneity of
Variance
• H0: 12 = 2
2
• F-statistic = ratio of 2 sample variances
F = s12 / s2
2
reject H0 if F < or > 1
• if H0 is true, F-ratio follows F distribution
• follows usual logic of a statistical test
• will this test be too powerful or not powerful
enough?
Evaluating Assumptions of t-test: Homogeneity of
Variance the F-statistic (AKA F-ratio)
0 10 20 30 40 50 60 70 80 90
Limpet numbers per quadrat
0
10
20
30
40
50
60
70
Count
Evaluating Assumptions: boxplot & histogram
1. IDEAL 2. SKEWED
4. UNEQUAL VARIANCES 3. OUTLIERS
*
*
*
*
*
Evaluating Assumptions: boxplots
Ourworld.syd
Pop_1990 Lpop1990
Europe 441 0.17
Islamic 1378 0.30
Newworld 1042 0.34
Greatest
ratio
3.12:1 2:1
Variance
Evaluating Assumptions: transformations to
mitigate departures from normality and
homoscedasticity
raw data log transformed Normal Probability plots (pplots)
Boxplots
• these tests don’t assume particular underlying distribution of data – normal distributions not necessary
• usually based on ranks of the data
• H0: samples come from populations with identical distributions – equal means or medians
• equal variances and independence still required
• typically less powerful than parametric tests
Nonparametric Tests
• use the test that is more efficient (i.e., has the
greatest power given the sample size (n); results in less cost and effort)
• if assumptions of parametric tests are met, they
are always more efficient
• parametric tests are able to deal with more
complex experimental designs – there may be no
nonparametric equivalent
Which type of test to use??
• if assumptions not met, then explore the data – try transformations to ‘normalize’ the data or equate
variances
– if normality assumption violated, still hard to recommend NP tests unless distributions are very weird, transformations do not help, or outliers are present
• do a parametric test based on the ranks!
• use a robust parametric test that does not assume equal variances (e.g., separate variance t-test)
• do a randomization test of your data in conjunction with a parametric test
Parametric tests are usually better
• calculates sum of ranks in 2 samples
– should be similar if H0 is true
• compares rank sum to sampling distribution of
rank sums
– i.e., the distribution of rank sums when H0 true
• equivalent to t-test on data transformed to ranks
Mann-Whitney U / Wilcoxon test
— a nonparametric 2-sample t-test
• DATA: consist of 2 random samples
• ASSUMPTIONS
– both samples are random samples from respective
populations
– independent samples
– measurement scale is at least ordinal
– if there is a difference between sample distributions,
that difference is one of location (i.e., the variances are
equal)
Mann-Whitney U / Wilcoxon test
• e.g., Satterthwaite’s adjusted t-test for unequal
variances (= Separate variances t-test)
• the common version is to recalculate the df for the test
to make it more conservative (a lower df, which may no
longer be an integer)
df =
s1n1
+s2
n2
2
s1 n1( )2
n1 +1( ) + s2 n2( )2
+ n2 +1( )2
• these tests are more reliable than the traditional tests
when variances or sample sizes are very unequal, but
still require normality
“Robust” parametric tests
calculate difference between the averages of the groups (D0)
randomly reassign the observations so that there are n1 in group 1 and
n2 in group 2
calculate D1
repeat this procedure ~1000 times each time calculating Di
calculate the proportion of all Di’s that are D0. This is the p-value that
can be compared to to decide upon accepting or rejecting Ho.
Given the power of computers, this procedure is starting to replace the use of non-
parametric testing when distributional assumptions are violated, distributions are unknown or random sampling not possible
• reshuffling the data many times to generate the sampling
distribution of a statistic directly
• principle: if H0 is true, then any random arrangement of observations to groups is equally likely
Randomization (permutation) tests