IE241: Introduction to Hypothesis Testing. Topic Slide Hypothesis...

Preview:

Citation preview

IE241: Introduction to Hypothesis Testing

Topic Slide

Hypothesis testing………………………………………..3Light bulb example………………………………………..4Null and alternative hypotheses………………..……….5Two types of error…………………………………………8Decision rule……………………………………..……….11 test statistic……………………………………………11 critical region………………………………………….12Power of the test……………………………….…….17Simple hypothesis testing……………………………...18 Neyman-Pearson lemma……………….…….…….19 example………………………………………………..21Composite hypothesis testing ……………...………..26 example…………………………………..……………29 Likelihood ratio test………………………………….34 relationship to mean…………………………….38Examples of 1-sided composite hypotheses drug to help sleep……………………………………42 civil service exam………....................…………..44 difference between two proportions ……….46 effect of size of n…………..………………….51 railroad ties………………………………………..…. 55 fertilizer to improve yield of corn…………………..58 test of two variances…………………………..62 F distribution ……………………………………63Tests of correlated means…………………………..…69Bayes’ likelihood ratio test…………………………..…77 example…………………...……………………..…..78

Topic SlideChi-square tests……………………………………….81 goodness of fit…………………………………….82 independence in contingency tables…………..91 testing sample vs hypothesized variance……..108Significance testing…………………………………….111

We said before that estimation of parameters was one of the two major areas of statistics. Now let’s turn to the second major area of statistics, hypothesis testing.

A test of a statistical hypothesis is a procedure for deciding whether or not to reject the hypothesis.

What is a statistical hypothesis? A statistical hypothesis is an assumption about f(X) if X is continuous or p(X) if X is discrete.

Let’s look at an example.

A buyer of light bulbs bought 50 bulbs of each of two brands. When he tested them, Brand A had an average life of 1208 hours with a standard deviation of 94 hours. Brand B had a mean life of 1282 hours with a standard deviation of 80 hours. Are brands A and B really different in quality?

We set up two hypotheses.

The first, called the null hypothesis Ho, is the hypothesis of no difference.

Ho: μA = μB

The second, called the alternative hypothesis Ha, is the hypothesis that there is a difference.

Ha: μA ≠ μB

On the basis of the sample of 50 from each of the two populations of light bulbs, we shall either reject or not reject the hypothesis of no difference.

In statistics, we always test the null hypothesis. The alternative hypothesis is the default winner if the null hypothesis is rejected.

We never really accept the null hypothesis; we simply fail to reject it on the basis of the evidence in hand.

Now we need a procedure to test the null hypothesis. A test of a statistical hypothesis is a procedure for deciding whether or not to reject the null hypothesis.

There are two possible decisions, reject or not reject. This means there are also two kinds of error we could make.

The two types of error are shown in the table below.

True state

Decision

Ho true Ho false

Reject Ho

Type 1error α

Correct decision

Do not reject Ho

Correct decision

Type 2 error β

If we reject Ho when Ho is in fact true, then

we make a type 1 error. The probability of type 1 error is α.

If we do not reject Ho when Ho is really false, then we make a type 2 error. The probability of a type 2 error is β.

Now we need a decision rule that will make the probability of the two types of error very small. The problem is that the rule cannot make both of them small simultaneously.

The one type of error the experimenter has under his control is α error. He can choose the size of α.

Because in science we have to take the conservative route and never claim that we have found a new result unless we are really convinced that it is true, we choose a very small α, the probability of type 1 error.

Then among all possible decision rules given α, we choose the one that makes β as small as possible.

The decision rule consists of a test statistic and a critical region where the test statistic may fall. For means from a normal population, the test statistic is

where the denominator is the standard deviation of the difference between two independent means.

B

B

A

A

BA

diff

BA

ns

ns

XX

s

XXt

22

The critical region is a tail of the distribution of the test statistic. If the test statistic falls in the critical region, Ho is rejected.

Now, how much of the tail should be in the critical region? That depends on just how small you want α to be. The usual choice is α = .05, but in some very critical cases, α is set at .01.

Here we have just a non-critical choice of light bulbs, so we’ll choose α = .05. This means that the critical region has probability = .025 in each tail of the t distribution.

For a t distribution with .025 in each tail, the critical value of t = 1.96, the same as z because the sample size is greater than 30. The critical region then is |t |> 1.96.

In our light bulb example, the test statistic is

23.45.17

74

5094

5080

1208128222

t

Now 4.23 is much greater than 1.96 so we reject the null hypothesis of no difference and declare that the average life of the B bulbs is longer than that of the A bulbs.

Because α = .05, we have 95% confidence in the decision we made.

We cannot say that there is a 95% probability that we are right because we are either right or wrong and we don’t know which.

But there is such a small probability that t will land in the critical region if Ho is true that if it does get there, we choose to believe that Ho is not true.

If we had chosen α = .01, the critical value of t would be 2.58 and because 4.23 is greater than 2.58, we would still reject Ho. This time it would be with 99% confidence.

How do we know that the test we used is the best test possible?

We have controlled the probability of Type 1 error. But what is the probability of Type 2 error in this test? Does this test minimize it subject of the value of α?

To answer this question, we need to consider the concept of test power. The power of a statistical test is the probability of rejecting Ho when Ho is really false. Thus power = 1-β.

Clearly if the test maximizes power, it minimizes the probability of Type 2 error

β. If a test maximizes power for given α, it is called an admissible testing strategy.

Before going further, we need to distinguish betw

een two types of hypotheses.

A simple hypothesis is one where the value of the parameter under Ho is a specified constant and the value of the parameter under Ha is a different specified constant.

For example, if you test

Ho: μ = 0 vs Ha: μ = 10

then you have a simple hypothesis test.

Here you have a particular value for Ho and a different particular value for Ha.

For testing one simple hypothesis Ha against the simple hypothesis Ho, a ground-breaking result called the Neyman-Pearson lemma provides the most powerful test.

λ is a likelihood ratio with the Ha parameter MLE in the numerator and the Ho parameter MLE in the denominator. Clearly, any value of λ > 1 would favor the alternative hypothesis, while values less than 1 would favor the null hypothesis.

)ˆ(

)ˆ(

0

L

L a

Basically, this likelihood ratio says that if there exists a critical region A of size α and a constant k such that

inside A

and

outside A

then A is a best (most powerful) critical region of size α.

kxf

xf

L

Ln

ioi

n

iai

o

a

1

1

);(

);(

kxf

xf

L

Ln

ioi

n

iai

o

a

1

1

);(

);(

Consider the following example of a test of two simple hypotheses.

A coin is either fair or has p(H) = 2/3. Under Ho, P(H) = ½ and under Ha, P(H) =

2/3.

The coin will be tossed 3 times and a decision will be made between the two hypotheses. Thus X = number of heads = 0, 1, 2, or 3. Now let’s look at how the decision will be made.

First, let’s look at the probability of Type 1 error α. In the table below, Ho⇒ P(H) =1/2 and Ha⇒ P(H) = 2/3.

Now what should the critical region be?

X P(X|Ho) P(X|Ha)

0 1/8 1/27

1 3/8 6/27

2 3/8 12/27

3 1/8 8/27

Under Ho, if X = 0, α = 1/8. Under Ho, if X = 3, α =

1/8. So if either of these two values is chosen as the critical region, the probability of Type 1 error would be the same.

Now what if Ha is true? If X = 0 is chosen as the critical region, the value of β = 26/27 because that is the probability that X ≠ 0.

On the other hand, if X = 3 is chosen as the critical region, the value of β = 19/27 because that is the probability that X ≠ 3.

Clearly, the better choice for the critical region is X=3 because that is the region that minimizes β for fixed α. So this critical region provides the more powerful test.

In discrete variable problems like this, it may not be possible to choose a critical region of the desired α. In this illustration, you simply cannot find a critical region where α = .05 or .01.

This is seldom a problem in real-life experimentation because n is usually sufficiently large so that there is a wide variety of choices for critical regions.

This problem to illustrate the general method for selecting the best test was easy to discuss because there was only a single alternative to Ho.

Most problems involve more than a single alternative. Such hypotheses are called composite hypotheses.

Examples of composite hypotheses:

Ho: μ = 0 vs Ha: μ ≠ 0

which is a two-sided Ha.

A one-sided Ha can be written as

Ho: μ = 0 vs Ha: μ > 0 or Ho: μ = 0 vs Ha: μ < 0

All of these hypotheses are composite because they include more than one value for Ha. And unfortunately, the size of β here depends on the particular alternative value of μ being considered.

In the composite case, it is necessary to compare Type 2 errors for all possible alternative values under Ha. So now the size of Type 2 error is a function of the alternative parameter value θ.

So β(θ) is the probability that the sample point will fall in the noncritical region when θ is the true value of the parameter.

Because it is more convenient to work with the critical region, the power function 1-β(θ) is usually used.

The power function is the probability that the sample point will fall in the critical region when θ is the true value of the parameter.

As an illustration of these points, consider the following continuous example.

Let X = the time that elapses between two successive trippings of a Geiger counter in studying cosmic radiation. The density function is

f(x;θ) = θe-θx

where θ is a parameter which depends on experimental conditions.

Under Ho, θ = 2. Now a physicist believes that θ < 2. So under Ha, θ < 2.

Now one choice for the critical region is the right tail of the distribution, X ≥ 1

Another choice is the left tail, X ≤ .07 for which α = .135. That is,

Now let’s examine the power for the two competing critical regions.

1

2 135.2 dxe x

07.

0

2 135.2 dxe x

For the right-tail critical region X > 1,

and for the left-tail critical region X <.07,

The graphs of these two functions are called the power curves for the two critical regions.

1

1)(1 edxe x

07.

0

07.2 1)(1 edxe x

These two power functions are

Note that the power function for X>1 region is always higher than the power function for X<.07 region before they cross at θ = 2. Since the alternative θ values in the problem are all θ<2, clearly the right-tail critical region X>1 is more powerful than the left-tail region.

Power functions for two critical regions

0

0.2

0.4

0.6

0.8

1

1.2

0 0.5 1 1.5 2 2.5 3 3.5 4

Theta

Pow

er

critical region X>1

critical region X<.07

What we just saw was a 1-sided composite alternative hypothesis test. Unfortunately, with two-sided composite alternative hypotheses, there is no best test that covers all alternative values.

Clearly, if the alternative were θa < θo , the left tail would be best, and if the alternative were θa > θo , the right tail would be best.

This shows that best critical regions exist only if the alternative hypothesis is suitably restricted.

So for composite hypotheses, a new principle needs to be introduced to find a good test. This principle is called a likelihood ratio test.

where the denominator is the maximum of the likelihood function with respect to all the parameters, and the numerator is the maximum of the likelihood function after some or all of the parameters have been restricted by Ho.

)ˆ(

)ˆ( 0

L

L

Consequently, the numerator can never exceed the denominator, so λ can assume values only between 0 and 1.

A value of λ close to 1 lends support to Ho because then it is clear that allowing the parameters to assume values other than those possible under Ho would not increase the likelihood of the sample values very much, if at all.

If, however, λ is close to 0, then the probability of the sa

mple values of X is very low under Ho, and Ho is therefore not supported by the data.

Because increasing values of λ correspond to increasing

degrees of belief in Ho, λ may serve as a statistic for testing Ho, with small values leading to rejection of Ho.

Now the MLEs are functions of the values of the random variable X, so λ is also a function of these values of X and is therefore an observable random variable.

λ is often related to whose distribution is known so it i

s not necessary to find the distribution of λ.

X

Suppose we have a normal population with

σ = 1 and we are interested in testing whether the mean = μo. That is,

Let’s see how we would construct a likelihood ratio test.

2)(2

1

2

1

x

e

In this case,

Since maximizing L(μ) is equivalent to maximizing log L(μ),

so and therefore

n

iixn

eL 1

2)(2

1

2)2()(

)()(log

1

n

iix

L

n

ii Xxn

eL 1

2)(2

1

2)2()(

Under Ho, there are no parameters to be estimated, so

and λ then is

n

ioixn

o eL 1

2)(2

1

2)2()(

2

1 1

22

)(2

)()(2

1

o

n

i

n

iioi

Xn

Xxx

e

e

This expression shows a relationship between λ and , such that for each value of λ, there are two critical values of , which are symmetrical with respect to = μo.

So the 5% critical region for λ corresponds to the two 2.5% tails of the normal distribution given by

Thus the likelihood ratio test is identical to the t test and serves as a compromise test when no best test is available.

XX

X

X

|96.1|

n

X o

It is because of the concept of power that we simply fail to reject the null hypothesis and do not accept it when the test value does not fall into the rejection region.

The reason is that if we had a more powerful test, we might have been able to reject Ho.

Now let’s look at some examples.

As an example of a one-sided composite hypothesis test, suppose a new drug is available which claims to produce additional sleep. The drug is tested on 10 patients with the results shown.

We are testing the hypothesis Ho: μ = 0 vs Ha: μ > 0

Patient 1 2 3 4 5 6 7 8 9 10

Hours gained

0.7 -1.1

-0.2

1.20.

13.4 3.

70.8 1.

82.0

The mean hours gained = 1.24 and s = 1.45. So the t statistic is

which has 9 df.

For df = 9 and α = .05, the required t = 2.262. Since our obtained t is greater then the required t, we can, with 95% confidence, reject Ho.

So in this case, even with only 10 patients, we can endorse the drug for obtaining longer sleep.

7.2

10

45.1024.1

t

Now let’s take a second example. A civil service exam is given to a group of 200 candidates. Based on their total scores, the 200 candidates are divided into two groups, the top 30% and the bottom 70%.

Now consider the first question in the examination. In the upper 30% group, 40 had the right answer. In the lower 70% group, 80 had the right answer. Is the question a good discriminator between the top scorers and the lower scorers?

To answer this question, we first set up the two hypotheses.

In this case, the null hypothesis is Ho: pu = pl and the alternative is Ha: pu > pl

because we would expect the upper group to do better than the lower group on all questions.

In binomial situations, we must deal with proportions instead of counts unless the two sample sizes are the same.

The proportion of successes p = x/n may be assumed to be normally distributed with mean p and variance

pq/n if n is large.

Then the difference between two sample proportions may also be approximately normally distributed if n is large.

In this situation, μ p1-p2 = p1-p2 and

Just as for the binomial distribution, the normal approximation will be satisfactory if each nipi exceeds 5 when p ≤ ½ and niqi exceeds 5 when p > ½.

2

22

1

112

21 n

qp

n

qppp

The test statistic is

We need the common estimate of p under Ho to use in the denominator, so we use the estimate for the entire group.

So p = 120/200 = 3/5 =.6 and q = .4. The p for the upper group = 40/60 = .67. T

he p for the lower group = 80/140≈.57.

lu

lu

npq

npq

ppt

So inserting our values into the test statistic, we get

Our critical region is t > 1.65 because we have set α = .05 as the critical value in this 1-tailed test. Because of the large sample size, t.95 = z.

95 .

32.1076.

10.

140)4(.6.

60)4(.6.

57.67.

t

Because the obtained t = 1.32 is lower than the required t = 1.65, we cannot reject the null hypothesis because the data didn’t allow us to do so.

So, given the data, we conclude that the first question is not a good one for distinguishing between the upper scorers and the lower scorers on the entire test.

Now let’s look at our test problem again. Suppose instead of 200 candidates we tested 500, but kept everything else in the problem the same.

Now we will reject Ho because now t = 2.092, which is greater than 1.65, the critical value of t.

092.20478.

10.

350)4(.6.

150)4(.6.

57.67.

t

This is why we never accept Ho, but only fail to reject it with the evidence in hand. It is always possible that a more powerful test will provide evidence to reject Ho.

But this leads to another question. If, theoretically, we can always keep increasing sample size, then eventually we will always be able to reject Ho. So why do the test to begin with?

The reality is that you can’t keep increasing n in the real world because there are constraints on time, money, and manpower that prevent having n so large that rejection of Ho is a foregone conclusion.

We usually have to get by with the n we have available.

Furthermore, even if we could get a larger sample size, there is no guarantee that everything else will remain the same.

The mean difference in the numerator could change. So could the variance estimates in the denominator.

So we do the test because there is no other choice.

Let’s look at another example of testing the difference between two proportions.

A railroad company installed two sets of 50 ties. The two sets were treated by creosote using two different processes.

After a number of years in service, 22 ties of the first set and 18 ties of the second set were still in good condition.

The question is whether one method of treating with creosote is better than the other. So we set up two hypotheses:

Ho: p1 = p2

Ha: p1 ≠ p2

Now we can use the t test statistic because the samples are large enough to assume normality of p1 – p2. For a 2-tailed test with α = .05, the critical value of t = 1.96.

21

21

npq

npq

ppt

First, we need to get the values of p and q for the denominator. Since Ho treats both p1 and p

2 as coming from populations with the same p, the common estimate of p is (22+18)/100 = .4. So q = .6.

Now the t test is

Clearly, we cannot reject Ho.

816.09798.

08.

50

)6)(.4(.

50

)6)(.4(.

36.44.

t

As another example, consider the application of a fertilizer to plots of farm ground and the effect it has on the yield of corn in bushels. The data are

The average yield for the treated plots = 6.0, with s2 = 0.0711. The average yield for the untreated plots = 5.7 with s2 =0.0267.

Ho: μ treated = μuntreated

Ha: μ treated ≠ μuntreated

Treated 6.2 5.7 6.5 6 6.3

5.8

5.7

6 6 5.8

Untreated

5.6 5.9 5.6 5.7

5.8

5.7

6 5.5

5.7

5.5

The test statistic is

So Ho can be rejected because α = .05 and t.025 = 2.101 with 18 df. When you test the difference between two means,

df = (nA-1) + (nB-1).

So we can conclude that the fertilizer will help produce more bushels of corn.

0339.3098883.

3.

100267.0

100711.0

7.56

t

Now can ask how many extra bushels of corn we will get with the fertilizer. The point estimate is .3, but we can find a 95% confidence interval around this estimate.

In the case of a confidence interval for the difference between two means,

Note that the 95% confidence interval does not include 0, and the t test rejected Ho that the difference = 0. The confidence interval thus confirms the t test outcome.

508.092.

208.3.

)0989(.101.23.

)( 2295.

BA

XXBA BAsstXX

Because the sample size was only 10 for each group, we can’t say with any degree of confidence that the increase in yield is more than .092, but it may be as much as .508.

One caution about using small samples to test the difference between means is that t assumes equality of the two variances. If the samples are large, this assumption is unnecessary.

Now how can we know if the two variances are equal? We can test them.

We already know that each variance is distributed as chi-square. Now how can we test to see if two variances are equal?

The answer is the F test.

The F distribution is a ratio of two chi-square variables. So if s2

1 and s22 possess i

ndependent chi-square distributions with v1 and v2 df, respectively, then

has the F distribution with v1 and v2 df.

2

22

1

21

vsv

s

F

The F distribution is

where c is given by

and the symbol Γ(x) denotes the gamma or factorial func

tion of x, which has the property that Γ(x+1) = xΓ(x).

)(2

1

12

)2(2

1211

)()(vvv

FvvcFFf

2

2

1

21

21

21

2122

21

1

21

!22

!22

!2

2

22

2v

vv

v

vvv

vv

vv

vvvv

Now let’s do the test to see if our two variances are equal. In the problem, the two variances are .0711 and .0267.

So and there are 10-1 and 10-1 df. Is 2.66 g

reater than would be expected if the two variances were equal? To answer this, we must consult the F distribution.

66.20267.

0711.22

21 s

sF

Now it turns out that the critical region in the two tails have critical values that are reciprocals of each other. That is, if

then

Because of this reciprocal property, the procedure is always to place the larger variance over the smaller.

Then we can refer to the F distribution for the .025 critical region to see if the hypothesis of a common variance is to be rejected.

22

21

s

sF

21

22/1s

sF

For this case, with 9 and 9 df, the critical value of F = 4.025. There is an FINV function in EXCEL to find critical values for the F distribution.

Since the observed value of 2.66 is less than the critical value of 4.025, we cannot reject the null hypothesis of common variance.

To see the reciprocality, if we had placed the smaller variance over the larger, the observed ratio would be, 1/2.66 = .376

The critical value would be 1/F = 1/4.025 = .2484

In this case, for the left tail, the observed value should be less than the critical value. But here .376 > .2484, so we would not reject Ho.

So far, we have always been talking about the difference between two independent means. What if the means are not independent?

In the situation of correlated means, we must find the standard error of the difference between two dependent means. So now the situation is complicated.

Suppose we have 10 heart patients who took a treadmill test, then went on a strict exercise program, then retook the treadmill test.

Patient

Test 1

Test 2

Gain

1 15 21 6

2 18 23 5

3 18 21 3

4 20 25 5

5 20 24 4

6 22 29 7

7 23 28 5

8 24 28 4

9 22 25 3

10 25 33 8

Mean 20.7 25.7 5

St dev 3.093 3.802 1.633

Our question is whether the strict exercise program really helped the patients’ treadmill performance.

Ho: μafter = μbefore

Ha: μafter > μbefore

Clearly, the before and after test values are correlated because they are scores for the same patients. In fact, the correlation coefficient = .908. We can use the t test statistic, but what do we put in the denominator?

We have to incorporate this dependence in the standard error. The simplest way to do this is to reformulate Ho and Ha.

Ho: μdiff = 0

Ha: μdiff > 0

Now all we have to do is test the observed differen

ce against the null difference of 0. Clearly the difference between means is normally di

stributed, so the t test can be used.

Now we have

For α =.05 in the right tail (1-tailed test), the critical value of t with df = 9 is 1.833.

Our observed value of 9.68 is much greater than 1.833 so we can reject Ho with 95% confidence.

68.95164.

5

10

633.105 t

The point to see here is that with correlated data, the test of the differences gives the same result as a test of before and after values if the covariance is incorporated in the standard error.

There is no clear way to incorporate the covariance in the standard error. Therefore, the way to handle correlated values is to use the average of the differences in the t test statistic numerator and the standard deviation of the differences, divided by the square root of n, in the denominator.

Now what about a confidence interval around this difference?

The 95% confidence interval is 5 2.262(.5164) = 5 1.168 3.832 < μdiff < 6.168

Again the 95% confidence interval does not include 0, confirming the t test result at α = .05. In this case, the t test is one-sided and the confidence interval is 2-sided. This does not always happen with 1-sided t tests.

Tests of correlated data are much less common than tests of independent groups. Nonetheless, they do happen and testing the differences is the way to handle this.

Most correlated data come from before-after situations, so you must be careful in your inferences about the effect of the intervening activity if you reject Ho.

For example, it is possible in our example, that some of the patients were taking some drug that helped their treadmill performance. So we can’t conclude that it was just the exercise program unless we make sure that there are no other factors to consider.

Another approach to an admissible test strategy is that developed by Bayes. which turns out to be a likelihood ratio test. Bayes’ formula is used to determine the likelihood of a hypothesis, given an outcome.

This formula gives the likelihood of Hi given the data you actually got versus the total likelihood of every hypothesis given the data you got. So Bayes’ strategy is a likelihood ratio test.

k

iii

iii

HDPHP

HDPHPDHP

1

)|()(

)|()()|(

Consider an example where there are two identical boxes. Box 1 contains 2 red balls and Box 2 contains 1 red ball and 1 white ball.

Now a box is selected by chance and 1 ball is drawn from it. What is the probability that it was Box 1 that was selected if the ball that was drawn was red?

Let’s test this with Bayes’ formula.

There are only two hypotheses here, so H1= Box1 and H2 = Box2. The data, of course, = R. So we can find

And we can find

So we can see that the odds of the data favoring Box1 to Box2 are 2:1.

3

2

)2/1)(2/1()1)(2/1(

)1)(2/1(

)|()()|()(

)|()()|(

2211

111

HRPHPHRPHP

HRPHPRHP

3

1

)2/1)(2/1()1)(2/1(

)2/1)(2/1(

)|()()|()(

)|()()|(

2211

222

HRPHPHRPHP

HRPHPRHP

We are twice as likely to be right if we choose Box 1, but there is still some probability that it could be Box 2.

The reason we choose Box 1 is because it is more likely, given the data we have.

This is the whole idea behind likelihood ratio tests. We choose the hypothesis which has the greater likelihood, given the data we have. With other data, we might choose another hypothesis.

Now we’re going to look at tests where the test statistic is χ2.

The first case is a test of goodness-of-fit. Here we are comparing an observed distribution with some distribution expected under the null hypothesis to see if the data fit Ho or not.

Since we’re dealing with distributions here and not means, we will use frequencies instead of measurements.

Suppose you want to know if a die is fair. Under the null hypothesis, the die is fair. Under the alternative hypothesis, the die is not fair.

Ho: p = 1/6 for all sides of the die Ha: p ≠ 1/6 for all sides of the die

Now how do we test this?

We do an experiment to test the die. The observed data are shown in the table below:

The expected data are what is expected under Ho.

X Observed

Expected

1 n1 n/6

2 n2 n/6

3 n3 n/6

4 n4 n/6

5 n5 n/6

6 n6 n/6

This question is whether or not the observed data agree with the expected data. If they do not agree, then we reject Ho that the die is fair.

This goodness-of-fit test is due to Karl Pearson. The test statistic is

where k = number of categories.

In fact, this χ2 test, like the t test, turns out to be equivalent to a likelihood ratio test.

k

i io

ioi

np

npn

1

22

Now suppose the experiment consists of 60 rolls of the die with the following results:

Now we can apply our test statistic to these data to see if Ho is to be rejected or not.

X Observed

Expected

1 15 60/6= 10

2 7 60/6= 10

3 4 60/6= 10

4 11 60/6= 10

5 6 60/6= 10

6 17 60/6= 10

The test statistic is

Now for α = .05, the critical value of χ2 with 6-1 = 5 df is 11.1. Since our observed χ2 > 11.1, we reject Ho that the die is fair.

k

i io

ioi

np

npn

1

22

6.1310

13610

)1017(

10

)106(

10

)1011(

10

)104(

10

)107(

10

)1015( 2222222

The test statistic we used approaches the χ2 distribution when n is large because the proportions p are distributed normally when n is large.

A limitation of the use of χ2 is that all of the expected frequencies must be ≥ 5.

This is similar to the limitation for the use of the normal approximation to the binomial in which np and nq were required to be > 5.

The expected frequencies are not always equal. Consider the following example.

In experiments on breeding flowers, the colors were purple, red, green, and yellow. The flower colors are expected to occur in a 9:3:3:1 ratio. So

Ho: pp = 9/16 pr = 3/16 pg = 3/16 py = 1/16

in a multinomial distribution involving four categories for which n = 217.

The question is whether the colors are in accord with the theoretically expected frequencies.

The observed and expected data are

The expected data are obtained by multiplying each expected p by n.

So the expected frequencies for purple and red flowers are purple: (217) 9/16 = 122.06 red: (217) 3/16 = 40.69

Because this gives rounded decimal values the total for observed and expected frequencies may not be identical.

purple

red green yellow total

ni 120 48 36 13 217

ei 122 41 41 14 218

Here the test statistic is

The critical value of χ2 with 3 df for a critical region of size α=.05 is 7.8. Since the observed χ2 of 1.909 < 7.8, the null hypothesis cannot be rejected.

k

i io

ioi

np

npn

1

22

909.114

)1413(

41

)4136(

41

)4148(

122

)122120( 22222

Another use of the χ2 distribution for hypothesis testing is for tests of independence in contingency tables.

A contingency table is a cross-classification of two variables

Variable A

Variable B

B1 B2 B3 B4

Totals

A1 p1.

A2 (p2.)(p.3)

p2.

A3 p3.

Totals p.1 p.2 p.3 p.4

where pi. is the probability of being in the ith row and p.j is the probability of being in the jth column.

Since, under Ho, these variables are independent, the probability of the ijth cell is the product of the row and column probabilities.

The dot notation is common in cross-classifications, where the dot is in place of what has been summed over to get the marginal.

and pi. = ni. / n p.j = n.j / n

c

jiji nn

1

r

iijj nn

1

In this case,

Ho: Variables A and B are independent Ha: Variables A and B are not independent

and the test statistic is

with (r-1)(c-1) df. This test is also due to Karl Pearson.

r

i

c

j ji

jiij

pnp

pnpn

1 1

22

Let’s look at an example. Suppose an experimenter is interested in whether or not educational level is related to marital adjustment, and has collected the following data, where the values in parentheses are expected frequencies.

Educational

Level

Marital Adjustment Score

Very low

Low High Very high

Totals

College 18 (27) 29 (39)

70 (64)

115 (102)

232

High school 17 (13) 28 (19)

30 (32)

41 (51) 116

Grade school

11 (6) 10 (9) 11 (14)

20 (23) 52

Totals 46 67 111 176 400

How do you get the expected frequencies? Just replace the pi. and p.j with their maximum likelihood estimates as frequencies.

The expected frequency is

and we can find the expected frequency for cel

l11 by

2768.26400

46*232

n

nn

n

n

n

nnpnp jiji

ji

Now the test statistic is

which has a χ2 distribution with (r-1)(c-1) df if n is

sufficiently large and Ho is true.

Note that nij is the observed cell frequency and (ni.n.j)/n is the expected cell frequency under Ho.

r

i

c

j ji

jiij

n

nnn

nnn

1 1

2

2

Now we can find all the cell expected frequencies and find the observed χ2 just as we did for the goodness-of-fit test.

The χ2 for (3-1)(4-1)= 6 df with α = .05 is 12.592. The observed χ2 is 20.7 for our contingency table, so the null hypothesis can be rejected with 95% confidence.

Again, we must be sure that all expected frequencies are ≥ 5 to assure the validity of the χ2 test.

Now let’s look at some more examples.

The following data are for school children in a city in Scotland.

Test to see whether hair color and eye color are independently distributed.

HairEyes

Fair Red Medium

Brown

Black

Total

blue 1368

170 1041 398 1 2978

light 2577

474 2703 932 11 6697

medium

1390

420 3826 1842 33 7511

dark 454 255 1848 2506 112 5175

Total 5789

1319

9418 5678 157 22361

In this study, Ho: hair and eye color are independent Ha: hair and eye color are not independent

The appropriate test statistic here is χ2 with (4-1)(5-1) = 12 df. For α = .05, the critical value of χ2 = 21.026.

Now this is a big table and involves so much computation that we will look for a shortcut.

Let’s look for some obvious incompatibility. The smallest eye color total overall is for blue eyes. Blue eyes has the smallest frequency for all hair colors except fair.

Since cell11 is incompatible with the other cells in row 1, let’s look at the χ2 value for this cell.

For cell11, the expected frequency under Ho is

So the χ2 value for this cell is

Just with this one cell, we can reject Ho because

462.27 >> 21.026. We don’t need to compute the χ2 values for all the other cells because we already can reject Ho. If we were to add the χ2 values for all other cells, the rejection would be very much stronger.

77197.77022361

5789*2978

n

nn ji

27.462771

356409

771

)7711368( 2

This example illustrates the value of looking at the data to see where there is an inconsistency and then checking that cell.

We also could have checked cell35 because the highest total row is row 3 but this cell is not highest in its column.

Looking at the data also helps to find the cell(s) with the most unexpected result. It may be that most of the cells are close to expectation, but that one or two of them are very much not in accord with expectation. This could be a very useful finding.

Take another example. Five brands of canned salmon are being tested for quality. The tester examines 24 cans of each brand and finds the following results.

Can you say which brand of tuna you would most like to buy and which you would not accept?

BrandQuality

A B C D E Total

High 21 14 17 22 16 90

Very low

3 10 7 2 8 30

Total 24 24 24 24 24 120

In this case, Ho: brand and quality are independent Ha: brand and quality are not independent

The critical value for the χ2 statistic with (2-1)(5-1) = 4 df where α = .05 is 9.488.

Let’s see if we can reject Ho. The expected frequencies are easy to compute here because the column totals are all equal. So for the first row, all the expected frequencies = 18. For the second row, all the expected frequencies are 6.

The χ2 is

which is greater than the critical value of 9.488, so we ca

n reject Ho with 95% confidence.

But we can do more than that. We can choose the tuna brand we will buy (brand D) and the brand we will avoid (brand B) because they have the largest combined (high quality + low quality) χ2 values.

22.10

67.67.217.67.25.1

22.89.06.89.5.6

)68(

6

)62(

6

)67(

6

)610(

6

)63(

18

)1816(

18

)1822(

18

)1817(

18

)1814(

18

)1821(

22222

222222

Finally, there is a third way to use χ2 as a test statistic. We have already seen χ2 used to form a confidence interval for the variance σ2. Now we see how to use χ2 for testing an observed vs a hypothetical value of σ.

Consider the following problem. Past experience for a manufactured product has shown that σ = 7.5. However, the latest sample of size 25 gave s =10. Has the variability in this product increased?

We set up the two hypotheses:

Ho: σ = 7.5 Ha: σ > 7.5

The test statistic is

and the critical value of χ2 with 24 df for α = .05 in this 1-tailed test is 36.415. Since our observed value > the critical value, we reject Ho with 95% confidence and claim that variability has increased.

67.42

5.7

10)125()1(2

2

2

2

21

2

2

sn

xn

ii

The right tail of the χ2 distribution is a restricted best critical region because the best critical region exists only when μ = 0.

If the alternative were σ < σo, then the left tail of the χ2 distribution is a restricted best test.

But if the alternative were σ ≠ σo, there is no best critical region and we use the two equal tails of the χ2 distribution as the compromise critical region.

We have looked at hypothesis testing with three different statistics, t, F, χ2.

In all cases, we have chosen the desired α level and used the test statistic to see if the result falls in the critical region. When it does and we reject Ho at the chosen α level, we call the result significant.

There is another way to deal with testing that does not involve setting the chosen α level beforehand.

This way is called significance testing.

We still do hypothesis testing as before except that now we do not choose the α level. That is, we do not have a critical region or a critical value of the test statistic.

Instead we compute the observed value of the test statistic and then find the probability that the test statistic will exceed this value. If the probability is small enough, we say the result is significant at p < whatever probability value we get.

For example, in the light bulb test, we observed a t of 4.23. If we had been doing significance testing instead of hypothesis testing, we would find the probability of a t this great or greater.

It turns out that this probability is 0.0000994, so we would say that this res

ult is significant at p < .0000994.

Significance testing, in effect, gives us the smallest α level at which we would reject Ho for the data we have.

For the treadmill test of correlated means, we observed a t = 9.68, which was significant at the .05 level.

If we were doing significance testing here, we would declare the result significant at p < 0.00000469.

Significance testing vs hypothesis testing was at one time a very controversial issue. But today significance testing is more commonly used than hypothesis testing.

This is a case where technological advances changed statistical practice.

When hypothesis testing was first developed, there were no computers so people selected α levels because they were tabled for the major test statistics.

With computers, significance testing took over because there was no longer any need to use tables. The p-value was computable in a split second, so why not use it?