60
1 15. Nonparametric Methods : chi-square applications 無母數統計方法一交叉表應用

15. Nonparametric Methods : chi-square applicationshsueh/statII/ch15.pdf · 2006. 1. 17. · 3 • Nonparametric or distribution-free tests – 無須母體分佈假設。 – Recall

  • Upload
    others

  • View
    2

  • Download
    1

Embed Size (px)

Citation preview

  • 1

    15. Nonparametric Methods : chi-square applications

    無母數統計方法一—交叉表應用

  • 2

    Outline

    • Chi-square distribution 卡方分配• Chi-square tests 卡方檢定

    – Goodness-of-fit test 適合度檢定 : whether the data is taken from the population of

    • A uniform distribution : 均勻分佈,有相同機率• Some specified distribution :某特定分佈• A normal distribution:常態分佈

    – Independence test 獨立性檢定 : whether two discrete variables are independent or not.

  • 3

    • Nonparametric or distribution-free tests– 無須母體分佈假設。– Recall : testing a population mean with unknown

    population variance, the t-test is available only when X ~ normal.

    • Chi-square tests is available for categorical data.• A categorical data have nominal or ordinal scale.

    – Nominal : classification, e.g. gender, marriage status.– Ordinal : classification + rank, e.g. performance = {bad,

    fair, good}– A categorical data set is always summarized into a

    frequency/contingency table(次數表)

  • 4

    …..1111... 2222

    ...kkkk...Population:

    The population data can be summarized into a frequency table:

    category 1 2 … k total frequency f1 f2 … fk N

    Further, the population distribution can be calculated.

    category 1 2 … k total

    Prob. ... 1Nf1

    1 =π Nf 2

    2 =π Nf k

    k =π

    …..1 2 3 k

  • 5

    …..1111... 2222... kkkk...Population:

    sampling

    kkkk...…222...

    1111...

    Sample:

    A sample data can be also summarized into a frequency table.category 1 2 … k total

    frequency fo1 fo2 … fok n

  • 6

    A goodness-of-fit test: whether the population distribution is equal to a specific null distribution.

    …..1 2 k

    Population distribution

    …..1 2 3 k

    population

    vs.

    sampling

    …..1 2 3 k

    H0 : some distribution

    Testing

    k…21sample

  • 7

    Chi-square distribution 2χ

    • The chi-square distribution is the sampling distribution of the chi-square test statistic.– The statistic is always positive. (統計量恆為正值)– There is a family of chi-square distributions.

    • Depending on the degree of freedom.– Chi-square distribution is positively skewed.– Theoretically,

    X1, …, Xn ~ independent N(0,1), ΣX2 ~ 2nχ

  • 8

  • 9

    I. Goodness-of-fit(GOF) test: 適合度檢定

    • Data : – A sample of n categorical observations is summarized to a

    frequency/contingency table with k categories.– 將樣本中的n個觀測值整理成k個類別的次數分配表。

    • Goals : to test– whether the data is sampled from a specified distribution, – whether the population has a specified distribution.– 是否樣本資料來自某特定的母體分佈?

    • Hypotheses : – H 0 : the population of the data follows a specified distribution– H 1 : H0 is not true.

  • 10

    Data : An observed frequency table,{foi, i=1,...,k}

    Category 1 2 … k Sum Observed Frequency

    1of 2of … okf n

    Hypothetical distribution

    1π 2π … kπ 1

    Expected frequency

    11e nf π= 22e nf π= … kek nf π= n

    觀測次數

    H0分配

    期望次數

    Expected frequency : when H0 is true, the expected frequency in each category,

    iei nf π×=

  • 11

    • Strategy :– Compare observed frequencies to expected frequencies.– 比較每一類別中觀測次數與期望次數

    • Test statistic : chi-square test statistic

    – When null hypothesis is true, the test statistic has chi-square distribution with (k-1-p) degrees of freedom.

    – k = category number, 類別數– p = number of estimates for calculating the expected frequencies.(在計算期望次數fe中的估計值個數)

    ∑⎭⎬⎫

    ⎩⎨⎧ −

    =χcategory e

    2eo2

    f)ff(

  • 12

    • Why (k-1-p) degrees of freedom ? In the test statistic,– Observed frequencies : with total sample size n, k categories

    • There are k observed frequencies,

    • However, there is one restriction.

    • Thus, the d.f. = k-1 在k個觀測次數中,總共有k-1個自由度。– Expected frequencies : all are determined by p estimates

    • Thus, the d.f. = p k個期望次數由p個估計值所計算得出。

    • In summary, the d.f. = (k-1) - p

    ok1o f,...,f

    nfk

    1ioi =∑

    =

  • 13

    Decision rule : • Rejection region : an right-tailed test!

    – H0 should be rejected if the test statistic is significantly large.

    • Critical value = ?– Under null hypothesis, ~ a chi-square distribution with (k-1-p) d.f.

    check the chi-square distribution table in Appendix I.

    – At significance level α, H0 is rejected if

    Where can be found in Appendix I with d.f. (k-1-p) and right-tailed probability α

    2),p1k(

    2α−−χ≥χ

    2),p1k( α−−χ

  • 14

    Example. Appendix I. Table 15-3??? 2 )01.0,9(

    2)1.0,10(

    2)05.0,5( =χ=χ=χ

  • 15

    Example. Chart 15-1 070.112 )05.0,5( =χ

  • 16

    • Types of problems with GOF tests :– Categorical data:

    • Equal expected frequencies : – The population has a “uniform”distribution.(均勻分佈)

    • Unequal expected frequencies : general cases.

    …..1 2 3 k

    k/1...:H k210 =π==π=π

    …..1 2 3 k

    0kk2021010 ,...,,:H π=ππ=ππ=π

  • 17

    • Types of problems with GOF tests :– Continuous data:

    • Normal distribution. 常態分配– Before advance analysis, testing the normality assumption.

    • Example. The following cases need the normality assumption.– Testing the single population mean, unknown variance. Small

    sample size. T-test– Testing the difference of two population mean with common

    unknown variance, small sample sizes. T- test– ANOVA – Linear regression

  • 18

    Type I: Categorical data

    • Step 1. State hypotheses–

    • Step 2. Select the significance level, α=0.05

    • Step 3. Select the test statistic : chi-square test

    – foi : observed frequency of the ith category– fei : expected frequency of the ith category = n×πi0

    0kk2021010 ,...,,:H π=ππ=ππ=π

    ∑⎭⎬⎫

    ⎩⎨⎧ −

    =χcategory e

    2eo2

    f)ff(

  • 19

    – Step 4. Formulate the decision rule : • Since under H0, the distribution is completely determined. • No estimate is needed for calculating expected frequencies.• Thus, p=0.• Then, df = k-1• H0 is rejected if

    – Step 5. Collect data, calculate the test statistic and draw conclusion.

    2,1k

    2α−χ≥χ

  • 20

    Example. P523 Equal expected frequencies Jan plans to begin a series of sport cards. One of the problems is the

    selection of the former players. At the end of a weekend she sold a total of n=120 cards. The number of cards sold for each player is given. Can she conclude the sales are not the same for each player?

    X= player’s card={T, N, Ty, G, H, J}=a nominal-type variable

    Player T N Ty G H J Total

    Card sold Observed Frequency

    13 33 14 7 36 17 120

    Hypothetical distribution

    1/6 1/6 1/6 1/6 1/6 1/6 1

    Expected Frequency 20

    )6/1(120n 1

    ==π

    20)6/1(120

    n 2

    ==π

    20)6/1(120

    n 3

    ==π

    20)6/1(120

    n 4

    ==π

    20

    )6/1(120n 5

    ==π

    20)6/1(120

    n 6

    ==π 120

  • 21

    • Step 1. State hypotheses – H0 : equal sales proportions. H1 : H0 is not true.–

    • Step 2. Select the significance level, α=0.05

    • Step 3. Select the test statistic : chi-square test

    • Step 4. Formulate the decision rule : – k = 6– p = 0, no estimate is needed for expected frequencies.– df = k-1-p = 6-1 = 5– H0 is rejected if

    6/1:H JHGTyNT0 =π=π=π=π=π=π

    07.112 05.0,52 =χ≥χ

  • 22

  • 23

    • Step 5. Draw conclusion :

    – Since is 34.40 > 11.07, the null hypothesis is rejected at α=0.05.

    Player T N Ty G H J Total

    fo Observed Frequency

    13 33 14 7 36 17 120

    Hypothetical distribution

    1/6 1/6 1/6 1/6 1/6 1/6 1

    fe Expected Frequency 20

    )6/1(120n 1

    ==π

    20)6/1(120

    n 2

    ==π

    20)6/1(120

    n 3

    ==π

    20)6/1(120

    n 4

    ==π

    20)6/1(120

    n 5

    ==π

    20)6/1(120

    n 6

    ==π 120

    (f0-fe) -7 13 -6 -13 16 -3 0 (f0-fe)2 49 169 36 169 256 9 (f0-fe)2/fe 2.45 8.45 1.80 8.45 12.80 0.45 34.40

  • 24

    Example. P530unequal expected frequencies

    Table 15-4 gives a result of the AHAA report on the admissions of senior citizens to hospitals in a one-year period. The community BartowEstates would like to have a comparison with AHAA’s result. A sample of n=150 local senior citizens were selected. The number of admissions of each individual was surveyed. See Table 15-5.

    Use significance level α=0.05 to determine whether there is difference between national and local pattern.

    X=the number of admissions={0, 1, 2, 3 or more}=a discrete variable

  • 25

    Table 15-4 Summary of study by AHAA and a survey of Bartow Estates Residents Number of times

    admitted AHAA(%) Number of Bartow

    residents (fo) Expected number of

    residents 0 40 55 60 1 30 50 45 2 20 32 30

    3 or more 10 13 15 Total 100 150 150

  • 26

    • Step 1. State hypotheses – H0 : no difference between local and national H1 : H0 is not true.–

    • Step 2. Select the significance level, α=0.05

    • Step 3. Select the test statistic : chi-square test

    • Step 4. Formulate the decision rule : – k = 4– p = 0, no estimate is needed for expected frequencies.– df = k-1-p = 4-1 = 3– H0 is rejected if

    1.0,2.0,3.0,4.0:H 32100 =π=π=π=π ≥

    815.72 )05.0,3(2 =χ≥χ

  • 27

  • 28

    • Step 5. Draw conclusion.

    • Since , the null hypothesis of no difference is not rejected at level 0.05.

    No. of

    admissions 0 1 2

    ≧3 Total

    fo Observed Frequency

    55 50 32 13 150

    π

    Hypothetical distribution

    0.40 0.30 0.20 0.10 1

    fe Expected Frequency 60

    )4.0(150n 0

    ==

    π

    45

    )3.0(150n 1

    ==π

    30

    )2.0(150n 2

    ==π

    15

    )1.0(150n 3

    ==

    π≥

    150

    (f0-fe) -5 5 2 -2 0 (f0-fe)2 25 25 4 4 (f0-fe)2/fe 0.4167 0.5556 0.1333 0.2667 1.3723

    815.73723.12

  • 29

    Limitations of Chi-square test: P531• Chi-square test is an approximate/asymptotic test. • The approximation is valid only with sufficiently

    large n and thus the expected frequencies, fe. – 卡方檢定為一個近似方法,當樣本數n及各類別的期望次數夠大,此近似才準確。

    – When some expected frequencies are small, we would have very large value of chi-square test. It might result in an erroneous conclusion.

    – 若期望次數過小(統計量的分母小),統計量的值會變的很大,不穩定,進而產生錯誤結論。

  • 30

    • 判斷準則:When is insufficient ? 1. If k=2 cells, one of any fe < 5.2. If k>2 cells, if more than 1/5 of the cells that fe < 5.Then, a chi-square test should not be used or the

    “combination” is needed.

    • What to do when some fe are not large?– Combining categories.– 合併類別。

  • 31

    Example. 531-532

    •Before combining categories,

    •p-value=0.0295, H0 is rejected at α=0.05

    • However, about 98%(32.1%+1.8%+64.3%) of variation are due to the three vice president categories. Also, their fe=χ

    Page 560 Level of management fo fe fo-fe (fo-fe)2 (fo-fe)2/fe % of chi-square Foreman 30 32 -2 4 0.125 0.89 Supervisor 110 113 -3 9 0.080 0.57 Manager 86 87 -1 1 0.011 0.08 Middle management 23 24 -1 1 0.042 0.30 Assistant vice president 5 2 3 9 4.500 32.12 Vice president 5 4 1 1 0.250 1.78 Senior vice president 4 1 3 9 9.000 64.25

    Total 263 263 14.008 100.00

  • 32

    Example. P532 after combination

    Page 561 Level of management fo fe fo-fe (fo-fe)2 (fo-fe)2/fe % of chi-square Foreman 30 32 -2 4 0.125 1.72 Supervisor 110 113 -3 9 0.080 1.10 Manager 86 87 -1 1 0.011 0.16 Middle management 23 24 -1 1 0.042 0.57 Vice president 14 7 7 49 7.000 96.45

    Total 263 263 7.258 100.00

    • After combining three president categories, , p-value=0.1229,

    • H0 is not rejected at α=0.05488.9258.7 2 )05.0,4(

    2 =χ

  • 33

    Type II: Continuous data(Optional)

    • Data : continuous variable 連續型資料• Check the assumption on population distribution,

    – 檢定母體分佈假設。– Normal – T-distribution– F-distribution

  • 34

    • Strategy : – Data : 資料處理

    • Summarize these continuous-scale observations into a frequency table.

    • Original continuous data grouping contingency table. (Ch. 2)

    • 原始連續資料 分類 次數表 fo– Hypothetical distribution : 假設分佈

    • Calculate the probability of each group under the assumed distribution. (計算假設分佈下,每一類別之機率)

    • The expected frequency = n×prob.每一類別期望次數 fe

  • 35

    Example. Normal assumption

    The president of Duval University collected data on the annual salaries of full professors at 160 colleges.

    The sample mean salary = 54.03 K and the sample standard deviation = 13.76K.

    With proper grouping, the frequency distribution of these annual salaries is given in Table 15-7.

    Do the observed frequencies coincide with the expected frequencies based on the normal distribution? Can we conclude that the distribution of salary is normal ?

  • 36

    • Step 1. State hypotheses – H0 : X ~ normal H1 : H0 is not true.

    • Step 2. Select the significance level, α=0.05

    • Step 3. Select the test statistic : chi-square test– fo : obtained by summarizing original data into a frequency table

    See next page. – fe : estimated under null hypothetical distribution.

  • 37

    Summarize original data into a frequency table

    Table 15-7 Salary fo 20~30 4 30~40 20 40~50 41 50~60 44 60~70 29 70~80 16 80~90 2 90~100 4

    Total 160

  • 38

    How to estimate fe ?

    1. Estimate the hypothetical probability in each cell.Table 15-7-1 Salary Hypothetical probability ~30 P(X

  • 39

    How to estimate P(a

  • 40

    How to estimate P(a

  • 41

  • 42

    Table 15-7-2

    Salary Estimated Hypothetical probability

  • 43

    For n=160,

    Table 15-7-3 Salary Estimated

    Hypothetical probability

    fe=160×prob.

    Expected frequency

  • 44

    • Step 4. Formulate the decision rule.– Rejection region and critical value.– Since k = 7 categories in the frequency table.– Use in the sample to estimate population and,

    further, the expected frequencies. Thus p=2.– The d.f. = k-1-p = 7-1-2=4– At level α=0.05, the null hypothesis is rejected if

    s,x σµ,

    488.92 )05.0,4(2 =χ≥χ

  • 45

    • Step 5. Draw conclusion

    • Since thus the null hypothesis is not rejected at α=0.05. The distribution of full professors’ salaries follows a normal distribution.

    488.959.2 2 )05.0,4(2 =χ

  • 46

    How to obtain with grouped data? S,X

    • Sometimes, the data is grouped already. • When the original data has been grouped, in Chap. 3 and 4,

    • Where xi=class mark, fi=frequency

    n

    fxx class:i

    ii∑=

    ( )

    1nnfxxf

    1n

    )xx(fs

    2k

    1i

    2ii

    k

    1i

    2ii

    −=

    −=

    ∑∑∑==

  • 47

    II. Independence test : a r×c contingency table analysis

    • Data: – For each subject, two discrete variables,X1, X2, are studied.– The data are summarized by a r×c contingency table.

    • For example. 省籍(4-level)vs. 黨派(2-level)

    • Research question:“Is there a relationship between X1, X2?”

    黨派 省籍 藍 綠 total 本省 400 400 800 外省 180 20 200 客家 50 50 100 原住民 50 50 100 total 480 520 1000

  • 48

    • Example. (P534) “Does a released prison make a different adjustment to civilian life if

    1. He returns to his hometown, or2. He goes elsewhere

    to live ?• Is there a relationship between place of

    residence(X1) and adjustment to civilian life(X2)after release from prison?

  • 49

    • Data : a 2×4 table by X1=“residence”, X2=“adjustment”• Determine whether

    – Is there a relationship between “residence” and “adjustment”?– Is “residence” independent with “adjustment”?

    Table 15-10 Adjustment to civilian life and place of residence adjustment to civilian life Residence outstanding good fair unsatisfactory total hometown 27 35 33 25 120 elsewhere 13 15 27 25 80 total 40 50 60 50 200

    X2

    X1

  • 50

    Data : r×c contingency table

    Whether there is a relationship between ,X1, X2? X2 X1 1 …. …. C total 1 fo11 … … fo1c n11 … … … … … r for1 … … forc n1r total n21 … … n2C n

  • 51

    • Step 1. Hypotheses – H0 : ,X1, X2 are independent– H1 : ,X1, X2 are not independent

    • Step 2. Significance level α• Step 3. Test statistic : chi-square test

    – Where foi = observed frequency;and fei = expected frequency = ?

    ∑−

    =χcategory:i ei

    2eioi2

    f)ff(

  • 52

    Expected frequency, fe, • fei = expected frequency

    – Under the null hypothesis of independence.• Recall : If A, B are independent events

    P(A∩B) = P(A, B 同時發生) = P(A) P(B) • If X1, X2 are independent, the probability of each cell is

    – Joint prob. = (row marg. prob.)×(column marg. prob.)

    – Where n1i = row marginal total– And n2j = column marginal total

    ⎟⎟⎠

    ⎞⎜⎜⎝

    ⎛⎟⎠⎞

    ⎜⎝⎛≈=====

    nn

    nn)jX(P)iX(P)jX,iX(P j2i12121

  • 53

    • Thus, the expected frequency fe = n × (cell joint prob.)

    n)totalcolumn()totalrow(

    nn

    nnnf j2i1e =⎟⎟

    ⎞⎜⎜⎝

    ⎛⎟⎠⎞

    ⎜⎝⎛×=

  • 54

    Degree of freedom : k – 1 – p

    • For a chi-square test, df = k-1-p• If r = row no.=列數, c = column no.=行數, then

    k = total cells number = (row no.)×(column no.) = r × c• And p = number of estimates for calculation of fe = ?

    – All fe’s are determined by row and column marginal probabilities.– For r rows, there are (r-1) row marginal probabilities.– For c columns, there are (c-1) column marginal probabilities.– Thus, p = (r-1) + (c-1)

    • Thus, df = k - 1 – p = k – 1 – (r –1) – (c – 1) = (r-1)(c-1)

  • 55

    • Step 4. Rejection region– A one-sided Chi-square test with df = (r-1)(c-1)– H0 is rejected at level α if

    • Step 5. Conclusion : calculate chi-square test statistic and draw conclusion based on decision rule in Step 4.

    2),1c)(1r(

    2α−−χ≥χ

  • 56

    Example. P566

    • Step 1. Hypotheses – H0 : “Residence” and “Adjustment” are independent– H1 : “Residence” and “Adjustment” are not independent

    • Step 2. Significance level α=0.01• Step 3. Test statistic : chi-square test

    – Where foi = observed frequency;and fei = expected frequency

    ∑−

    =χcategory:i ei

    2eioi2

    f)ff(

  • 57

    • Step 4. Rejection region : r=2, c=4– A one-sided Chi-square test with df = (r-1)(c-1) = 3– H0 is rejected at level α=0.01 if

    345.112 01.0,32 =χ≥χ

  • 58

    Step 5. Conclusion :adjustment

    residence outstanding good fair unsatisfactory totalhomehown 27 35 33 25 120expected 120*40/200=24 120*50/200=30 120*60/200=36 120*50/200=30elsewhere 13 15 27 25 80expected 80*40/200=16 80*50/200=20 80*60/200=24 80*50/200=20total 40 50 60 50 200

    22 oi ei

    i:category ei

    2 2 2

    (f f )f

    (27 24) (35 30) (25 20)24 30 20

    5.729 11.345

    χ −=

    − − −= + + +

    = <

    L

    • H0 is not rejected at α=0.01, there is no relationship between adjustment to civilian life and the residence.

  • 59

    Exercise.

    • GOF test:– Uniform distribution : 19– Some specified distribution : 21, 23– Poisson distribution : 22 (optional)

    • Independence test: 25, 27

  • 60

    Bonus (+1%)

    台北市銀行於今年春節時發行『幸運水果』吉時樂彩券,宣稱中獎率為30%。

    1. 小明隨機購買n=90張彩券,其中有38張中獎,在顯著水準α=5%下,請問上述宣稱是否屬實?p-value=?

    2. 如果今天小明改變調查的方式,每次隨機買3張彩券,記錄其中中獎的張數,以下為30次隨機實驗的結果,

    0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,3在顯著水準α=5%下,請問上述宣稱是否屬實?Hint : 在中獎率為30%的宣稱下,令X=三張彩券中獎張數,

    X ~ Binomial ( 3, 0.3)

    15. Nonparametric Methods : chi-square applicationsOutlineChi-square distributionI. Goodness-of-fit(GOF) test: 適合度檢定Data : An observed frequency table,{foi, i=1,...,k}Example. Appendix I. Table 15-3Example. Chart 15-1Example. P523 Equal expected frequenciesExample. P530unequal expected frequenciesExample. 531-532Example. P532 after combinationExample. Normal assumptionHow to estimate fe ?How to estimate P(a