294
APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations presented in a clear and thorough manner Masayasu AOTANI (谷正妥) Kyoto University Kyoto, Japan Gampsocleis mikado (ヒガシキリギリス) 2014 No insects, no life!

APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

Embed Size (px)

Citation preview

Page 1: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

APPLIED STATISTICSWITH

HELPFUL DETAILS

Fascinating, eye-opening, and even life-changing explanationspresented in a clear and thorough manner

Masayasu AOTANI(青谷正妥)

Kyoto UniversityKyoto, Japan

Gampsocleis mikado (ヒガシキリギリス)

2014No insects, no life!

Page 2: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations
Page 3: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

Applied Statistics

Masayasu AOTANI(青谷正妥)

Fall 2014

Page 4: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

Copyright ©1991–2014

Masayasu AOTANI

Permission is granted for personal useonly.

Page 5: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

Contents

1 What is Statistics? 131.1 Some Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2 Describing the Data 152.1 Numerical Measures of Central Tendency . . . . . . . . . . . . . . . . 152.2 Measures of Variability . . . . . . . . . . . . . . . . . . . . . . . . . . 182.3 How Standard Deviation Is Interpreted . . . . . . . . . . . . . . . . . 252.4 Two Measures of Relative Standing . . . . . . . . . . . . . . . . . . . 26Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3 The Normal Distribution 333.1 Random Variables and Probability Distribution . . . . . . . . . . . . 333.2 Characteristics of the Normal Distribution . . . . . . . . . . . . . . . 363.3 The Standard Normal Random Variable z . . . . . . . . . . . . . . . 37Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4 Sampling Distribution 454.1 Sampling Distribution of the Mean . . . . . . . . . . . . . . . . . . . 46Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5 Statistical Inferences Based on a Single Sample 515.1 Estimation of the Population Mean Based on a Large Sample . . . . 52

5.1.1 Constructing a Confidence Interval from a Large Sample . . . 535.1.2 Sample Size Determination . . . . . . . . . . . . . . . . . . . 57

5.2 Constructing a Confidence Interval from a Small Sample . . . . . . . 585.2.1 Small-Sample Confidence Interval for µ . . . . . . . . . . . . . 62

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

3

Page 6: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

6 Hypothesis Testing 696.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 696.2 Two Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 696.3 Steps, Errors, and Decision Making . . . . . . . . . . . . . . . . . . . 716.4 Specific Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

7 Variance, Covariance, and Correlation 817.1 Population Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . 817.2 Population Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . 867.3 Sample Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 887.4 Sample Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 907.5 Correlation as Inner Product . . . . . . . . . . . . . . . . . . . . . . 927.6 Variance Covariance Matrix . . . . . . . . . . . . . . . . . . . . . . . 93Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

8 Analysis of Variance 998.1 One-Way Analysis of Variance . . . . . . . . . . . . . . . . . . . . . . 99

8.1.1 Partitioning the Total Sum of Squares . . . . . . . . . . . . . 1008.1.2 Partitioning the Degrees of Freedom . . . . . . . . . . . . . . 1028.1.3 Mean Squares and an F-Statistic . . . . . . . . . . . . . . . . 103

8.2 Effects of Different Treatments on High Blood Pressure . . . . . . . . 104Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

9 Simple Linear Regression 1139.1 The True Meaning of a Statistical Linear Relationship . . . . . . . . 1139.2 Sum of Squares, Sxx and Syy, and Sum of Cross Products, Sxy . . . . 114

9.2.1 Sum of Squares: Sxx and Syy . . . . . . . . . . . . . . . . . . 1159.2.2 Sum of Cross Products: Sxy . . . . . . . . . . . . . . . . . . . 116

9.3 Principle of Least Squares . . . . . . . . . . . . . . . . . . . . . . . . 1169.4 Correlation, Least Squares, and Explained Variance . . . . . . . . . . 120

9.4.1 Correlation Coefficient and the Slopes . . . . . . . . . . . . . 1209.4.2 Explained Variance and the Coefficient of Determination . . . 1219.4.3 Correlation Coefficient and Standardized Coefficient . . . . . 124

9.5 Interpreting and Using Software Output . . . . . . . . . . . . . . . . 1259.5.1 Variables Entered and Removed . . . . . . . . . . . . . . . . . 1269.5.2 Model Summary . . . . . . . . . . . . . . . . . . . . . . . . . 1269.5.3 Analysis Of Variance Table . . . . . . . . . . . . . . . . . . . 1289.5.4 Table of Coefficients . . . . . . . . . . . . . . . . . . . . . . . 131

Page 7: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

9.6 Confidence Intervals For The Estimation Of Y . . . . . . . . . . . . . 1339.6.1 Confidence Interval For The Population Mean Y . . . . . . . 1339.6.2 Confidence Interval for An Individual Y -value . . . . . . . . . 135

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

10 Chi-Square Tests: Categorical Data Analysis 13910.1 Goodness-Of-Fit Tests . . . . . . . . . . . . . . . . . . . . . . . . . . 13910.2 Tests Of Independence . . . . . . . . . . . . . . . . . . . . . . . . . . 141Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

11 Factor Analysis and Principal Components Analysis 15511.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15511.2 Three Types of Variance: Common, Specific, and Error Variances . . 15611.3 Principal Components Analysis . . . . . . . . . . . . . . . . . . . . . 157

11.3.1 Properties and Principles . . . . . . . . . . . . . . . . . . . . 15711.3.2 From Correlation Matrix To Components . . . . . . . . . . . 15911.3.3 Two Variable Example . . . . . . . . . . . . . . . . . . . . . . 16311.3.4 Orthogonal Rotation . . . . . . . . . . . . . . . . . . . . . . . 16811.3.5 Communalities, Variance, and Covariance . . . . . . . . . . . 170

11.3.5.1 Multivariate Variance . . . . . . . . . . . . . . . . . 17011.3.5.2 Total Variance in the Data Set . . . . . . . . . . . . 17111.3.5.3 Communalities, Variance, and Covariance . . . . . . 172

11.4 Three Variable Example . . . . . . . . . . . . . . . . . . . . . . . . . 17311.4.1 Unrotated Solution . . . . . . . . . . . . . . . . . . . . . . . . 17311.4.2 Rotated Solution . . . . . . . . . . . . . . . . . . . . . . . . . 17811.4.3 The Effect of Dropping A Component . . . . . . . . . . . . . 179

11.5 Multi-Variable Example . . . . . . . . . . . . . . . . . . . . . . . . . 18111.5.1 Correlation Matrix . . . . . . . . . . . . . . . . . . . . . . . . 18311.5.2 Stepwise Description Of The Procedure . . . . . . . . . . . . 18411.5.3 Interpreting and Using the Computer Output . . . . . . . . . 185

11.6 Factor Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20111.6.1 Differences from Principal Components Analysis . . . . . . . . 20111.6.2 How Does Factor Analysis Work? . . . . . . . . . . . . . . . . 20211.6.3 Interpreting and Using the Computer Output . . . . . . . . . 204

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208

12 Structural Equation Modeling 21312.1 Model Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . 214Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218

Page 8: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

13 Rasch Analysis 21913.1 Dichotomous Rasch Analysis . . . . . . . . . . . . . . . . . . . . . . . 220

13.1.1 Ability and Difficulty: Symmetry and Common Scale . . . . . 22013.1.2 Rasch Measures as Relative Measures . . . . . . . . . . . . . . 22213.1.3 Invariance of Differences: An Interval Scale . . . . . . . . . . 22313.1.4 The Reference Pair, Abilities, and Difficulties . . . . . . . . . 22313.1.5 Pmi in Terms of Bm − Di . . . . . . . . . . . . . . . . . . . . 225

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227

Appendix A Scales of Measurement 231

Appendix B Chebyshev’s Inequality and Markov’s Inequality 233

Appendix C Independence and Uncorrelatedness 235

Appendix D Properties of Correlation 239

Appendix E The t Distribution 241

Appendix F The Chi-Squared Distribution 245

Appendix G The F-Distribution 249

Appendix H F-Tests for ANOVA and Linear Regression 251H.1 The F-Test for ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . 251H.2 The F-Test for Linear Regression . . . . . . . . . . . . . . . . . . . . 253

Appendix I Moment Generating Function 255

Appendix J Matrix Algebra 259

Appendix K Different Rotation Schemes 267

Appendix L Miscellaneous 271

Bibliography 275

Answers to Exercises 277Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278

Page 9: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

Chapter 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280Chapter 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280Chapter 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283Chapter 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284Chapter 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285Chapter 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285Chapter 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286Chapter 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288Chapter 12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289Chapter 13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289

Subject Index 292

Page 10: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations
Page 11: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

List of Tables

6.1 All possible decisions: errors and power . . . . . . . . . . . . . . . . . 72

8.1 k Groups of Measurements . . . . . . . . . . . . . . . . . . . . . . . . 1008.2 Systolic Blood Pressure after Treatments (Data 1) . . . . . . . . . . . 1058.3 ANOVA Table for Data 1 . . . . . . . . . . . . . . . . . . . . . . . . 1058.4 Systolic Blood Pressure after Treatments (Data 2) . . . . . . . . . . . 1068.5 ANOVA Table for Data 2 . . . . . . . . . . . . . . . . . . . . . . . . 1078.6 Descriptive Statistics for Data 2 . . . . . . . . . . . . . . . . . . . . . 1088.7 Multiple Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . 1088.8 Reading Comprehension Scores . . . . . . . . . . . . . . . . . . . . . 1128.9 ANOVA Table for Reading Comprehension . . . . . . . . . . . . . . . 112

9.1 Variables Entered/Removed . . . . . . . . . . . . . . . . . . . . . . . 1269.2 Model Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1279.3 ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1289.4 Coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1319.5 Coefficients with 95% Confidence Intervals . . . . . . . . . . . . . . . 1339.6 Model Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1379.7 ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1379.8 Coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1379.9 Model Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1389.10 ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1389.11 Coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

10.1 Observed And Expected Cell Counts . . . . . . . . . . . . . . . . . . 14010.2 Religion and Conservatism . . . . . . . . . . . . . . . . . . . . . . . . 14310.3 Favorite Ice Cream Flavors for Different Age Brackets . . . . . . . . . 14610.4 Education and Longevity . . . . . . . . . . . . . . . . . . . . . . . . . 153

11.1 Loadings, Communalities, Variance, and Covariance . . . . . . . . . . 172

9

Page 12: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

11.2 Full Correlation Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . 18311.3 Full Correlation Matrix with a Block Structure . . . . . . . . . . . . 18311.4 Communalities for Full Analysis . . . . . . . . . . . . . . . . . . . . . 18511.5 Total Variance Explained . . . . . . . . . . . . . . . . . . . . . . . . 18611.6 Full Component Matrix . . . . . . . . . . . . . . . . . . . . . . . . . 18611.7 Reproduced Full Correlation Matrix . . . . . . . . . . . . . . . . . . 18911.8 Component Score Coefficient Matrix . . . . . . . . . . . . . . . . . . 19011.9 List of Eigenvalues . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19311.10Communalities with Three Components . . . . . . . . . . . . . . . . 19511.11Component Transformation Matrix . . . . . . . . . . . . . . . . . . . 19711.12Component Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19811.13Rotated Component Matrix . . . . . . . . . . . . . . . . . . . . . . . 19911.14Total Variance Explained . . . . . . . . . . . . . . . . . . . . . . . . 19911.15Component Matrix (≥ 0.4) . . . . . . . . . . . . . . . . . . . . . . . . 20011.16Rotated Component Matrix (≥ 0.4) . . . . . . . . . . . . . . . . . . . 20011.17Initial and Final Communalities: Principal Axis Factoring . . . . . . 20311.18Full Correlation Matrix with Converged Communalities on the Diagonal20411.19Total Variance Explained by Factor Analysis . . . . . . . . . . . . . . 20511.20Full Factor Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20511.21Factor Score Covariance Matrix . . . . . . . . . . . . . . . . . . . . . 20611.22Reproduced Full Correlation Matrix (PAF) . . . . . . . . . . . . . . . 20711.23Communalities for Full Rotated Analysis . . . . . . . . . . . . . . . . 20811.24List of Eigenvalues . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20811.25Rotated Component Matrix . . . . . . . . . . . . . . . . . . . . . . . 20911.26Component Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21011.27Component Score Coefficient Matrix . . . . . . . . . . . . . . . . . . 21011.28Standardized Test Scores for the Fourth Subject . . . . . . . . . . . . 21011.29Full Component Matrix . . . . . . . . . . . . . . . . . . . . . . . . . 212

Page 13: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

List of Figures

8.1 Means of Systolic Blood Pressure for Treatments A, B, C, and theControl D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

11.1 The Relation Between (X, Y ) And (C1, C2) Before Rotation . . . . . 16811.2 The Relation Between (X, Y ) And (C ′

1, C ′2) After Rotation . . . . . . 169

11.3 The Relation Between (X, Y, Z) And (C1, C2, C3) Before Rotation . . 17611.4 The Relation Between (X, Y, Z) And (C1, C2) Before Rotation . . . . 18011.5 The Relation Between (X, Y, Z) And (C1, C2) after Rotation . . . . . 18011.6 Scree Plot of 12 Eigenvalues . . . . . . . . . . . . . . . . . . . . . . . 19411.7 Scree Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20911.8 Scree Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211

12.1 The Relation Between Latent and Indicator Variables . . . . . . . . . 21412.2 The Full Refined Model with 12 Indicators . . . . . . . . . . . . . . . 21512.3 The Full Model with GED . . . . . . . . . . . . . . . . . . . . . . . . 21612.4 The Full Model without GED . . . . . . . . . . . . . . . . . . . . . . 217

13.1 Item-Person Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221

11

Page 14: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

12

Page 15: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

Chapter 1

What is Statistics?

There are two types of statistics; descriptive statistics and inferential statistics.

• Descriptive statistics focuses on developing a numerical description of somephenomenon in summary form.

• Inferential statistics uses available numerical summaries, usually gained froma small portion of the target dataset, to assess the quantities of interest forthe entire dataset and assist decision making.

Our focus will be on decision making using inferential statistics.

1.1 Some Definitions

You should see these formal definitions at least once in your lifetime.

• A population is a set of existing units of interest. (everything)

• A variable is a characteristic or property of each unit in the population.

• A sample is a subset of the population.

• A statistical inference is an estimate, a prediction, or some generalizationabout the population based on the information gained from a sample. (your

13

Page 16: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

14 CHAPTER 1. WHAT IS STATISTICS?

educated guess)

Inferential statistics works stepwise as follows.

1. Pick the target population: the population of interest.

2. Decide on the variable(s) to be investigated.

3. Extract your sample units from the population.

4. Study the sample and compute relevant sample parameters.

5. Make inferences about the population based on the information gathered fromthe sample.

6. Investigate the reliability of your inference using appropriate tools.

Step 4 requires graphical and/or numerical techniques to describe the sample andextract necessary information.

Step 6 requires understanding basic concepts of probability. As it is easiest to graspthese concepts through concrete examples, we will not study probability theory perse. We will pick up the concepts we need as we go along.

Page 17: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

Chapter 2

Describing the Data

Both graphical methods and numerical methods are used. A histogram, a frequencydiagram, is one typical graphical method that you will encounter often. Other thanthis diagram, we will focus on numerical methods in this course.

A typical set of data can be described very roughly by its central tendencyand variability.

1. The central tendency of a set of data is the tendency of the data to clusteraround a certain numerical value. It is where the ”peak” is.

2. The variability of the set is how spread out the data points are.

Different central tendencies, but the same variability The same central tendency,but different variabilities

These are definitely not the same, and it is obvious that we need to specify boththe central tendency and the variability in order to describe a distribution. So, wewill learn how to measure, numerically, the central tendency first, followed by howto measure the variability of a data set.

2.1 Numerical Measures of Central Tendency

15

Page 18: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

16 CHAPTER 2. DESCRIBING THE DATA

Definition 2.1 The mean of a set of quantitative data is equal to the sum of themeasurements divided by the number of measurements contained in the data set. For{x1, x2, . . . , xn}, we have

x = x1 + x2 + . . . + xn

n=∑n

i=1 xi

n(2.1)

Notation

x : the sample mean

µ : the population mean

Definition 2.2 The median of a data set is the middle number or the mean of themiddle two numbers when the measurements are arranged in ascending or descendingorder.

1. If n is odd, the median is the middle number.

2. If n is even, the median is the mean of the middle two numbers.

Consider the set{2, 4, 0, 1, 6}.

In ascending order, we get{0, 1, 2, 4, 6},

while in descending order we get

{6, 4, 2, 1, 0}.

Hence, the median is 2.

Now consider the set{2, 4, 0, 1, 3, 6}.

Page 19: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

2.1. NUMERICAL MEASURES OF CENTRAL TENDENCY 17

In ascending order, we get{0, 1, 2, 3, 4, 6},

while in descending order we get

{6, 4, 3, 2, 1, 0}.

Hence, the median is2 + 3

2= 2.5.

Comparing the Mean with the Median

median < mean : If the median is less than the mean, the data setis skewed to the right.

mean < median : If the median is greater than the mean, the dataset is skewed to the left.

a symmetric data set : The median and the mean are the samewhen the data set is symmetric.

Once a sociologist, I believe, said, “Don’t worry if you are below the average,because half the population is below the average anyway.” However, the correctstatement would be: “Don’t worry if you are below the median, because half thepopulation is below the median anyway” unless the distribution is symmetric.

Definition 2.3 The mode(s) is (are) the measurement(s) that occurs (occur) withthe greatest frequency in the data set.

As is obvious in the definition, there can be more than one mode. Fro example,the mode of the set

{1, 5, 6, 4, 8, 1}

is 1, while the modes of the set

{1, 5, 6, 4, 8, 1, 5}

are 1 and 5.

Page 20: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

18 CHAPTER 2. DESCRIBING THE DATA

2.2 Measures of Variability

Definition 2.4 The range of a data set is equal to the largest measurement minusthe smallest measurement.

For{21, 33, . . . , 95},

where the data are arranged in ascending order, the range = 95 − 21 = 74.

Definition 2.5 The sample variance s2 for a sample of n measurements is equalto the sum of the squared distances from the mean divided by n − 1.

s2 =∑n

i=1(xi − x)2

n − 1. (2.2)

In (2.2), the numerator is called the sum of squares of deviations from the mean,and the denominator is called the degrees of freedom.

So, for the set{0, 1, 2, 3, 4},

where x = 2 and n = 5, the sample variance is given by

s2 =∑5

i=1(xi − 2)2

5 − 1

= (0 − 2)2 + (1 − 2)2 + (2 − 2)2 + (3 − 2)2 + (4 − 2)2

5 − 1= 10

4= 2.5.

Definition 2.6 The sample standard deviation, s, is defined as the positivesquare root of the sample variance s2.

s =√

s2 =√∑n

i=1(xi − x)2

n − 1(2.3)

Page 21: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

2.2. MEASURES OF VARIABILITY 19

For our example,s =

√2.5 ≈ 1.6.

Note 2.1 The population variance, denoted by the symbol σ2 (sigma squared), is theaverage of the squared distances of the measurements of all units in the populationfrom the population mean µ. The population standard deviation is the positive squareroot of the population variance σ2. Needless to say, we cannot compute µ, σ, and σ2

directly. This amounts to examining every single member of the population, howeverlarge. The goal of inferential statistics is to avoid such a hustle by strategicallyguessing the population parameters based on a sample.

σ2 =∑N

i=1(xi − µ)2

N(2.4)

σ =

√∑Ni=1(xi − µ)2

N(2.5)

Let us stop here for a moment and take stock of the terms and symbols.

Sample variance: s2

Sample standard deviation: sPopulation variance: σ2

Population standard deviation: σ

In order to place variance in a larger context, we will now introduce the conceptof the expected value or expectation value.

Definition 2.7 The expected value E[X] of a discrete random variable X isdefined by

E[X] =n∑

i=1xipi; (2.6)

where pi is the probability associated with xi often denoted by P (xi).The expected value E[X] of a continuous random variable X is defined by

E[X] =∫ +∞

−∞xf(x) dx; (2.7)

Page 22: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

20 CHAPTER 2. DESCRIBING THE DATA

wheref(x)dx is the probability of finding x in the infinitesimally small interval (x, x+dx); that is, f(x) is the probability density function for X.

With these definitions, it is straightforward to prove the following properties ofthe expected value. Here, X and Y are any random variables, and a, b, and c areany numbers.1

1. E[c] = c

2. X ≤ Y =⇒ E[X] ≤ E[Y ]

3. E[X + c] = E[X] + c

4. E[X + Y ] = E[X] + E[Y ]

5. E[aX] = aE[X]

6. E[aX + bY + c] = aE[X] + bE[Y ] + c (linearity: This is a consequence of 1,3,4,and 5.)

Let us look at the formula for the population variance again.

σ2 =∑N

i=1(xi − µ)2

N=

N∑i=1

(xi − µ)2 1N

When each of the N units of the population has the same probability of occur-ring, pi = 1

Nfor i = 1, 2, . . . , N − 1, N . So, ∑N

i=1(xi − µ)2 1N

is of the form foundin Definition 2.7. Therefore, the variance is the expected value of (X − µ)2. Wesometimes write V ar[X] to denote the variance of X and express this fact as below.

V ar[X] = E[(X − µ)2] (2.8)

Similarly,

µ =∑N

i=1 xi

N=

N∑i=1

xi1N

,

1Another property of E often encountered is E[XY ] = E[X]E[Y ] if the random variables X andY are independent.

E[XY ] =∫ ∫

xyf(x)g(y) dydx =[∫

xf(x) dx

] [∫yg(y) dy

]= E[X]E[Y ];

where f(x)g(y) is the joint probability density function of XY .

Page 23: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

2.2. MEASURES OF VARIABILITY 21

and this is again of the form given in Definition 2.7. We have

µ = E[X]. (2.9)

Both µ = E[X] and V ar[X] = E[(X − µ)2] hold for continuous random variables aswell, with summation replaced by an integral. In addition to

E[X] =∫ +∞

−∞xf(x) dx, (2.10)

we also have

V ar[X] =∫ +∞

−∞(x − µ)2f(x) dx; (2.11)

where f(x) is the probability density function of X.

Proposition 2.1 The following equality holds between V ar and E.

V ar[X] = E[X2] − (E[X])2 (2.12)

Proof

V ar[X] = E[(X − µ)2] = E[X2 − 2µX + µ2] = E[X2] − 2µE[X] + µ2

= E[X2] − 2µ · µ + µ2 = E[X2] − µ2 = E[X2] − (E[X])2

Properties of VarianceShown below are some properties of variance. Here, X, Y , Xi, Yj are all randomvariables and a, b, ai and bj are constants. These apply to both the sample varianceand population variance.

1. V ar(X) = 0 if and only if X = a; i.e., X is a constant.

2. V ar(X + a) = V ar(X)

3. V ar(aX) = a2V ar(X)

4. V ar(aX + bY ) = a2V ar(X) + b2V ar(Y ) + 2abCov(X, Y )In particular,V ar(X ± Y ) = V ar(X) + V ar(Y ) ± 2Cov(X, Y )(With this and the following properties, we are ahead of ourselves as the notionof covariance Cov, a generalization of V ar because V ar(X) = Cov(X, X) asProperty 5 below claims, will be discussed later in Chapter 7. Just glancethrough the rest of the properties for now.)

Page 24: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

22 CHAPTER 2. DESCRIBING THE DATA

5. V ar(X) = Cov(X, X)

6. In general,V ar

(∑Ni=1 aiXi

)= ∑N

i=1∑N

j=1 aiajCov(Xi, Xj) = ∑Ni=1 a2

i V ar(Xi)+∑

i,j aiajCov(Xi, Xj) =∑Ni=1 a2

i V ar(Xi) + 2∑i<j aiajCov(Xi, Xj).And, in particular,V ar

(∑Ni=1 Xi

)= ∑N

i,j=1 Cov(Xi, Xj) = ∑Ni=1 V ar(Xi) +∑

i,j Cov(Xi, Xj).

7. When random variables X1, X2, . . . Xn−1, Xn are pairwise uncorrelated, Cov(Xi, Xj) =0 if i , j, andV ar(∑N

i=1 aiXi) = ∑Ni=1 V ar(Xi).

There is an alternative formula for sample variance.

s2 =∑n

i=1(xi − x)2

n − 1=∑n

i=1 x2i − (∑n

i=1 xi)2

n

n − 1(2.13)

Let us use this formula to compute s2 for the same sample that was considered before;namely, {0, 1, 2, 3, 4}.

s2 =∑n

i=1 x2i − (∑n

i=1 xi)2

n

n − 1=∑5

i=1 x2i − (∑5

i=1 xi)2

55 − 1

=02 + 12 + 22 + 32 + 42 − (0+1+2+3+4)2

55 − 1

=0 + 1 + 4 + 9 + 16 − 102

54

=30 − 100

54

= 30 − 204

= 104

= 2.5

This checks.

Verification of the Alternative FormulaWe only need to show

n∑i=1

(xi − x)2 =n∑

i=1x2

i − (∑ni=1 xi)2

n.

But,n∑

i=1(xi − x)2 =

n∑i=1

x2i −

n∑i=1

2xix +n∑

i=1x2,

Page 25: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

2.2. MEASURES OF VARIABILITY 23

and we need only to show

−n∑

i=12xix +

n∑i=1

x2 = −(∑ni=1 xi)2

n.

Indeed,

−n∑

i=12xix +

n∑i=1

x2 = −2xn∑

i=1xi + nx2 = −2xn

(∑ni=1 xi

n

)+ nx2 = −2xnx + nx2

= −2nx2 + nx2 = −nx2 = −n

(∑ni=1 xi

n

)2

= −(∑ni=1 xi)2

n.

Done!

Note 2.2 The divisor for the sample variance s2 and the sample standard deviations is (n − 1) instead of n, known as Bessel’s correction, because the use of n tendsto underestimate the population variance σ2 and the population standard deviationσ. must be used. The xis tend to be closer to their average x than to the populationmean µ. So, to compensate for this, the divisor n − 1 is used rather than n. Itis customary to refer to s2 as being based on n − 1 degrees of freedom (df). Thisterminology results from the following fact. Although s2 is based on the n quantitiesx1 − x, x2 − x, . . . xn − x, these sum to 0, and specifying the values of any n − 1of the quantities determines the remaining one value [Devore and Berk, 2012, p.35].In mathematical statistics, the s2 with a divisor (n − 1) is called an unbiasedestimator of σ2.

In order to explain what an “unbiased estimator” is, we need to define both an“estimator” and “bias”.

Definition 2.8 An estimator is a statistic, i.e. a function of the data, that is usedto infer the value of an unknown parameter such as a population mean and populationvariance.

It is customary to denote the estimator of θ by θ.

Definition 2.9 The bias of an estimator θ, sometimes denoted by Bias[θ], is definedas follows.

Bias[θ] = E[θ] − θ (= E[θ − θ]) (2.14)

The last equality holds as θ is a constant.

Page 26: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

24 CHAPTER 2. DESCRIBING THE DATA

Definition 2.10 An unbiased estimator is an estimator θ for which

Bias[θ] = E[θ] − θ = 0 or E[θ] = θ. (2.15)

Example 2.1 A good example of an unbiased estimator is the sample mean x forthe population mean µ.

E[X] = E

[∑ni=1 Xi

n

]= 1

nE

[n∑

i=1Xi

]= 1

n

n∑i=1

E[Xi] = 1n

n∑i=1

µ = 1n

· nµ = µ

Whereas, a good example of a biased estimator is∑ni=1(xi − x)2

n

for σ2. In fact, we can show

E

[∑ni=1(xi − x)2

n

]= n − 1

nσ2, (2.16)

and this is the reason why we use ∑ni=1(xi − x)2

n − 1as the estimator for σ. This estimator is unbiased because

E

[∑ni=1(xi − x)2

n

]= n − 1

nσ2 ⇐⇒ E

[∑ni=1(xi − x)2

n − 1

]= σ2. (2.17)

In order to prove the last equality, we will use the alternative formula for thevariance, Proposition 2.1, and the following Fact2.

Fact 2.1 When random variables X1, X2, . . . Xn are pairwise uncorrelated, then, forreal numbers a1, a2, . . . an,

V ar

[n∑

i=1aiXi

]=

n∑i=1

ai2V ar[Xi]. (2.18)

In particular, if X1, X2, . . . Xn are independent3, the above equality holds as inde-pendence implies there is no pairwise correlation.

2This is stated as a fact here, but its proof will be given when we discuss the notions of covarianceand correlation later in this book.

3For the record, lack of pairwise correlation does not imply independence.

Page 27: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

2.3. HOW STANDARD DEVIATION IS INTERPRETED 25

Proof that s2 = (∑ni=1(xi − x)2)/(n − 1) is unbiased

E

[∑ni=1(xi − x)2

n − 1

]= E

∑ni=1 x2

i − (∑n

i=1 xi)2

n

n − 1

= 1n − 1

E

[n∑

i=1x2

i − (∑ni=1 xi)2

n

]

= 1n − 1

(E

[n∑

i=1x2

i

]− E

[(∑n

i=1 xi)2

n

])= 1

n − 1

n∑i=1

E[x2i ] − 1

nE

( n∑i=1

xi

)2

Now, from Proposition 2.1,

E[x2i ] = V ar[xi] + (E[xi])2 = σ2 + µ2,

and

E

( n∑i=1

xi

)2 = V ar

[(n∑

i=1xi

)]+(

E

[n∑

i=1xi

])2

=n∑

i=1V ar[xi] +

(n∑

i=1E[xi]

)2

= nσ2 + (nµ)2 = nσ2 + n2µ2.

Therefore,

1n − 1

n∑i=1

E[x2i ] − 1

nE

( n∑i=1

xi

)2 = 1

n − 1

(n(σ2 + µ2) − 1

n(nσ2 + n2µ2)

)

= 1n − 1

(nσ2 + nµ2 − σ2 − nµ2) = 1n − 1

(n − 1)σ2 = σ2,

which is the desired result.

2.3 How Standard Deviation Is Interpreted

There is a probabilistic theorem that applies to completely arbitrary distribu-tions.

Theorem 2.1 (Chebyshev’s Theorem) For any real number k > 1, at least(1 − 1/k2) of the measurements will fall within (x − ks, x + ks); i.e. within kstandard deviations of the mean.

Page 28: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

26 CHAPTER 2. DESCRIBING THE DATA

And, this has the following implications.

1. At least 3/4 of the measurements will fall within (x − 2s, x + 2s).

2. At least 8/9 of the measurements will fall within (x − 3s, x + 3s).

3. It is possible that none of the measurements will fall within the interval (x −s, x + s).

Furthermore, there are empirical rules that apply to a mound-shaped frequencydistribution that is approximately symmetric. These rules are only empirical butprovide sharper bounds.

Empirical Rules

1. Approximately 68% of the measurements will fall within the interval (x−s, x+s).

2. Approximately 95% of the measurements will fall within the interval (x −2s, x + 2s).

3. Essentially all the measurements will fall within the interval (x − 3s, x + 3s).

These rules are quite robust and apply to many mound-shaped distributionswhich are not extremely skewed.

2.4 Two Measures of Relative Standing

Definition 2.11 The pth percentile is the value Q for which p% of the data areless than Q. Or equivalently, (1 − p)% of the data are greater than Q.

Definition 2.12 The sample z-score for a measurement x is given by its signeddistance from the mean divided by the standard deviation s.

z = x − x

s(2.19)

Page 29: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

2.4. TWO MEASURES OF RELATIVE STANDING 27

Similarly, the population z-score for a measurement x is given by

z = x − µ

σ. (2.20)

The z-scores defined above have a distribution with the mean z or µz = 0 and astandard deviation sz or σz = 1. Let us verify this for a sample of size n.

For each sample{x1, x2, . . . , xn−1, xn},

we have the corresponding set of z-scores

{z1, z2, . . . , zn−1, zn} ={

x1 − x

s,x2 − x

s, . . . ,

xn−1 − x

s,xn − x

s

}.

It is the mean and the standard deviation of the last set that we are interested in.We will compute the mean z first.

z =∑n

i=1 zi

n=∑n

i=1xi−x

s

n= 1

s

(∑ni=1 xi

n−∑n

i=1 x

n

)= 1

s

(x − nx

n

)= 0.

Next, we will show the standard deviation sz is 1, which is equivalent to showingthat the variance s2

z is one.

s2z =

∑ni=1(zi − z)2

n − 1=∑n

i=1

(xi−x

s− 0

)2

n − 1= 1

s2

∑ni=1(x − x)2

n − 1= 1

s2 s2 = 1

Likewise, for a population of size N .

Our ”Empirical Rules” that apply to mound-shaped roughly symmetric distri-bution can now be restated as below.

Empirical Rules: z-Score Version

1. Approximately 68% of the measurements will fall within the interval (−1, +1).

2. Approximately 95% of the measurements will fall within the interval (−2, +2).

3. Essentially all the measurements will fall within the interval (−3, +3).

Page 30: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

28 CHAPTER 2. DESCRIBING THE DATA

Example 2.2 The average salary µ of Kyoto University professors is $72,000 an-nually, and the standard deviation σ is $4,000. As Professor Aotani’s salary is$60,000, while he definitely deserves at least five times that amount, the z-score is

60 − 724

= −3.

What does this mean? How should we interpret this?

Interpretation: Because the salaries have a mound-shaped and more or less sym-metric distribution, we can use the rules of thumb presented above. The third em-pirical rule states: ”Essentially all the measurements will fall within the interval(−3, +3).” So, the annual salary of $60,000 is extremely low with almost everyoneelse getting a higher salary.

QuestionA sample of two hundred business persons were selected randomly to investigatehow much they spend for lunch on a typical day. It was found that the mean was480 yen and the standard deviation was 40 yen. Suppose that John Smith spends560 yen on lunch. How would you interpret this amount, assuming the distributionis mound-shaped and approximately symmetric?

AnswerJohn’s z-score is

x − x

s= 560 − 480

40= 2.

We know 95% of the measurements will be in (−2, +2). So, 5% will be outside thisinterval. It is either smaller than −2 or greater than +2. By symmetry, 2.5% willbe in the interval [+2, ∞). Hence, we know that about 2.5% of the business personsspend more than John Smith (more than 560 yen) on lunch and that about 97.5%of them spend less than John does for lunch.

Page 31: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

29

ExercisesClick to go to solutions.

1. For a sample of 16 students at a high school cafeteria, the following salesamounts arranged in ascending order of magnitude are observed:3.05, 3.10, 3.20, 3.25, 3.35, 3.35, 3.50, 3.75, 3.90, 4.15, 4.25, 4.55, 4.70, 5.00,5.10. Determine the mean, the median, and the mode(s) for these amounts.

2. The number of cars sold by each of the 10 salespersons working for the samedealership in a month, arranged in ascending order, is:3, 5, 7, 9, 10, 11, 12, 13, 14, 16.

(a) Determine the variance and the standard deviation.(b) What if this was a sample of 10 sales persons out of 60 working for the

same dealership?

3. Ten students took an examination whose possible high was 10. The scoredistribution was as follows.

{1, 2, 2, 3, 4, 4, 5, 6, 6, 7}

(a) What is the median?(b) What is (are) the mode(s)?(c) Determine the mean?(d) Write the formula for the standard deviation with the numbers given

above. You do not need to compute it.

4. Ten students took an examination whose possible high was 10. The scoredistribution was as follows.

{1, 2, 3, 3, 4, 4, 5, 5, 6, 7}

(a) What is the median?(b) What is (are) the mode(s)?(c) Determine the mean?(d) Write the full formula for the standard deviation with the numbers given

above. You do not need to compute it.

Page 32: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

30

5. Ten students took an examination whose possible high was 10. The scoredistribution was as follows.

{1, 2, 2, 3, 4, 4, 5, 6, 6, 7}

(a) What is the median?(b) What is (are) the mode(s)?(c) Write the full formula for the variance with the numbers given above. You

do not need to compute it. (Note: The mean is 4.)

6. Ten students took an examination whose possible high was 10. The scoredistribution was as follows.

{1, 2, 3, 3, 4, 4, 5, 5, 6, 7}

(a) What is the median?(b) What is (are) the mode(s)?(c) Determine the mean?(d) Write the full formula for the standard deviation with the numbers given

above. You do not need to compute it.

7. Ten students took an examination whose possible high was 10. The scoredistribution was as follows.

{1, 2, 2, 3, 4, 4, 5, 6, 6, 7}

(a) What is the median?(b) What is (are) the mode(s)?(c) Determine the mean?(d) Write the formula for the standard deviation with the numbers given

above. You do not need to compute it.

8. Ten students took an examination whose possible high was 10. The scoredistribution was as follows.

{1, 2, 2, 3, 4, 4, 5, 6, 6, 7}

(a) What is the median?

Page 33: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

31

(b) What is (are) the mode(s)?(c) Determine the mean?(d) Write the formula for the standard deviation with the numbers given

above. You do not need to compute it.

9. Ten students took an examination whose possible high was 10. The scoredistribution was as follows.

{1, 2, 3, 3, 4, 4, 5, 5, 6, 7}

(a) What is the median?(b) What is (are) the mode(s)?(c) Determine the mean?(d) Write the full formula for the standard deviation with the numbers given

above. You do not need to compute it.

10. Ten students took an examination whose possible high was 10. The scoredistribution was as follows. Regard this as the entire population.

{1, 2, 2, 3, 4, 4, 5, 6, 6, 7}

(a) What is the median?(b) What is (are) the mode(s)?(c) Determine the mean?(d) Write the formula for the standard deviation with the numbers given

above. You do not need to compute it.

11. A large number of people rolled a die ten times each. The numbers of headsfor a sample of eight people are shown below.

{6, 3, 5, 4, 7, 4, 5, 6}

(a) What is the mean?(b) Write down the expression to compute the variance for the given sample.

You do not have to carry out an actual computation.

Page 34: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

32

Page 35: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

Chapter 3

The Normal Distribution

3.1 Random Variables and Probability Distribu-tion

In this book, we are mainly interested in the value of some numerical variable Xwhich takes different values x with different probabilities f(x). In statistics, such avariable is called a random variable.

Example 3.1 Let X be the result of a roll of a die. Then, x = 1, 2, 3, 4, 5, or 6,and f(x) = 1/6 for all x. Note that X cannot take any value between 1 and 2, 2 and3, 3 and 4, 4 and 5, and 5and 6.. Such X is called a discrete random variable.In this case, you can count the number of possible values X takes.

Example 3.2 Let X be the height of the students at some university. Then, xranges from the height of the shortest student to the height of the tallest student.For example, it may be in the range [151.2, 194.5] in centimeters. In this case,it is not clear with what probability each height measurement will be encountered.We will typically draw a frequency diagram, a histogram, to see how the heights aredistributed and how likely it is that each height appears in the measurement. More onthis shortly. Note that one’s height can in principle be any number such as 172.245,168.2121, and 180.000 centimeters. This is in stark contrast to the rolling of a die,and a random variable like this is called a continuous random variable. In thiscase, there are uncountably many values X can take.

33

Page 36: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

34 CHAPTER 3. THE NORMAL DISTRIBUTION

As explained by the two examples above, a random variable can be discrete orcontinuous. But, we will mostly deal with continuous random variables. When wehave a continuous random variable, the frequency diagram, histogram, becomes asmooth curve, f(x), as opposed to a bar chart. Shown below is an example of sucha curve called probability distribution. The height tells us how likely it is toobserve the corresponding x-value.

More precisely, the probability of observing x in the range [a, b], denoted byP ([a, b]), is given by the area under the curve over the interval [a, b]. That is,

P ([a, b]) =∫ b

af(x)dx. (3.1)

Needless to say, the curve should be so drawn, by rescaling, that∫ +∞

−∞f(x) = 1. (3.2)

The description of the example of a continuous random variable above states, ”itis not clear with what probability each height measurement will be encountered.”In fact, the probability of finding exactly x = a is zero as there are infinitely manypossible values of x. This can also be understood, at least formally, based on thefollowing definite integral

“the probability of finding exactly x = a” = P ([a, a]) =∫ a

af(x)dx = 0.

Remark 3.1 One consequence of the above is

[P ([a, b]) = P ((a, b]) = P ([a, b)) = P (a, b). (3.3)

In other words, inclusion or exclusion of one point does not change anything whenwe have a continuous random variable.

What is the µ of a continuous random variable x?Recall, for a discrete random variable, we had

µ = x1 + x2 + . . . + xN

N=∑N

i=1 xi

N. (3.4)

Page 37: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

3.1. RANDOM VARIABLES AND PROBABILITY DISTRIBUTION 35

What we are doing here is actually multiplying each value xi by the probability pi(=1/N in this case) associated with xi and adding them together. In the continuouscase, we replace xi by x, pi by f(x)dx, and ∑N

i=1 by∫+∞

−∞ , generating

µ =∫ +∞

−∞xf(x)dx. (3.5)

Remark 3.2 µ is also called the expectation value of X and is denoted by E(X)or E[X].

What is the standard deviation σ or the variance σ2 of a continuousrandom variable x?Recall, for a discrete random variable, we had

σ2 =∑N

i=1(xi − µ)2

N(3.6)

σ =

√∑Ni=1(xi − µ)2

N. (3.7)

Resorting to the same substitutions as above

xi → x (3.8)

1N

→ f(x)dx (3.9)

N∑i=1

→∫ +∞

−∞, (3.10)

we get

σ2 =∫ +∞

−∞(x − µ)2f(x)dx. (3.11)

This is the expectation value of (x − µ)2 denoted by E[(X − µ)2].

Page 38: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

36 CHAPTER 3. THE NORMAL DISTRIBUTION

We will focus our attention on one particular probability distribution knownas the normal (probability) distribution. A random variable whose probabilitydistribution is normal is called a normal variable. Hence,

the normal distribution f(x)= the probability distribution for a normal variable x. (3.12)

The normal distribution is one of the most useful and most frequently encountereddistributions with the following characteristics.

3.2 Characteristics of the Normal Distribution

Characteristics of the Normal Distribution

1. Continuous random variable

2. A bell-shaped curve with one hump

3. Extending indefinitely to ±∞, approaching, but never touching, the horizontalaxis

4. Mean = Median = Mode

5. Symmetrical with respect to the mean

6. The distance between the two points of inflection is 2σ.

7. The total area under the normal curve is equal to 1.

In fact, the normal distribution is a collective name for a family of probabilitydistributions. Depending on the central tendency µ and variability σ, the shapes ofthe curves are different. But, the formulas are common to all of them.

f(x) = 1σ

√2π

e−(1/2)[(x−µ)/σ]2 = 1σ

√2π

e− (x−µ)2

2σ2 ; (3.13)

where µ is the mean of the normal random variable x and σ2 is the variance of thenormal random variable x. Also recall π = 3.1416 . . . and e = 2.71828 . . ..

Page 39: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

3.3. THE STANDARD NORMAL RANDOM VARIABLE Z 37

This sameness becomes even clearer when we use the standard normal randomvariable z.

3.3 The Standard Normal Random Variable z

Definition 3.1 The standard normal random variable z is defined by the formula

z = x − µ

σ; (3.14)

where x is a normal random variable with mean µ and standard deviation σ.

You might think this is just a repetition of our previous definition; Definition2.11. However, our previous definition was for discrete random variable x, and weneed a continuous variable counterpart.Only with the clarifications about the mean µ, the variance σ2, and the standarddeviation σ for continuous random variables given in Section 3.1, the meaning ofDefinition 3.1 for z becomes clear.

Since these definitions take mathematically different forms from the discrete case,we need to verify again that µz, the mean of z, is 0, and the standard deviation σz

of z is 1.

µz = E[Z] =∫ +∞

−∞

x − µ

σf(x)dx = 1

σ

[∫ +∞

−∞xf(x)dx −

∫ +∞

−∞µf(x)dx

]= 1

σ[µ − µ] = 0 (3.15)

Note here that∫+∞

−∞ xf(x)dx = µ by definition, and∫+∞

−∞ f(x)dx = 1 as it is theprobability of x taking some value in (−∞, +∞).

σ2z = E[(Z − µz)2] =

∫ +∞

−∞(z − µz)2f(x)dx =

∫ +∞

−∞z2f(x)dx

=∫ +∞

−∞

(x − µ

σ

)2f(x)dx = 1

σ2

∫ +∞

−∞(x − µ)2f(x)dx = 1

σ2 σ2 = 1; (3.16)

where∫+∞

−∞ (x − µ)2f(x)dx = σ2 by definition. So, we indeed have µz = 0 andσ2

z = σz = 1.

Page 40: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

38 CHAPTER 3. THE NORMAL DISTRIBUTION

Here is the situation.

• x is a normal random variable with mean µ, standard deviation σ, and theprobability distribution given by

f(x) = 1σ

√2π

e−(1/2)[(x−µ)/σ]2 = 1σ

√2π

e− (x−µ)2

2σ2 . (3.17)

• z = x−µσ

is the normal random variable whose mean µz is 0, and standard devi-ation σz is 1. This standardized variable z represents the number of standarddeviations between x and µ as is obvious from the defining formula. As thename suggests, z is normally distributed with

f(z) = 1√2π

e− z22 . (3.18)

This last expression is obtained when we substitute µ = 0 and σ = 1 into f(x)and replace x with z.

The point of all this is that we have the same formula for the probability distri-bution of any random variable x upon converting to z. This is important because itmakes a unified treatment of all normally distributed random variables possible aswe will see shortly.

The Meaning of the Curve f(z)Simply put, the area under the curve gives the probability. Recall that f(x)dx ex-presses the probability of finding a measurement of the random variable x in theinterval [x, x + dx]. For this reason, the function f(x) is nonnegative for all x and∫+∞

−∞ f(x)dx = 1.

Fro example, in the figure below, the area of the shaded region bounded by thecurve f(z), the line segment of the z-axis [−1, +1], and two vertical lines drawnthrough z = −1 and z = +1 is the probability of finding the value of z in the range[−1, +1]. Incidentally, we already know that this is 0.68.

Page 41: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

3.3. THE STANDARD NORMAL RANDOM VARIABLE Z 39

In the next figure, the area of the shaded region gives the probability of findingz in the range [0, 2].

The general formula for the probability of observing z in the range [a, b] is

P (a ≤ z ≤ b) =∫ b

af(z)dz =

∫ b

a

1√2π

e− z22 dz = 1√

∫ b

ae− z2

2 dz. (3.19)

Recall one more time that P (a ≤ x ≤ b) = P (a ≤ x < b) = P (a < x ≤ b) = P (a <x < b) for any continuous random variable.

The problem is that this integral has no closed form and can only be computednumerically. Hence, we need to use a table.

Here are some illustrative examples.

Example 3.3

1. P (0 < z < 1) = 0.3413

2. P (−1 < z < 1) = 2P (0 < z < 1) (by symmetry) = 2 × 0.3413 = 0.6826

3. What is the probability that z exceeds 1.64? =⇒ P (1.64 < z) = P (0 <z) − P (0 < z < 1.64) = 0.5 − 0.4495 = 0.0505

4. Find the probability that a normal random variable lies more than 1.96 stan-dard deviations from its mean in either direction.

P (|z| > 1.96) = P (z < −1.96 or 1.96 < z) = 2P (1.96 < z)

= 2(0.5 − P (0 < z < 1.96) = 2 × (0.5 − 0.4750) = 2 × 0.0250

= 0.05 or 5%

More generally, we need to convert the random variable x at hand to a standardrandom variable z first in order to make use of the z-table. The following wordproblem is such an example.

Page 42: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

40 CHAPTER 3. THE NORMAL DISTRIBUTION

Example 3.4 Assume that the length of time, x, between charges of a new gener-ation smart phone is normally distributed with a mean of 50 hours and a standarddeviation of 15 hours. Find the probability that the battery will last between 30and 70 hours between charges.

AnswerWe will step into the Z-space from the X-space using

µ = 50 and σ = 15.

zx=30 = 30 − 5015

= −1.33

zx=70 = 70 − 5015

= +1.33

P (30 ≤ x ≤ 70) = P (−1.33 ≤ z ≤ +1.33) = 2 × P (0 ≤ z ≤ 1.33)

= 2 × 0.4082 = 0.8164

Therefore, the probability that the battery lasts between 30 and 70 hours is 0.8164or 81.64%.

Remark 3.3 The shift from the x-space to the z-space involves moving the originfrom µ to 0 and rescaling both the horizontal and the vertical axis. The horizontalaxis is rescaled so that the variability µz of z is 1, and the vertical axis is rescaledso that P (−∞, +∞) =

∫+∞−∞ f(z)dz = 1.

Page 43: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

41

ExercisesClick to go to solutions.

1. Compute the following probabilities.

(a) P (0 < z < 2)(b) P (−2 < z < 1)(c) P (z > 1.64)

2. Assume that the length of time, x, between charges of a pocket calculator isnormally distributed with a mean of 50 hours and a standard deviation of 15hours. Find the probability that the calculator will last between 40 and 55hours between charges.

3. The average salary of Kyoto University professor is $72,000 per year, and thestandard deviation is $4,000. But, Professor Aotani’s salary is only $60,000while he is worth at least five times that amount. Assume that faculty salariesare normally distributed.

(a) What percentage of Kyoto University professors has salaries in the range($64000, $80000)?

(b) Approximately, what is the percentage of Kyoto University professors whoare overpaid; i.e. professors who make more money than Professor Aotani?

4. Compute or find the following probabilities, assuming the normal distribution.

(a) P (0 < z < 2)(b) P (z > 1.64)(c) The probability that a normal random variable z lies more then 1.96

standard deviations from its mean in either direction

5. Assume that the length of time, x, between charges of a pocket calculator isnormally distributed with a mean of 50 hours and a standard deviation of 10hours. Find the probability that the calculator will last between 40 and 55hours between charges.

6. Compute or find the following probabilities.

(a) P (0 < z < 1)

Page 44: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

42

(b) P (z > 1.65)(c) The probability that a normal random variable z lies within 2.0 standard

deviations of its mean

7. Assume that the length of time, x, between charges of an electronic dictionaryis normally distributed with a mean of 100 hours and a standard deviation of20 hours. Find the probability that the dictionary will last between 90 and 125hours between charges.

8. Compute or find the following probabilities.

(a) P (0 < z < 1)(b) P (−1.65 < z < 1)(c) The probability that a normal random variable x satisfies x−s < x < x+s;

where x is the mean and s is the standard deviation.

9. Compute or find the following probabilities.

(a) P (0 < z < 1)(b) P (z > 1.65)(c) The probability that a normal random variable z lies within 2.0 standard

deviations of its mean

10. Assume that the length of time, x, between charges of an electronic dictionaryis normally distributed with a mean of 100 hours and a standard deviation of20 hours. Find the probability that the dictionary will last between 90 and 125hours between charges.

11. Compute or find the following probabilities.

(a) P (z > 1.65)(b) The probability that a normal random variable z lies within 2.0 standard

deviations of its mean

12. Assume that the length of time, x, between charges of a pocket calculator isnormally distributed with a mean of 50 hours and a standard deviation of 10hours. Find the probability that the calculator will last between 40 and 55hours between charges.

13. Compute or find the following probabilities.

Page 45: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

43

(a) P (0 < z < 1)(b) P (−1.65 < z < 1)(c) The probability that a normal random variable x satisfies x−s < x < x+s;

where x is the mean and s is the standard deviation.

14. Assume that the length of time, x, between charges of a pocket calculator isnormally distributed with a mean of 50 hours and a standard deviation of 10hours. Find the probability that the calculator will last between 40 and 55hours between charges.

15. Compute or find the following probabilities.

(a) P (0 < z < 1)(b) P (z > 1.65)(c) The probability that a normal random variable z lies within 2.0 standard

deviations of its mean

16. Consider 1,000 students whose scores on a certain test are normally distributed.If the mean is 50 and the standard deviation is 15. How many students receivedscores in the range [50,80]? Round the answer to the nearest whole number.

17. Compute or find the following probabilities.

(a) P (0 < z < 1)(b) P (−1.65 < z < 1)(c) The probability that a normal random variable x satisfies x−s < x < x+s;

where x is the mean and s is the standard deviation.

18. Find the following probabilities.

(a) P (0 < z < 1.2)(b) P (1 < x < 2) if µ = 1 and σ = 0.5

19. Assume that the length of time, x, between charges of an electronic dictionaryis normally distributed with a mean of 100 hours and a standard deviation of20 hours. Find the probability that the dictionary will last between 90 and 125hours between charges.

Page 46: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

44

Page 47: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

Chapter 4

Sampling Distribution

Inferential statistics means inferring the “real” = “population” statistics/parametersbased on samples. Sample sizes are often extremely small compared with the corre-sponding populations. For example, in the 2012 presidential election in the UnitedStates, Gallup used a sample size of between 1,000 and 1,500 in their surveys whilethe number of registered voters was estimated to be about 150 million. Gallup says“Surprisingly, however, once the survey sample gets to a size of 500, 600, 700 ormore, there are fewer and fewer accuracy gains that come from increasing the sam-ple size. Gallup and other major organizations use sample sizes of between 1,000and 1,500 because they provide a solid balance of accuracy against the increasedeconomic cost of larger and larger samples.”

So, how sure can we be that our inferences are correct?

To make it even worse, sampling is often done only once. Let us consider therelation between the population mean µ and the sample mean x. Our oversimplifiedexample is an examination with a possible high of 100 taken by 101 students (N =101). Assume that the frequency for each score is 1; that is, one person received 0,one person received 1, . . ., one person received 99, and one person received 100. Ifwe write out all the scores in the population in ascending order, it looks like

{0, 1, 2, 3, 4, 5, . . . , 96, 97, 98, 99, 100}.

Suppose we extract a sample of size 51 (n = 51). This is more than half the size ofthe population, and our intuition tells us it has to be sufficiently large. However, it

45

Page 48: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

46 CHAPTER 4. SAMPLING DISTRIBUTION

is certainly possible, albeit unlikely, that the sample we get happens to be

{0, 1, 2, 3, 4, . . . , 46, 47, 48, 49, 50}.

In this case, the sample mean x is 25, half the value of the population mean.

While we have no means of avoiding this type of disaster altogether, we wouldbe able to make probabilistic predictions if we know how x behaves under repeatedrandom sampling of the same size n from a population of size N a. So, we are inter-ested in the probability distribution of x, which is called the sampling distributionof the mean.

aThere are(

Nn

)ways to choose a sample of size n. Random sampling in this context means that

each of these cases are equally likely to be chosen. Therefore, repeating the sampling infinitelymany times is equivalent to including all possible combinations once for each; that is, you will getthe same relative frequency distribution for x in both cases. Needless to say, if N is infinite, wewill have infinitely many cases included in our combinations.

4.1 Sampling Distribution of the Mean

Fact 4.1 If a random sample of n observations is selected from a populationwith a normal distribution, the sampling distribution of x will be a normaldistribution.

The point here is that it is a normal sampling distribution no matter how smallthe sample size n is [Devore and Berk, 2012, p.297]. Compare this with Fact 4.2below, where a sufficiently large sample size is explicitly required.

Fact 4.2 The Central Limit Theorem If a random sample of n observationsis selected from a population (any population), then, when n is sufficiently large, thesampling distribution of x will be approximately a normal distribution. The largerthe sample size n, the better will be the normal approximation for the samplingdistribution of x.

Fact 4.3 Most Important for Our Applications[Devore and Berk, 2012, p.296]Regardless of the shape of the population’s relative frequency distribution, we have

Page 49: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

4.1. SAMPLING DISTRIBUTION OF THE MEAN 47

the following equalities. (Familiarize yourself with the relations, but you can skipthe proofs if so desired.)

1. The mean of the sampling distribution of x is the population mean; that is,

µx = µ. (4.1)

Proof

E[ x ] = E

[1n

n∑i=1

xi

]= 1

nE

[n∑

i=1xi

]= 1

n

n∑i=1

E[xi]

= 1n

n∑i=1

µ = n

nµ = µ

2. The standard deviation of the sampling distribution of x, is equal to σ, thestandard deviation of the population, divided by the square root of the samplesize n; that is,

σx = σ√n

. (4.2)

This σx is often referred to as the standard error of the mean. a

Proof

V ar(x) = V ar

(1n

n∑i=1

xi

)= 1

n2 V ar

(n∑

i=1xi

)= 1

n2

n∑i=1

V ar(xi)

= 1n2

n∑i=1

σ2 = n

n2 σ2 = σ2

n;

where the third equality is due to Fact 2.1.

It is very important that both µx = µ and σx = σ√napply to any population no

matter what the population relative frequency distribution is.

Here is how we can use these relations.

Example 4.1 In the olden days it was very difficult to manufacture rechargeablebatteries of a uniform quality. So, the population distribution profile of the lifetimesof rechargeable batteries was not well understood. One manufacturer claimed that

Page 50: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

48 CHAPTER 4. SAMPLING DISTRIBUTION

the mean lifetime of their rechargeable batteries µ was 54 months and the standarddeviation of the distribution was 6 months. A group of researchers wanted to testthis claim. They took a random and unbiased sample of 50 of these batteries. Theiraverage was 52 months.

1. What is the sampling distribution for x like?Because the sample size is sufficiently large, we can use the Central LimitTheorem. Therefore, the sampling distribution of the mean x is approximatelynormal with

µx = µ = 54 monthsand

σx = σ√n

= 6√50

≈ 0.85 months.

2. What is the probability that they observe a lifetime of 52 months or fewer fortheir sample of 50 batteries provided that the manufacturer’s claim is correct?In order to use the table, we need to convert to the standard normal variablez.For x = 52,

z = x − µx

σx

= x − µ

σx

= 52 − 540.85

= −2.35

From the z-table,P (0 < z < 2.35) = 0.4906

=⇒ P (z < −2.35) = 0.5 − 0.4906 = 0.0094.

So, it is very unlikely, but not impossible, that one observes a sample mean of52 or fewer months. Therefore, the manufacturer’s claim seems to be wrongbased on the sample.

3. Let us now consider the effect of sample size n.Consider n = 9 and assume we still have x = 52 months. Then,

σx = σ

n= 6√

9= 6

3= 2 months,

andz = x − µx

σx

= 52 − 542

= −1

=⇒ P (z ≤ 52) = 0.5 − 0.3413 = 0.1587.

Page 51: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

4.1. SAMPLING DISTRIBUTION OF THE MEAN 49

There is a 16% chance that the sample mean x ≤ 52. So, it is still unlikely,but it could have just happened.

Remark 4.1 It is customary in statistics to consider 5%, or sometimes 1%, assignificant or not negligible. We abode by this convention in judging 16% obtainedabove as not negligible.

aThe exact form for a finite N is

σx = σ

(1√n

− 1√N

),

which reduces to the given formula, σx = σ/√

n, as N → ∞.

Page 52: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

50

ExercisesClick to go to solutions.

1. A random sample of 64 observations is to be drawn from a large populationwith mean 500 and standard deviation 80. Find the probability that:

(a) x ≤ 500(b) x ≤ 510(c) 487 ≤ x ≤ 525.3

2. A random sample of 40 observations are to be drawn from a large population ofN measurements. It is known that 30% of the measurements in the populationare 1’s, 20% of the measurements are 2’s, 20% are 3’s, and 30% are 4’s. Givethe mean and standard deviation of the sampling distribution of x (the samplemean of the 40 observations). Does your answer depend on the sample size?

3. (TO BE REPLACED: COPYRIGHTED MATERIAL) Steven Hartley (1983)conducted an experiment to evaluate the forecasting skills of 140 retail buyersof two large Midwestern retail organizations. In one part of the experiment, 61of the buyers were given historical sales data for the previous 30 months andasked to forecast sales 6 months from now. For each buyer, Hartley calculatedthe difference between the actual number of units sold 6 months later andthe buyer’s forecast. This difference is sometimes called forecast error and isdenoted here as x. In order to characterize the accuracy of this group of 61buyers, Hartley calculated x the mean forecast error for the sample. Assumethe sample of 61 buyers was randomly selected from a large population ofbuyers whose forecast errors have a distribution with mean 10 and standarddeviation 16. Estimate µx and σx.

Page 53: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

Chapter 5

Statistical Inferences Based on aSingle Sample

In Chapter 4, both the population mean µ and the population standard deviationσ were given. However, this is like defeating the purpose as the goal of inferentialstatistics is to estimate the population parameters based on a sample. Now, thereare only two ways to get the population mean exactly.

1. Sample the entire population.

• While this option will certainly work if the size of the population N isfinite, this is defeating the whole purpose of inferential statistics.

2. Repeat sampling infinitely many times.

• If we had the luxury of sampling infinitely many times from a givenpopulation, we could draw the sampling distribution curve and obtainthe population mean µ and standard deviation σ from the curve. But, ofcourse, this is clearly impractical.

In many cases, we only have one sample to base our estimation on. This chap-ter outlines the computations, assumptions, and uncertainties involved in such aninference obtained from single sampling.

51

Page 54: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

52 CHAPTER 5. STATISTICAL INFERENCES BASED ON A SINGLE SAMPLE

5.1 Estimation of the Population Mean Based ona Large Sample

A large sample size, a large n, implies that the sample standard deviation s willbe close to the population standard deviation σ for most samples. Therefore, weassume s ≈ σ and use s to approximately compute σx; i.e. we will use

σx = s√n

or, to be more precise, σx ≈ s√n

. (5.1)

As one does not know where µ is, it is not possible to know the distance between xand µ exactly. But, of course, the whole point of sampling is to have a good edu-cated guess of where µ lies. While one cannot compute the population mean µ in apinpoint fashion, it is possible to come up with an interval in which the populationmean µ lies with a sufficiently large probability such as 0.9 (90%) and 0.95 (95%).Such an interval is called a 90% and 95% confidence interval, respectively. Inorder to achieve this, we will use σx ≈ s√

nand the fact that the sampling distribu-

tion of x is approximately normal due to the Central Limit Theorem.

Definition 5.1 The confidence coefficient is the probability, such as 0.9 and0.95 above, that an interval encloses the actual mean.

Definition 5.2 The confidence level is the confidence coefficient expressed as apercentage.

For example, a 95% confidence interval (x − c, x + c) is the interval such thatthere is an 0.95 chance that the actual population mean µ lies in the interval.

It is important that you understand exactly what this means. To do that, it isfirst demonstrated how a confidence interval is constructed given a single sample.

Page 55: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

5.1. ESTIMATION OF THE POPULATION MEAN BASED ON A LARGE SAMPLE 53

5.1.1 Constructing a Confidence Interval from a Large Sam-ple

Suppose we have a sample {x1, x2, . . . , xn} from a population with any distributionprofile, a mean µ, and standard deviation σ; where n is sufficiently large.Then;

• The sampling distribution of x is approximately normal. (the Central LimitTheorem)

• The mean of the sampling distribution is given by µx = µ.

• The standard deviation of the sampling distribution is σx = σ√n.

• The population standard deviation σ is closely approximated by the samplestandard deviation s.

• Hence, the standard deviation of the sampling distribution is closely approxi-mated by s√

n.

These facts and approximations, indicate that the sampling distribution can beclosely approximated by a normal distribution whose mean is µ and standard de-viation is s√

n. Armed with this, we can find a confidence interval for µ with any

desired confidence level.

Let us find a 90% confidence interval for the population mean µ based on thesingle sample described above. This involves the following steps. For a better un-derstanding of the situation and procedure, we will first assume that the populationmean µ and standard deviation σ, and hence the mean of the sampling distributionµx, which is equal to µ, and the standard deviation of the sampling distribution σx,which is equal to σ/

√n, are known. In addition, we will assume that the sampling

distribution is exactly normal.

1. Convert X to the standard normal variable Z by

Z = X − µ

σ/√

n. (5.2)

Page 56: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

54 CHAPTER 5. STATISTICAL INFERENCES BASED ON A SINGLE SAMPLE

2. Find a particular value of Z denoted by z0.05 such that P (0 < Z < z0.05) = 0.45;that is,

P (0 < Z < z0.05) = P

(0 <

X − µ

σ/√

n< z0.05

)= 0.45

and, by symmetry,

P (−z0.05 < Z < z0.05) = P

(−z0.05 <

X − µ

σ/√

n< z0.05

)= 0.90.

3. Note that

−z0.05 <X − µ

σ/√

n< z0.05 ⇐⇒ −z0.05

σ√n

< X − µ < z0.05σ√n

⇐⇒ −X − z0.05σ√n

< −µ < −X + z0.05σ√n

⇐⇒ X − z0.05σ√n

< µ < X + z0.05σ√n

.

4. Therefore,

P

(−z0.05 <

X − µ

σ/√

n< z0.05

)=

0.90 ⇐⇒ P

(X − z0.05

σ√n

< µ < X + z0.05σ√n

)= 0.90.

What this is saying is that 90% of X values satisfies

X − z0.05σ√n

< µ < X + z0.05σ√n

.

5. Equivalently, we can describe this situation as follows.

(a) Draw x from the sampling distribution randomly and repeatedly.(b) Construct the interval

(x − z0.05

σ√n, x + z0.05

σ√n

)for each x.

(c) Then, 90% of such intervals contains the population mean µ.

Page 57: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

5.1. ESTIMATION OF THE POPULATION MEAN BASED ON A LARGE SAMPLE 55

With this picture of the entire landscape in mind, we will replace σ of the ex-pression

Z = X − µ

σ/√

n(5.3)

in Step 1 with the sample standard deviation S, a good approximation of σ when nis large, to obtain

Z = X − µ

S/√

n. (5.4)

Proceeding through Steps 2, 3, and 4 as before, we now know that the 90% confidenceinterval based on our sample is (approximately)(

x − z0.05s√n

, x + z0.05s√n

).

This is approximate because the population standard deviation σ was replaced withthe sample standard deviation s, and the sampling distribution is only approxi-mately, and not exactly, normal.

Finally, note that our analysis above clearly indicates the true meaning of the90% confidence interval. If we sample repeatedly, get x and s for each sample, andconstruct the interval

(x − z0.05

s√n, x + z0.05

s√n

)for each pair (x, s), then, (about)

90% of the intervals so constructed contains the population mean µ.

More concretely, suppose we draw 100 random samples of size n with replacementas follows.

{x11, x2

1, . . . , xn1 }, {x1

2, x22, . . . , xn

2 }, . . . , {x1100, x2

100, . . . , xn100}.

Let the sample mean and standard deviation of the n-th sample be xn and sn andconstruct 100 intervals as we did above.

(x1 − z0.05

s1√n

, x1 + z0.05s1√

n

),

(x2 − z0.05

s2√n

, x2 + z0.05s2√

n

), . . . ,

Page 58: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

56 CHAPTER 5. STATISTICAL INFERENCES BASED ON A SINGLE SAMPLE

(xn − z0.05

sn√n

, xn + z0.05sn√

n

)Then, out of these 100 intervals, about 90 should contain the population mean µ.

If the confidence coefficient is 1 − α, or equivalently, if the confidence level is100(1 − α), we will use zα

2instead of z0.05.

Example 5.1 (Unoccupied seats on flights) When an airline fails to sell a seat fora flight, it causes a revenue loss. Therefore, it is very important for airlines tomonitor their flights for unoccupied seats. Suppose an airline wants to find theaverage number of unoccupied seats per flight In a real-life situation, the financialimpact of an unoccupied seat differs from one route to another. But, we will ignoresuch a factor here for simplicity.

Question An unbiased random sample of 225 flights has been taken, andit was found that the mean number of unoccupied seats was 11.6 and thestandard deviation was 4.1; that is,

n = 225, x = 11.6, and s = 4.1.

Estimate µ the mean number of unoccupied seats per flight using a 90%confidence interval. (Same as saying: Find the 90% confidence interval for µ,the mean number of unoccupied seats per flight.)

Answer Our α is 0.1, and we need to find zα2

= z0.05. From a z-table,z0.05 = 1.645. So, the 90% confidence interval is(

x − z0.05s√n

, x + z0.05s√n

)

=(

11.6 − 1.645 4.1√225

, 11.6 + 1.645 4.1√225

)= (11.6 − 0.45, 11.6 + 0.45) = (11.15, 12.005).

This does not mean we know much about this particular interval. Remember themeaning of 90%. If this process is repeated many times, the interval contains real µ

Page 59: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

5.1. ESTIMATION OF THE POPULATION MEAN BASED ON A LARGE SAMPLE 57

90% of the time and does not contain µ 10% of the time. Whether this particularinterval is one of the 90% that contain µ or one of the 10% that do not is completelyunknown to us.

5.1.2 Sample Size Determination

When planning a sampling study, one of the most important design decisions is howlarge the sample should be for the study to generate a useful result. The appropriatesample size for an estimation of the population mean depends on both the desiredconfidence level and the desired reliability, that is, the width of the confidence inter-val. It is probably the best to explain the procedure for sample size determinationby way of an example.

Example 5.2 (Unoccupied Seats on Flights Revisited)

Question Suppose we wanted the 90% confidence interval to be of width0.225. What minimum sample size will we need for this? Is it even possibleto begin with?

Answer The confidence interval is given by(x − z0.05

s√n

, x + z0.05s√n

).

Hence,

w = x + z0.05s√n

−(

x − z0.05s√n

)= 2z0.05

s√n

=⇒√

n = 2z0.05s

w

=⇒ n =(2z0.05s

w

)2.

Substituting z0.05 = 1.645, s = 4.1, and w = 0.225 into this expression,

n =(2 × 1.645 × 4.1

0.225

)2≈ 3594.136.

Page 60: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

58 CHAPTER 5. STATISTICAL INFERENCES BASED ON A SINGLE SAMPLE

Theoretically we need 3595 seats, but this is not possible. Hence, 90% confi-dence level and w = 0.225 cannot be achieved simultaneously.

If we want 100(1 − α)% confidence interval to be of width w, all we need to dois to replace z0.05 with zα

2to obtain

n =(2zα/2s

w

)2. (5.5)

5.2 Constructing a Confidence Interval from a SmallSample

We have been assuming that the sample size n is large enough (n ≥ 30 or so). Now,we will learn how to deal with small samples (n < 15, say).

Possible Problems

1. The probability distribution of the sample mean x, the sampling distribution ofthe mean, now depends on the shape of the population that is being sampled.

2. Although it is still true that σx = σ√n, the sample standard deviation s may

provide a poor approximation of the population standard deviation σ whenthe sample size is small..

About 1 From 4.1, the sampling distribution of x will be normal (approxi-mately normal) if the sampled population is normal (approximately normal).

At this point, it may appear that we can use the z-statistic as before witha large sample. However, there is additional uncertainty/variability in the samplestandard deviation which prevents us from using the z-statistic even if the populationdistribution is (approximately) normal.

About 2 In order to compensate for the small sample size, we will use thet-statistic rather than the z-statistic. The t-statistic is more variable, havinga larger standard deviation, and is dependent on the sample size by way ofthe degrees of freedom, df = n − 1. a

Page 61: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

5.2. CONSTRUCTING A CONFIDENCE INTERVAL FROM A SMALL SAMPLE 59

When we have a sample {x1, x2, . . . , xn} with a small n, the Central Limit Theo-rem does not apply, and the sampling distribution is not normal. However, if thepopulation has a normal distribution,

T = X − µ

S/√

n(5.6)

has the t distribution with n − 1 degrees of freedom as seen in 4.1.1.

In estimating a population parameter, we have to compromise somewhere tomake the process work. Such is the fate of inferential statistics. After all, it is justan inference and not a proof or an exact measurement. In this case, the normaldistribution assumption for the population is often made as the compro-mise. This is not unreasonable as so many real-life distributions are more or lessnormal.

Possible Points of Confusion: More on z and tI think the fact that we use both

Z = X − µ

S/√

n(5.7)

and

T = X − µ

S/√

n, (5.8)

where the expressions on the right-hand side are exactly the same, may be confusing.So, I would like to revisit this point and “beat it to death” this time.

Here is how, when, and why Z is used.

• The sample size n is large.

• Then, by the Central Limit Theorem, the sampling distribution is (near) nor-mal.

• There is a good chance that the sample standard deviation S is close to thepopulation standard deviation σ.

Page 62: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

60 CHAPTER 5. STATISTICAL INFERENCES BASED ON A SINGLE SAMPLE

• We also know

µx = µ and σx = σ√n

. (5.9)

• So, if we knew the population standard deviation σ, we will use

Z = X − µ

σ/√

n(5.10)

and the fact that the sampling distribution is (near) normal.

• However, in actuality, we do not know what the population standard deviationσ is and use the sample standard deviation S as a substitute.

X − µ

S/√

n(5.11)

• Using S instead of σ is justified by the large sample size.

Here is how, when, and why T is used.

• The sample size n is small.

• So, the sampling distribution is not normal and the sample standard deviationS may not be close to the population standard deviation σ.

• Note at this point that

X − µ

S/√

n(5.12)

cannot play the role it played as Z for a large sample.

• In order to use

Z = X − µ

S/√

n, (5.13)

we need a normal sampling distribution.

Page 63: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

5.2. CONSTRUCTING A CONFIDENCE INTERVAL FROM A SMALL SAMPLE 61

• Instead, we define a new variable T by

T = X − µ

S/√

n. (5.14)

• T has its own distribution that can be tabulated.

• In a sense, it is an accident that T and Z have distribution profiles whichappear similar as these come from different kinds of considerations. How-ever, mound-shaped distributions are very common both in natural and socialsciences. And, in that sense, the similarity is not surprising.

It helps to remember that Z is a standard normal random variable with, of course,a normal probability distribution. So, Z does not make any sense unless we havea normal sampling distribution. Also note from Fact 4.1 of 4.1, that the samplingdistribution is normal if a sample is from a population with a normal distribution.We also know the mean µx and standard deviation σx of the sampling distributiongiven the population mean µ and standard deviation σ.

µx = µ and σx = σ√n

. (5.15)

However, the problem lies with the fact that S no longer approximates σ closely.Hence,

X − µ

S/√

n0

X − µ

σ/√

n= Z (5.16)

and

X − µ

S/√

n(5.17)

can no longer be used as a substitute for the true normal random variable

Z = X − µ

σ/√

n(5.18)

as before. This is why a new random variable T is introduced.

Note the following about T as compared with Z.

Page 64: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

62 CHAPTER 5. STATISTICAL INFERENCES BASED ON A SINGLE SAMPLE

• T has larger variability and is more spread out than Z.

• The distribution of T depends on the sample size n, or the degrees of freedomdf = n − 1 more precisely, unlike Z.

The first property reflects greater uncertainty associated with the smaller samplesize. The second property takes account of the fact that the impact of the differencebetween n and n + 1, for example, is greater if n is small. Contrast going from 1 to1 + 1 = 2 with going from 100 to 100 + 1 = 101. Intuitively, it should be clear thatthe difference between n = 1 and n = 2 is far greater than the difference betweenn = 100 and n = 101.

At any rate, we cannot take advantage of the fact that the sampling distributionis normal as we lack a solid handle on the population standard deviation σ.

In order to understand the difference between Z and T more clearly, we will nowrevisit T .

Once, we decide on the use of T , the rest of the procedure is completely in parallelwith the large sample case. In other words, we need only to replace z with t in theformulas derived previously.

a The concept of the degrees of freedom is not easy to grasp. For this kind of application, it isalways n − 1, and you may want to commit it to memory and just use it. However, for those whowant a little more justification, here is a story-telling type explanation.

• The degrees of freedom can be regarded as the number of independent pieces of data beingused to make a calculation.

• We are looking at the distribution of X.• Consider selecting random samples {x1, x2, . . . , xn}, each with the same fixed value xfixed

of X.• The first n − 1 data x1, x2, . . . , xn−1 can always be chosen randomly and independently.• However, for the sample mean to be xfixed, we need to have xn = nxfixed −

∑n−1i=1 xi, and

xn is uniquely determined by {x1, x2, . . . , xn−1}.• This is why df = n − 1.

5.2.1 Small-Sample Confidence Interval for µ

Page 65: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

5.2. CONSTRUCTING A CONFIDENCE INTERVAL FROM A SMALL SAMPLE 63

For a confidence level 100(1 − α), we have

P

(−tα/2 <

X − µ

σ/√

n< tα/2

)= 100(1 − α) (5.19)

⇐⇒ P

(X − tα/2

σ√n

< µ < X + tα/2σ√n

)= 100(1 − α). (5.20)

So, the 100(1 − α)% confidence interval for µ is given by(x − tα/2

s√n

, x + tα/2s√n

). (5.21)

Recall that a confidence level of 100(1−α)% is equivalent to a confidence coefficientof 1 − α.

Example 5.3 A manufacturer of printers wanted to conduct a quality control re-search that required destructive sampling, which means the printers used for thetest would all be broken at the end of the test. They wanted to measure the meannumber of characters printed before the printer breaks down. As they were notwilling to destroy many products, the sample size had to be small. Given n = 15,x = 1.23 (million characters), and s = 0.27 (million characters), form a 99% con-fidence interval for the mean number of characters their printer can print beforebreaking down.As 100(1 − α) = 99, 1 − α = 0.99, and α = 0.01. When the degrees of freedom df=15 − 1 = 14, tα/2 = t0.005 = 2.977. Therefore,

x ± t0.005

(s√n

)= 1.23 ± 2.997

(0.27√

15

)= 1.23 ± 0.21.

The desired 99% confidence interval is

(1.23 − 0.21, 1.23 + 0.21) = (1.02, 1.44).

Page 66: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

64

ExercisesClick to go to solutions.

1. (TO BE REPLACED: COPYRIGHTED MATERIAL) Nasser Arshadi andEdward Lawrence investigated the profiles (i.e., career patterns, social back-grounds, and so forth) of the top executives in the United States’ bankingindustry in 1984. They sampled 96 executives and found that 80% studiedbusiness or economics and that 45% had a graduate degree. With respect tothe number of years of service at the same bank, the group had a mean of 23.43years and a standard deviation of 10.82 years. Construct a 90% confidence in-terval for the mean number of years of service µ.

2. A random sample of 64 observations is to be drawn from a large populationwith mean 500 and standard deviation 80. Find the probability that x ≤ 515.(Note: Use the z-statistic, and show all your work.)

3. A major department store chain is interested in estimating the average amountits credit card customers spent on their first visit to the chain’s new store in themall. Twenty-five credit card accounts were randomly sampled and analyzedwith the following results: the mean = 60 and the variance s2 = 100. Constructthe 90% confidence interval. (Note: Use the z-statistic.)

4. A car manufacturer wants to test a new engine to determine whether it meetsnew governmental standards. The mean emission µ must be less than 20 ppmof carbon. Ten engines were manufactured for this testing. The mean of tenmeasurements was x = 17.1 ppm.

(a) If the standard deviation is s and the population mean is 20, give theformula for the t-statistic for x = 17.1. You do not have to compute t.

(b) What is the probability that your t-value is smaller than −2.821? Don’tforget to take the degrees of freedom into account.

5. A random sample { 5, 6, 2, 8, 4 } was selected from a normal distribution.Construct 90%, 95%, and 99% confidence intervals for the population mean µ.

6. A random sample of 100 observations is to be drawn from a large populationwith mean 1000 and standard deviation 100. Find the probability that x ≤1010. (Note: Use the z-statistic, and show all your work.)

Page 67: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

65

7. A manufacturer of cereal decides to test its box filling machines. The machine isdesigned to discharge an average of 12 ounces per box. After 100 observations,the average amount discharged was 11.85 ounces per box with the standarddeviation of 0.5 ounces.

(a) Compute z for x = 11.85.(b) Explain why you can or can not conclude the machine is underfilling the

boxes with α = 0.01?

8. Consider a population whose ages are normally distributed. The average age µis 35 and the standard deviation σ is 30. Consider a simple random samplingof size 225 from this population. Needless to say, this is a large sample case.

(a) What is µx?(b) What is σx?(c) What is the probability that the sample mean is greater than 41?

9. Suppose that the standard deviation for the operating life for a particular caris known to be 2 years, but the mean operating life is not known. Assume thatthe operating life is normally distributed. For a large sample of size n = 100,the mean operating life is 10 years.

(a) Determine the 90% confidence interval for the population mean life.(b) If one wants the 95% confidence interval to be [9.5, 10.5], what sample size

does he/she need?

10. The mean diameter of a sample of 16 pipes is 2.50 mm with a standard deviationof 0.05 mm. There are a total of 100 pipes.

(a) Estimate the population mean diameter using a 99% confidence intervalwith the t-statistic.

(b) Estimate the probability that the average diameter for the population isgreater than 2.522.

11. A random sample of 100 observations is to be drawn from a large populationwith mean 1000 and standard deviation 100. Find the probability that x ≤1010. (Note: Use the z-statistic, and show all your work.)

Page 68: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

66

12. A manufacturer of cereal decides to test its box filling machines. The machine isdesigned to discharge an average of 12 ounces per box. After 100 observations,the average amount discharged was 11.85 ounces per box with the standarddeviation of 0.5 ounces.

(a) Compute z for x = 11.85.(b) Explain why you can or can not conclude the machine is underfilling the

boxes with α = 0.01?

13. A manufacturer of computer disk drives has a product list-priced at $ 750.They want to know if the current mean retail price differs from the list price.The mean and the standard deviation of 17 retail prices are x = $732 ands = $38.

(a) Give the formula for the t-statistic for x = 732. You do not have tocompute t.

(b) The t-value above is −1.95. Can the manufacturer conclude the retailprice is different from the list price with 95% confidence?

14. Consider a population whose ages are normally distributed. The average age µis 35 and the standard deviation σ is 30. Consider a simple random samplingof size 225 from this population. Needless to say, this is a large sample case.

(a) What is µx?(b) What is σx?(c) What is the probability that the sample mean is greater than 41?

15. Suppose that the standard deviation for the operating life for a particular caris known to be 2 years, but the mean operating life is not known. Assume thatthe operating life is normally distributed. For a large sample of size n = 100,the mean operating life is 10 years.

(a) Determine the 90% confidence interval for the population mean life.(b) If one wants the 95% confidence interval to be [9.5, 10.5], what sample

size does he/she need? Give the formula for the sample size with actualnumbers substituted in. You do not need to compute it.

16. A group of 17 students selected from the entire school took an examination.The average score was 84, and the standard deviation was 16. Since the samplesize is small, the sampling distribution is the t-distribution.

Page 69: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

67

(a) What is t0.05 such that P (|t| ≥ t0.05) = 0.05? Note that the conventionhere may be different from the one used in our table.

(b) Find the 95% confidence interval for the population mean.

17. Tuitions of 225 colleges in our (large) sample has a mean x of $25,000 and astandard deviation s of $9,000.

(a) What is the estimate for the standard deviation of the sampling distribu-tion σx of the sample mean x?

(b) What is the probability that the population mean µ is in the range [$24,400, $26,200]?

18. The number of minutes 9 workers in Kyoto City spend commuting to work hadthe sample average x of 105 minutes and the sample standard deviation s of30 minutes.

(a) What is the estimate for the standard deviation of the sampling distribu-tion σx of the sample mean x? What are the degrees of freedom for thissample?

(b) Find the 95% confidence interval for the population mean µ?

19. You measured the speed of 85 vehicles (a large sample) going past an obser-vation point of a certain highway and got a mean speed of 66.3 mph (milesper hour). If from previous studies you know that the population standarddeviation is 8.3 mph, then what is the (approximate) 95% confidence intervalfor the population mean speed at this observation point?

20. A random sample of small size n = 16 is taken from a normally distributedpopulation with unknown µ and σ. If the sample has a mean x = 27.9 andstandard deviation s = 3.23, then what is the 95% confidence interval for µ(t-distribution)?

21. A sample of 400 business persons were randomly chosen in order to determinethe average annual income of all business persons in Manhattan. The samplemean x was 90,000 dollars and the standard deviation s = 20, 000. Find the95% confidence interval for the population mean µ following the steps outlinedbelow. Needless to say, this is a large sample case.

(a) From the z-table, find z.025.(b) Find the x-value that gives z.025 if µ = 90000.

Page 70: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

68

(c) Now compute the upper and the lower limits of the 95% confidence intervalfor µ.

Page 71: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

Chapter 6

Hypothesis Testing

6.1 Introduction

So farWe have learned how to infer population parameters based on a sample. However,this is only one type of inferential statistics.

Where we are goingWe now show how we can test a population parameter of interest is less than, equalto, or greater than a hypothesized value based on the information gathered from asample. Making this kind of inference is known as hypothesis testing or a test ofhypothesis. We will mainly focus on hypotheses about the population mean µ.

Note here that we are more interested in testing if a hypothesis is true thanknowing what the value is precisely. We will make a statistical hypothesis, anassumption, about the population and decide whether that hypothesis about thepopulation is consistent with the sample parameters we observe.

6.2 Two Hypotheses

69

Page 72: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

70 CHAPTER 6. HYPOTHESIS TESTING

There are two hypotheses involved in hypothesis testing.

Null Hypothesis : The null hypothesis denoted by H0 is the claim aboutthe population that is initially assumed to be true. (the “prior belief” claim:However, see Note 6.1 below.)

Alternative Hypothesis : The alternative hypothesis denoted by H1 orHa is the assertion that is contradictory to H0.

Note 6.1 While the null hypothesis is often referred to as a “prior belief” hypothesis,the picture is not so simple. This is because we do not reject H0 unless there is strongevidence to support such a decision. We will see a level of confidence of 95% or higheris often required. This is sometimes likened to a court trial in which the defendant willnot be found guilty unless one can prove it beyond any reasonable doubt. Therefore,the true meaning of “prior belief” should be understood clearly, and the choice ofthe null hypothesis should be made carefully. Due to this conservative attitude indecision making, we say we either reject or fail to reject the null hypothesis ratherthan accept it. The following examples illustrate this point.

Example 6.1

1. The amount of peanuts per bag is supposed to be 150 grams on the average;i.e. the population mean µ is 150, no less and no more. We would like to checkif this is indeed the case. Our initial belief is that it is indeed the case.

H0 : µ = 150 (grams) Ha : µ , 150 (grams)

2. The thickness of a glass plate in a physics experiment should be strictly lessthan 20 mm. As we want to be absolutely sure that the plate is less than 20mm in thickness, it may make sense to adopt µ ≥ 20 mm as our null hypothesiseven if we do not necessarily believe it is true. This is because we are morelikely to fail to reject the null hypothesis in conventional hypothesis testing.

H0 : µ ≥ 20 mm Ha : µ < 20 mm

3. Suppose that the waiting time per customer at a fast food restaurant, the timeit takes for the order to be filled after entering the place, has to be fewer than

Page 73: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

6.3. STEPS, ERRORS, AND DECISION MAKING 71

or equal to 4 minutes on the average in order to call it a fast food restaurant.A government official would like to conduct a hypothesis test to assure therule is followed.

H0 : µ ≤ 4 minutes Ha : µ > 4 minutes

Rejection of the null hypothesis means judging the classification/naming ofthe restaurant to be illegal. Because it has a dire consequence for the businessif so judged, it makes sense not to reject the null hypothesis easily.

6.3 Steps, Errors, and Decision Making

As stated already, hypothesis testing hinges on the consistency check between thenull hypothesis and the observed sample parameters.

Here is a step-by-step list of the procedure.

1. State the null hypothesis H0 and alternative hypothesis Ha, which are mutuallyexclusive. If one is true, the other must be false.

2. Set the decision rule, according to which you reject or fail to reject the nullhypothesis.

3. Analyze sample data and find the value of the test statistic such as the mean.

4. Interpret results applying the decision rule above. If the value of the teststatistic is unlikely, under the null hypothesis, reject the null hypothesis.

In order to discuss the decision rule, we need to consider two types of errors first.

Type I Error : A Type I error occurs when the test rejects a null hypothesiswhen it is actually true. The probability of committing a Type I error is calledthe significance level. This probability is also called α.

Type II Error : A Type II error occurs when the test fails to reject a nullhypothesis that is false. The probability of committing a Type II error iscalled β. The probability of not committing a Type II error, which is the

Page 74: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

72 CHAPTER 6. HYPOTHESIS TESTING

probability of correctly rejecting the null hypothesis when it is false, is calledthe powera of the test. So, the power = 1 − β.

aThis is called sensitivity in biostatistics.

H0: true H0: false (Ha: true)Do not reject H0 Correct: 1 − α Type II error: βReject H0 Type I error: α Correct: 1 − β = power

Table 6.1: All possible decisions: errors and power

The decision rule can be formulated in the following two ways, which are equivalentas we will see shortly.

• p-value: If the test statistic is equal to S, the p-value is the probability ofobserving a test statistic as extreme as or more extreme than S, assuming thenull hypothesis is true. If the p-value is less than the significance level, wereject the null hypothesis.

• Region of Acceptance/Region of Rejection: If the test statistic fallswithin the region of acceptance, the null hypothesis is not rejected. The regionof acceptance is defined so that the chance of making a Type I error is equalto the significance level.The region outside the region of acceptance is called the region of rejection.If the test statistic falls within the region of rejection, the null hypothesis isrejected. We say that the hypothesis is rejected at the significance level α.

Another important aspect of a hypothesis test is whether it is a two-tailed testor one-tailed test.

One-Tailed Test Suppose the null hypothesis H0 states µ ≤ 0. Then, ourconcern is whether x is positive and large because that means a large deviationfrom µ. In this case we focus our attention only on the positive values of x.Similarly, if H0 states that 0 ≤ µ, our only concern is how small x is if it isnegative. In this case we focus our attention only on the negative values ofx.These are called one-tailed test.

Page 75: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

6.4. SPECIFIC EXAMPLES 73

Two-Tailed Test Suppose the null hypothesis H0 is µ = 0. Then, x maybe either positive or negative with a large absolute value. In this case, bothpositive and negative values of x should be considered. We call this a two-tailed test.

6.4 Specific Examples

The best way to understand different aspects of hypothesis testing is by way ofexamples. The following examples demonstrate how hypothesis test works. We willnow revisit the three cases of Example 6.1.

Example 6.2 The amount of peanuts per bag is supposed to be 150 grams on theaverage. We would like to check if this is indeed the case. Our sample of 50 bags(n = 50) had a mean x = 150.2 (grams) and standard deviation s = 1 (gram). Wewant to examine if µ = 150 is a reasonable conclusion.

1. Our initial belief is that the population mean µ is 150.

H0 : µ = 150 (grams) Ha : µ , 150 (grams)

This is a two-tailed test.

2. Set the significance level α at 0.05. (It is customary to choose 0.05 or 0.01 asthe significance level.)

3. The test static x is already given.

4. This is where most of the work is done.

(a) Because the sample size is sufficiently large (n = 50), the sampling dis-tribution, the distribution of x, is normal. In other words, we can use z,and not t. We will first work in the z-space.

(b) Here, µ = 150 means it is not acceptable even if µ > 150 though someconsumers might like it. Therefore, we have a two-tailed situation.

zα/2 = z0.025 = 1.960

Page 76: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

74 CHAPTER 6. HYPOTHESIS TESTING

First, we knowz = x − µ

σ√n

≈ x − µs√n

,

and the use of the sample standard deviation s instead of the populationstandard deviation σ is nothing more than an approximation. But, wealready discussed this extensively, and it is cumbersome to keep using ≈.Furthermore, statistics is basically about approximating and guessing inan educated and informed manner. Therefore, in the rest of this book, wewill simply use = where it should really be ≈.

Going back to the x space,

P (−1.960 < z < +1.960) = P (|z| < 1.96) = 0.95

translates to

P

(µ − 1.96 s√

n< x < µ + 1.960 s√

n

)< 0.95

as follows.

|z| < 1.960 ⇐⇒

∣∣∣∣∣∣x − µs√n

∣∣∣∣∣∣ < 1.960 ⇐⇒ |x − µ| < 1.960 s√n

⇐⇒ µ − 1.960 s√n

< x < µ + 1.960 s√n

This means that x should lie in the range(µ − 1.960 s√

n, µ + 1.960 s√

n

)

with a probability of 0.95; i.e. 95% of the time.Plugging in µ = 150 (the assumption of the null hypothesis), s = 1, andn = 50, we get

P

(µ − 1.960 s√

n, µ + 1.960 s√

n

)

Page 77: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

6.4. SPECIFIC EXAMPLES 75

= P

(150 − 1.960 × 1√

50, 150 + 1.960 × 1√

50

)= P (150−0.277, 150+0.277) = P (149.723, 150.277) = 0.95.

Therefore, upon random sampling repeated infinitely many times, 95% ofthe sample means {x} is in the range(149.723, 150.277). This is the region of acceptance when the significancelevel or α is 0.05.We have x = 150.2 ∈ [the region of acceptance], and we failed to re-ject, or we accept if not with extreme confidence, the null hypothesis thatµ = 150. We conclude the average amount of peanuts per bag is 150grams (within statistical error).

What if we compute the p-value and use it for our decision making ratherthan use the acceptance and rejection regions? In order to do this, wewill convert x = 150.2 to a z-score.

z = x − µs√n

= 150.2 − 1501√50

= 0.2 ×√

50 = 1.414.

P (1.414 < z) = 0.07868

Because this is a two-tailed situation, we also need to consider the negativez values.

P (z < −1.414) = 0.07868

Hence, p = 0.157. As this p is greater than the significance level of 0.05,we fail to reject the null hypothesis.

Example 6.3 The thickness of a glass plate should be strictly less than 20 mm.But, it appears to be thicker. A sample of 40 plates was randomly selected, for whichx = 20.4 mm and s = 1 mm. This is a one-tailed test.

1. If we choose H0 : µ ≥ 20 mm and Ha : µ < 20 mm as suggested on p. 70,there is nothing to examine as x = 20.4 mm leads to an outright failure toreject H0. So, let us pretend that our initial beliefe is H0 : µ ≤ 20 mm andHa : µ > 20 mm.

2. Set the significance level α at 0.01 this time.

Page 78: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

76 CHAPTER 6. HYPOTHESIS TESTING

3. The test statistic x is already chosen.

4. We only need to modify Step 4 slightly for this example.

(a) Because n = 40, we will use z.(b) As mentioned already, we have a one-tailed situation.

zα = z0.01 = 2.3263

Becausez = x − µ

s√n

,

z < 2.3263 ⇐⇒ x − µs√n

< 2.3263 ⇐⇒ x − µ < 2.3263 s√n

⇐⇒ x < µ + 2.3263 s√n

.

Substituting µ = 20, s = 1, and n = 40 into the right-hand side,

x < 20 + 2.3263 × 1√40

= 20.3678.

The region of acceptance is (−∞, 20.3678). As our sample mean x =20.4 < (−∞, 20.3678), the null hypothesis is rejected. We conclude thatthe specification is not met.

Example 6.4 Suppose that the waiting time per customer at a fast food restaurant,the time it takes for the order to be filled after entering the place, has to be fewerthan or equal to 4 minutes on the average in order to call it a fast food restaurant. Agovernment official would like to conduct a preliminary hypothesis test in preparationfor a larger survey. As there are so many restaurants and only a few officials, thesample size n had to be limited to 15. For one restaurant, x = 4.3 and s = 0.5.Conduct a hypothesis test for the mean waiting time at this restaurant.

1. The hypotheses are as follows.

H0 : µ ≤ 4 minutes Ha : µ > 4 minutes

2. Set the significance level α at 0.05.

Page 79: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

6.4. SPECIFIC EXAMPLES 77

3. The test statistic is x.

4. As you will see shortly, this is qualitatively different from Example 6.2 andExample 6.3.Due to the small sample size, we have to use the t-statistic with df= 15−1 = 14.

T = x − µS√n

=⇒ t = 4.3 − 40.5√

15= 2.324

The one tailed p-value for t = 2.324 and df= 14 is 0.0178 < 0.05a. And thenull hypothesis is rejected. It is not a fast food restaurant.

aYou cannot get this value from a typical table. Use, for example, the online calculator athttp://www.danielsoper.com/statcalc3/.

[Bureau, 2012, p.2] [Trek, 2012]

Page 80: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

78

ExercisesClick to go to solutions.

1. The amount of peanuts per bag is supposed to be 150 grams on the average. Wewould like to check if this is indeed the case. Our sample of 50 bags (n = 50)had a mean x = 150.2 (grams) and standard deviation s = 1 (gram). We wantto examine if µ = 150 is a reasonable conclusion at p = 0.01.

2. The thickness of a glass plate should be strictly less than 20 mm. But, itappears to be thicker. A sample of 40 plates was randomly selected, for whichx = 20.4 mm and s = 1 mm. This is a one-tailed test at p = 0.05.

3. Suppose that the waiting time per customer at a fast food restaurant, the timeit takes for the order to be filled after entering the place, has to be fewer thanor equal to 4 minutes on the average in order to call it a fast food restaurant.A government official would like to conduct a preliminary hypothesis test inpreparation for a larger survey. As there are so many restaurants and only afew officials, the sample size n had to be limited to 15. For one restaurant,x = 4.3 and s = 0.5. Conduct a hypothesis test for the mean waiting timeat this restaurant. The p-value is 0.05. (Read, understand, and “copy” thesolution in the lecture notes.)

4. In Problem 3 above, use the on-line statistics calculator athttp://www.danielsoper.com/statcalc3/to find the critical value of x below which it is a fast food restaurant. Assumethat s remains the same.

5. A manufacturer of computer disk drives has a prodcut list-priced at $ 750.They want to know if the curretn mean retail price differs from the list price.The mean and the standard deviation of 17 retail prices are x = $732 ands = $38.

(a) Give the formula for the t-statistic for x = 732. You do not have tocompute t.

(b) The t-value above is −1.95. Can the manufacturer conclude the retailprice is different from the list price with 95% confidence?

6. A manufacturer of cereal decides to test its box filling machines. The machine isdesigned to discharge an average of 12 ounces per box. After 100 observations,

Page 81: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

79

the average amount discharged was 11.85 ounces per box with the standarddeviation of 0.5 ounces.

(a) Compute z for x = 11.85.(b) Explain why you can or can not conclude the machine is underfilling the

boxes with α = 0.01?

7. A car manufacturer wants to test its new engine to determine whether it meetsthe air-pollution standard of less than 20 parts per million of carbon. Theytested 9 engines, and the mean and standard deviation for the tests are x = 23and s = 3. Do the data supply enough evidence to allow the manufacturer toconclude that this type of engine meets the pollution standard at α = 0.01?

(a) Compute t for x = 23 assuming µ = 20.(b) What is the degrees of freedom df and t0.01 for this problem?(c) Explain, in words, why you can or can not conclude the engine meets the

standard?

8. For a particular species of rat, it is known that the average birth weight isµ = 25 grams and the standard deviation σ is 6 grams. We want to study theeffect of under-nourishment on the weight of the babies. The average weightof 100 babies was x = 22 grams. Conduct a one-tailed hypothesis test atp = 0.05 to examine if this difference is indeed meaningful and possibly due tomalnutrition.

9. A certain plant is known to grow very rapidly. A sample of 36 such plants wereobserved, and it was found that they grew by an average x of 3.35 inches perday during the rainy season. On the other hand, botanists had conducted anextensive survey and found that µ = 3.2 inches and σ = 0.6 inches during thedry season. Conduct a one-tailed test at p = 0.05 for the null hypothesis H0that the plant grows at the same speed during the dry and rainy seasons.

10. The mean batting average for all professional baseball players who were notusing any drug µ was found to be 0.275 with a standard deviation of 0.02.When 25 players who were on a particular steroid were studied, their averagex = 0.270. Experts’ opinions are divided between performance enhancementand reduced level of performance as a result of taking the steroid. Perform atwo-tailed hypothesis test; where the null hypothesis H0 is that the steroid hasno effect on the performance. Treat this as a large-sample case.

Page 82: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

80

11. A car manufacturer is claiming that the paint on their Model A will last for 10years on the average (µ = 10) before it needs repainting. A consumer groupsecured nine cars and conducted a long-term test. They found that the paintlasted for 9.6 years on the average, x = 9.6, with a standard deviation of 0.5,s = 0.5. Conduct a one-tailed small-sample test for the manufacturer’s claimat the significance level of 0.01. How about 0.05?

12. There is a municipal specification that requires the mean strength µ of morethan 2,400 pounds per foot for residential sewer pipes. Let the null hypothesisbe µ ≤ 2, 400 and the research hypothesis be µ > 2, 400. Our α is 0.05. Whena sample of 50 one-foot segments were tested for their lineal breaking strength,the mean x was 2, 460 and the standard deviation s was 200.

(a) What is the z that corresponds to this x?(b) Can we conclude from the above that the pipes satisfy the municipal

specification? Don’t fail to explain why.

13. A salesperson has a contractual obligation to sell an average of 100 pairs ofshoes every month. In the last nine months, the person’s monthly sales havebeen 95, 99, 98, 97, 95, 101, 98, 93, and 97 pairs. The sample standard deviations = 2.4. Based solely on these data, conduct a one-sided t-test to judge if thesalesperson is abiding by the contract at α = 0.05?

(a) What are the sample size n, the degrees of freedom df, and the samplemean x?

(b) What is the t-value for x?(c) With H0: µ ≥ 100 conduct a one-sided t-test. What can you conclude

about the average number of pairs of shoes sold by the salesperson?

Page 83: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

Chapter 7

Variance, Covariance, andCorrelation

So far, we have only encountered the concept of variance. However, variancecan be viewed as a special case of covariance, covariance can be viewed as a kindof extension of variance, and correlation is a rescaled version of covariance. In thischapter, we will start with covariance and derive variance and correlation in thatorder. It is hoped that this chapter will help readers to put each of these conceptsin proper perspective and understand the larger framework in which these conceptsare placed. Just like the network of neurons, statistical concepts are also interre-lated forming their own networks. Understanding how different notions are relatedto each other provides an opportunity to look at the concepts from different angles.The author strongly believes that this is a very effective and efficient way to solidifyone’s knowledge.

Variance is a measure of variability of one random variable X. But, covariance isa measure of the extent to which two variables X and Y vary together. They covary,and hence the name covariance.

7.1 Population Covariance

Definition 7.1 (Popultion Covariance) Let X and Y be two random variableswith means µX and µY . Then,

E[(X − µX)(Y − µY )]

81

Page 84: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

82 CHAPTER 7. VARIANCE, COVARIANCE, AND CORRELATION

is called the (population) covariance of X and Y and is often denoted by Cov(X, Y ).If the random variables are discrete, we have

Cov(X, Y ) = E[(X − µX)(Y − µY )] =∑

x

∑y

(x − µX)(y − µY )p(x, y),

where the summation is over all possible combinations (x, y) and p(x, y) is the prob-ability, called “joint probability”, associated with the pair (x, y).If the random variables are continuous, we have∫ +∞

−∞

∫ +∞

−∞(x − µX)(y − µY )f(x, y) dxdy,

where f(x, y) dxdy is the probability of finding the point (x, y) in the small rectangledefined by x < X < x + dx and y < Y < y + dy. The function f(x, y) is known asthe “joint probability density function” of X and Y .

Qualitatively speaking, the idea here is quite simple. If X and Y tend to changetogether, then positive values of X − µX would be associated with positive valuesof Y − µY , and negative values of X − µX would be associated with negative valuesof Y − µY . Therefore, E[(X − µX)(Y − µY )] will be large. They can also changetogether except that Y becomes smaller as X becomes larger. In this case, we willget a large negative value for E[(X − µX)(Y − µY )]. Finally, if there is not muchrelation between X and Y , X − µX and Y − µY have as much chance of having thesame sign as different signs, making Cov(X, Y ) = E[(X − µX)(Y − µY )] close to 0.

Proposition 7.1 The following equality is satisfied by the covariance.

E[(X − µX)(Y − µY )] = E[XY ] − µXµY (7.1)or

Cov(X, Y ) = E[XY ] − E[X]E[Y ]. (7.2)

This relation can be used to compute Cov(X, Y ) more easily.

Proof

E[(X − µX)(Y − µY )] = E[XY − µY X − µXY + µXµY ]= E[XY ] − µY E[X] − µXE[Y ] + µXµY

Page 85: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

7.1. POPULATION COVARIANCE 83

= E[XY ] − µY µX − µXµY + µXµY = E[XY ] − µXµY (7.3)

Let us consider one simple example.

Example 7.1 Suppose a discrete random variable X can take values 0, 1, 2, 3, and4, while another discrete random variable Y assumes values 1, 2, 3, 4, and 5. Thefollowing grid shows all possible combinations (x, y) with associated joint probabilitiesp(x, y) in each cell. For example, the number 0.2 in the upper right-hand cornerindicates that the probability associated with (0, 1) is 0.2; p(0, 1) = 0.2.

x = 0 1 2 3 4y = 1 0.2 0 0 0 02 0 0.2 0 0 03 0 0 0.2 0 04 0 0 0 0.2 05 0 0 0 0 0.2

To make the computation simple, I set all joint probabilities except for the diagonalentries 0. I also made all diagonal entries the same at 0.2. Note that the sum ofp(x, y) satisfies ∑X

∑Y p(x, y) = 1 as it should.

Needless to say, it is not necessary to have the same number of data points for X andY , 5 here, to have many joint probabilities to be 0, or for the nonzero probabilitiesto be the same as in this simplified example.As many probabilities are set equal to 0, this dataset is the same as the following setwhere only the pairs (x, y) = (0, 1), (1, 2), (2, 3), (3, 4), and (4, 5) are included withthe same joint probability p(x, y) of 0.2.

X 0 1 2 3 4Y 1 2 3 4 5

For this data, µX = 2 and µY = 3, and

Cov(X, Y ) = E[(X − µX)(Y − µY )] =∑

x

∑y

(x − µX)(y − µY )p(x, y)

= (0 − 2)(1 − 3)(0.2) + (1 − 2)(2 − 3)(0.2) + (2 − 2)(3 − 3)(0.2)+ (3 − 2)(4 − 3)(0.2) + (4 − 2)(5 − 3)(0.2)

= [−2 · (−2) + (−1) · (−1) + 0 · 0 + 1 · 1 + 2 · 2](0.2)= (4 + 1 + 1 + 4)(0.2) = 2. (7.4)

Page 86: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

84 CHAPTER 7. VARIANCE, COVARIANCE, AND CORRELATION

On the other hand, according to Proposition 7.1,

E[(X − µX)(Y − µY )] = E[XY ] − µXµY

= (0 · 1 + 1 · 2 + 2 · 3 + 3 · 4 + 4 · 5)(0.2) − 2 · 3= (0 + 2 + 6 + 12 + 20)(0.2) − 6 = 8 − 6 = 2. (7.5)

This checks.

Example 7.2 Next, consider another example with a nonzero joint probability foreach combination (x, y). The table below summarizes the situation. The numberin each cell represents the joint probability p(x, y). Again, ∑X

∑Y p(x, y) = 1 as it

should be.x = 3 7 8

y = 2 0.1 0.1 0.24 0.2 0.3 0.1

For this data, µX = 3(0.1 + 0.2) + 7(0.1 + 0.3) + 8(0.2 + 0.1) = 3(0.3) + 7(0.4) +8(0.3) = 0.9 + 2.8 + 2.4 = 6.1 and µY = 2(0.1 + 0.1 + 0.2) + 4(0.2 + 0.3 + 0.1) =2(0.4) + 4(0.6) = 0.8 + 2.4 = 3.21, and we have

Cov(X, Y ) = E[(X − µX)(Y − µY )] =3∑

i=1

2∑j=1

(xi − 6.1)(yj − 3.2)p(xi, yj)

= (3 − 6.1)(2 − 3.2)(0.1) + (7 − 6.1)(2 − 3.2)(0.1)+ (8 − 6.1)(2 − 3.2)(0.2) + (3 − 6.1)(4 − 3.2)(0.2)+ (7 − 6.1)(4 − 3.2)(0.3) + (8 − 6.1)(4 − 3.2)(0.1)

= (−3.1)(−1.2)(0.1) + (0.9)(−1.2)(0.1) + (1.9)(−1.2)(0.2)+ (−3.1)(0.8)(0.2) + (0.9)(0.8)(0.3) + (1.9)(0.8)(0.1)

= (3.72)(0.1) + (−1.08)(0.1) + (−2.28)(0.2) + (−2.48)(0.2)+ (0.72)(0.3) + (1.52)(0.1)

= 0.372 − 0.108 − 0.456 − 0.496 + 0.216 + 0.152 = −0.32. (7.6)1Don’t commit a beginner’s error here. The correct definition of the mean is µ = E[X] =∑N

i=1 xipi(x), the expected value, and not µ =∑N

i=1xi

N . The latter is valid only when all pi’sare equal, so that pi = 1

N for each of i = 1, 2, . . . N . Incidentally, if you make this mistake anduse µX = 6 and µY = 3, you will get E[(X − µX)(Y − µY )] = −0.3, which does not agree withE[XY ] − µXµY = 1.2.

Page 87: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

7.1. POPULATION COVARIANCE 85

Alternatively,

Cov(X, Y ) = E[(X−µX)(Y −µY )] = E[XY ]−µXµY

= (3)(2)(0.1)+(7)(2)(0.1)+(8)(2)(0.2)+(3)(4)(0.2)+(7)(4)(0.3)+(8)(4)(0.1)−(6.1)(3.2)= 0.6+1.4+3.2+2.4+8.4+3.2−19.52 = 19.2−19.52 = −0.32Note that the alternative formula simplifies the computation considerably.

Proposition 7.2Cov(X, X) = V ar(X)

This is a direct consequence of the definitions of variance and covariance.

Cov(X, X) = E[(X − µX)(X − µX)] = E[(X − µX)2] = V ar(X)

Proposition 7.3 There is a simple relation among V ar(X +Y ), V ar(X), V ar(Y ),and Cov(X, Y ); namely,

V ar(X + Y ) = V ar(X) + V ar(Y ) + 2Cov(X, Y ).

Proof

V ar(X+Y ) = Cov(X+Y, X+Y ) = Cov(X, X)+Cov(X, Y )+Cov(Y, X)+Cov(Y, Y )

= Cov(X, X)+Cov(Y, Y )+Cov(X, Y )+Cov(X, Y ) = V ar(X)+V ar(Y )+2Cov(X, Y )

Here is a list of the properties of covariance.

Properties of Population CovarianceLet X, Y , V , and W be random variables, and let a, b, c, and d be arbitrary con-stants. Then, we have the following identities. Some are redundant but listed herefor ready reference.

1. Cov(X, Y ) = E[XY ] − E[X]E[Y ]

2. Cov(X, Y ) = Cov(Y, X)

3. Cov(X, X) = V ar(X) or σ2

4. Cov(X + Y, Z) = Cov(X, Z) + Cov(Y, Z)

Page 88: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

86 CHAPTER 7. VARIANCE, COVARIANCE, AND CORRELATION

5. Cov(aX, bY ) = abCov(X, Y )

6. Cov(X + a, Y + b) = Cov(X, Y )

7. Cov(a + bX, Y ) = bCov(X, Y )

8. Cov(aX + bY, cV + dW ) = acCov(X, Y ) + adCov(X, W ) + bcCov(Y, V ) +bdCov(Y, W )

7.2 Population CorrelationWhile covariance has reasonable properties given above, the number representingcovariance depends on the units of the data, and it is difficult to compare covariancesamong data sets having different scales. It is the correlation coefficient thataddresses this issue by normalizing the covariance to the product of the standarddeviations of the variables, creating a dimensionless quantity that facilitates thecomparison between different data sets.

Definition 7.2 (Population Correlation) If X and Y are random variables withstandard deviations σX and σY , then the number

ρX,Y = Cov(X, Y )σXσY

is called the correlation coefficient of X and Y . ρX,Y is also denoted by Corr(X, Y ).

To see the difference in the information represented by covariance and correlation.Consider the following simple example.

Example 7.3 Consider the two cases below.

1. Y = X + 1 and p(x, y) = 0.2 for all pairs (x, y) in the table. Implicit in thistable is the assumption that other pairs not in the table, such as (2, 1) and(3, 5), do not occur.

X 0 1 2 3 4Y 1 2 3 4 5

2. Y = X + 100 and p(x, y) = 0.2 for all pairs (x, y)

X 0 100 200 300 400Y 100 200 300 400 500

Page 89: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

7.2. POPULATION CORRELATION 87

In case 1, x = 2, y = 3, σx =√∑5

i=1(xi−2)2

5 =√

(0−2)2+(1−2)2+(2−2)2+(3−2)2+(4−2)2

5

=√

4+1+0+1+45 =

√2, σy =

√(1−3)2+(2−3)2+(3−3)2+(4−3)2+(5−3)2

5 =√

4+1+0+1+45 =

√2.

So,

Cov(X, Y ) = E[XY ] − µXµY = (0 + 2 + 6 + 12 + 20)(0.2) − (2)(3) = 2,

andCorr(X, Y ) = Cov(X, Y )

σXσY

= 2√2

√2

= 1.

In case 2, x = 200, y = 300, σx =√∑5

i=1(xi−200)2

5

=√

(0−200)2+(100−200)2+(200−200)2+(300−200)2+(400−200)2

5 =√

40000+10000+0+10000+400005

=√

20000, σy =√

(100−300)2+(200−300)2+(300−300)2+(400−300)2+(500−300)2

5

=√

40000+10000+0+10000+400005 =

√20000.

So,

Cov(X, Y ) = E[XY ]−µXµY = (0+20000+60000+120000+200000)(0.2)−(200)(300)

= 20000,

andCorr(X, Y ) = Cov(X, Y )

σXσY

= 20000√20000

√20000

= 1.

Comparing Cases 1 and 2, one can see that Cov(X, Y ) is affected directly by thesize of the random variables X and Y , while Corr(X, Y ) successfully extracts therelationship between X and Y . In this example, the relationship is reflected in theslope of the straight line. In both cases, one unit of increase in X is associated withone unit of increase in Y .2

Here is a list of the properties of correlation.

Properties of Population CorrelationLet X and Y be random variables, and let a and b be arbitrary constants. Then, wehave the following identities. Some are redundant but listed here for ready reference.

1. The correlation coefficient is independent of the measurement scale. Hence, itgives the same value whether the height is measured in feet and inches or in

2We will see later that the slope is indeed Corr(X, Y ) in the framework of linear regression.

Page 90: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

88 CHAPTER 7. VARIANCE, COVARIANCE, AND CORRELATION

centimeters, for example.

2. Corr(X, Y ) = Corr(Y, X)

3. Corr(X, Y ) and Cov(X, Y ) have the same sign.

4. Corr(a + bX, Y ) = Corr(X, Y ) if b > 0

5. Corr(a + bX, Y ) = −Corr(X, Y ) if b < 0

6. If a and c are either both positive or both negative, Corr(aX + b, cY + d) =Corr(X, Y ).a

7. The absolute value of the correlation coefficient is at most 1; |ρ| ≤ 1.

8. Corr(X, Y ) = 1 if and only if Y = a + bX for some constants a and b > 0.Likewise, Corr(X, Y ) = −1 if and only if Y = a + bX for some constants aand b < 0.

aWe sometimes say the correlation is independent of both the origin and scale as this propertyimplies Corr(X, Y ) = Corr(U, V ) for U = c(X − X0) and V = d(Y − Y0) for any constants c, d,X0, and Y0 such that cd > 0.

Before closing this section, let us make a note of the following fact.Fact 7.1 The correlation between X and Y is the covariance of their z-statistics ZX

and ZY . That is, Corr(X, Y ) = Cov(ZX , ZY ). So, correlation is a standardizedversion of covariance.Reason

Corr(X, Y ) = Cov(X, Y )σXσY

= E[(X − µX)(Y − µY )]σXσY

= E[(

X − µX

σX

)(Y − µY

σY

)]= E[ZXZY ] = E[(ZX − 0)(ZY − 0)] = E[(ZX − µZX

)(ZY − µZY)] = Cov(ZX , ZY )

Recall that the mean of Z is 0.

7.3 Sample CovarianceAs covariance is a measure of the association between two variables X and Y , oursample is necessarily of the form {(x1, y1), (x2, y2), . . . , (xn, yn)}. Given this sample,how should we define sample covariance? Recall the population covariance.

Cov(X, Y ) = E[(X − µX)(Y − µY )] =∑

x

∑y

(x − µX)(y − µY )p(x, y)

Page 91: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

7.3. SAMPLE COVARIANCE 89

We cannot easily use an analogous form for the sample as assessing each p(x, y) isnot an easy task. However, without going deeply into mathematical details, we cangive a convincing argument as to the form of sample covariance. We only need tocompare

s2 =∑n

i=1(xi − x)2

n − 1, σ2 = Cov(X, X) = E[(X−µX)2], and Cov(X, Y ) = E[(X−µX)(Y −µY )].

In particular, comparing (xi − x)2, (X − µX)2, and (X − µX)(Y − µY ), it seemsreasonable to replace (xi − x)2 with (xi − x)(yi − y) to obtain the sample covariance.

Definition 7.3 (Sample Covariance) Given a sample {(x1, y1), (x2, y2), . . . , (xn, yn)},the sample covariance SX,Y is defined by

SX,Y =∑n

i=1(xi − x)(yi − y)n − 1

.3

Proposition 7.4 The sample covariance satisfies the following identity, which of-fers an alternative way to compute it.

SX,Y = n

n − 1(XY − X · Y ) or SX,Y = 1

n − 1

(n∑

i=1xiyi −

∑ni=1 xi

∑ni=1 yi

n

)(7.7)

I wrote X · Y rather than the usual X Y to make the distinction between the meanof the product xy and the product of two means x · y clear.ProofIt suffices to show

n∑i=1

(xi − x)(yi − y) =n∑

i=1xiyi −

∑ni=1 xi

∑ni=1 yi

n.

Now,n∑

i=1(xi − x)(yi − y) =

n∑i=1

(xiyi − xiy − xyi + x y) =n∑

i=1xiyi − y

n∑i=1

xi

3Though it is well beyond the scope of this book, it can be shown that this expression converges tothe population distribution in a certain way, known as convergence in probability [Hogg et al., 2012,p.316]. It can also be shown that E[SX,Y ] = Cov(X, Y ), and SX,Y is an unbiased estimator[Siegrist, 2012a].

Page 92: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

90 CHAPTER 7. VARIANCE, COVARIANCE, AND CORRELATION

−xn∑

i=1yi + nx y

=n∑

i=1xiyi − yn

∑ni=1 xi

n− xn

∑ni=1 yi

n+ nx y =

n∑i=1

xiyi − ny x − nx y

+nx y

=n∑

i=1xiyi − nx y =

n∑i=1

xiyi − n

∑ni=1 xi

n

∑ni=1 yi

n

=n∑

i=1xiyi −

∑ni=1 xi

∑ni=1 yi

n. (7.8)

Here are some essential properties of sample covariance. Let X, Y , and Z berandom variables and c be an arbitrary constant. The notation S(X, Y ) is usedinstead of SX,Y for clarity where such is deemed appropriate.

1. SX,X = s2X : Sample covariance is a generalization of sample variance.

2. S(X, Y ) = S(Y, X)

3. S(X + Y, Z) = S(X, Z) + S(Y, Z) and S(X, Y + Z) = S(X, Y ) + S(X, Z)

4. S(cX, Y ) = S(X, cY ) = cS(X, Y ) for any number ca

5. S(X + Y, X + Y ) = S(X, X) + S(Y, Y ) + 2S(X, Y ) = s2X + s2

Y + 2S(X, Y )

6. More generally, S(∑ki=1 aiXi,

∑lj=1 bjYj) = ∑k

i=1∑l

j=1 aibjS(Xi, Yj).aProperties 3 and 4 make S bi-linear by definition.

7.4 Sample CorrelationRecall the population correlation ρX,Y defined by ρX,Y = Cov(X,Y )

σXσY. The sample

correlation coefficient rX,Y is defined analogously.

rX,Y = SX,Y

sXsY

This amounts to computing

rX,Y =∑n

i=1(xi − x)(yi − y)√∑ni=1(xi − x)2∑n

i=1(yi − y)2.

Page 93: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

7.4. SAMPLE CORRELATION 91

Note that one can also write

rX,Y = 1n − 1

n∑i=1

(xi − x

sX

)(yi − y

sY

);

where sX and sY are standard deviations of X and Y , and hence, xi−xsX

and yi−ysY

arethe corresponding standard scores.

Fact 7.2 Fact 7.1 still holds, and sample correlation is a standardized version ofsample covariance.

Alternative formulas are also available.

rX,Y =∑n

i=1 xiyi − nx y

(n − 1)sXsY

= n∑n

i=1 xiyi −∑ni=1 xi

∑ni=1 yi√

n∑n

i=1 x2i − (∑n

i=1 xi)2√

n∑n

i=1 y2i − (∑n

i=1 yi)2(7.9)

The properties of sample correlation listed below follow from the corresponding prop-erties of sample covariance. Let X and Y be random variables and a, b, c, and d benonzero constants.

1. rX,Y = SzX ,zY: The correlation between X and Y is the covariance between

the z-scores of X and Y .

2. rX,Y = rY,X

3. raX,Y = rX,aY = rX,Y if a > 0

4. raX,Y = rX,aY = −rX,Y if a < 0

5. rX+a,Y +b = rX,Y

6. raX+b,cY +d = rX,Y if a × c > 0

7. raX+b,cY +d = −rX,Y if a × c < 0

8. −1 ≤ rX,Y ≤ +1

9. rX,Y = +1 if and only if the sample points lie on a line with a positive slope;i.e., Y = a + bX with b > 0.Likewise, rX,Y = −1 if and only if the sample points lie on a line with a positiveslope; i.e., Y = a + bX with b < 0.

Page 94: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

92 CHAPTER 7. VARIANCE, COVARIANCE, AND CORRELATION

Properties 6 and 7 are consequences of Property 2 through Property 5.4 For proofsof Properties 8 and 9, see Appendix D.

7.5 Correlation as Inner ProductOur discussion here is only heuristic, but comparing correlation and inner productwill help our qualitative understanding of correlation.

Consider two vectors with n components;

X = [(x1 − x), (x2 − x), . . . (xn − x)], Y = [(y1 − y), (y2 − y), . . . (yn − y)].

Then, the inner product

X · Y =n∑

i=1(xi − x)(yi − y),

∥X∥ =√X · X =

√√√√ n∑i=1

(xi − x)2,

and

∥Y∥ =√Y · Y =

√√√√ n∑i=1

(yi − y)2.

Now, remember the definition of the sample correlation rX,Y .

rX,Y = SX,Y

sXsY

=∑n

i=1(xi − x)(yi − y)√∑ni=1(xi − x)2∑n

i=1(yi − y)2

=∑n

i=1(xi − x)(yi − y)√∑ni=1(xi − x)2

√∑ni=1(yi − y)2

= X · Y∥X∥ ∥Y∥

(7.10)

4The sample correlation coefficient r is what is called the maximum likelihood estimator (mle) forthe correlation parameter ρ of a distribution called bivariate normal distribution [Hogg et al., 2012,pp. 471, 507, 584]. And, this is the reason why our sample correlation takes the form given here.But, the concept of mle is well beyond the scope of this book.

Page 95: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

7.4. SAMPLE CORRELATION 93

In 2-dimensions, such as the x, y-plane, we have

X · Y = ∥X∥ ∥Y∥ cos θ or X · Y∥X∥ ∥Y∥

= cos θ;

where θ is the angle between the vectors X and Y.

If X and Y are parallel and pointing in the same direction, θ = 0 and cos θ = 1.If X and Y are anti-parallel and pointing in opposite directions, θ = π and cos θ =−1.

We can see that the correlation coefficient Corr(X, Y ) or rX,Y is an n-dimensionalanalogue of cos θ.

Looking ahead −→ Coefficient of Determination: When we discuss sim-ple linear regression, which is an attempt to describe the relation between the vari-ables Y and X in the familiar form Y = aX + b, we will see that the square ofthe correlation coefficient between the values predicted by Y = aX + b and actuallyobserved Y -values provides a measure of how good the prediction is. This is calledthe coefficient of determination.

7.6 Variance Covariance MatrixDefinition 7.4 (Variance Covariance Matrix) Consider a set of random vari-ables {X1, X2, . . . , Xn} or equivalently a random column vector

X =

X1X2...

Xn

. (7.11)

The n × n symmetric matrix Σ such that its i-th row and j-th column entry Σij isgiven by

Σij = Cov(Xi, Xj) = E[(Xi − µi)(Xj − µj)], (7.12)

Page 96: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

94 CHAPTER 7. VARIANCE, COVARIANCE, AND CORRELATION

where µi and µj are the means/expected values of Xi and Xj, is called the variancecovariance matrix. We have

Σ = [Σij]

=

E[(X1 − µ1)(X1 − µ1)] E[(X1 − µ1)(X2 − µ2)] . . . E[(X1 − µ1)(Xn − µn)]E[(X2 − µ2)(X1 − µ1)] E[(X2 − µ2)(X2 − µ2)] . . . E[(X2 − µ2)(Xn − µn)]

......

. . ....

E[(Xn − µn)(X1 − µ1)] E[(Xn − µn)(X2 − µ2)] . . . E[(Xn − µn)(Xn − µn)]

.

(7.13)

Note that

E[X] = E

X1X2...

Xn

=

E[X1]E[X2]

...E[Xn]

=

µ1µ2...

µn

, (7.14)

and

X − E[X] =

X1X2...

Xn

µ1µ2...

µn

=

X1 − µ1X2 − µ2

...Xn − µn

(7.15)

Hence,

(X − E[X])(X − E[X])T

=

X1 − µ1X2 − µ2

...Xn − µn

[

X1 − µ1 X2 − µ2 . . . Xn − µn

]

=

(X1 − µ1)(X1 − µ1) (X1 − µ1)(X2 − µ2) . . . (X1 − µ1)(Xn − µn)(X2 − µ2)(X1 − µ1) (X2 − µ2)(X2 − µ2) . . . (X2 − µ2)(Xn − µn)

......

. . ....

(Xn − µn)(X1 − µ1) (Xn − µn)(X2 − µ2) . . . (Xn − µn)(Xn − µn)

Page 97: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

7.4. SAMPLE CORRELATION 95

E[(X − E[X])(X − E[X])T

]

=

E[(X1 − µ1)(X1 − µ1)] E[(X1 − µ1)(X2 − µ2)] . . . E[(X1 − µ1)(Xn − µn)]E[(X2 − µ2)(X1 − µ1)] E[(X2 − µ2)(X2 − µ2)] . . . E[(X2 − µ2)(Xn − µn)]

......

. . ....

E[(Xn − µn)(X1 − µ1)] E[(Xn − µn)(X2 − µ2)] . . . E[(Xn − µn)(Xn − µn)]

(7.16)

Fact 7.3 (Compact Representation of Variance Covariance Matrix) We havethe following matrix relation between the variance covariance matrix Σ and X ofDefinition 7.4.

Σ = E[(X − E[X])(X − E[X])T

](7.17)

Page 98: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

96

ExercisesClick to go to solutions.

1. Write down the formula with the numbers given below to calculate the Pear-son’s correlation coefficient, r, between mother’s education and daughter’s edu-cation both expressed in terms of the number of years in school. Do not use thesummation symbol ∑. (Note: The Pearson’s coefficient is the only correlationcoefficient we encountered in the lecture.)

Mother’s education Daughter’s education9 1011 2213 19

2. Write down the formula with the scores given below to calculate the Pearson’scorrelation coefficient, r, between Test 1 and Test 2. Do not use the summationsymbol ∑. (Note: The Pearson’s coefficient is the only correlation coefficientwe encountered in the lecture.)

Test 1 Test 260 1070 3080 20

3. Give an example of a case where there is a high Pearson’s correlation betweentwo variables A and B, but there is no direct causality between them. (Note:You will receive full credit only if you answer both (a) and (b) correctly.)

(a) What are your A and B?(b) Why is there a high correaltion between A and B, but no causality? (In

at most 50 words.)

4. Write down the formula with the scores given below to calculate the Pearson’scorrelation coefficient, r, between Test 1 and Test 2. Do not use the summationsymbol ∑. (Note: The Pearson’s coefficient is the only correlation coefficientwe encountered in the lecture.)

Page 99: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

97

Test 1 Test 260 1070 3080 20

5. Give an example of a case where there is a high Pearson’s correlation betweentwo variables A and B, but there is no direct causality between them. (Note:You will receive full credit only if you answer both (a) and (b) correctly.)

(a) What are your A and B?(b) Why is there a high correaltion between A and B, but no causality? (In

at most 50 words.)

6. (a) Write down the formula with the numbers given below to calculate thePearson’s correlation coefficient, r, between mother’s education and daugh-ter’s education both expressed in terms of the number of years in school.Do not use the summation symbol ∑. (Note: The Pearson’s coefficient isthe only correlation coefficient we encountered in the lecture.)

Mother’s education Daughter’s education9 1011 2213 19

(b) Give an example of two correlated quantities A and B, where there is nocausality either way. That is, neither A causes B nor B causes A. Explainwhy there is no causality.

7. Write down the formula with the scores given below to calculate the Pearson’scorrelation coefficient, r, between Test 1 and Test 2. Do not use the summationsymbol ∑. (Note: The Pearson’s coefficient is the only correlation coefficientwe encountered in the lecture.)

Test 1 Test 260 1070 3080 20

8. (a) Write down the formula with the numbers given below to calculate thePearson’s correlation coefficient, r, between mother’s education and daugh-ter’s education both expressed in terms of the number of years in school.

Page 100: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

98

Do not use the summation symbol ∑. (Note: The Pearson’s coefficient isthe only correlation coefficient we encountered in the lecture.)

Mother’s education Daughter’s education9 1011 2213 19

(b) Give an example of two correlated quantities A and B, where there is nocausality either way. That is, neither A causes B nor B causes A. Explainwhy there is no causality.

Page 101: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

Chapter 8

Analysis of Variance

It is often of extreme importance to test if the means obtained for several groups,called group means, are equal. Analysis of variance (ANOVA) is a method to testthe null hypothesis that multiple independent population means are the same. Inother words, we want to test whether the data come from populations with the samemean for the variable of interest.

H0 : All group means are the same. (8.1)Ha : At least one of the means is different. (8.2)

For ANOVA to work properly, we will need an assumption that the measurements/observations are from normal populations with equal variances [Norušis, 2008, p.147].

8.1 One-Way Analysis of VarianceSuppose we have k groups of measurements, such that the i-th group contains ni

measurements {xi1, xi2, . . . , xini} whose group mean is µi. Then, the sets of mea-

surements we have are:

{x11, x12, . . . , x1n1}, {x21, x22, . . . , x2n2}, . . . , {xk1, xk2, . . . , xknk}.

This is shown in Table 8.1. In order to use analysis of variance, the observationsmust be independent random samples from normal populations with equal variances1

[Norušis, 2008, p. 147]. So, we will assume that these conditions are satisfied. These1The equal variance condition is referred to as homoscedasticity.

99

Page 102: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

100 CHAPTER 8. ANALYSIS OF VARIANCE

Table 8.1: k Groups of Measurements

Groups1 2 . . . i . . . k

x11 x21 . . . xi1 . . . xk1x12 x22 . . . xi2 . . . xk2...

... . . .... . . .

......

... . . .... . . .

...

x1n1

... . . .... . . .

......

... xini

......

......

... xknk

x2n2

......

.........

µ1 µ2 . . . µi . . . µk

k groups are referred to as k treatments when they arise as a result of differentprocesses they have gone through. Think of different types of medical treatmentsthese groups have received. Actually, it is common practice to use the term treatmentas a catch-all name for groups2 in an ANOVA context. This is the convention weadopt here as well.

8.1.1 Partitioning the Total Sum of SquaresNow, let x be the mean of all the measurements and xi be the mean for the i-thtreatment; that is,

x = x11 + x12 + . . . + x1n1 + x21 + . . . + x2n2 + . . . + . . . + xknk

n1 + n2 + . . . + nk

=∑k

i=1

(∑nij=1 xij

)∑k

i=1 ni

(8.3)

and

2In this abuse of terminology, a treatment is defined as a specific combination of different factorswhose influence is compared with other treatments.

Page 103: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

8.1. ONE-WAY ANALYSIS OF VARIANCE 101

xi = xi1 + xi2 + . . . + xini

ni

=∑ni

j=1 xij

ni

. (8.4)

Then, we can express the difference between a measure xij and x as a sum of twodifferences; the difference between xij and its treatment mean xi and the differencebetween the treatment mean xi and x.

xij − x = (xij − xi) + (xi − x) . (8.5)

The sum of squares of the left-hand side of (8.5), called the total sum of squares(SST ), can be decomposed into local and global sums of squares as below.

k∑i=1

ni∑j=1

(xij − x)2 =k∑

i=1

ni∑j=1

(xij − xi)2 +k∑

i=1ni (xi − x)2 (8.6)

The first term is called the sum of squares within groups (SSW ), and the secondterm is the sum of squares between groups (SSB). With these notations, (8.6)can be rewritten as

SST = SSW + SSB. (8.7)

Relation (8.6) is obtained by straightforward algebraic operations. If you are soinclined, here is an unabridged line-by-line derivation of (8.6).

k∑i=1

ni∑j=1

(xij − x)2 =k∑

i=1

ni∑j=1

[(xij − xi) + (xi − x)]2

=k∑

i=1

ni∑j=1

[(xij − xi)2 + 2 (xij − xi) (xi − x) + (xi − x)2

]

=k∑

i=1

ni∑j=1

(xij − xi)2 +k∑

i=1

ni∑j=1

[2 (xij − xi) (xi − x) + (xi − x)2

](8.8)

It remains to showk∑

i=1

ni∑j=1

[2 (xij − xi) (xi − x) + (xi − x)2

]=

k∑i=1

ni (xi − x)2 . (8.9)

Page 104: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

102 CHAPTER 8. ANALYSIS OF VARIANCE

But,

k∑i=1

ni∑j=1

[2 (xij − xi) (xi − x) + (xi − x)2

]

=k∑

i=1

ni∑j=1

[2(xijxi − xijx − x2

i + xix)

+ x2i − 2xix + x2

]

=k∑

i=1

ni∑j=1

(2xijxi − 2xijx − 2x2

i + 2xix + x2i − 2xix + x2

)=

k∑i=1

ni∑j=1

(2xijxi − 2xijx − x2

i + x2)

=k∑

i=1

2xi

ni∑j=1

xij − 2xni∑

j=1xij −

ni∑j=1

x2i +

ni∑j=1

x2

=

k∑i=1

[2xini

∑nij=1 xij

ni

− 2xni

∑nij=1 xij

ni

− nix2i + nix

2]

=k∑

i=1

[2nix

2i − 2nixix − nix

2i + nix

2]

=k∑

i=1

[nix

2i − 2nixix + nix

2]

=k∑

i=1ni

(x2

i − 2xix + x2)

=k∑

i=1ni (xi − x)2 . (8.10)

We have shown that the equality in (8.6) holds indeed.

8.1.2 Partitioning the Degrees of FreedomWe can also partition the degrees of freedom. We have

dfT = dfW + dfB; (8.11)where dfT is the total degrees of freedom, dfW is the within-groups degrees of free-dom, and dfB is the between-groups degrees of freedom. Let the total number ofmeasurements be N.

N =k∑

i=1ni (8.12)

As one degree of freedom is lost when x is estimated, we havedfT = N − 1. (8.13)

Page 105: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

8.1. ONE-WAY ANALYSIS OF VARIANCE 103

Likewise, we lose k degrees of freedom when the means of k groups are estimated togive

dfW = N − k. (8.14)

Finally, the between-groups degrees of freedom is given by the number of means, k,minus one degree of freedom lost when x is estimated.

dfB = k − 1 (8.15)

Note that we indeed have

dfW + dfB = (N − k) + (k − 1) = N − 1 = dfT , (8.16)

verifying (8.11).

8.1.3 Mean Squares and an F-StatisticWe now need the concept of mean square. A mean square, abbreviated as MS,is defined as the sum of squares divided by the degrees of freedom; that is, it isan “average” sum of squares. Hence, the total mean square MST , the withingroups mean square MSW , and the between groups mean square MSB aregiven by

MST = SST

dfT, MSW = SSW

dfW, and MSB = SSB

dfB(8.17)

respectively. Then, it turns out the ratio MSB/MSW gives an F -statistic FdfB ,dfW.

MSB/MSW = FdfB ,dfW(8.18)

A derivation of Relation (8.18) can be found in Section H.1 of Appendix H. But, thegist of the proof is that we have

SSB ∼ χ2dfB

(8.19)and

SSW ∼ χ2dfW

, (8.20)

which in turn implies that

MSB = SSB

dfB

∼χ2

dfB

dfB

(8.21)

Page 106: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

104 CHAPTER 8. ANALYSIS OF VARIANCE

and

MSW = SSW

dfW

∼χ2

dfW

dfW

. (8.22)

We will now show how this F -statistic is used for our hypothesis testing by anexample.

8.2 Effects of Different Treatments on High BloodPressure

Our first example involves treatments of high blood pressure by different medicines.We will try to assess the effectiveness of these treatments by comparing the meanblood pressures of the patients after the treatment.

Example 8.1 Consider three medicines, A, B, and C, that are supposed to lowerthe systolic blood pressure3. We call these treatments A, B, and C, and let D bethe control where the participants received a placebo. The group sizes are nA = 12,nB = 10, nC = 14, and nD = 12, and their systolic blood pressure readings in mmHgas well as the means µA, µB, µC, and µD are shown in Table 8.2. Assume that thesegroups were carefully matched for their blood pressure profile before the treatments.

While it may appear from the average systolic blood pressure values that A, B,and C indeed lowered the blood pressure compared with the placebo group D, we cannot easily tell whether this difference is statistically significant. One-way ANOVA isuseful in this situation. Our null hypothesis is

H0 : µA = µB = µC = µD (8.23)

and the alternative hypothesis is

Ha : at least one of the means is different. (8.24)

According to Table 8.3, the significance level of the F -statistic is .658, and we cannot reject the null hypothesis at the α-level of .05. Therefore, despite the possible

3Systolic blood pressure is the top reading while diastolic blood pressure is the bottom reading.So, if your blood pressure ranges from 80 to 120, 80 is the diastolic blood pressure and 120 is thesystolic blood pressure.

Page 107: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

8.2. EFFECTS OF DIFFERENT TREATMENTS ON HIGH BLOOD PRESSURE 105

Table 8.2: Systolic Blood Pressure after Treatments (Data 1)

TreatmentsA B C D120 142 131 129136 126 154 155117 124 137 126125 113 130 139145 147 122 135137 129 135 123123 134 128 154128 141 117 155151 132 149 122119 119 127 143148 139 119133 122 139

146119

µA = 131.83 µB = 130.7 µC = 132.57 µD = 136.58

Table 8.3: ANOVA Table for Data 1

ANOVASystolic

Sum of Squares df Mean Square F Sig.Between Groups 225.888 3 75.296 .539 .658Within Groups 6144.112 44 139.639Total 6370.000 47

Page 108: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

106 CHAPTER 8. ANALYSIS OF VARIANCE

initial ”hunch”, even the highest value µD = 136.58 is not significantly differentfrom the other three means at the α = .05 level. So, we can not conclude that thetreatments are indeed making a difference. Needless to say, the effects of the threetreatments represented by µA, µB, and µc are not significantly different from

Example 8.2 This example is the same as Example 8.1 except that the systolicpressure readings for the control D were each increased by 8 (Data 2).

Table 8.4: Systolic Blood Pressure after Treatments (Data 2)

TreatmentsA B C D120 142 131 137136 126 154 163117 124 137 134125 113 130 147145 147 122 143137 129 135 131123 134 128 162128 141 117 163151 132 149 130119 119 127 151148 139 127133 122 147

146119

µA = 131.83 µB = 130.7 µC = 132.57 µD = 144.58

Without much ado, let us go right to the ANOVA table for Data 2 (Table 8.5). Asyou can see, the F -value now gives the significance figure of .022 < .05. We rejectthe null hypothesis that µA = µB = µC = µD in this case. The second column of Ta-ble 8.5 shows that “the total sum of squares = the sum of squares between groups+the sum of squares within groups” (7634.000 = 1489.888 + 6144.112), and the thirdcolumn exhibits dfT = N − 1, dfW = N − k, and dfB = k − 1. The Mean Squares inthe fourth column are computed as follows.

1489.8883

= 496.629 and 6144.11244

= 139.639 (8.25)

Page 109: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

8.2. EFFECTS OF DIFFERENT TREATMENTS ON HIGH BLOOD PRESSURE 107

With these, we can compute the F -value;

F = MSB

MSW

= 496.629139.639

= 3.557, (8.26)

which agrees with the entry in the fifth column of Table 8.5. Indeed, the p-valuecalculated with an online statistical calculator [Soper, nd] is 0.22 in agreement withTable 8.5.

Table 8.5: ANOVA Table for Data 2

ANOVASystolic

Sum of Squares df Mean Square F Sig.Between Groups 1489.888 3 496.629 3.557 .022Within Groups 6144.112 44 139.639Total 7634.000 47

Let us now examine other data included in the output of SPSS. The first is a tableof descriptive statistics (Table 8.6). The fifth column from the left labeled “Std.Error” shows the result of “Std. Deviation”/

√N . For example, for treatment A, we

get

Std. Error = 11.72√12

= 3.38. (8.27)

Recall that this provides an estimate for the standard deviation of sampling distri-bution of the mean. In that sense, Std. Error estimates the potential for samplingerror.

Next is a table titled Multiple Comparisons (Table 8.7). This table lists the result ofthe significance test for the difference between two means, one pair at a time. Theletters LSD right above the table stands for “Least Significant Difference t-test.”

Page 110: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

108 CHAPTER 8. ANALYSIS OF VARIANCE

Table 8.6: Descriptive Statistics for Data 2

Descriptives95% ConfidenceInterval for Mean

Std. Std. Lower UpperN Mean Deviation Error Bound Bound Min. Max.

A 12.00 131.83 11.72 3.38 124.39 139.28 117.00 151.00B 10.00 130.70 10.71 3.39 123.04 138.36 113.00 147.00C 14.00 132.57 11.39 3.04 125.99 139.15 117.00 154.00D 12.00 144.58 13.19 3.81 136.20 152.96 127.00 163.00Total 48.00 135.00 12.74 1.84 131.30 138.70 113.00 163.00

Table 8.7: Multiple Comparisons

Multiple ComparisonsLSD

95% ConfidenceInterval

Mean(I) (J) Difference Std. Lower Upper

Treatment Treatment (I-J) Error Sig. Bound BoundA B 1.133 5.060 0.824 -9.064 11.330

C -0.738 4.649 0.875 -10.107 8.631D -12.750* 4.824 0.011 -22.473 -3.027

B A -1.133 5.060 0.824 -11.330 9.064C -1.871 4.893 0.704 -11.732 7.989D -13.883* 5.060 0.009 -24.080 -3.686

C A 0.738 4.649 0.875 -8.631 10.107B 1.871 4.893 0.704 -7.989 11.732D -12.012* 4.649 0.013 -21.381 -2.643

D A 12.750* 4.824 0.011 3.027 22.473B 13.883* 5.060 0.009 3.686 24.080C 12.012* 4.649 0.013 2.643 21.381

*. The mean difference is significant at the 0.05 level.

Page 111: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

8.2. EFFECTS OF DIFFERENT TREATMENTS ON HIGH BLOOD PRESSURE 109

Mean Differences listed in the second column are µi − µj for i, j = A, B, C, and Dwith i , j. For example, the fifth entry is computed as below. This is the numeratorof the formula for t.

µA − µD = 131.83 − 144.58 = −12.750 (8.28)

Standard Error in the third column was computed by the following formula.

sµi−µj=

√√√√MSW

(1ni

+ 1nj

)(8.29)

For example, the third entry of the third column is computed as follows. This is thedenominator of the formula for t.

sµA−µD=√

MSW

( 1nA

+ 1nD

)=√

139.639( 1

12+ 1

12

)= 4.824 (8.30)

Hence, the t-value, also expressed as LSDA-D, is given by

t = LSDA-D = µA − µD

sµA−µD

= µA − µD√MSW

(1

nA+ 1

nD

) = −12.7504.824

= −2.643. (8.31)

This is a t-distribution with N −k degrees of freedom; where N is the total number ofdata points or the total number of the patients, and k is the number of treatments/groups. So, in our case, df = 48 − 4 = 44. A two-tailed test with Daniel Soper’sStatistics Calculators [Soper, nd] gives a p-value of 0.011 in agreement with the thirdentry of the Sig. column of Table 8.7. Now, for df = 44 Daniel Soper’s calculatorgives

t.025,44 = 2.01536757, (8.32)

which allows us to compute the 95% confidence interval. Carrying as many signif-icant figures as given by SPSS in order to generate results as close to the ones inTable 8.7 as possible, we get

−t.025,44 <(µA − µD) − (−12.750)

sµA−µD

< t.025,44

Page 112: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

110 CHAPTER 8. ANALYSIS OF VARIANCE

=⇒ −2.01536757 <(µA − µD) − (−12.750)

4.82422544605879< 2.01536757

=⇒ −2.01536757 × 4.82422544605879 − 12.750 < µA − µD

< 2.01536757 × 4.82422544605879 − 12.750=⇒ −22.47258751 < µA − µD < −3.027412486. (8.33)

These figures agree with the lower bound of −22.473 and the upper bound of −3.027given as the third entries of the fifth and the sixth columns of Table 8.7.

Finally, “Means Plots” given in Figure 8.1 is what you think it is. These are themeans of the four treatments/groups.

Treatment

DCBA

Mea

n o

f S

ys

toli

c B

loo

d P

res

su

re

144

141

138

135

132

Figure 8.1: Means of Systolic Blood Pressure for Treatments A, B, C, and the ControlD

Page 113: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

111

ExercisesClick to go to solutions.

1. Answer the following questions using the data given in the table below.ANOVA

SystolicSum of Squares df Mean Square F Sig.

Between Groups 880.888 3 a c dWithin Groups 6144.112 44 bTotal 7025.000 47

(a) Compute the mean squares a and b.(b) Compute the F -statistic c.(c) Compute the F -value for α = .05 using Daniel Soper’s online calculator.(d) What can you conclude about the means at the α level of .05?(e) Compute the significance level d using the online calculator.

2. Three groups of students, Group A, Group B, and Group C, were given differenttypes of trainings to improve their reading skill. There was also Group D, acontrol group, that did not receive any training. After the training, they tooka reading comprehension test. The scores are listed in Table 8.8. The resultsof one-way ANOVA are summarized in Table 8.9.

(a) Give the fractions that give the values of the within groups mean squareand between groups mean square.

(b) Show how you compute the F -value.(c) What do you conclude from Table 8.9 at α = .01?

Page 114: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

112

Table 8.8: Reading Comprehension Scores

TrainingsA B C D36 76 56 2365 47 98 7031 44 67 1745 24 55 4182 85 40 3467 53 64 1242 62 51 6851 75 31 7093 58 89 1035 35 49 4887 71 560 40 41

8435

Table 8.9: ANOVA Table for Reading Comprehension

ANOVAReading Comprehension Scores

Sum of Squares df Mean Square F Sig.Between Groups 4141.576 3 1380.525 3.014 .040Within Groups 20154.340 44 458.053Total 24295.917 47

Page 115: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

Chapter 9

Simple Linear Regression

A correlation coefficient is a measure of how data points (x, y) cluster around astraight line. However, it is not sensitive to the nature of that straight line. It givesno information as to the slope and intercept of the line. Indeed, we have seen thatCorr(aX + b, cY + d) = ±Corr(X, Y ) depending on whether ac is positive or nega-tive. This means that we cannot distinguish among the relations Y = X, Y = 2X,Y = X + 1, and Y = 2X + 1, for example, because Corr(X, X) = Corr(X, 2X) =Corr(X, X + 1) = Corr(X, 2X + 1) = 1.

It would be useful if we can get the formula of the elusive straight line if suchactually exists. One way to accomplish this is linear regression which gives us thebest estimate for the slope and intercept of the straight line relationship between theindependent variable X and dependent variable Y . This chapter is titled “SimpleLinear Regression” as it deals with one independent variable. We will also dealwith multiple linear regression in the next chapter, where we have more than oneindependent variable X1, X2, . . . Xn and a relation Y = a1X1+a2X2+ . . . +anXn+b.

9.1 The True Meaning of a Statistical Linear Re-lationship

In mathematics, a relationship like Y = a + bX is deterministic, and the value ofY is uniquely determined once the value of X is specified. However, statistics isa discipline that deals with probabilistic distributions, and linear regression is noexception. Here, we assume the Y values to be normally distributed about the value

113

Page 116: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

114 CHAPTER 9. SIMPLE LINEAR REGRESSION

a + bX. We will describe this situation by introducing an error term ε as follows.

Y = a + bX + ε.

If we have a sample {(x1, y1), (x2, y2), . . . , (xn, yn)}, we have a set of errors {ε1, ε2, . . . , εn},such that yi = a + bxi + εi for i = 1, 2, . . . , n. The error term ε is a random variablewith the following properties.

1. Unbiased: E[εi] = 0 (i.e. mean value of 0) for i = 1, 2, . . . , n.

2. Homoscedastic: V ar(εi) = σ2 for i = 1, 2, . . . , n. The same variance σ2 forany value of X.

3. Independent1: Cov(εi, εj) = 0 for i , j.

4. Normal: εi ∼ N(0, σ2), i = 1, 2, . . . , n.

Due to Property 1 above, once the value of X is fixed at x0, the expected value ofthe corresponding Y satisfies the relation below.

E[Y ] = E[a + bx0 + ε] = E[a] + E[bx0] + E[ε] = a + bx0.

Therefore, Y = a + bX specifies the mean value of Y .

Y = a + bX or E[Y ] = a + bX

Finally, it is comforting to know that simple linear regression is reasonably robust tomild violations of the four model assumptions. We need to be concerned only whenthere is a clear violation of the assumptions.

9.2 Sum of Squares, Sxx and Syy, and Sum of CrossProducts, Sxy

We have already encountered Sxx and Syy, and Sxy when we defined and computedvariance and covariance without referring to them explicitly by these names. Assimple linear regression is closely tied to these concepts, we will now revisit themand give them compact notations.

1It is customary to use the term ”independent” in this context. However, as explained inAppendix C, the errors are actually uncorrelated and not independent in the strictly statisticalsense.

Page 117: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

9.2. SUM OF SQUARES, SXX AND SY Y , AND SUM OF CROSS PRODUCTS, SXY 115

9.2.1 Sum of Squares: Sxx and Syy

Remember the definition of the variance. For example, the sample variance was givenby

s2 =∑n

i=1(xi − x)2

n − 1.

By definition, the sum of squares for X, denoted by Sxx, is the numerator of thisexpression.

Sxx =n∑

i=1(xi − x)2

In words, Sxx is the sum of squared distances from the mean for X.Similarly, for Y , the sum of squares is given by

Syy =n∑

i=1(yi − y)2.

In words, Syy is the sum of squared distances from the mean for y.

Sxx and Syy are each called total sum of squares, denoted by SST , and pro-vide quantitative measures of the total amounts of variation in observed x and yvalues. If there is no variation in y, for example, all y values would be y, and hence,it makes sense to measure the variation by the signed distance from the mean y. Wesquare the measured distances so that positive and negative signed distances will notcancel each other, rendering the sum smaller than it should be and making the totalvariance appear too small.

Sxx satisfies the following identity.

Sxx =n∑

i=1(xi − x)2 =

n∑i=1

(x2 − 2xix + x2

)=

n∑i=1

x2 − 2xn∑

i=1xi +

n∑i=1

x2

=n∑

i=1x2 − 2x(nx) + nx2 =

n∑i=1

x2 − nx2 orn∑

i=1x2 − (∑n

i=1 xi)2

n

Similarly, Syy satisfies the following identity.

Syy =n∑

i=1(yi − y)2 =

n∑i=1

(y2 − 2yiy + y2

)=

n∑i=1

y2 − 2yn∑

i=1yi +

n∑i=1

y2

Page 118: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

116 CHAPTER 9. SIMPLE LINEAR REGRESSION

=n∑

i=1y2 − 2y(ny) + ny2 =

n∑i=1

y2 − ny2 orn∑

i=1y2 − (∑n

i=1 yi)2

n

9.2.2 Sum of Cross Products: Sxy

The definition of the sample covariance is

SX,Y =∑n

i=1(xi − x)(yi − y)n − 1

,

and the sum of cross products Sx,y is the numerator of this expression.

Sxy =n∑

i=1(xi − x)(yi − y)

The following identity holds.

Sxy =n∑

i=1(xi−x)(yi−y) =

n∑i=1

(xiyi − xiy − xyi + x y) =n∑

i=1xiyi−y

n∑i=1

xi−xn∑

i=1yi+

n∑i=1

x y

=(

n∑i=1

xiyi

)−y(nx)−x(ny)+nx y =

(n∑

i=1xiyi

)−nx y or

(n∑

i=1xiyi

)−(∑n

i=1 xi) (∑ni=1 yi)

n

9.3 Principle of Least SquaresIn this chapter, we are after the best-fitting straight line y = a + bx 2 to a set ofdata points {(x1, y1), (x2, y2), . . . (xn, yn)}. However, we have not yet discussed what“best-fitting” means in this context, and the criterion we adopt here is the principleof least squares, also known as the method of least squares.

Definition 9.1 (Principle of Least Squares) The vertical deviation ri of the point(xi, yi) from the line y = a + bx is simply:

ri = yi − (a + bxi).2In order to draw a distinction between the observed y-value and the computed y-value, we will

use y for the computed value.

Page 119: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

9.3. PRINCIPLE OF LEAST SQUARES 117

So, the sum of squared vertical deviations, f(a, b), from the points {(x1, y1), (x2, y2), . . . (xn, yn)}to the line y = a + bx is given by

f(a, b) =n∑

i=1r2

i =n∑

i=1[yi − (a + bxi)]2.

The coefficients a and b are determined so that f(a, b) is minimized for the given dataset {(x1, y1), (x2, y2), . . . (xn, yn)}. This is called the Principle of Least Squares3.Note that xi’s and yi’s are fixed numbers, typically a randomly picked sample of npairs (xi, yi), and f is to be regarded as a function of two variables a and b in thiscontext.

In order to determine the values of a and b for which f(a, b) assumes the minimumvalue, we take partial derivatives of f(a, b) with respect to a and b, set them equalto 0, and solve the equations for a and b in terms of xi, x2, . . . xn and y1, y2, . . . yn.

Taking the partial derivative of f(a, b) with respect to a,

∂f(a, b)∂a

= ∂

∂a

n∑i=1

[yi−(a+bxi)]2 =n∑

i=1

∂a[yi−(a+bxi)]2 =

n∑i=1

2[yi−(a+bxi)](−1) = 0

⇐⇒n∑

i=1[yi − (a + bxi)] = 0 ⇐⇒

n∑i=1

(a + bxi) =n∑

i=1yi ⇐⇒ na +

(n∑

i=1xi

)b =

n∑i=1

yi.

Now, with respect to b.

∂f(a, b)∂b

= ∂

∂b

n∑i=1

[yi−(a+bxi)]2 =n∑

i=1

∂b[yi−(a+bxi)]2 =

n∑i=1

2[yi−(a+bxi)](−xi) = 0

⇐⇒n∑

i=1[yi−(a+bxi)](xi) = 0 ⇐⇒

n∑i=1

(xiyi−axi−bx2i ) = 0

⇐⇒(

n∑i=1

xi

)a+(

n∑i=1

x2i

)b =

n∑i=1

xiyi

So, we have the following simultaneous equations known as the normal equa-tions.

3Other names include the Method of Least Squares, Least Squares Method, and LeastSquare Fitting.

Page 120: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

118 CHAPTER 9. SIMPLE LINEAR REGRESSION

na +(

n∑i=1

xi

)b =

n∑i=1

yi

(n∑

i=1xi

)a +

(n∑

i=1x2

i

)b =

n∑i=1

xiyi

Sometimes, the following equivalent forms are more convenient.

na + nxb =n∑

i=1yi

nxa +(

n∑i=1

x2i

)b =

n∑i=1

xiyi

Provided that at least two of the xi’s are different, these equations have a uniquesolution. Skipping the arithmetic details, the solutions are

b =∑n

i=1(xi − x)(yi − y)∑ni=1(xi − x)2 = Sxy

Sxx

= Cov(X, Y )V ar(X)

anda =

∑ni=1 yi − b

∑ni=1 xi

n= y − bx = y − Sxy

Sxx

x.

These are our least squares estimates for the slope and intercept. We can now seehow these parameters are related to the sum of squares and sum of cross products;Sxx and Sxy. We can also see that the following fact holds true.

Fact 9.1 The least squares line goes through the point (x, y).Proof: y(x) = a + bx = y − bx + bx = y

You may have noticed that we did not need the other sum of squares Syy. In order tosee how and where this sum of squares shows up, we will now consider x = a′ + b′y.A moment’s thought may tell you that all you have to do in order to compute a′

and b′ is to switch x and y everywhere in the previous calculation for y = a + bx.If you do not see it right away, you have no need to be concerned as it most likelymeans that you are a very careful thinker, which is a very good thing. At any rate,if you go through the same computation as you did for y = a + bx, we will get thefollowing.

f(a′, b′) =n∑

i=1r2

i =n∑

i=1[xi − (a′ + b′yi)]2

Page 121: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

9.3. PRINCIPLE OF LEAST SQUARES 119

∂f(a′, b′)∂a′ = ∂

∂a′

n∑i=1

[xi−(a′+b′yi)]2 =n∑

i=1

∂a′ [xi−(a′+b′yi)]2 =n∑

i=12[xi−(a′+b′yi)](−1) = 0

⇐⇒n∑

i=1[xi − (a′ + b′yi)] = 0 ⇐⇒

n∑i=1

(a′ + b′yi) =n∑

i=1xi ⇐⇒ na′ +

(n∑

i=1yi

)b′ =

n∑i=1

xi

∂f(a′, b′)∂b′ = ∂

∂b′

n∑i=1

[xi−(a′+b′yi)]2 =n∑

i=1

∂b′ [xi−(a′+b′yi)]2

=n∑

i=12[xi−(a′+b′yi)](−yi) = 0

⇐⇒n∑

i=1[xi−(a′+b′yi)](yi) = 0 ⇐⇒

n∑i=1

(xiyi−a′yi−b′y2i ) = 0

⇐⇒(

n∑i=1

yi

)a′+

(n∑

i=1y2

i

)b′ =

n∑i=1

xiyi

So, we have the following simultaneous equations.

na′ +(

n∑i=1

yi

)b′ =

n∑i=1

xi

(n∑

i=1yi

)a′ +

(n∑

i=1y2

i

)b′ =

n∑i=1

xiyi

Provided that at least two of the yi’s are different, these equations have a uniquesolution.

b′ =∑n

i=1(yi − y)(xi − x)∑ni=1(yi − y)2 = Sxy

Syy

= Cov(X, Y )V ar(Y )

anda′ =

∑ni=1 xi − b′∑n

i=1 yi

n= x − b′y = x − Sxy

Syy

y

Applying the principle of least squares, we have now determined the four coefficientsa, b, a′, and b′ such that

y = a + bx and x = a′ + b′y

are the least squares fits for the sample (x1, y1), (x2, y2), . . . (xn, yn).

Page 122: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

120 CHAPTER 9. SIMPLE LINEAR REGRESSION

In the next section we will take a small detour to explore the relationship betweenthe sample correlation coefficient and our least squares estimates.

9.4 Correlation, Least Squares, and Explained Vari-ance

9.4.1 Correlation Coefficient and the SlopesBy definition, the sample covariance SX,Y is given by

SX,Y =∑n

i=1(xi − x)(yi − y)n − 1

,

and the sum of cross products Sxy is the numerator of this expression.

Sxy =n∑

i=1(xi − x)(yi − y)

Therefore,Sxy = (n − 1)SX,Y .

Next, the sample variance for X and Y are

s2X =

∑ni=1(xi − x)2

n − 1and s2

Y =∑n

i=1(yi − y)2

n − 1,

and the sums of squares Sxx and Syy are

Sxx =n∑

i=1(xi − x)2 and Syy =

n∑i=1

(yi − y)2.

So, we haveSxx = (n − 1)s2

X and Syy = (n − 1)s2Y .

Now, remember the definition of the sample correlation coefficient Corr(X, Y ) orrX,Y .

Corr(X, Y ) = rX,Y = SX,Y

sXsY

Comparing this with the product bb′ from the previous section, we can see that

bb′ = Sxy

Sxx

Sxy

Syy

=S2

xy

SxxSyy

=(n − 1)2S2

X,Y

(n − 1)s2X(n − 1)s2

Y

=(

SX,Y

sXsY

)2= Corr(X, Y )2 or r2

X,Y .

The product of the least squares slopes bb′ is the square of the correlation coefficientbetween X and Y ; that is, bb′ = r2

X,Y . But, this is not the end of the story.

Page 123: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

9.4. CORRELATION, LEAST SQUARES, AND EXPLAINED VARIANCE 121

9.4.2 Explained Variance and the Coefficient of Determina-tion

We saw that a quantitative measure of the total amount of variation in observed yvalues is given by the total sum of squares Syy. In this section, we will see that thesquare of the correlation coefficient, called the coefficient of determination, is a mea-sure of what proportion of the total variance in y, represented by Syy, is explainedby the least squares model.

As the measure of total variation in y is given by

SST = Syy =n∑

i=1(yi − y)2,

it is reasonable to measure the variation in yi = a + bxi4 by the sum of (yi − y)2 5

called the regression sum of squares and denoted by SSR.

SSR =n∑

i=1(yi − y)2.

On the other hand, the error in the least squares prediction formula is given bythe difference yi − yi for each of 1 ≤ i ≤ n. So, the sum of the squares of thesequantities, (yi − yi)2 for 1 ≤ i ≤ n, gives a quantitative measure of the error. This iscalled the error sum of squares and represented by SSE.

SSE =n∑

i=1(yi − yi)2

Proposition 9.1 The total sum of squares is the sum of the regression sum ofsquares and the error sum of squares; that is, SST = SSR + SSE.

4 It is important to understand the difference between yi and yi clearly. While yi is the observedvalue itself, yi is the value computed by plugging xi into y = a + bxi.

5Note that y is y asn∑

i=1yi =

n∑i=1

(a + bxi) = na + bnx = n(y − bx) + bnx = ny.

Page 124: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

122 CHAPTER 9. SIMPLE LINEAR REGRESSION

Proof

SST = Syy =n∑

i=1(yi−y)2 =

n∑i=1

[(yi−yi)+(yi−y)]2

=n∑

i=1(yi−yi)2+

n∑i=1

(yi−y)2+2n∑

i=1[(yi−yi)(yi−y)] = SSR+SSE+2

n∑i=1

[(yi−yi)(yi−y)]

So, it suffices to shown∑

i=1[(yi − yi)(yi − y)] = 0.

First, note that

yi − y = a + bxi − y = y − bx + bxi − y = b(xi − x).

Hence,n∑

i=1[(yi − yi)(yi − y)] =

n∑i=1

[(yi − yi)b(xi − x)] = bn∑

i=1[(yi − yi)(xi − x)],

and we need only to shown∑

i=1[(yi − yi)(xi − x)] = 0.

But, using ∑ni=1 yi = ny and ∑n

i=1 yi = ny,n∑

i=1[(yi−yi)(xi−x)] =

n∑i=1

(xiyi−xyi−xiyi+xyi) =n∑

i=1xiyi−

n∑i=1

xyi−n∑

i=1xiyi+

n∑i=1

xyi

=n∑

i=1xiyi−xny−

n∑i=1

xiyi+xny =n∑

i=1xiyi−

n∑i=1

xiyi =n∑

i=1xiyi−

n∑i=1

xi(a+bxi)

=n∑

i=1xiyi−a

n∑i=1

xi−bn∑

i=1x2

i =n∑

i=1xiyi−nxa−

(n∑

i=1x2

i

)b = 0;

where the last equality follows from the second of the normal equations.

In terms of how well the least squares model fits the data, we should regard each term

Page 125: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

9.4. CORRELATION, LEAST SQUARES, AND EXPLAINED VARIANCE 123

ofSST = SSR + SSE

as follows.

• SST is the total variance in the data to be explained.

• SSR is the amount of the variance the model has successfully accounted for.

• SSE = SST − SSR is the remaining variance6, or the variance the model hasfailed to explain.

We will use the ratio SSRSST

as a measure of the model fit, and call it the coeffi-cient of determination.

Because yi = a + bxi, necessarily yi = a + bxi, and y = a + bx. So,

SSE =n∑

i=1(yi−yi)2 =

n∑i=1

(yi−y+y−yi)2 =n∑

i=1[yi−y+(a+bx)−(a+bxi)]2

=n∑

i=1[yi−y+b(x−xi)]2 =

n∑i=1

(yi−y)2+b2n∑

i=1(xi−x)2−2b

n∑i=1

(yi−y)(xi−x)

= Syy+b2Sxx−2bSxy = Syy+(

Sxy

Sxx

)2Sxx−2

(S2

xy

Sxx

)= Syy+

(S2

xy

Sxx

)−2

(S2

xy

Sxx

)

= Syy−(

S2xy

Sxx

)= Syy

(1 −

S2xy

SxxSyy

)= SST (1−r2

X,Y ).

We now have

SSR

SST= SST − SSE

SST= 1 − SSE

SST= 1 − (1 − r2

X,Y ) = r2X,Y .

Therefore, the coefficient of determination is given by the square of the correlationcoefficient, r2

X,Y .

6SSE is indeed a variance as y = a+bx estimates y and SSE =∑n

i=1(yi −yi)2 ≈∑n

i=1(yi −yi)2.

Page 126: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

124 CHAPTER 9. SIMPLE LINEAR REGRESSION

9.4.3 Correlation Coefficient and Standardized CoefficientRemember the difference between covariance and correlation. Correlation is a stan-dardized, rescaled, and dimensionless version of covariance which makes comparisonsamong different data sets easy. We can introduce the same standardization schemeinto linear regression and use the standardized versions of x and y denoted here byu and w. Namely, we will use

ui = xi − x

sx

and wi = yi − y

sy

;

where sx and sy are standard deviations, to conduct a linear regression analysis;w = a + bu. From Section 9.3, we know

b = Sxy

Sxx

.

So, for w = a + bu,b = Suw

Suu

.

Proposition 9.2 The slope b is the correlation coefficient rX,Y if x and y are stan-dardized as above, and linear regression is performed on the resulting variables.

ProofRecall that the mean is 0 and the standard deviation is 1 for a standardized variable.Then,

Suu =n∑

i=1(ui − u)2 = V ar(u) = 1 and

Suw =n∑

i=1(ui − u)(wi − w) =

n∑i=1

uiwi =n∑

i=1

xi − x

sx

yi − y

sy

= SX,Y

sxsy

= rX,Y .

Hence,b = Suw

Suu

= rX,Y

1= rX,Y .

When x and y are standardized, the resulting slope b in w = a + bu is called stan-dardized coefficient or β coefficient, and the intercept a is always 0. So, wesimply have w = bu.

Page 127: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

9.5. INTERPRETING AND USING SOFTWARE OUTPUT 125

The magnitude of a regression coefficient is not necessarily a good indicator of howgood a variable x is in predicting the y value. In particular, the units of measurementhave a large effect on the size of the regression coefficients. For example, if x isthe length measured in meters, the coefficient is 1,000 times larger than when x ismeasured in millimeters, provided the unit of measurement for y remains the same.Consider a very simple case where a person’s weight y is measured in kilograms andheight x in meters, and suppose that the relationship is y = 10 + 30x so that a = 10and b = 30. If x = 1.7, y = 10+(30)(1.7) = 61, which happens to be the right weightto remain healthy. Now, suppose we change the unit of measurement for height frommeters to millimeters. Then x = 1.7 meters = 1700 millimeters and the formula hasto be adjusted as below to generate the same y value; namely 61.

y = 10 + 30 × 1.7 (in meters) = 10 + 301000

× 1700 (in millimeters)

So, b is now 0.03, 1,000th of the original value of 30.

One way to remedy this situation and make variables more comparable is stan-dardization introduced in this section.

9.5 Interpreting and Using Software OutputComputations involved in linear regression is very cumbersome even for a small sam-ple and almost prohibitive for a large sample. Therefore, researchers ordinarily usesome kind of software to do the computation. A typical software provides moreoutput than just the values of the slope and the intercept. We will now look at suchoutputs and learn how they can help us.

As it is beyond the scope of this textbook, we will not discuss the mathemati-cal details of how the numbers are obtained and focus mainly on how they can beinterpreted. For mathematical details, see the books by Pardoe [Pardoe, 2012] andDevore and Berk [Devore and Berk, 2012].7

For demonstration purposes, we will show the outputs of SPSS, one of a fewpopular software packages. The output of other software are very similar. The data

7Details of how to conduct linear regression on statistical software are not provided in this book.Suffice it to say that the procedure is quite straightforward once you have your data in, for example,Excel format. Incidentally, if you are only interested in getting the slope, intercept, and coefficientof determination, Excel can do it for you.

Page 128: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

126 CHAPTER 9. SIMPLE LINEAR REGRESSION

we are using here are listening comprehension and reading comprehension scores onan English proficiency test taken by students at Kyoto University located in KyotoJapan. Our dependent variable Y is the listening comprehension score, denoted byMLLT, and the independent/predictor variable X is the reading comprehension scoredenoted by MRCL.8

9.5.1 Variables Entered and Removed

Table 9.1: Variables Entered/Removed

Variables Entered/RemovedModel Variables Entered Variables Removed Method

1 MRCL Enter

Table 9.1 lists the independent variables used in the prediction of Y . The tableis of no use in Simple Linear Regression as we have only one variable. However, wewill later consider cases where there is more than one independent/predictor variablecalled Multiple Linear Regression. We will choose the best subset of predictor vari-ables from a set of candidates. Table 9.1 is useful in keeping track of which variablesare included and which are not. In the largely trial-and-error process of arriving atan optimal formula, the software tries entering and removing each variable, and thisis the reason for the expression ”Variables Entered/Removed”.

9.5.2 Model SummaryIn Table 9.9, R is the absolute value of the correlation coefficient between MLLT(listening comprehension) and MRCL (reading comprehension) and R Square is thecoefficient of determination which is nothing but the correlation coefficient squared,representing the proportion of the total variance the model explains.

8You may wonder why these names are used. LLT stands for “Long Listening Comprehension”as they listened for 3 to 5 minutes before answering questions, and RCL represents “ReadingComprehension of Listening Scripts” as the text they read was indeed the script of some listeningcomprehension material. This was done in order to set the difficulty levels equal among otherreasons. The first letter “M” is added to indicate the fact that the raw scores were subjected toRasch analysis, which linearizes and rescales the scores to produce an induced “M”easure from theraw score so that further statistical analyses will become more accurate and meaningful.

Page 129: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

9.5. INTERPRETING AND USING SOFTWARE OUTPUT 127

Table 9.2: Model Summary

Model SummaryModel R R Square Adjusted R Square Standard Error of the Estimate

1 .756 .571 .567 .511060

A model always fits the sample from which it is calculated better than othersamples, and this is problematic in predicting the population parameters such as theslope and intercept. In this sense, the R Square calculated from one particular samplealmost always overestimates the amount of variance explained for the population ifthe same slope and intercept are used. Adjusted R Square is an attempt to correctfor this overestimation by reducing the size of R Square. The formula is

Adjusted R Square = 1 − (1 − R2)n − 1n − 2

or R2 − (1 − R2) 1n − 2

.

Recall thatR2 = 1 − SSE

SST.

We will replace SSE with the unbiased estimator of the variance of the error term ε,given by SSE/(n−2) a , and SST with the unbiased estimator of the total variance,given by SST/(n − 1) [contributors, nd]. This gives us the Adjusted R Square.

R2 = 1−SSE

SST=⇒ Adjusted R Square = 1−SSE/(n − 2)

SST/(n − 1)= 1−

(SSE

SST

)(n − 1n − 2

)

= 1 −(

1 −(

1 − SSE

SST

))n − 1n − 2

= 1 − (1 − R2)n − 1n − 2

aWe lose two degrees of freedom as the two parameters a and b must first be estimated[Devore and Berk, 2012, p.631]. See this Web site [Xia, 2012] for a proof that SSE/(n − 2) isunbiased.

Standard Error of the Estimate is an estimate of the standard deviation ofthe error term in the model; that is, the standard deviation of ε in Y = a + bX + εon p.114. Recall our assumption that ε has a normal distribution, and the mean aswell as the standard deviation are the same for all values of X. The formula for thestandard error of the estimate is √

SSE

n − 2.

Page 130: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

128 CHAPTER 9. SIMPLE LINEAR REGRESSION

9.5.3 Analysis Of Variance Table

Table 9.3: ANOVA

ANOVAModel Sum of Squares df Mean Square F Sig.1 Regression 38.284 1 38.284 146.579 .000

Residual 28.730 110 .261Total 67.014 111

In Table 9.10, we will first look at the “Sum of Squares” column and the “df”(degrees of freedom) column.

In the “Sum of Squares” column, the first row labeled Regression is the regres-sion sum of squares SSR (38.284 here).

SSR =n∑

i=1(yi − y)2

The second row labeled Residual is the error sum of squares we denoted by SSE(28.730 here).

SSE =n∑

i=1(yi − yi)2

Finally, the third row labeled Total lists the total amount of variation in observed yvalues.

SST = Syy =n∑

i=1(yi − y)2

From Table 9.10, SSR + SSE = 38.284 + 28.730 = 67.014 = SST as it should be.

Now, look at the “df” column. As n = 112 here, the second entry, which is thedegrees of freedom associated with SSE denoted by dfE, is n − 2 = 110, and thethird entry , the total degrees of freedom associated with SST and denoted by dfT ,is given by n − 1 = 111. Note that dfT = dfR + dfE as it should be. These are con-sistent with our previous assertions. Because the notion of the degrees of freedom isoften elusive for the learners and researchers alike, a brief semi-qualitative summaryabout the degrees of freedom is due.

Consider the Regression degrees of freedom, denoted by dfR. The corresponding

Page 131: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

9.5. INTERPRETING AND USING SOFTWARE OUTPUT 129

sum of squares is what we called SSR before.

SSR =n∑

i=1(yi − y)2

Why is the degree 1? The ANOVA table is to test the null hypothesis H0 that theslope is 0 against the alternative hypothesis Ha that the slope is nonzero. When H0holds, we are left with y = y, and there is no coefficient to be estimated. On theother hand, Ha involves an estimation of one coefficient, the slope b in y = a + bx.Because the intercept is uniquely specified by a = y − bx once the slope b is given,the degrees of freedom is 1 and not 2. Some may feel uncomfortable about leavingthe intercept term out of our consideration, but they should not for this reason.Therefore, there is 1 − 0 = 1 degree of freedom for testing the null hypothesis.Here is another way to do the bookkeeping. The null hypothesis H0 claims that themean response is the same for all values of x. Hence, we only need to estimate onecommon response. This, of course, is y. The alternative hypothesis Ha asserts thatthe model is of the form y = a + bx with nonzero b, and we need to estimate twoparameters, the slope and the intercept. Therefore, we have 2 − 1 = 1 Regressiondegree of freedom.

Next look at the Residual degrees of freedom. This is SSE.

SSE =n∑

i=1(yi − yi)2

Why do we have n − 2 degrees of freedom? There are n observations and twoparameters, the slope and the intercept, to be estimated. This leaves n − 2 degreesof freedom for the error variance.

DEGREES OF FREEDOM FOR t: n-1Multiple regression with p predictors: There are n observations with p+1 pa-

rameters to be estimated–one regression coefficient for each of the predictors plusthe intercept. This leaves n-p-1 degrees of freedom for error, which accounts for theerror degrees of freedom in the ANOVA table.

The fourth column from the left labeled “Mean Square” is the mean of the sumof squares defined as “the sum of squares divided by the degrees of freedom”. So, thefirst entry, denoted by MSR is 38.284/1 = 38.284, and the second entry, denotedby MSE, is 28.730/110 = 0.261.

The next column of Table 9.10 is the value of F-statistic for hypothesis testing, called

Page 132: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

130 CHAPTER 9. SIMPLE LINEAR REGRESSION

global/overall/model utility F-test for regression, and the rightmost column is thep-value for the test. This requires much explanation. First, the null hypothesis andthe alternative hypothesis for our F-test are as follows.

H0 : the slope b = 0Ha : the slope b , 0

Therefore, this is a test of a linear relation between Y and X. When the null hy-pothesis holds, Y takes the same value a for all values of X, and the graph of Yagainst X is a horizontal line. This test is important as it saves much wasted effortin case the null hypothesis is supported, indicating no linear relation.

In Appendix G, the F-distribution Fm,n is defined as

Fm,n = χ2m/m

χ2n/n

;

where χ2m and χ2

n are χ2 probability distribution functions with degrees of freedomm and n, respectively. Appendix H explains this in more detail, but it turns out that

SSR/dfR

SSE/dfE

= MSR/MSE = FdfR,dfE.

This is F1,110 in our case. We can either use the F -table or a calculator available onthe net such as ”Statistics Calculators” at http://www.danielsoper.com/statcalc3/default.aspx. However, the p-value of 0.000 is already given. As mentioned before,if you double click on the cell, you can see many more digits. In this case we get5.8671419926180005×10−22, which means the probability of observing an F -value ofthe given 146.579 or greater is an extremely small 5.8671419926180005 × 10−22 if H0were true. Hence, we reject H0 either at p = 0.05 or 0.01 level, and it makes senseto proceed with our regression analysis.

Finally, recall the formula for the standard error of the estimate given in Table9.9. Using the entries of Table 9.10, the formula√

SSE

n − 2

becomes √Residual Sum of Squares

dfE

.

Page 133: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

9.5. INTERPRETING AND USING SOFTWARE OUTPUT 131

Indeed,√SSE

n − 2=√Residual Sum of Squares

dfE

=√

28.730100729711978110

= 0.5110604014267516,

which agrees with the figure given in Table 9.9.

9.5.4 Table of Coefficients

Table 9.4: Coefficients

CoefficientsUnstandardized StandardizedCoefficients Coefficients

Model B Std. Error Beta t Sig.1 (Constant) -.435 .081 -5.400 .000

MRCL .450 .037 .756 12.107 .000

It is self-explanatory that the listed coefficients indicate

y = −.435 + .450x or MLLT = −.435 + .450MRCL

and its standardized version

w = .756u or MLLTS = .756MRCLS,

where MLLTS and MRCLS stand for the corresponding standardized variables.The slope of the standardized equation (.756) is the same as R as Proposition 9.2indicates. To be more precise, R equals the absolute value of the slope of the stan-dardized equation, because R is the absolute value of the correlation coefficient.

The second column from the right lists the t-statistics that can be used for hy-pothesis testing. In this case, the null hypothesis is that the corresponding coefficient,either the intercept or the slope, is zero, and the alternative hypothesis is that it isnonzero. The standard error, denoted by Std. Error in Table 9.11, is the standarddeviation of the sampling distribution of a statistic. Therefore, the t-value for theintercept is

−.435 − 0.081

= −5.370.

Page 134: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

132 CHAPTER 9. SIMPLE LINEAR REGRESSION

The reason why this is not exactly the same as the listed value of -5.400 is simplythat there are rounding errors. Indeed, if you include more digits under decimal9,we get

t = −0.435012490817763660.08056299820658243

= −5.399656175931904,

which is −5.400 if rounded to three digits under decimal. Likewise, for the slope,

t = 0.450009772896863450.0371694743122448

= 12.1069716810231 ≈ 12.107.

I have already been calling it a t-statistic, but to be more precise, it only has an ap-proximate t-distribution. The degrees of freedom in this case is n−2 = 112−2 = 110.But, it is not necessary to use a t-table as the p-value is already provided in the right-most column of Table 9.11. In this particular case, the p-values are both .000, andthe intercept and slope are significantly different from 0. More precise values are3.890471287011951 × 10−7 for the intercept and 5.8671419926180005 × 10−22 for theslope. It is obvious we can confidently reject the null hypotheses. Note that ourchoice of H0 is very conservative as we are only claiming the value is nonzero, andhence, it is a strong indication of the inappropriateness of the model if t is largerthan 0.05 or 0.01 depending on our choice of the confidence level. On the other hand,rejection of the null hypothesis only tells us that the slope is not zero and there issome linear relation, which is not much, needless to say. Therefore, we should onlyregard this test as a preliminary examination of the potential linear association. Ifthe slope in the population could be 0, it would suggest that the regression modelcontains very little useful information about the population association between Yand X. Testing for this possibility saves much wasted effort to try to find and inter-pret a linear relation that does not exist to begin with.

By choosing the confidence interval option, we can get more information aboutthe coefficients. In Table 9.5, 95% confidence intervals are shown. But, you can getthe confidence interval for any confidence level you like. Let us double check to seeif the intervals are indeed the same as the ones we get if the computation is doneby hand for the slope. For the degrees of freedom of 110, t110,.025 = 1.9817652810.Carrying as many digits as provided by the software for precision, we get

Unstandardized Coefficient± (t110,.025×Std. Error)9Most statistical packages, including SPSS, allow you to see more digits if you double click on

the corresponding cell.10A typical t table does not have this entry. You should use an online calculator such as “Statistics

Calculators” at http://www.danielsoper.com/statcalc3/default.aspx.

Page 135: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

9.5. INTERPRETING AND USING SOFTWARE OUTPUT 133

Table 9.5: Coefficients with 95% Confidence IntervalsCoefficients with 95% Confidence Intervals

Unstandardized Standardized 95.0% ConfidenceCoefficients Coefficients Interval for B

Model B Std. Error Beta t Sig. Lower Bound Upper Bound1 (Constant) -.435 .081 -5.400 .000 -.595 -.275

MRCL .450 .037 .756 12.107 .000 .376 .524

= 0.45000977289686345 ± (1.98176528×0.0371694743122448)= 0.45000977289686345 ± 0.0736611736678586= 0.3763485992290048 and 0.523670946564722≈ 0.376 and 0.524 if rounded to three digits under decimal.This checks. Likewise, you can compute the endpoints of the confidence interval forthe intercept.

9.6 Confidence Intervals For The Estimation OfY

Recall that our model isY = a + bX + ε.

So, we always have an error bar for our estimation of the value of Y given somefixed x. Let us consider two estimations; one for Y and the other for Y itself.

9.6.1 Confidence Interval For The Population Mean Y

This is a situation where we are estimating the mean value of Y , denoted by Y orE[Y ], at a particular X value, we denote by x, based on the linear association be-tween Y and X, which is in turn estimated based on a sample {(x1, y1), (x2, y2), . . . ,(xn, yn)}. The standard deviation sY we use here is related to the standard error ofthe estimate, denoted by s as follows.

sY =√

SSE

dfE

√√√√ 1n

+ (x − x)2∑ni=1(xi − x)2 = s

√√√√ 1n

+ (x − x)2∑ni=1(xi − x)2

Though derivation of this formula is beyond the scope of this book, the formula isvery instructive. We can note the following characteristics of this expression.

Page 136: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

134 CHAPTER 9. SIMPLE LINEAR REGRESSION

1. If x is close to the sample mean x, sY is small, and our estimate is moreaccurate.

2. If the sample size n is large, sY is small, and our estimate is more accurate.

3. If the standard error of the estimate s is small, sY is small, and our estimateis more accurate.

4. In particular, when x = x, we have

sY = s√n

,

which is the familiar approximation for the population standard deviation ofthe sampling distribution of Y .

The distribution is a t distribution with n − 2 degrees of freedom. In our case,s = 0.5110604014267517, n = 112, x = 1.7349107142857148, and ∑n

i=1(xi − x)2 =189.0477991071a gives

sY = 0.5110604014267517 ×√

1112

+ (x − 1.7349107142857148)2

189.0477991071.

Let us compute the 95% confidence interval for x = 2. First, the predicted value ofY is

a + bx = −0.43501249081776366 + 0.45000977289686345 × 2 = 0.4650070550,

and

sY (at x = 2) = 0.5110604014267517×√

1112

+ (2 − 1.7349107142857148)2

189.0477991071

= 0.0492856452.

So, the 95% confidence interval is

[0.4650070550−t110,.025×sY , 0.4650070550+t110,.025×sY ] =⇒

Page 137: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

9.5. INTERPRETING AND USING SOFTWARE OUTPUT 135

[0.4650070550−1.98176528×0.0492856452, 0.4650070550+1.98176528×0.0492856452]

=⇒ [0.36733447, 0.56267964].

While SPSS does not show the values of sY themselves, it provides an option ofcomputing the confidence interval for Y , and the values shown above agree with theresults obtained by running SPSS with that option.b

9.6.2 Confidence Interval for An Individual Y -valueThis is when we are interested in predicting an individual Y -value given a particularX-value x. As it is more difficult to predict the Y -value than its average Y , thestandard deviation sY in this context should be larger than sY . In fact, we have

sY = s

√√√√1 + 1n

+ (x − x)2∑ni=1(xi − x)2 ;

where s is the standard error of the estimate. The argument below is only heuristic,but it captures the essence of how sY is arrived at. Now, if you expand the squareroot in a series and keep only the linear term, you get

sY ≈ s

[1 + 1

2

(1n

+ (x − x)2∑ni=1(xi − x)2

)].

This is a good approximation when

1n

+ (x − x)2∑ni=1(xi − x)2

is small compared with 1, which is almost always the case. We have

sY ≈ s + 12

(1n

+ (x − x)2∑ni=1(xi − x)2

)s.

The first term s is a measure of uncertainty as measured from the mean Y , while thesecond term 1

2

(1n

+ (x−x)2∑n

i=1(xi−x)2

)s is the uncertainty in estimating where the mean

Y is. Hence, these two uncertainties add up to produce the overall uncertainty.

Page 138: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

136 CHAPTER 9. SIMPLE LINEAR REGRESSION

In our case,

sY = 0.5110604014267517 ×√

1 + 1112

+ (x − 1.7349107142857148)2

189.0477991071.

Let us compute the 95% confidence interval for x = 2.

sY (at x = 2) = 0.5110604014267517×√

1 + 1112

+ (2 − 1.7349107142857148)2

189.0477991071

= 0.5134314061

This gives us

[0.4650070550−t110,.025×sY , 0.4650070550+t110,.025×sY ] =⇒

[0.4650070550−1.98176528×0.5134314061,

0.4650070550+1.98176528×0.5134314061] =⇒ [−0.55249348, 1.48250759].

This agrees with the interval provided by SPSS.aThis has to be computed by hand. However, it is a fairly straightforward task with any

database software such as Excel.bThe resulting lower bound and upper bound are not shown in the output window, but are

given as values of new variables, LMCI_1 (lower endpoint) and UMCI_1 (upper endpoint) in theoriginal database window. This treatment of the lower and upper endpoints as variables makessense as the confidence interval is a function of the X-value, changing from one value of X toanother.

Page 139: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

137

ExercisesClick to go to solutions.

1. This problem concerns a linear regression of MLLT on MSLT as the indepen-dent variable.

(a) What do R Square and Adjusted R Square mean? The difference betweenR Square and Adjusted R Square should be clearly explained.

Table 9.6: Model Summary

Model SummaryModel R R Square Adjusted R Square Standard Error of the Estimate

1 .621 .386 .380 .611809

(b) Conduct the F test, and compute Standard Error of the Estimate.

Table 9.7: ANOVA

ANOVAModel Sum of Squares df Mean Square F Sig.1 Regression 25.840 1 25.840 69.033 .000

Residual 41.174 110 .374Total 67.014 111

(c) Find the 95% confidence interval for the slope.

Table 9.8: Coefficients

CoefficientsUnstandardized StandardizedCoefficients Coefficients

Model B Std. Error Beta t Sig.1 (Constant) .342 .058 5.912 .000

MSLT .583 .070 .621 8.309 .000

Page 140: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

138

2. Consider simple linear regression where the dependent variable Y is listeningcomprehension and the independent variable X is reading comprehension. Inone research project, a model summary given in Table 9.9 was obtained alongwith an ANOVA table (Table 9.10) and a table of coefficients (Table 9.11).

(a) What do R Square and Adjusted R Square mean?(b) Conduct the F -test using the entries in Table 9.10. Identify the F -value

in the table and determine whether the result of this regression analysisis significant at α = .05.

(c) Give the expression for the standard error of the estimate using the entriesin Table 9.10.

Table 9.9: Model Summary

Model SummaryModel R R Square Adjusted R Square Standard Error of the Estimate

1 .756 .571 .567 .511060

Table 9.10: ANOVA

ANOVAModel Sum of Squares df Mean Square F Sig.1 Regression 38.284 1 38.284 146.579 .000

Residual 28.730 110 .261Total 67.014 111

Table 9.11: Coefficients

CoefficientsUnstandardized StandardizedCoefficients Coefficients

Model B Std. Error Beta t Sig.1 (Constant) -.435 .081 -5.400 .000

MSLT .450 .037 .756 12.107 .000

Page 141: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

Chapter 10

Chi-Square Tests: CategoricalData Analysis1

In this chapter, we will learn how to test whether our assumption about the distribu-tion over multiple categories is supported by the sample. The aim is not to estimatenumerical parameters, but to conduct hypothesis tests about assumed distributions.Let us look at examples without further ado.

10.1 Goodness-Of-Fit TestsExample 10.1 Suppose there are four hospitals of a comparable size in the samecity. They are Hospitals A, B, C, and D. We would like to test if our assumption thatthe four hospitals are equally popular among the residents is correct at the confidencelevel of 95%.the null hypothesis H0: The four hospitals are equally popular.the alternative hypothesis Ha: The levels of popularity are not the same.Our measure of popularity is the average number of outpatients per day. The dataare shown below.

Hospital A B C D# of Outpatients 3010 2870 2700 2960

1The traditional notation χ2 may be misleading as it is only suggestive of its construction as thesum of squares of terms. It is a single statistical variable and is not the square of another quantityχ. That is why some claim that the distribution should be called chi-square distribution and notchi squared distribution.

139

Page 142: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

140 CHAPTER 10. CHI-SQUARE TESTS: CATEGORICAL DATA ANALYSIS

The total number of outpatients per day is 3010 + 2870 + 2700 + 2960 = 11540. So,each hospital should attract 11540 ÷ 4 = 2885 outpatients if they are equally popular.Now, consider the following sum.∑

all cells

(observed number − 2885)2

2885 = (3010 − 2885)2

2885+(2870 − 2885)2

2885

+(2700 − 2885)2

2885+(2960 − 2885)2

2885= (125)2 + (−15)2 + (−185)2 + (75)2

2885

= 15625 + 225 + 34225 + 56252885

= 553402885

= 19.18197574

This sum is asymptotically the chi-square distribution with 3 degrees of freedom2.Now, P [χ2

3 ≥ 7.81472790] = 0.05 or χ23,0.05 = 7.81472790. which means that the

probability of observing a χ23 value equal to or greater than 7.81472790 is 0.05 if the

null hypothesis H0 were true. Because 19.18197574 > 7.81472790, we reject the nullhypothesis and conclude that the four hospitals are NOT equally popular.

In Example 10.1, we assumed that the same numbers of outpatients visit the fourhospitals. In other words, the probability of being chosen by a randomly pickedoutpatient is the same 1

4 for any hospital. However, it is not necessary for theprobabilities to be the same. To demonstrate this, consider the distribution below.

Table 10.1: Observed And Expected Cell Counts

Category: i = 1 i = 2 . . . i = k Row TotalObserved o1 o2 . . . ok

∑ki=1 oi = N

Expected e1 e2 . . . ek∑k

i=1 ei = N

Theorem 10.1 [Devore and Berk, 2012, p.726] Provided the number of observationsin cell i, denoted by oi in Table 10.1, is greater than or equal to 5 for all 1 ≤ i ≤ k,the variable

k∑i=1

(oi − ei)2

ei

has approximately and asymptotically a chi-squared distribution with k − 1 degrees offreedom. We lose one degree of freedom as ∑n

i=1 oi = N .2The degrees of freedom is “the number of hospitals−1” as one degree of freedom is lost because

the total number of outpatients is fixed at 11540.

Page 143: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

10.2. TESTS OF INDEPENDENCE 141

Proof of this theorem is well beyond the scope of this book. However Lebanon andHunter provide concise notes that outline a proof [Lebanon, 2009, Hunter, 2002a,Hunter, 2006].

In Example 10.1, the null hypothesis assumed that all four hospitals are equallypopular. However, as mentioned above, it is not necessary that the probabilities,and hence the expected values, are equal. Note that we have no equality condi-tion on e1, e2 . . . ek in Table 10.1. Let us next consider an example with unequalprobabilities; that is, the expected values are not the same.

Example 10.2 Suppose there are three universities in the same area; UniversitiesA, B, and C. One thousand high school seniors were surveyed to find out whichuniversity they liked best. The same survey was run 10 years ago, and it was foundthat 30% liked University A, 50% for University B, and the remaining 20% preferredUniversity C. Our null hypothesis H0 is that these figures are the same now. Thetable below shows the observed and expected values for the three universities. Basedon these numbers, we will conduct a chi-square test for the null hypothesis at thesignificance level p = 0.05.

University A B C TotalObserved Numbers 330 490 180 1000Expected Numbers 300 500 200 1000

Solution

k∑i=1

(oi − ei)2

ei

=3∑

i=1

(oi − ei)2

ei

= (330 − 300)2

300+(490 − 500)2

500+(180 − 200)2

200

= 900300

+100500

+400200

= 3+0.2+2 = 5.2

When the degrees of freedom is 3−1 = 2, the chi-square value of 5.2 gives the p-valueof 0.07427358 > 0.05. So, we fail to reject the null hypothesis H0 that the ratio isstill 3 to 5 to 2 now, ten years later.

10.2 Tests Of IndependenceWe will go right to the first example.

Page 144: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

142 CHAPTER 10. CHI-SQUARE TESTS: CATEGORICAL DATA ANALYSIS

Example 10.3 Some people are religious, while others are not. Some people areconservative, but others are not. Using these two dichotomies, we can divide the pop-ulation into four groups; religious and conservative, religious and not conservative,not religious and conservative, and not religious and not conservative. In this roughclassification, we can place each person in one of the cells in the table below, calleda “contingency table”. Our aim here is to see if there is a statistically significantassociation between religiousness and conservatism; that is, if religious people areless or more conservative than non-religious people.

PPPPPPPPPConservative Not Conservative

Religious

Not Religious

Suppose we pick a random sample of 1000 voters. If the distribution is as follows,we can safely conclude that religious people are conservative and non-religious peopleare not conservative.

PPPPPPPPPConservative Not Conservative

Religious 680 20

Not Religious 10 290

Likewise, if we get the following distribution.PPPPPPPPP

Conservative Not Conservative

Religious 20 680

Not Religious 290 10

However, the distribution is not always as clear as this. For example, our surveymay result in the following distribution.

PPPPPPPPPConservative Not Conservative

Religious 500 200

Not Religious 190 110

Page 145: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

10.2. TESTS OF INDEPENDENCE 143

All that we can possibly conclude or guess from this result is that there are moreconservative people in the population than non-conservative people. However, it isnot clear if there is a significant association between religiousness and conservatism.

Let our null hypothesis H0 be that there is no association between religiousnessand conservatism; that is, whether one is religious or not has no bearing on whetherthat person is conservative. The statistic we can use for this purpose is again

∑all cells

(oi − ei)2

ei

.

We can rewrite this as follows

2∑i=1

2∑j=1

(oij − eij)2

eij

;

where i means the i-th row, j stands for the j-th column, oij is the number in thecell where the i-th row and the j-th column intersect, and eij is the expected countfor the cell at the intersection of the i-th row and the j-th column, provided the nullhypothesis is true. With this notation, we have o11 = 500, o12 = 200, o21 = 190,and o22 = 110. But, how do we compute the expected value eij? In preparation forthis, we will add what are called marginal totals and the grand total of 1000 to thetable as shown below (Table 10.2). You can see the first two entries in the rightmost

Table 10.2: Religion and ConservatismPPPPPPPPP

Conservative Not Conservative Totals

Religious 500 200 700

Not Religious 190 110 300

Totals 690 310 1000

column are the sums of the entries in the corresponding rows; i.e. 500 + 200 = 700and 190 + 110 = 300. Similarly for the first two entries of the bottom row. The 1000in the lower right-hand corner is the grand total, which is the total number of peoplesurveyed.

Page 146: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

144 CHAPTER 10. CHI-SQUARE TESTS: CATEGORICAL DATA ANALYSIS

Let us first compute e11 and e21, the expected number of religious and conservativepeople and the expected number of nonreligious and conservative people under the nullhypothesis of no association between religiousness and conservatism. We will proceedas follows.

1. Note that there is a total of 700 religious people.

2. Next, note that out of 1000 surveyed, 690 are conservative. The overall rate is6901000 = 0.69, and 69% of the people are conservative.

3. Our null hypothesis stipulates that whether one is religious or not religiousdoes not affect his/her chance of being conservative. Therefore, the rate ofconservative people is the same 0.69 both for the religious and non-religiouspopulations.

4. This means

e11 = (the expected number of religious and conservative people)

= (the total number of religious people)× 0.69 = 700 × 0.69 = 483and

e21 = (the expected number of nonreligious and conservative people)

= (the total number of nonreligious people)× 0.69 = 300 × 0.69 = 207

At this point, there are two ways to compute e12 and e22.

5. The first method is to repeat the same procedure as for e11 and e21.

e12 = (the expected number of religious and non-conservative people)

= (the total number of religious people)× 0.31 = 700 × 0.31 = 217and

e22 = (the expected number of non-religious and non-conservative people)

= (the total number of nonreligious people)× 0.31 = 300 × 0.31 = 93

6. But, the easier method is to subtract e11 and e12 from the total number ofreligious and nonreligious people, respectively.

e12 = 700 − e11 = 700 − 483 = 217

e22 = 300 − e21 = 300 − 207 = 93

Page 147: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

10.2. TESTS OF INDEPENDENCE 145

We now have the following table where the entries in each cell are arranged in theform oij (eij).

PPPPPPPPPConservative Not Conservative Totals

Religious 500 (483) 200 (217) 700

Not Religious 190 (207) 110 (93) 300

Totals 690 310 1000

Fact 10.1 The claim of Theorem 10.1 applies to a more general case

r∑i=1

c∑j=1

(oij − eij)2

eij

as well, where the contingency table has r rows and c columns. This sum hasapproximately and asymptotically a chi-squared distribution with (r−1)(c−1) degreesof freedom. The degrees of freedom (df) is equal to the total number of cells that canbe varied without changing marginal totals.3

Let us compute the chi-square value.

2∑i=1

2∑j=1

(oij − eij)2

eij

= (500 − 483)2

483+(200 − 217)2

217+(190 − 207)2

207+(110 − 93)2

93

= 6.434

The degrees of freedom is (2−1)(2−1) = 1, and the corresponding p-value is 0.0112.Therefore, we reject the null hypothesis H0 that there is no association betweenreligiousness and conservatism.

As is obvious in Fact 10.1, this analysis can be applied to a contingency table ofany size. Our next example has a contingency table with seven rows and columns.

3For the first row, (c−1) cells can be changed as the last cell can be used to adjust the marginaltotal for the row, and similarly for the second through the (r − 1)-st row. However, having variedthe cells in the first (r − 1) rows, the entries in the last r-th row are uniquely determined by therestriction on the marginal total for each column. Hence, a total of (r −1)(c−1) cells can be variedwithout changing the marginal totals.

Page 148: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

146 CHAPTER 10. CHI-SQUARE TESTS: CATEGORICAL DATA ANALYSIS

Example 10.4 Consider six age brackets, 10 to 19, 20 to 29, 30 to 39, 40 to 49, 50 to59, and 60 to 69. Also consider five different flavors of ice cream, vanilla, chocolate,strawberry, peach, and coffee. Our rather adventurous (reckless?) null hypothesisH0 is that there is no difference in their preference pattern for flavors despite thedifference in the age brackets they belong to. In order to examine this hypothesisfor the population, we took a random sample of 500 people from each bracket for thetotal of 3000 and asked which flavor of the five they liked best. The following tablesummarizes the results.

Table 10.3: Favorite Ice Cream Flavors for Different Age Brackets

XXXXXXXXXXXXAgeFlavor Vanilla Chocolate Strawberry Peach Coffee Totals

10-19 176 122 114 70 18 500

20-29 178 123 105 79 15 500

30-39 180 131 102 67 20 500

40-49 184 130 90 70 26 500

50-59 186 125 88 68 33 500

60-69 193 130 80 62 35 500

Totals 1101 761 575 416 147 3000

The observed counts o11 through o65 are shown in Table 10.3. The expected counts{eij} can be computed as before. For example,

e11 = 500 × 11013000

= 183.5,

and likewise for other expected values. The computation of Pearson’s chi-square islong and tedious, but the end result is

6∑i=1

5∑j=1

(oij − eij)2

eij

= 25.936.

The degrees of freedom is (6−1)(5−1) = 20, and the p-value is 0.16793602. Therefore,we fail to reject the null hypothesis that there is no association between the age bracketand flavor preference.

Page 149: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

10.2. TESTS OF INDEPENDENCE 147

Example 10.4 demonstrates the following characteristic of the chi-square test.

1. Chi-square is an “overall” or “global statistic”. It is a useful measure ofindependence as a whole, but it is not sensitive to a trend localized to one partof the contingency table.

For example, it may appear that preference for coffee flavored ice cream becomesstronger with age. We can check this by conducting the following Goodness-of-Fittest.

XXXXXXXXXXXXFlavorAge 10-19 20-29 30-39 40-49 50-59 60-69

Coffee 18 15 20 26 33 35

Under the null hypothesis that there is no association between age and flavor, theexpected count is 24.5 for each cell. The degrees of freedom is 6 − 1 = 5, the chi-square value is 13.776, and the p-value is 0.01710008. This p-value leads to rejectionof the null hypothesis at the significance level of 0.05. Note that the global chi-square statistic used in Example 10.4 was insignificant because it is insensitive tolocal trends.

There is another characteristic of a global chi-square statistic.

2. Chi-statistic helps in deciding if a relationship exits, but it is not a good mea-sure of the strength of association.

Page 150: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

148

ExercisesClick to go to solutions.

1. Is birth order related to one’s selection of a career? A study was conductedin 1993 about the first-born and the last-born women. One finding was thatthe first-born women were equally likely to become scientists, social reformers,performing artists, or writers. For the last-born women, the following tablewas obtained. Is the difference significant at the 95% level? Conduct a χ2-testand discuss the result.

Scientist Social Reformer Pefroming Artist Writer Total23 13 34 14 84

2. At a high school, a study was conducted to investigate the relation betweenthe mother’s attitude and enrollment in a foreign language class. Motherswere asked if they thought it was important for their children to learn a foreignlanguage. It was a simple yes-no question. The results are summarized by thetable below.

Mother’s AttitudeYes No

Enrolled 55 5Did not enroll 15 75

(a) Below are the expected frequencies. Fill in the blank in the table. Asusual, you have to show all your work.

Yes No32

42 48

(b) Write down the formula to compute χ2 with the actual numbers filled in.You do not have to compute anything.

(c) Find df, the dgree(s) of freedom.(d) The χ2 for this problem is 81.37. Would you conclude, with 99% confi-

dence that there is a significant correlation between a child’s enrollmentin a foreign language class and the attitude of the mother? Explain theprocedure used to reach your conclusion.

Page 151: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

149

3. At a high school, a study was conducted to investigate the relation betweenthe mother’s attitude and enrollment in a foreign language class. Motherswere asked if they thought it was important for their children to learn a foreignlanguage. It was a simple yes-no question. The results are summarized by thetable below.

Mother’s AttitudeYes No

Enrolled 55 5Did not enroll 15 75

(a) Below are the expected frequencies. Fill in the blank in the table. Asusual, you have to show all your work.

Yes No32

42 48

(b) Write down the formula to compute χ2 with the actual numbers filled in.You do not have to compute anything.

(c) Find df, the dgree(s) of freedom.(d) The χ2 for this problem is 81.37. Would you conclude, with 99% confi-

dence that there is a significant correlation between a child’s enrollmentin a foreign language class and the attitude of the mother? Explain theprocedure used to reach your conclusion.

4. At a high school, a study was conducted to investigate the relation betweenthe mother’s attitude and enrollment in a foreign language class. Motherswere asked if they thought it was important for their children to learn a foreignlanguage. It was a simple yes-no question. The results are summarized by thetable below.

Mother’s AttitudeYes No

Enrolled 180 90Did not enroll 60 170

(a) Below are the expected frequencies. Fill in the blank in the table. Asusual, you have to show all your work.

Page 152: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

150

Yes No140.4

110.4 119.6

(b) Write down the formula to compute χ2 with the actual numbers filled in.You do not have to compute anything.

(c) Find df , the dgree(s) of freedom.(d) The χ2 for this problem is 81.93. Would you conclude, with 95% confi-

dence that there is a significant correlation between a child’s enrollmentin a foreign language class and the attitude of the mother? Briefly explainthe procedure used to reach your conclusion.(Note: I took this data from http://www.upa.pdx.edu/IOA/newsom/da1/ho_chisq.doc.)

5. There are three brands of bread respectively called A, B, and C. The brandpreferences of a random sample of 150 buyers are observed with the resultingcount shown in the table below.

Brand A B CBuyer Count 61 53 36

(a) What is the expected frequency provided the three brands are equallyliked? Show your computation.

(b) Write down the formula to compute χ2 with the actual numbers filled in.You do not have to compute anything.

(c) Find df, the dgree(s) of freedom.(d) The χ2 for this problem is 6.52. Would you conclude, with α = 0.05 that

there is a customer preference for one or more of the brands of bread?Explain the procedure used to reach your conclusion.

6. At a high school, a study was conducted to investigate the relation betweenthe mother’s attitude and enrollment in a foreign language class. Motherswere asked if they thought it was important for their children to learn a foreignlanguage. It was a simple yes-no question. The results are summarized by thetable below.

Mother’s AttitudeYes No

Enrolled 55 5Did not enroll 15 75

Page 153: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

151

(a) Below are the expected frequencies. Fill in the blank in the table. Asusual, you have to show all your work.

Yes No32

42 48

(b) Write down the formula to compute χ2 with the actual numbers filled in.You do not have to compute anything.

(c) Find df, the dgree(s) of freedom.(d) The χ2 for this problem is 81.37. Would you conclude, with 99% confi-

dence that there is a significant correlation between a child’s enrollmentin a foreign language class and the attitude of the mother? Explain theprocedure used to reach your conclusion.

7. There are three brands of bread respectively called A, B, and C. The brandpreferences of a random sample of 150 buyers are observed with the resultingcount shown in the table below.

Brand A B CBuyer Count 61 53 36

(a) What is the expected frequency? Show your computation.(b) Write down the formula to compute χ2 with the actual numbers filled in.

You do not have to compute anything.(c) Find df, the degree(s) of freedom.(d) The χ2 for this problem is 6.52. Would you conclude, with α = 0.05 that

there is a customer preference for one or more of the brands of bread?Explain the procedure used to reach your conclusion.

8. Consider a survey of 110 males and 90 females about their opinions on raisingsales tax. The results are summarized in the table below.

Supports Opposes Undecided Row totalMale 40 40 30 110Female 25 45 20 90Column total 65 85 50 200

Page 154: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

152

(a) The table below shows the expected frequencies. What is the expectedfrequency for undecided females? Show your computation.

Supports Opposes Undecided Row totalMale 35.75 46.75 27.5 110

Female 29.25 38.25 90

Column total 65 85 50 200

(b) Write down the formula to compute χ2 with the actual numbers from thetable filled in. You do not have to compute anything.

(c) Find df, the degree(s) of freedom.(d) The χ2 for this problem is 3.794. Would you conclude, with α = 0.05 that

there is a difference between males and females? Explain the procedureused to reach your conclusion.

9. There are three brands of bread respectively called A, B, and C. The brandpreferences of a random sample of 150 buyers are observed with the resultingcount shown in the table below.

Brand A B CBuyer Count 61 53 36

(a) What is the expected frequency provided the three brands are equallyliked? Show your computation.

(b) Write down the formula to compute χ2 with the actual numbers filled in.You do not have to compute anything.

(c) Find df, the dgree(s) of freedom.(d) The χ2 for this problem is 6.52. Would you conclude, with α = 0.05 that

there is a customer preference for one or more of the brands of bread?Explain the procedure used to reach your conclusion.

10. We would like to test the interaction between the level of education andlongevity. In our preliminary study, we considered people who did not goto college and those with a doctorate. For the sake of classification, we dividedthe participants into 3 groups according to their longevity x in years; SHORT(x < 70), AVERAGE (70 ≤ x < 80), and LONG (80 ≤ x). Our survey of 1,000participants produced the contingency table given below (Table 10.4).

Page 155: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

153

(a) Complete the table by computing the expected frequencies A, B, and Cunder the null hypothesis that there is no interaction between educationand longevity. You have to show your work.

(b) Using the data given in Table 10.4 as well as the expected frequencies,write down the formula to compute the χ2-value.

(c) Find df, the degrees of freedom.(d) The χ2 for this problem is 11.28. Would you conclude, with 99% confi-

dence that there is a significant correlation/interaction between the levelof education and longevity? Briefly explain the procedure you used toreach the conclusion.

Table 10.4: Education and LongevityPPPPPPPPP

SHORT AVERAGE LONG Totals

No College 100 (90) 250 (235) 150 (175) 500

Doctorate 80 (A) 220 (B) 200 (C) 500

Totals 180 470 350 1000

Page 156: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

154

Page 157: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

Chapter 11

Factor Analysis and PrincipalComponents Analysis

11.1 IntroductionSo far, we have encountered different types of statistical tools and techniques suchas the t-test, hypothesis testing, and linear regression. The t-test is used to measurethe difference between different sets of data, hypothesis testing is literally a test todetermine whether a given hypothesis is supported by the data, and linear regressionallows us to predict the value of the dependent variable from a set of independentvariables. We will now add another tool with yet another purpose, called datareduction, to our toolbox; namely, factor analysis and principal components analysis.Factor analysis and principal components analysis (PCA) are statistical techniquesapplied to a set of variables to discover which variables in the set form coherentsubsets that are independent of one another. Loosely speaking, each of such subsetscorresponds to a factor or component. Unlike observed variables, a factor/componentis a less visible conceptual entity1, lurking behind the observables and not directlymeasurable. However, they are assumed to cause the measurable variables to takethe observed values. Because we will replace a set of variables with a smaller setof factors/components, each explaining multiple observed variables, factor analysisand principal components analysis can lead to significant reduction in the number ofvariables we have to deal with.

Here is a trivial example, there are a large number of examinations that arepurported to measure a nonnative speaker’s English proficiency. While one can

1This is often referred to as a construct.

155

Page 158: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

156 CHAPTER 11. FACTOR ANALYSIS AND PRINCIPAL COMPONENTS ANALYSIS

take hundreds of such examinations if so desired, that person’s performance maybe explained by the four major language skills of reading, listening, speaking, andwriting to a large extent. So, the difference in the test scores can be explainedby which of the four skills are/is tapped by different tests. Note that a directmeasurement of the abstract construct of “reading comprehension”, for example,would be difficult, and we will try to get to it indirectly from the test scores.

In the case of reading, listening, speaking, and writing, these four skills are knownto be correlated. But, it will be very convenient if we can come up with a smallnumber of uncorrelated components that can explain all observed values. This isanalogous to specifying any vector v in the x, y-plane by its x- and y- component(a, b) such that v = ai + bj; where i and j are unit vectors in the x- and y-direction,respectively.

In sum, the goal of factor analysis and principal components analysis is twofold.

1. To explain as much variances as possible in the observables with as few factors/components as possible.

2. To find factors/components which are as distinct from each other as possible.

However, in real world, it is not possible to have a small number of components toexplain all observed values completely. For example, principal components analysiscan explain all variables completely, but only when as many components as thenumber of variables are included. As for factor analysis, it cannot account for all thevariances in the data by construction2. Still, it is often the case that a few of thosecomponents can explain sufficient amounts of the variance in observed variables,leading to a successful reduction of the variables we have to deal with. This is thepath we will follow.

Let us first be clear about three kinds of variances which play central roles inChapter 11.

11.2 Three Types of Variance: Common, Specific,and Error Variances

Common variance is the portion of total variance, typically denoted by h2, thatis shared by a set of variables and explained by a set of common factors/components.This is the variance we will try to account for in Chapter 11. Our analyses are so

2The real meaning of this will become clear in Section ??.

Page 159: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

11.3. PRINCIPAL COMPONENTS ANALYSIS 157

designed that we will have 0 ≤ h2 ≤ 1. This is accomplished by standardizing thevariables such that the mean is 0 and the standard deviation is 1.

Specific variance means variance specific to one particular variable in the data.This variance is not shared with any other variable. Of course, this does not meanthat the variable in question is completely unique and has no shared variance withany other variable at all. The term specific simply refers to the fact that no variablewith shared variance with the variable in question is included in our data set.

Error variance represents the ever-existing measurement error.

It is the assumptions made about these variances or the variances accounted forby the models, h2, that separate principal components analysis from factor analysis.Principal components analysis assumes that the components can explain all the vari-ances in the observed variables (h2 = 1), while factor analysis assumes that parts ofthe variances will remain unexplained by the factors (h2 < 1).

At this point, you may be itching to know details of the differences between factoranalysis and principal components analysis. However, the true meaning of all this aswell as how and where the differences arise in the analysis can be understood moreeasily if you first learn how principal components analysis works.

Principal components analysis uses the correlation matrix in a straightforwardmanner as the starting point, while we modify/decrease the diagonal entries of thecorrelation matrix in factor analysis. In these matrices, the diagonal entries are theexplained variances denoted by h2. As principal components analysis assumes h2 = 1,you can see this is consistent with the use of correlation matrix, whose diagonalentries are indeed 1s. It is because of this simplicity, as well as the simplicity of theensuing analysis/computation, that we will examine principal components analysisin detail first. Mathematically speaking, principal components analysis is by farthe cleaner of the two. Of course, this does not necessarily mean factor analysis isinferior.

11.3 Principal Components Analysis

11.3.1 Properties and PrinciplesProperties of Principal Components Analysis

Page 160: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

158 CHAPTER 11. FACTOR ANALYSIS AND PRINCIPAL COMPONENTS ANALYSIS

1. It allows the observed variables to be explained as linear combinations of anunderlying set of uncorrelated/orthogonal components.

2. It provides a unique solution from which the original data can be reconstructedwith no error.3

3. The number of components in the full solution equals the number of variables4,though we usually retain only a few dominant components.

More specifically, our formulation of principal components analysis is as follows.

Let {X1, X2, . . . , Xn} be our observed variables, {Z1, Z2, . . . , Zn} be their stan-dard scores, and C1, C2, . . . , Cn be the components5. Our basic assumption is thatthe standard scores of the observables {Zi} can be expressed as a linear combinationof the uncorrelated components; namely, we have, for each 1 ≤ k ≤ n,

Zi =n∑

k=1aikCk, (11.1)

or

Z1 = a11C1 + a12C2 + . . . + a1n−1Cn−1 + a1nCn

Z2 = a21C1 + a22C2 + . . . + a2n−1Cn−1 + a2nCn

... (11.2)Zn−1 = an−11C1 + an−12C2 + . . . + an−1n−1Cn−1 + an−1nCn

Zn = an1C1 + an2C2 + . . . + ann−1Cn−1 + annCn.

In matrix form, this becomes

Z1Z2...Zn−1Zn

=

a11 a12 . . . a1n−1 a1n

a21 a22 . . . a2n−1 a2n...

.... . .

......

an−11 an−12 . . . an−1n−1 an−1n

an1 an2 . . . ann−1 ann

C1C2...Cn−1Cn

, (11.3)

3This is when all the components are retained. As we will see later, some components are moreimportant than others, and our aim is usually to find a few dominant components that can describethe data adequately.

4There is some complication when what is called ”multiplicity” is present. We will limit ourattention to the cases without multiplicity for simplicity of explanation.

5Note that the number of observables and the number of components are the same.

Page 161: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

11.3. PRINCIPAL COMPONENTS ANALYSIS 159

or more compactly,

Z = AC. (11.4)

Furthermore, we will require that the correlation coefficient between Xi and Xj,denoted by rij, be expressed as an “inner product” between Zi and Zj; that is, werequire

rij = ⟨Zi, Zj⟩= ⟨ai1C1 + ai2C2 + . . . + ainCn, aj1C1 + aj2C2 + . . . + ajnCn⟩

=n∑

k,l=1aikajl ⟨Ck, Cl⟩ =

n∑k,l=1

aikajlδkl =n∑

k=1aikajk; (11.5)

where lack of any correlation among the components {Ci} is expressed by ⟨Ck, Cl⟩ =δkl. As you can see, this is completely analogous to the usual dot product for twovectors in Cartesian coordinates.

rij = (ai1, ai2, . . . , ain) · (aj1, aj2, . . . , ajn) = ai1aj1 + ai2aj2 + . . . + ainajn (11.6)

Now, it may not be immediately obvious if such a formulation is possible, butthere is a straightforward derivation. Recall that we are interested in grouping re-lated variables together in order to discover the latent components {Ci} causing theobserved values of {xi}. Because our canonical measure of “relatedness” betweentwo variables is the correlation coefficient, it is reasonable to choose the correlationmatrix R = [rij] as our starting point.

11.3.2 From Correlation Matrix To ComponentsAs the correlation matrix R among the variables is real and symmetric, we can usethe following theorem, Theorem 11.1, to find uncorrelated components. This is thesame as Theorem J.1 in Appendix J.

Theorem 11.1 (Spectral Theorem for Real Symmetric Matrices)A real symmetric n by n matrix is diagonalizable. In fact, Rn has an orthonormalbasis of eigenvectors for any real symmetric matrix A. If these are taken as thecolumns of an (orthogonal) matrix V , then V T AV is a diagonal matrix with theeigenvalues of A on the diagonal.

The matrix V satisfies V V T = V T V = I as the transpose of an orthogonal matrixis its inverse; V T = V −1.

Page 162: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

160 CHAPTER 11. FACTOR ANALYSIS AND PRINCIPAL COMPONENTS ANALYSIS

Let us denote the diagonal matrix in Theorem 11.1 by L. Then, we get

L = V T RV or V LV T = R. (11.7)

Without loss of generality, we can assume that the eigenvalues are arranged indescending order on the diagonal of L such that we have λ1 > λ2 > . . . > λn−1 > λn

for

L =

λ1

λ2 0. . .

. . .

0 λn−1λn

. (11.8)

Due to Fact 7.2, the correlation matrix is exactly the covariance matrix if the vari-ables are first standardized. So, R is positive semidefinite according to Theorem J.4. This, in turn, implies that L is positive semidefinite as follows.

xT Lx = xT(V T RV

)x = xT V T RV x = (V x)T R (V x) ≥ 0

for any vector x ∈ Rn (11.9)

In this case, we know that the eigenvalues of L are nonnegative due to Theorem J.2, and the positive square root of L defined in a straightforward manner by

√L =

√λ1

√λ2 0

. . .. . .

0 √λn−1 √

λn

(11.10)

is a real n × n diagonal matrix. We are now ready to define A in (11.4). Let

A = [aik] = V√

L, (11.11)

Page 163: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

11.3. PRINCIPAL COMPONENTS ANALYSIS 161

and write

AT =(V

√L)T

=[aT

jk

], (11.12)

where clearly aTjk = akj. Then,

R = [rij] = V LV T = V√

L√

LV T = V√

L(√

L)T

V T =(V

√L) (

V√

L)T

= [aik][aT

jk

]=

a11 . . . a1n...

. . ....

an1 . . . ann

aT11 . . . aT

1n...

. . ....

aTn1 . . . aT

nn

=[

n∑k=1

aikaTkj

]=[

n∑k=1

aikajk

]. (11.13)

Hence, we have

rij =n∑

k=1aikajk,

which is nothing but (11.5).

Recall from linear algebra that√

L can be regarded as a linear transformationoperating in the vector space L spanned by the eigenvectors of R, and expressed interms of the orthonormal basis made up of the normalized eigenvectors of R. Like-wise, V can be regarded as a linear map from L to the original space O, connectingthe orthonormal basis consisting of eigenvectors with the original orthonormal basisa. √

L : L −→ L and V : L −→ O

Therefore,V

√L : L −→ O.

Now refer back to (11.4)

Z = AC,

where x ∈ O, A : L → O, and C ∈ L . We can see that we should regardeach normalized eigenvector of R spanning a one-dimensional subspace of L as a

Page 164: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

162 CHAPTER 11. FACTOR ANALYSIS AND PRINCIPAL COMPONENTS ANALYSIS

component. That is, by abuse of notation, we have

C1 =

10...0

, C2 =

01...0

, . . . , Cn =

00...1

; (11.19)

where Ci is the i-th component/eigenvector.

Let us see how principal components analysis works in detail with a series ofconcrete examples.

aWe know V : L −→ O. But, we can be more specific. Let {ei} be the canonical orthonormalbasis of O and {ui} be the orthonormal basis of L comprised of normalized, hence orthonormal,eigenvectors of L. More visually appealing representations may be

e1 =

10...0

{ei}⊂O

, e2 =

01...0

{ei}⊂O

, . . . , en =

00...1

{ei}⊂O

(11.14)

and

u1 =

10...0

{ui}⊂L

, u2 =

01...0

{ui}⊂L

, . . . , un =

00...1

{ui}⊂L

; (11.15)

where [ ]{ei}⊂O and [ ]{ui}⊂L indicate that the representations are in reference to the canonicalbasis of O and the basis of L consisting of eigenvectors of L, respectively. The basis vectors {ui}can also be represented in terms of the other set of basis vectors {ei}. Let us write

ui =

ui1ui2...

uin

{ei}⊂O

(11.16)

for the components of the eigenvector ui when expressed with respect to the canonical orthonormalbasis {ei} of O. Then, V is a bijection, a one-to-one and onto map, whose matrix representation

Page 165: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

11.3. PRINCIPAL COMPONENTS ANALYSIS 163

is given by

V = [u1 u2 . . . un]{ui}→{ei} =

u11 u21 . . . un1u12 u22 . . . un2

...... . . .

...u1n u2n . . . unn

{ui}→{ei}

; (11.17)

where the notation [ ]{ui}→{ei} signifies the fact that V effects the basis change from {ui}to {ei}. The j-th column of V is the j-th eigenvector of L expressed with respect to {ei}.And, this is the necessary and sufficient condition for V : uj(expressed in the basis {ui}) 7−→uj(expressed in the basis {ei}) as shown below.

V uk =

u11 u21 . . . un1u12 u22 . . . un2

...... . . .

...u1n u2n . . . unn

{ui}→{ei}

0...1...0

k

{ui}⊂L

=

uk1uk2

...ukn

{ei}⊂O

(11.18)

Note that

0...1...0

k is uk expressed in {ui} ⊂ L , and

uk1uk2

...ukn

is the same uk expressed in

{ei} ⊂ O.

11.3.3 Two Variable ExampleOur first example contains only two variables X and Y . The observed values are

shown below. Each column corresponds to one pair of X- and Y -values(

xy

).

Simply regard this data as a set of 10 points with respective x- and y-coordinates inthe usual Cartesian plane.

X 2.50 0.50 2.20 1.90 3.10 1.30 3.00 1.00 1.50 1.10Y 2.40 1.70 2.90 1.20 2.80 2.70 2.10 1.40 1.60 0.90

The correlation matrix R is [1.000000 0.5565050.556505 1.000000

].

Page 166: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

164 CHAPTER 11. FACTOR ANALYSIS AND PRINCIPAL COMPONENTS ANALYSIS

The two normalized, i.e. of unit length, eigenvectors are[ 1√2

1√2

]and

[ −1√2

1√2

]

with associated eigenvalues of 1.556505 and 0.443495, respectively. These indeedform an orthonormal basis of R2 as claimed in Theorem 11.1. Now, the V in Theorem11.1 is [ 1√

2−1√

21√2

1√2

]= 1√

2

[1 −11 1

].

It is easy to check that V V T = V T V = I. With this V , we get,

V T RV = 1√2

[1 1

−1 1

] [1.000000 0.5565050.556505 1.000000

]1√2

[1 −11 1

]

= 12

[1 1

−1 1

] [1.000000 0.5565050.556505 1.000000

] [1 −11 1

]

= 12

[1 1

−1 1

] [1 + 0.556505 −1 + 0.5565051 + 0.556505 −0.556505 + 1

]

= 12

[1 1

−1 1

] [1.556505 −0.4434951.556505 0.443495

]= 1

2

[2 × 1.556505 0

0 2 × 0.443495

]

=[

1.556505 00 0.443495

]We have indeed recovered the eigenvalues on the diagonal as claimed by Theorem11.1.

Now, let L denote this diagonal matrix with eigenvalues on the diagonal so that

L =[

1.556505 00 0.443495

],

and consider V√

L, where√

L is simply[ √1.556505 0

0√

0.443495

].

Page 167: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

11.3. PRINCIPAL COMPONENTS ANALYSIS 165

Then,

V T RV = L =⇒ V(V T RV

)V T = V LV T =⇒

R = V LV T = V√

L√

LV T =(V

√L) (

V√

L)T

.

Recall from linear algebra that√

L can be regarded as a linear transformation op-erating in the vector space L spanned by the eigenvectors of R, and expressed interms of the orthonormal basis made up of the normalized eigenvectors of R. Like-wise, V can be regarded as a linear map from L to the original space O, connectingthe orthonormal basis consisting of eigenvectors with the original orthonormal basis.

√L : L 7−→ L and V : L 7−→ O

Therefore,V

√L : L 7−→ O.

We regard each normalized eigenvector of R spanning a one-dimensional subspaceof L as a component. Therefore, vectors[

10

]and

[01

]

in L correspond to the two components associated with the eigenvalues 1.556505and 0.443495. We will label our components C1, C2, . . . in descending order of theassociated eigenvalues λ1, λ2, . . ..a So, in this case,

C1 =[

10

]associated with λ1 = 1.556505,

andC2 =

[01

]associated with λ2 = 0.443495.

According to Property 1 on p.157, X and Y are linear combinations of the compo-nents C1 and C2, and as explained on p.155, the components C1 and C2 are supposedto cause the measurable variables to take the observed values. We incorporate thesefeatures in our model by computing the images of C1 and C2 under V

√L.

(V

√L)

C1 = 1√2

[1 −11 1

] [ √1.556505 0

0√

0.443495

] [10

]

Page 168: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

166 CHAPTER 11. FACTOR ANALYSIS AND PRINCIPAL COMPONENTS ANALYSIS

= 1√2

[1 −11 1

] [ √1.556505

0

]= 1√

2

[ √1.556505√1.556505

]=[

0.8821860.882186

](V

√L)

C2 = 1√2

[1 −11 1

] [ √1.556505 0

0√

0.443495

] [01

]

= 1√2

[1 −11 1

] [0√

0.443495

]= 1√

2

[−

√0.443495√0.443495

]=[

−0.4709010.470901

]In these matrices, the first row represents X, the second row for Y , and, of course,the first and the only column represents C1 or C2. We interpret 0.882186 as thecoefficient in front of C1 and ±0.443495 as the coefficients for C2 in the expressionof X and Y as linear combinations of C1 and C2 as shown below.

aWe will see later that C1 explains the most variance of all the components and is called theprincipal component. {

X = 0.882186 C1 − 0.470901 C2Y = 0.882186 C1 + 0.470901 C2

It is customary to refer to 0.882186 as the loading of X or Y on C1 and ∓0.470901as the loadings of X and Y , respectively, on C2. Also, the following matrix is calledthe loading matrix. [

0.882186 −0.4709010.882186 0.470901

]The correlation between X and Y can be computed as the inner product betweenX and Y , and likewise between X and X as well as between Y and Y . Let uscheck this, remembering that the inner product between C1 and C2 is zero; i.e. theorthogonality condition.

Corr(X, Y ) =< X, Y >=[

0.882186 −0.470901] [ 0.882186

0.470901

]

= (0.882186)2−(0.470901)2 = 0.556504

Corr(X, X) =< X, X >=[

0.882186 −0.470901] [ 0.882186

−0.470901

]

= (0.882186)2+(−0.470901)2 = 1.000000

Page 169: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

11.3. PRINCIPAL COMPONENTS ANALYSIS 167

Corr(Y, Y ) =< Y, Y >=[

0.882186 0.470901] [ 0.882186

0.470901

]

= (0.882186)2+(0.470901)2 = 1.000000

These agree with the previously computed numbers up to the rounding error, andthis formulation works as it is nothing but a restatement of the identity(

V√

L) (

V√

L)T

= V LV T = R.

In order to see this, examine the components of V√

L.

V√

L = 1√2

[1 −11 1

] [ √1.556505 0

0√

0.443495

]

= 1√2

[ √1.556505 −

√0.443495√

1.556505√

0.443495

]=[

0.882186 −0.4709010.882186 0.470901

]So, we have

(V

√L) (

V√

L)T

=[

0.882186 −0.4709010.882186 0.470901

] [0.882186 0.882186

−0.470901 0.470901

]

=[

Corr(X, X) Corr(X, Y )Corr(Y, X) Corr(Y, Y )

].

Comparing this with{X = 0.882186 C1 − 0.470901 C2Y = 0.882186 C1 + 0.470901 C2

, (11.20)

it is clear that our inner product formulation is equivalent to the matrix multiplica-tion.

Note that Properties 1, 2, and 3 on p.157 are indeed satisfied by this componentstructure. However, we can further improve on this result by evoking ”rotation” ofC1 and C2.

Page 170: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

168 CHAPTER 11. FACTOR ANALYSIS AND PRINCIPAL COMPONENTS ANALYSIS

11.3.4 Orthogonal RotationFigure 11.1 shows the relationship between (X, Y ) and (C1, C2). As they are locatednow, X and Y equally load on C1 and C2 up to the sign of the loadings as you cansee from the symmetric positioning of X and Y about the C2 axis. In other words,the components C1 and C2 are not doing a good job of distinguishing X from Y . Inparticular, C1 is contributing nothing towards discriminating between X and Y . Ifpossible, we would like to have two components such that X loads mainly on onecomponent while Y loads on the other.

Component 1

1.00.50.0-0.5-1.0

Co

mp

on

en

t 2

1.0

0.5

0.0

-0.5

-1.0

Y

X

Component Plot (before rotation)

Figure 11.1: The Relation Between (X, Y ) And (C1, C2) Before Rotation

From Figure 11.1, we can see that rotating the C1 and C2 axes clockwise willlead to a better distinction. Figure 11.2 shows the result of such a rotation thattransforms (C1, C2) to (C ′

1, C ′2). The loading matrix has now become

C ′

1 C ′2

X 0.956777 0.290823Y 0.290823 0.956777

, (11.21)

Page 171: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

11.3. PRINCIPAL COMPONENTS ANALYSIS 169

so that we have {X = 0.956777 C ′

1 + 0.290823 C ′2

Y = 0.290823 C ′1 + 0.956777 C ′

2. (11.22)

The matrix (11.21) was computed by SPSS. It is clear that X is aligned more withC ′

1, and Y with C ′2. As it turned out, this rotation of the axes is clockwise by π/4

radians or 45◦.

Component 1’

1.00.50.0-0.5-1.0

Co

mp

on

en

t 2’

1.0

0.5

0.0

-0.5

-1.0

Y

X

Component Plot (after rotation)

Figure 11.2: The Relation Between (X, Y ) And (C ′1, C ′

2) After Rotation

If you have taken elementary college physics, you know that the 2-by-2 matrix thateffects this rotation acting on (X, Y ) isa

[cos (π/4) − sin (π/4)sin (π/4) cos (π/4)

]= 1√

2

[1 −11 1

].

Page 172: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

170 CHAPTER 11. FACTOR ANALYSIS AND PRINCIPAL COMPONENTS ANALYSIS

Now,[cos (π/4) − sin (π/4)sin (π/4) cos (π/4)

][X] =

[cos (π/4) − sin (π/4)sin (π/4) cos (π/4)

] [0.882186

−0.470901

]

= 1√2

[1 −11 1

] [0.882186

−0.470901

]= 1√

2

[0.882186 + 0.4709010.882186 − 0.470901

]=[

0.9567770.290822

]and[

cos (π/4) − sin (π/4)sin (π/4) cos (π/4)

][Y ] =

[cos (π/4) − sin (π/4)sin (π/4) cos (π/4)

] [0.8821860.470901

]

= 1√2

[1 −11 1

] [0.8821860.470901

]= 1√

2

[0.882186 − 0.4709010.882186 + 0.470901

]=[

0.2908220.956777

].

The numbers agree up to the rounding error, and we have confirmed that this matrixindeed connects (C1, C2) and (C ′

1, C ′2). A rotation like this is known as an orthogonal

rotation because the resulting C ′1 and C ′

2 remain orthogonal. Of many available al-gorithms for orthogonal rotation, varimax, quadrimax, and equamax are commonlyencountered, with varimax rotation technique being easily the most commonly used.By abuse of notation, we redefine Component 1, C1, and Component 2, C2, as theserotated components.

aHere, we are rotating the axes clockwise. However, it is equivalent to rotating the pointscounterclockwise, which you may be more familiar with.

11.3.5 Communalities, Variance, and Covariance11.3.5.1 Multivariate Variance

Suppose y1, y2, . . . , yn are random variables with means µ1, µ2, . . . , µn, and formcolumn vectors

y =

y1y2...

yn

and µ =

µ1µ2...

µn

.

We use a more compact notation E[y] = µ to describe E[yi] = µi for i = 1, 2, . . . , n.Now let σij denote the covariance between yi and yj, that is,

Page 173: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

11.3. PRINCIPAL COMPONENTS ANALYSIS 171

σij = Cov(yi, yj) = E[(yi − µi)(yj − µj)] = E[yiyj − yiµj − µiyj + µiµj]= E[yiyj] − E[yi]µj − µiE[yj] + µiµj = E[yiyj] − µiµj = E[yiyj] − µiµj.

(11.23)

Then, the variance-covariance matrix Σ is defined by

Σ =

σ11 σ12 . . . σ1n

σ21 σ22 . . . σ2n...

.... . .

...σn1 σn2 . . . σnn

, (11.24)

the diagonal entries of which are the variances of y1, y2, . . . , yn

Definition 11.1 (Total Variance and Generalized Variance) The quantity tr(Σ) =∑ni=1 σii is called total variance, and the determinant of Σ, often denoted by |Σ|,

is referred to as the generalized variance. These are both overall measures ofvariability for the vector y.

11.3.5.2 Total Variance in the Data Set

In our analysis, the observed variables are standardizeda, so that their means are0, and the standard deviations are 1. Therefore, each observed variable contributesone unit of variance to the total variance in the data set. Therefore, the totalvariance in a principal component analysis of a correlation matrix isequal to the number of observed variables n. As the diagonal entries of anycorrelation matrix are 1’s, this is consistent with Definition 11.1, which claims ”thetotal variance = tr(R)”.

In our case, we have

R = V LV T =⇒ tr(R) = tr(V LV T ) = tr(V T V L) = tr(IL) = tr(L) = λ1 + λ2

by construction. So, the total variance is the sum of the eigenvalues. In our case, thetotal variance is 2, of which 1.556505 is explained by C1 and the remaining 0.443495by C2. Like so, the variance explained by each component is equal to the associatedeigenvalue. This is why the component associated with the largest eigenvalue is the”principal component”.

Page 174: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

172 CHAPTER 11. FACTOR ANALYSIS AND PRINCIPAL COMPONENTS ANALYSIS

One concern you may have is that some eigenvalues can be negative while vari-ances are nonnegative by definition, contradicting the claim above that each eigen-value equals the amount of variance explained by the associated component. Whenthis happens, you usually discard those eigenvalues and the associated ”components”[Tabachnick and Fidell, 2001, p.593].

aNote that we use the correlation matrix which is nothing but the variance-covariance matrixafter standardization.

11.3.5.3 Communalities, Variance, and Covariance

As explained on p.166, loadings are the coefficients for components when observedvariables are explained as linear combinations of the components. Two column en-tries to the right of the observed variables are the loadings for Component 1 andComponent 2.

Communality for a variable is the variance accounted for by the components, andit is the sum of squared loadings (SSL) for a variable across components as shown inTable 11.1. On the other hand, the SSL of the column entries for each componentis the variance explained by that component. So, SSL= (0.956777)2 + (0.290823)2 =1.000000 for C1 and (0.290823)2 + (0.956777)2 = 1.000000 for C2.

Table 11.1: Loadings, Communalities, Variance, and Covariance

Component 1 Component 2 Communalities (h2)X 0.956777 0.290822 (0.956777)2 + (0.290822)2 = 1.000000Y 0.290822 0.956777 (0.290822)2 + (0.956777)2 = 1.000000SSL 1.000000 1.000000 2.000000Variance 0.50 0.50Covariance

You may have noticed that one thing we did not accomplish with our first examplewas the reduction in the number of variables mentioned on p.155. In our two variableexample, we retained both, and all, of the components and demonstrated that allthe variance in the sample was accounted for. We will next see an example wherethe third eigenvalue is negligibly small, and see what effect the omission of thecomponent associated with that eigenvalue has on our analysis. While retention ofall the components automatically leads to a successful explanation of all the varianceas it is a mathematically provable fact, the art of including only the important

Page 175: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

11.4. THREE VARIABLE EXAMPLE 173

components and omitting insignificant ones is at the core of principal componentanalysis technique.

11.4 Three Variable ExampleWe now add the third variable Z. The observed values are shown below. Each

column corresponds to a set of X-, Y -, and Z-values

XYZ

.X 2.50 0.50 2.20 1.90 3.10 1.30 3.00 1.00 1.50 1.10Y 2.40 1.70 2.90 1.20 2.80 2.70 2.10 1.40 1.60 0.90Z 2.40 0.40 2.40 1.60 3.20 1.50 3.20 0.90 1.40 1.00

X and Y values are the same as before, and the correlation matrix R this time is 1.000000 0.556505 0.9864970.556505 1.000000 0.6469340.986497 0.646934 1.000000

.

11.4.1 Unrotated SolutionThree pairs of eigenvalues and eigenvectors of the correlation matrix R are

2.477241 and

−0.603327−0.499532−0.621662

,

0.515775 and

0.429445−0.860349

0.274548

,

and

0.00698405 and

−0.671992−0.101327

0.733594

a.

According to Section 11.3.5.2, the component associated with the eigenvalue 0.00698405explains only 0.00698405/3 = 0.00232802 or about 0.2% of the total variance. Itmay be wise to ignore this component. We will come back to this considerationlater. But, first, let us proceed as before and subject this three variable case to the

Page 176: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

174 CHAPTER 11. FACTOR ANALYSIS AND PRINCIPAL COMPONENTS ANALYSIS

same analysis as the two variable example.

We have

V =

−0.603327 0.429445 −0.671992−0.499532 −0.860349 −0.101327−0.621662 0.274548 0.733594

,

L =

2.477241 0 00 0.515775 00 0 0.00698405

,

and√

L =

1.573925 0 00 0.718175 00 0 0.0835706

.

Let us first check that V T RV = L.

V T RV =

−0.603327 −0.499532 −0.6216620.429445 −0.860349 0.274548

−0.671992 −0.101327 0.733594

1.000000 0.556505 0.986497

0.556505 1.000000 0.6469340.986497 0.646934 1.000000

−0.603327 0.429445 −0.671992

−0.499532 −0.860349 −0.101327−0.621662 0.274548 0.733594

=

−0.603327 −0.499532 −0.621662−0.671992 −0.101327 0.733594

0.429445 −0.860349 0.274548

−1.494586754 −0.004692702 0.221497258

−1.237460776 −0.000707007 −0.443746274−1.540006510 0.005124027 0.141605184

=

2.477241 0 00 0.00698405 00 0 0.515775

= Lb

So, V T RV = L is indeed satisfied up to rounding error. Next, we will computeV

√L to express X, Y , and Z as a linear combination of components. In other

words, we will compute the loadings of the variables on the three components.

V√

L =

−0.603327 0.429445 −0.671992−0.499532 −0.860349 −0.101327−0.621662 0.274548 0.733594

1.573925 0 0

0 0.718175 00 0 0.0835706

Page 177: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

11.4. THREE VARIABLE EXAMPLE 175

=

−0.949592 0.308417 −0.0561588−0.786226 −0.617881 −0.0084680−0.978450 0.197173 0.0613069

Recall that this means

X = −0.949592 C1 + 0.308417 C2 − 0.0561588 C3Y = −0.786226 C1 − 0.617881 C2 − 0.0084680 C3Z = −0.978450 C1 + 0.197173 C2 + 0.0613069 C3

;

where C1, C2, and C3 are the components associated with the eigenvalues 2.477241,0.515775, and 0.00698405, respectively. Note that 6 of the 9 coefficients are negative.As we generally prefer positive coefficients, we will replace C1, C2, and C3 with −C1,−C2, and −C3 and still call them C1, C2, and C3 by the usual abuse of notation.Then, the loading matrix is

=

0.949592 −0.308417 0.05615880.786226 0.617881 0.00846800.978450 −0.197173 −0.0613069

.

Figure 11.3 shows the relationship between {X, Y, Z} and {C1, C2, C3}.aThese values were computed using ”Eigenvalues and Eigenvectors Calculator” at

http://www.akiti.ca/EigR12Solver.html.A more versatile online calculator is available athttp://www.bluebit.gr/matrix-calculator/.

bMatrix multiplication was conducted using ”Online Matrix Calculator” available at http://www.bluebit.gr/matrix-calculator/. Where possible, a precision level of 15 digits under decimalwas used.

It is easy to see that all three points lie close to the plane defined by C1 and C2,reflecting the small values in the third column of the loading matrix. This takes usback to the issue of the significance of the third component raised on p.173. It iscustomary in principal components analysis to require the following conditions for acomponent to be included in the analysis.

1. The associated eigenvalue λ > 1.

2. The absolute value of at least one of the loadings > 0.3.

Though nothing is carved in stone, we will drop C3 as the loadings are all smallerthan 0.07 < 0.3 and the eigenvalue of 0.00698405 is smaller than one hundredth of

Page 178: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

176 CHAPTER 11. FACTOR ANALYSIS AND PRINCIPAL COMPONENTS ANALYSIS

Component 31.0

0.50.0

-0.5

-1.0

Co

mp

on

en

t 2

1.0

0.5

0.0

-0.5

-1.0

Component 11.0

0.50.0

-0.5

-1.0

Y

X

Z

Component Plot (before rotation)

Figure 11.3: The Relation Between (X, Y, Z) And (C1, C2, C3) Before Rotation

Page 179: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

11.4. THREE VARIABLE EXAMPLE 177

the critical value of 1.6

This amounts to the following changes in V and L.

V :

−0.603327 0.429445 −0.671992−0.499532 −0.860349 −0.101327−0.621662 0.274548 0.733594

−→

−0.603327 0.429445−0.499532 −0.860349−0.621662 0.274548

L :

2.477241 0 00 0.515775 00 0 0.00698405

−→[

2.477241 00 0.515775

]

By abuse of notation and for symbolic consistency with previous computations, wewill still use V and L to denote these new matrices.

Our new V and L still satisfy V T RV = L, with the “sandwiching operation/operator” V T ( )V effectively cutting out the relevant portion from R and convertingit to the new 2-by-2 version of L. Let us see how it does so.

V T RV =[

−0.603327 −0.499532 −0.6216620.429445 −0.860349 0.274548

] 1.000000 0.556505 0.9864970.556505 1.000000 0.6469340.986497 0.646934 1.000000

−0.603327 0.429445

−0.499532 −0.860349−0.621662 0.274548

=[

−1.494587 −1.237461 −1.5400070.221497 −0.443747 0.141605

] −0.603327 0.429445−0.499532 −0.860349−0.621662 0.274548

=[

2.477241 0.0000000.000000 0.515775

]= L

6Errors in numerical computations often become problematic when the eigenvalue is small. Thisconstitutes another reason for dropping such components.

Page 180: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

178 CHAPTER 11. FACTOR ANALYSIS AND PRINCIPAL COMPONENTS ANALYSIS

And, the loadings can be found as in the two variable case.

V√

L =

−0.603327 0.429445−0.499532 −0.860349−0.621662 0.274548

[ 1.573925 0.0000000.000000 0.718175

]

=

−0.949592 0.308417−0.786226 −0.617881−0.978450 0.197173

As before, we have more negative entries than positive ones. So, we will again take−C1 and −C2 and call them C1 and C2 by abuse of notation for symbolic consistency.Then, we get

=

0.949592 −0.3084170.786226 0.6178810.978450 −0.197173

.

Recall that the first column lists the loadings of X, Y , and Z on C1 and the secondcolumn on C2. You can check it yourself, but this agrees with the output of asoftware package such as SPSS up to rounding error.

11.4.2 Rotated SolutionWe will apply varimax rotation, the most popular orthogonal rotation algorithm, tothe solution obtained in Section 11.4.1. While the unrotated initial solution can becalculated easily by hand, computation of optimal rotation requires some software.Shown below are the new loadings after the rotation.

C1 C2

X 0.961822 0.267852Y 0.313683 0.949490Z 0.924383 0.376507

(11.25)

X = 0.961822 C1 + 0.267852 C2Y = 0.313683 C1 + 0.949490 C2Z = 0.924383 C1 + 0.376507 C2

(11.26)

Page 181: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

11.4. THREE VARIABLE EXAMPLE 179

Here again, the new rotated axes C ′1 and C ′

2 are simply denoted by C1 and C2 byabuse of notation. The angle of rotation this time for the C1- and C2-axes is 0.585642radians or 33.554824 degrees clockwise. As this is equivalent to rotating X, Y , andZ counterclockwise by 0.585642 radians, the rotation matrix is[

cos 0.585642 − sin 0.585642sin 0.585642 cos 0.585642

]=[

0.833357 −0.5527350.552735 0.833357

].

Indeed,

[0.833357 −0.5527350.552735 0.833357

] [ X Y Z

0.949592 0.786226 0.978450−0.308417 0.617881 −0.197173

]

=[ X Y Z

C1 0.961822 0.313683 0.924383C2 0.267851 0.949490 0.376507

].

Note that the pairs of loadings, on C1 and C2, are now presented in columns as theobserved variables are expressed as column vectors. Figures 11.4 and 11.5 show thepositions of the variables before and after the rotation.

As you can see, X and Z depend mainly on C1 and Y on C2 after the rotation,which is reflected in their respective loadings.

11.4.3 The Effect of Dropping A ComponentRecall that we dropped the third minor component C3 from our analysis. While thishad minimal effects on distinguishing X and Z from Y , we no longer have all thecomponents for the principal components analysis. When all the components areretained:

1. The total variance in the observed variables is equal to the number of variablesand is completely accounted for by the components.

2. The amount of variance each component explains is equal to its associatedeigenvalue.

Page 182: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

180 CHAPTER 11. FACTOR ANALYSIS AND PRINCIPAL COMPONENTS ANALYSIS

Component 1

1.00.50.0-0.5-1.0

Co

mp

on

en

t 2

1.0

0.5

0.0

-0.5

-1.0

Z

Y

X

Component Plot (before rotation)

Figure 11.4: The Relation Between (X, Y, Z) And (C1, C2) Before Rotation

Component 1

1.00.50.0-0.5-1.0

Co

mp

on

en

t 2

1.0

0.5

0.0

-0.5

-1.0

Z

Y

X

Component Plot (after rotation)

Figure 11.5: The Relation Between (X, Y, Z) And (C1, C2) after Rotation

Page 183: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

11.5. MULTI-VARIABLE EXAMPLE 181

Property 1 above does not hold as we have dropped C3. The total variance accountedfor by the two components equals the sum of squared loadings from Section 11.3.5.3.The sum of squared loadings are

(0.961822)2+(0.267852)2 = 0.996846,

(0.313683)2+(0.949490)2 = 0.999928,

and

(0.924383)2+(0.376507)2 = 0.996241,

respectively for X, Y , and Z. The total explained variance is therefore

(0.961822)2+(0.267852)2+(0.313683)2+(0.949490)2+(0.924383)2+(0.376507)2

= 2.993016.

If you recall, the total variance is equal to the number of variables as each variableis standardized to have a unit variance. Hence, the total variance is 3 here, and wefell short of it by 0.006984. On the other hand, the variances explained by C1 andC2 are the same in size as their associated eigenvalues λ1 and λ2 due to Property 2,and so,

variance explained by C1 and C2 = λ1 + λ2 = 2.477241 + 0.515775 = 2.993016.

As you can see, our solution explained just as much variance as there are in C1 andC2 and failed to account for the variance equal in size to the third eigenvalue; namely,0.006984.

11.5 Multi-Variable ExampleSo far, we have been examining abstract and simple cases in order to shed light onthe inner workings of principal components analysis. However, many investigationsthat employ this technique include ten or more observed variables. In this section,we will return to the language learner example first encountered in Section 9.5 andconsider all 12 variables examined in that study. You may have thought that thereduction in the number of variables did not contribute to simplifying and makingmore manageable the problem at hand. But, with 12 variables, you will clearly seethe power and utility of principal components analysis. Though this is not a foreignlanguage education class, some explanation of the 12 variables is due. The variables

Page 184: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

182 CHAPTER 11. FACTOR ANALYSIS AND PRINCIPAL COMPONENTS ANALYSIS

we will deal with are as follows. Recall that the first letter M stands for the fact thatRasch analysis has been performed on the raw data. The only exception to this isZDTN which is the Z-score as Rasch analysis was not possible for the data.

1. MSLT (Short Listening Test): Conversations lasting for about 10 seconds.

2. MLLT (Long Listening Test): Lectures and conversations lasting for 3 to 5minutes.

3. MLCT (Listening Cloze Test): Gap filling while listening.

4. ZDTN (Dictation Test): Ordinary dictation.

5. MPHD (Phoneme Distinction Test): Distinguishing between words such as“late” and “rate”.

6. MAWR (Aural Word Recognition Test): Listening to a word and writing itdown. Word-level dictation.

7. MRCT (Reading Cloze Test): Gap filling while reading

8. MRCL (Reading Comprehension of Listening Script Test): The full name inparentheses says it all.

9. MVST (Vocabulary Size Test): Multiple-choice vocabulary test.

10. MPVT (Productive Vocabulary Levels Test): Filling a gap in a sentence withthe right word.

11. MGED (Grammatical Error Detection Test): Choosing an error out of fourpossibilities.

12. MGGT (GRE-TOEFL Gap Filling Test): Multiple choice gap filling test.

Of these, 1, 2, 3, 4, 5, and 6 measure listening skills directly, and 7 and 8 measurereading skills directly. However, even these are not genuine measures of listening andreading, and how the abilities measured by the other tests are related to listeningand reading is difficult to assess. As one reason for our difficulty is the large numberof observed variables, principal components analysis may contribute significantly toentangling the seemingly complicated situation.

Page 185: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

11.5. MULTI-VARIABLE EXAMPLE 183

11.5.1 Correlation MatrixThe full correlation matrix is given below in Table 11.2 on P.183. As you can see,what were purported to be measures of listening are significantly correlated withthose for reading. For example, the correlation coefficient between MSLT (shortlistening) and MRCL (reading comprehension of listening scripts) is 0.645. If thelistening measures and reading measures are both genuine measures of the respectiveskills and there is no association between them, the correlation matrix should havea block structure as shown in Table 11.3, where the entries in the cells with “.” are0.000. In a way, principal components analysis is an attempt to arrive at this kind of

1 2 3 4 5 6 7 8 9 10 11 121. MSLT 1.000 .621 .664 .576 .409 .489 .584 .645 .571 .518 .556 .3792. MLLT .621 1.000 .705 .507 .295 .353 .666 .756 .555 .457 .603 .4573. MLCT .664 .705 1.000 .659 .375 .610 .658 .583 .570 .544 .616 .4084. ZDTN .576 .507 .659 1.000 .386 .478 .539 .507 .425 .470 .564 .4005. MPHD .409 .295 .375 .386 1.000 .417 .248 .223 .297 .335 .265 .2496. MAWR .489 .353 .610 .478 .417 1.000 .435 .321 .426 .506 .483 .2847. MRCT .584 .666 .658 .539 .248 .435 1.000 .719 .573 .611 .748 .5428. MRCL .645 .756 .583 .507 .223 .321 .719 1.000 .581 .527 .669 .5449. MVST .571 .555 .570 .425 .297 .426 .573 .581 1.000 .655 .538 .57410. MPVT .518 .457 .544 .470 .335 .506 .611 .527 .655 1.000 .691 .63711. MGED .556 .603 .616 .564 .265 .483 .748 .669 .538 .691 1.000 .68912. MGGT .379 .457 .408 .400 .249 .284 .542 .544 .574 .637 .689 1.000

Table 11.2: Full Correlation Matrix

block structure in addition to reducing the number of variables to deal with, albeitonly approximately.

1 2 3 4 5 6 7 81. MSLT 1.000 .621 .664 .576 .409 .489 . .2. MLLT .621 1.000 .705 .507 .295 .353 . .3. MLCT .664 .705 1.000 .659 .375 .610 . .4. ZDTN .576 .507 .659 1.000 .386 .478 . .5. MPHD .409 .295 .375 .386 1.000 .417 . .6. MAWR .489 .353 .610 .478 .417 1.000 . .7. MRCT . . . . . . 1.000 .7198. MRCL . . . . . . .719 1.000

Table 11.3: Full Correlation Matrix with a Block Structure

Page 186: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

184 CHAPTER 11. FACTOR ANALYSIS AND PRINCIPAL COMPONENTS ANALYSIS

11.5.2 Stepwise Description Of The ProcedureWith 12 variables, we have no other choice but to use a software package. What weshould do, and what the software mostly does for us, is as follows step-by-step.

Steps Involved

1. Conduct the full rotated principal components analysis with as many compo-nents as the number of variables.7

2. Examine the scree plot and other data, such as eigenvalues, to decide on thenumber of components to retain.8 9 10

3. Run another rotated principal components analysis with the prescribed numberof components.

4. Ignore the loadings smaller than 0.4. (Note: This is a rule of thumb. Adjust itupward or downward, and use numbers such as 0.3 and 0.5 if they work better.)Hair et al. recommends the following guidelines for practical significance:

±0.3: Minimal±0.4: More Important±0.5: Practically Significant

5. Interpret the meaning of each component based on the nature of the variableswith high loadings on it.

6. Finally identify what component(s) is (are) behind each variable.

As an example of Step 6, one’s short listening skill, the observed variable MSLT, maybe a combination of two components; “aural processing”, including sound capturingand word recognition, and “processing for meaning”. We will see below that thisappears to be the case for our data.

7Recall that all the variance in the observed variables is explained if the number of componentsequals that of the variables.

8The usual practice is to include the components corresponding to the last eigenvalue before theplot begins to level off.

9The word “scree” refers to the loose rubble that lies at the base of a cliff.10Kaiser’s criterion of eigenvalues greater than 1, which recommends discarding the components

associated with eigenvalues smaller than 1, is often adopted as a rule of thumb.

Page 187: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

11.5. MULTI-VARIABLE EXAMPLE 185

11.5.3 Interpreting and Using the Computer OutputWe will follow the steps described on p.184 with our 12 variables. We will use vari-max rotation in the following.

Step 1: Full AnalysisLet us check the communalities first (Table 11.23). Recall from p.172 that commu-

CommunalitiesInitial Extraction

MSLT 1.000 1.000MLLT 1.000 1.000MLCT 1.000 1.000ZDTN 1.000 1.000MPHD 1.000 1.000MAWR 1.000 1.000MRCT 1.000 1.000MRCL 1.000 1.000MVST 1.000 1.000MPVT 1.000 1.000MGED 1.000 1.000MGGT 1.000 1.000

Table 11.4: Communalities for Full Analysis

nality for a variable is the variance accounted for by the components. Because weare using the correlation matrix, which is the same as the covariance matrix for stan-dardized variables, each variable has a unit variance of 1.000 as explained in Section11.3.5.2, and all of it is accounted for by the model as you see 1.000’s all the waydown to the bottom in the column labeled “extraction”.

The variances explained by all 12 components are summarized in Table 11.5. Thetotal amount explained is 100% as it should be.

The coefficients {aij} in (11.2) are listed in a table titled Component Matrix,Table 11.29.

We can express what Table 11.29 means using matrix multiplication as in (11.3)and (11.4).

Z = AC (11.27)

Page 188: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

186 CHAPTER 11. FACTOR ANALYSIS AND PRINCIPAL COMPONENTS ANALYSIS

Total Variance ExplainedExtraction Sums

Initial Eigenvalues of Squared Loadings% of Cumulative % of Cumulative

Component Total Variance % Total Variance %1 6.769 56.408 56.408 6.769 56.408 56.4082 1.136 9.467 65.875 1.136 9.467 65.8753 .883 7.358 73.233 .883 7.358 73.2334 .651 5.424 78.657 .651 5.424 78.6575 .563 4.695 83.352 .563 4.695 83.3526 .442 3.687 87.039 .442 3.687 87.0397 .371 3.095 90.133 .371 3.095 90.1338 .318 2.649 92.782 .318 2.649 92.7829 .271 2.257 95.039 .271 2.257 95.03910 .249 2.071 97.110 .249 2.071 97.11011 .184 1.531 98.641 .184 1.531 98.64112 .163 1.359 100.000 .163 1.359 100.000

Table 11.5: Total Variance Explained

Component MatrixComponent

1 2 3 4 5 6 7 8 9 10 11 12MSLT .785 .178 -.241 .112 -.176 -.143 -.396 .171 -.129 -.150 .020 .065MLLT .790 -.124 -.394 .173 -.055 .150 .245 .052 -.130 .141 -.062 .215MLCT .830 .210 -.224 -.176 -.048 -.018 .248 -.045 -.222 -.130 .084 -.217ZDTN .724 .256 -.145 -.131 .399 -.429 .040 -.052 .113 .090 -.011 .060MPHD .464 .635 .222 .527 .133 .162 .016 -.088 .032 -.012 -.022 -.037MAWR .632 .492 .177 -.426 -.160 .210 .027 .201 .157 .070 .031 .061MRCT .832 -.207 -.110 -.101 .079 .206 -.073 -.308 .161 -.208 .123 .099MRCL .805 -.292 -.276 .158 -.006 .090 -.119 .092 .198 .223 .079 -.199MVST .758 -.121 .185 .123 -.460 -.255 .151 -.079 .176 -.069 -.136 -.014MPVT .777 -.092 .431 -.088 -.099 -.043 -.142 -.214 -.223 .245 .077 .022MGED .840 -.215 .136 -.150 .244 .156 -.098 .036 -.058 -.061 -.318 -.071MGGT .690 -.370 .437 .144 .192 -.052 .136 .277 -.016 -.120 .154 .038

Table 11.6: Full Component Matrix

Page 189: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

11.5. MULTI-VARIABLE EXAMPLE 187

Here, x is a 12×1 column vector representing the variables MSLT through MGGT,A = [aij] for 1 ≤ i, j ≤ 12 is the coefficient matrix whose k-th column is the columnfor the k-th component, Ck, in Table 11.29, and C is another 12×1 column vectorrepresenting the components C1 through C12.

Z =

MSLTMLLTMLCTZDTNMPHDMAWRMRCTMRCLMV STMPV TMGEDMGGT

C =

C1C2C3C4C5C6C7C8C9C10C11C12

(11.28)

A =

.785 .178 −.241 .112 −.176 −.143 −.396 .171 −.129 −.150 .020 .065

.790 −.124 −.394 .173 −.055 .150 .245 .052 −.130 .141 −.062 .215

.830 .210 −.224 −.176 −.048 −.018 .248 −.045 −.222 −.130 .084 −.217

.724 .256 −.145 −.131 .399 −.429 .040 −.052 .113 .090 −.011 .060

.464 .635 .222 .527 .133 .162 .016 −.088 .032 −.012 −.022 −.037

.632 .492 .177 −.426 −.160 .210 .027 .201 .157 .070 .031 .061

.832 −.207 −.110 −.101 .079 .206 −.073 −.308 .161 −.208 .123 .099

.805 −.292 −.276 .158 −.006 .090 −.119 .092 .198 .223 .079 −.199

.758 −.121 .185 .123 −.460 −.255 .151 −.079 .176 −.069 −.136 −.014

.777 −.092 .431 −.088 −.099 −.043 −.142 −.214 −.223 .245 .077 .022

.840 −.215 .136 −.150 .244 .156 −.098 .036 −.058 −.061 −.318 −.071

.690 −.370 .437 .144 .192 −.052 .136 .277 −.016 −.120 .154 .038

(11.29)

Page 190: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

188 CHAPTER 11. FACTOR ANALYSIS AND PRINCIPAL COMPONENTS ANALYSIS

To be more precise, we should use standardized scores and write

Z =

ZMSLT

ZMLLT

ZMLCT

ZDTNZMP HD

ZMAW R

ZMRCT

ZMRCL

ZMV ST

ZMP V T

ZMGED

ZMGGT

.11 (11.30)

However, we will write MSLT for ZMSLT , MLLT for ZMLLT , and so forth for no-tational simplicity.

Recall from Section 11.3.5 that the communality of an observed variable, whichis the total amount of variance explained by the extracted components, is the sumof squared loadings (SSL) for the variable. Indeed, we have

(.785)2 + (.178)2 + (−.241)2 + (.112)2 + (−.176)2 + (−.143)2 + (−.396)2 + (.171)2

+(−.129)2 + (−.150)2 + (.020)2 + (.065)2 = 1.000 (11.31)

for MSLT, and likewise for other observed variables. Needless to say, these are con-sistent with Table 11.23.

According to Property 2 on p. 158, we should be able to reconstruct the originaldata with no error, and this is evident in Table 11.7 titled “Reproduced Correla-tions”; where the residual correlation matrix equals the original correlation matrixminus the reproduced correlation matrix, and E-16 and E-17 signify 10−16 and 10−17

respectively.

11From p.182, ZDTN is already a z-score.

Page 191: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

11.5.M

ULT

I-VAR

IABLE

EXA

MPLE

189

Reproduced CorrelationsReproduced 1 2 3 4 5 6 7 8 9 10 11 121. MSLT 1.000 .621 .664 .576 .409 .489 .584 .645 .571 .518 .556 .3792. MLLT .621 1.000 .705 .507 .295 .353 .666 .756 .555 .457 .603 .4573. MLCT .664 .705 1.000 .659 .375 .610 .658 .583 .570 .544 .616 .4084. ZDTN .576 .507 .659 1.000 .386 .478 .539 .507 .425 .470 .564 .4005. MPHD .409 .295 .375 .386 1.000 .417 .248 .223 .297 .335 .265 .2496. MAWR .489 .353 .610 .478 .417 1.000 .435 .321 .426 .506 .483 .2847. MRCT .584 .666 .658 .539 .248 .435 1.000 .719 .573 .611 .748 .5428. MRCL .645 .756 .583 .507 .223 .321 .719 1.000 .581 .527 .669 .5449. MVST .571 .555 .570 .425 .297 .426 .573 .581 1.000 .655 .538 .57410. MPVT .518 .457 .544 .470 .335 .506 .611 .527 .655 1.000 .691 .63711. MGED .556 .603 .616 .564 .265 .483 .748 .669 .538 .691 1.000 .68912. MGGT .379 .457 .408 .400 .249 .284 .542 .544 .574 .637 .689 1.000

Residual 1 2 3 4 5 6 7 8 9 10 11 121. MSLT 5.551 .000 -7.772 -3.331 -6.661 -2.220 3.331 4.441 -4.441 -7.772 -5.551

E-16 E-16 E-16 E-16 E-16 E-16 E-16 E-16 E-16 E-162. MLLT 5.551 -3.331 2.220 -5.551 1.110 1.110 .000 1.110 -1.110 -2.220 3.331

E-16 E-16 E-16 E-17 E-16 E-16 E-16 E-16 E-16 E-163. MLCT .000 -3.331 2.220 .000 -2.220 .000 -2.220 -1.110 -3.331 -2.220 2.776

E-16 E-16 E-16 E-16 E-16 E-16 E-16 E-164. ZDTN -7.772 2.220 2.220 .000 -1.665 1.110 -3.331 1.110 3.886 -3.331 -2.776

E-16 E-16 E-16 E-16 E-16 E-16 E-16 E-16 E-16 E-165. MPHD -3.331 -5.551 .000 .000 1.665 -1.388 -2.776 -2.220 2.220 -3.886 -4.718

E-16 E-17 E-16 E-16 E-17 E-16 E-16 E-16 E-166. MAWR -6.661 1.110 -2.220 -1.665 1.665 -2.776 1.665 -4.441 -2.220 -4.996 -2.776

E-16 E-16 E-16 E-16 E-16 E-16 E-16 E-16 E-16 E-16 E-167. MRCT -2.220 1.110 .000 1.110 -1.388 -2.776 2.220 -1.110 3.331 -1.110 .000

E-16 E-16 E-16 E-16 E-16 E-16 E-16 E-16 E-168. MRCL 3.331 .000 -2.220 -3.331 -2.776 1.665 2.220 1.110 .000 -1.110 .000

E-16 E-16 E-16 E-17 E-16 E-16 E-169. MVST 4.441 1.110 -1.110 1.110 -2.220 -4.441 -1.110 1.110 -4.441 -4.441 1.110

E-16 E-16 E-16 E-16 E-16 E-16 E-16 E-16 E-16 E-16 E-1610. MPVT -4.441 -1.110 -3.331 3.886 2.220 -2.220 3.331 .000 -4.441 -1.110 -1.110

E-16 E-16 E-16 E-16 E-16 E-16 E-16 E-16 E-16 E-1611. MGED -7.772 -2.220 -2.220 -3.331 -3.886 -4.996 -1.110 -1.110 -4.441 -1.110 -6.661

E-16 E-16 E-16 E-16 E-16 E-16 E-16 E-16 E-16 E-16 E-1612. MGGT -5.551 3.331 2.776 -2.776 -4.718 -2.776 .000 .000 1.110 -1.110 -6.661

E-16 E-16 E-16 E-16 E-16 E-16 E-16 E-16 E-16

Table 11.7: Reproduced Full Correlation Matrix

Page 192: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

190 CHAPTER 11. FACTOR ANALYSIS AND PRINCIPAL COMPONENTS ANALYSIS

If the matrix A in (11.4) is invertible, we have

Z = AC =⇒ C = A−1Z, (11.32)

which expresses each standardized observed variable as a linear combination of thecomponents {Ci}. In our case, we have

A−1 =

.116 .117 .123 .107 .069 .093 .123 .119 .112 .115 .124 .102

.157 −.109 .184 .225 .559 .433 −.182 −.257 −.107 −.081 −.189 −.326−.273 −.446 −.254 −.165 .252 .201 −.125 −.312 .210 .488 .153 .495

.172 .265 −.270 −.202 .810 −.655 −.155 .242 .189 −.136 −.230 .221−.312 −.097 −.085 .708 .237 −.285 .140 −.010 −.817 −.176 .433 .341−.324 .340 −.040 −.970 .365 .474 .466 .202 −.577 −.098 .352 −.118

−1.067 .659 .668 .107 .043 .072 −.196 −.321 .406 −.382 −.264 .366.537 .164 −.140 −.163 −.276 .632 −.968 .288 −.250 −.673 .115 .873

−.477 −.479 −.821 .417 .119 .581 .593 .730 .649 −.825 −.215 −.060−.604 .567 −.523 .363 −.048 .283 −.838 .896 −.279 .986 −.244 −.484

.108 −.339 .458 −.058 −.118 .168 .668 .432 −.741 .421 −1.732 .838

.396 1.318 −1.332 .371 −.229 .375 .607 −1.223 −.086 .133 −.435 .233

. (11.33)

The entries of(A−1

)Tare listed in a table titled “Component Score Coefficient

Matrix”; Table 11.8.

Component Score Coefficient MatrixComponent

1 2 3 4 5 6 7 8 9 10 11 12MSLT .116 .157 -.273 .172 -.312 -.324 -1.067 .537 -.477 -.604 .108 .396MLLT .117 -.109 -.446 .265 -.097 .340 .659 .164 -.479 .567 -.339 1.318MLCT .123 .184 -.254 -.270 -.085 -.040 .668 -.140 -.821 -.523 .458 -1.332ZDTN .107 .225 -.165 -.202 .708 -.970 .107 -.163 .417 .363 -.058 .371MPHD .069 .559 .252 .810 .237 .365 .043 -.276 .119 -.048 -.118 -.229MAWR .093 .433 .201 -.655 -.285 .474 .072 .632 .581 .283 .168 .375MRCT .123 -.182 -.125 -.155 .140 .466 -.196 -.968 .593 -.838 .668 .607MRCL .119 -.257 -.312 .242 -.010 .202 -.321 .288 .730 .896 .432 -1.223MVST .112 -.107 .210 .189 -.817 -.577 .406 -.250 .649 -.279 -.741 -.086MPVT .115 -.081 .488 -.136 -.176 -.098 -.382 -.673 -.825 .986 .421 .133MGED .124 -.189 .153 -.230 .433 .352 -.264 .115 -.215 -.244 -1.732 -.435MGGT .102 -.326 .495 .221 .341 -.118 .366 .873 -.060 -.484 .838 .233

Table 11.8: Component Score Coefficient Matrix

The component score coefficient matrix is used to compute the “component score”.Each case/subject has 12 test scores which characterize his/her proficiency. However,the proficiency of a subject can also be expressed in terms of the components. Sup-pose a subject obtained a set of test scores (observed variables) X1, X2, . . ., and X12such that the representation of the subject’s proficiency in terms of the standardized

Page 193: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

11.5. MULTI-VARIABLE EXAMPLE 191

scores is

Z =

Z1Z2...

Z12

. (11.34)

Then, the column vector that expresses the same proficiency profile with respect tothe components

C =

c1c2...

c12

(11.35)

is connected to Z as in (11.32).c1c2...

c12

= A−1

Z1Z2...

Z12

(11.36)

In other words, a person’s ability represented by a set of standardized test scores{Z1, Z2, . . . , Z12} can be expressed as a linear combination of the components;c1C1 + c2C2 + . . . + c12C12. The coefficients c1, c2, . . . , cn are called componentscores. Note that this is very useful because Ci’s are uncorrelated while Zi’s arecorrelated.

If we denote the component scores and the standardized test scores for the i-thsubject by {ci

j} = ci1, ci

2,…, ci12 and {Z i

j} = Z i1, Z i

2, . . . , Z i12, respectively, and write

A−1 in (11.36) as[A−1

kl

], where A−1

kl is the k-th row and l-th column entry of A−1,we can see from (11.36) that we now have

ci1

ci2...

ci12

=[A−1

kl

]

Z i1

Z i2...

Z i12

. (11.37)

or component-wise

cij =

12∑l=1

A−1jl Z i

l for 1 ≤ j ≤ 12. (11.38)

Page 194: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

192 CHAPTER 11. FACTOR ANALYSIS AND PRINCIPAL COMPONENTS ANALYSIS

Now, many software packages, including SPSS, can compute the componentscores for each subject. But, it is straightforward to compute a component scoreby hand given (11.38). Let us compute our first subject’s component score for thefirst component C1.

c11 =

12∑l=1

A−11l Z1

l =[.116 .117 .123 .107 .069 .093 .123 .119 .112 .115 .124 .102

]

0.9102.8632.2210.9370.3541.6960.5130.4483.0620.8880.7401.775

= 1.829 (11.39)

Page 195: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

11.5. MULTI-VARIABLE EXAMPLE 193

This agrees with the component score computed by SPSS.

We will denote these component scores by {cij} = ci

1, ci2,…, ci

12, and the standard-ized test scores by {Z i

j} = Z i1, Z i

2, . . . , Z i12, where the first subscript i signifies the i-th

subject, and the second subscript j refers to the j-th component, 1 ≤ j ≤ 12, of Cand Z. Writing A−1 in (11.36) as

[A−1

kl

], where A−1

kl is the k-th row and l-th columnentry of A−1, we can see from (11.36) that we now have

ci1

ci2...

ci12

=[A−1

kl

]

Z i1

Z i2...

Z i12

. (11.40)

or component-wise

cij =

12∑l=1

A−1jl Z i

l for 1 ≤ j ≤ 12. (11.41)

If you next look at the list of eigenvalues associated with the components (Table11.24)12, you can see that only the two largest eigenvalues satisfy the λ > 1 criterionexplained on p.175. This is reflected on the scree plot (Figure 11.7).

Step 2: Determining the Number of Components to RetainAs mentioned above, only two eigenvalues are greater than 1. On the other hand,according to Footnote 8 of this chapter, you are supposed to include through thecomponents corresponding to the last eigenvalue before the scree plot begins to leveloff, and this seems to indicate inclusion of the first four components.

List of EigenvaluesComponents 1 2 3 4 5 6 7 8 9 10 11 12Eigenvalues 6.769 1.136 .883 .651 .563 .442 .371 .318 .271 .249 .184 .163

Table 11.9: List of Eigenvalues

At this point the choice is more of an art than a science, and we have to make anintegrative decision so that the result makes most sense. Needless to say, the process

12If you compute the sum of all the eigenvalues, you get 6.769 + 1.136 + 0.883 + 0.651 + 0.563 +0.442 + 0.371 + 0.318 + 0.271 + 0.249 + 0.184 + 0.163 = 12.000, and this confirms the prior assertionthat the total variance equals the number of the observed variables.

Page 196: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

194 CHAPTER 11. FACTOR ANALYSIS AND PRINCIPAL COMPONENTS ANALYSIS

Component Number

121110987654321

Eig

en

va

lue

6

4

2

0

Scree Plot

Figure 11.6: Scree Plot of 12 Eigenvalues

involves an element of trial and error. I did this trial and error steps for you andfound out that the three-component model woks best. So, we will retain the firstthree components.

Step 3: Principal Components Analysis with the Chosen Number of Com-ponentsIn Step 2 we decided to keep three components. With these, the communalities, theamounts of variance in the observed variables explained by the components, are nolonger 1.000 as shown in Table 11.10. While there is no general consensus as to whatproportion of the variance a good model should explain, Habing [Habing, 2003] says,”it seems reasonable that any decent model should have at least 50% of the variancein the variables explained by the [components]. ” According to this criterion, ourcommunalities are sufficiently large for the three-component model.

We will next look at the table of the total variance explained (Table 11.14). Thefirst column lists the components, the next three columns with a group heading of“Initial Eigenvalues” are the variances explained by the components, the followingthree with a group heading of “Extraction Sums of Squared Loadings” are for thethree components we retained, and the three rightmost columns collectively labeled

Page 197: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

11.5. MULTI-VARIABLE EXAMPLE 195

CommunalitiesInitial Extraction

MSLT 1.000 .706MLLT 1.000 .795MLCT 1.000 .782ZDTN 1.000 .610MPHD 1.000 .667MAWR 1.000 .673MRCT 1.000 .747MRCL 1.000 .810MVST 1.000 .624MPVT 1.000 .798MGED 1.000 .769MGGT 1.000 .804

Table 11.10: Communalities with Three Components

“Rotation Sums of Squared Loadings” are the variances explained after rotation.Recall that this is for varimax rotation. The column labeled “% of Variance” showswhat percentage of the total variance each component explains, and the column la-beled “Cumulative %” shows what percentage of the total variance is explained bythe components so far. For example, the entry of 73.233 for the third componentunder “Initial Eigenvalues” means:

(variance explained by the 1st component)+ (variance explained by the 2nd component)+ (variance explained by the 3rd component)= 56.408 + 9.467 + 7.358 = 73.233.

While the first three rows under “Extraction Sums of Squared Loadings” arean exact replica of the corresponding rows in the “Initial Eigenvalues” group rep-resenting the three components we retained, you can see that the values are dif-ferent for the nine entries under “Rotation Sums of Squared Loadings”. This isbecause the values in this panel of the table represent the new distribution ofthe variance after the varimax rotation, and varimax rotation tries to maximizethe variance of each of the components, so that the total amount of variance ac-counted for by the model is redistributed over the three extracted components[Institute for Digital Research and Education UCLA, nd].

Page 198: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

196 CHAPTER 11. FACTOR ANALYSIS AND PRINCIPAL COMPONENTS ANALYSIS

The loadings before rotation are given in Table 11.12, which SPSS calls “com-ponent matrix”. This is the matrix of the coefficients in the linear combinationsof C1, C2, and C3 for the observed variables such that the following relations hold.Likewise, for the rotated component matrix.

MSLT = .785C1 + .178C2 − .241C3MLLT = .790C1 − .124C2 − .394C3MLCT = .830C1 + .210C2 − .224C3ZDTN = .724C1 + .256C2 − .145C3MPHD = .464C1 + .635C2 + .222C3MAWR = .632C1 + .492C2 + .177C3MRCT = .832C1 − .207C2 − .110C3MRCL = .805C1 − .292C2 − .276C3MV ST = .758C1 − .121C2 + .185C3MPV T = .777C1 − .092C2 + .431C3MGED = .840C1 − .215C2 + .136C3MGGT = .690C1 − .370C2 + .437C3

In our view, MSLT through MGGT as well as {C1, C2, C3} are regarded as vec-tors, and each linear combination stems from the observed variables’ decompositioninto its C1-, C2-, and C3-components; MSLT = (0.785, 0.178, −0.241) for example13.If we define a 12 by 1 column matrx [M ] and 3 by 1 column matrix [C], both withvector entries, by

[M ] =

MSLTMLLT

...MGEDMGGT

and [C] =

C1C2C3

and the 12 by 3 matrix [L] of the loadings by

[L] =

.785 .178 −.241

.790 −.124 −.394...

......

.840 −.215 .136

.690 −.370 .437

,

13This is exactly like the usual decomposition of a three dimensional vector V into x-, y-, and z-Cartesian components, such that V = (a, b, c) means V = ai + bj + ck.

Page 199: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

11.5. MULTI-VARIABLE EXAMPLE 197

we have[M ] = [L][C].

Compare this with the rotated component matrix ([L′]) given in Table 11.26. Thedifference is caused by varimax rotation which results in each component having asmall number of large loadings and a large number of zero (or small) loadings. Thissimplifies the interpretation because, after a varimax rotation, each original variabletends to be associated with one (or a small number) of components, and each com-ponent represents only a small number of variables [Abdi, 2003]. The simplificationwill be clearly visible in Step 4.

Before proceeding to the next step, we will look at Component TransformationMatrix, which we denote by [T ] (Table 11.11). The matrix [T ] connects the unrotated

Component Transformation MatrixComponent 1′ 2′ 3′

1 0.690 0.584 0.4272 -0.134 -0.477 0.8693 -0.711 0.657 0.251

Table 11.11: Component Transformation Matrix

components {C1, C2, C3} and the rotated components {C ′1, C ′

2, C ′3} backwards such

that C1C2C3

= [T ]

C ′1

C ′2

C ′3

or [C] = [T ][C ′].

Therefore, we have

[M ] = [L][C] = [L][T ][C ′] = [L′][C ′] =⇒ [L′] = [L][T ].

Indeed,

[L][T ] =

.785 .178 −.241

.790 −.124 −.394...

......

.840 −.215 .136

.690 −.370 .437

.690 .584 .427

−.134 −.477 .869−.711 .657 .251

Page 200: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

198 CHAPTER 11. FACTOR ANALYSIS AND PRINCIPAL COMPONENTS ANALYSIS

=

0.690 0.215 0.4290.842 0.262 0.131

......

...0.512 0.682 0.2050.215 0.867 0.083

= [L′].

Component MatrixComponent

1 2 3MSLT .785 .178 -.241MLLT .790 -.124 -.394MLCT .830 .210 -.224ZDTN .724 .256 -.145MPHD .464 .635 .222MAWR .632 .492 .177MRCT .832 -.207 -.110MRCL .805 -.292 -.276MVST .758 -.121 .185MPVT .777 -.092 .431MGED .840 -.215 .136MGGT .690 -.370 .437

Table 11.12: Component Matrix

Step 4: Ignoring Small LoadingsDeciding where to draw the line of demarcation between dropping and retaining aloading is often more of an art than a science. However, 0.4 is a good rule of thumb,and you should typically adjust the critical number upward or downward startingwith 0.4. When the loadings smaller than 0.4 are removed, we get Table 11.15 andTable 11.16 respectively for unrotated and rotated components. Before the rotation,all the variables except for MPHD load mainly and often exclusively on Component1. You will see in the next step that the components are easier to interpret aftervarimax rotation. Recall from p.195 that varimax rotation tries to maximize thevariance of each of the components, so that the total amount of variance accountedfor is redistributed over the three extracted components.Step 5: Interpreting The Meaning Of Each ComponentRotated component matrix (Table 11.16) will be used to interpret the meaning ofeach component. We will use both the variables with the highest loadings and the

Page 201: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

11.5. MULTI-VARIABLE EXAMPLE 199

RotatedComponent Matrix

Component1 2 3

MSLT .690 .215 .429MLLT .842 .262 .131MLCT .704 .237 .480ZDTN .569 .205 .495MPHD .077 .114 .805MAWR .244 .251 .742MRCT .681 .512 .148MRCL .791 .429 .021MVST .408 .622 .265MPVT .243 .781 .360MGED .512 .682 .205MGGT .215 .867 .083

Table 11.13: Rotated Component Matrix

Total Variance Explained by Three ComponentsExtraction Sums Rotation Sums

Initial Eigenvalues of Squared Loadings of Squared Loadings% of Cumulative % of Cumulative % of Cumulative

Component Total Variance % Total Variance % Total Variance %1 6.769 56.408 56.408 6.769 56.408 56.408 3.693 30.779 30.7792 1.136 9.467 65.875 1.136 9.467 65.875 2.948 24.563 55.3423 .883 7.358 73.233 .883 7.358 73.233 2.147 17.891 73.2334 .651 5.424 78.6575 .563 4.695 83.3526 .442 3.687 87.0397 .371 3.095 90.1338 .318 2.649 92.7829 .271 2.257 95.03910 .249 2.071 97.11011 .184 1.531 98.64112 .163 1.359 100.000

Table 11.14: Total Variance Explained

Page 202: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

200 CHAPTER 11. FACTOR ANALYSIS AND PRINCIPAL COMPONENTS ANALYSIS

Component MatrixComponent

1 2 3MSLT .785MLLT .790MLCT .830ZDTN .724MPHD .464 .635MAWR .632 .492MRCT .832MRCL .805MVST .758MPVT .777 .431MGED .840MGGT .690 .437

Table 11.15: Component Matrix (≥ 0.4)

Rotated Component MatrixComponent

1 2 3MSLT .690 .429MLLT .842MLCT .704 .480ZDTN .569 .495MPHD .805MAWR .742MRCT .681 .512MRCL .791 .429MVST .408 .622MPVT .781MGED .512 .682MGGT .867

Table 11.16: Rotated Component Matrix (≥ 0.4)

Page 203: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

11.6. FACTOR ANALYSIS 201

common characteristics among the variables with significant loadings (≥ 0.4) forthis purpose. Variables with high loadings are called indicator variables for thecomponent.

• Component 1 has highest loadings for long listening and reading. Hence, weconclude this component is about “processing for meaning”.

• Component 2 has the highest loadings for the gap-filling test and the productivevocabulary test. However, this alone is not enough to identify the meaning ofComponent 2. It is more useful in this particular case to look for the commoncharacteristics among the variables, rather than to rely on indicator variables.Because all the tests contain a written text, we associate this component with“written text intake” and call it “visual front-end”.

• The highest loadings for Component 3 are exhibited by phoneme distinctionand aural word recognition. In addition, all the tests with significant loadingsare listening tests. So, we identify this component as “sound capturing” or“aural front-end”.

Step 6: Identifying the Components Behind Each VariableHaving identified the nature of each component, we can now go backwards and whatskills are behind each variable/test. For example, our vocabulary test MVST hassignificant loadings on Component 1 (“processing for meaning”) and Component 2(“written text intake” and “visual front-end”) with a higher loading on Component2. Therefore, the vocabulary test MVST seems to require as much ability for writtentext intake as vocabulary knowledge. Similar analyses are possible for other variables,and it often helps us to detect a hidden factor operating behind the scenes, such asthe “visual front-end” for the vocabulary test.

11.6 Factor Analysis11.6.1 Differences from Principal Components AnalysisFactor analysis differs from principal components analysis in at least two ways.

Differences between Factor Analysis and Principal Components Analysis1. Only a part of the variance in each variable is explained by underlying fac-

tors. This means that the specific and error variance, collectively called theunique variance, of a particular item are separated out, and the factors willonly account for the common variance. Necessarily, we have h2 < 1.

Page 204: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

202 CHAPTER 11. FACTOR ANALYSIS AND PRINCIPAL COMPONENTS ANALYSIS

2. Factors are not simple linear combinations of the variables, but hypotheticalconstructs estimated from the variables. We will employ an iterative procedurethat continues until a satisfactory convergence of the two sets of input andoutput communalities is achieved.

Here is a brief sketch of the procedure.

11.6.2 How Does Factor Analysis Work?We first have to decide what to put in the main diagonal of the correlation matrix;i.e. we need to find the numbers, smaller than one, that replace the 1s on the maindiagonal of the correlation matrix. This is due to the first difference in Section 11.6.1that claims h2 < 1. By doing so, we eliminate the variance due to unique factors.14

Then, we need to come up with an algorithm to find the factors.

In a popular method known as the principal axis factoring (PAF)15, onefirst places the square of the multiple correlation coefficient (R2), resulting from themultiple regression of each variable on all other variables on the main diagonal. Forexample, the diagonal elements (R2) of the resulting initial matrix for the data ofSection 11.5 are presented as Initial Communalities in Table 11.17. If we sum theInitial Communalities, we get

.605 + .696 + .722 + .514 + .278 + .490 + .684+.713 + .581 + .638 + .738 + .589 = 7.248. (11.42)

As there are 12 standardized observed variables, the amount of total variance is 12.Therefore, this initial model reduced the variance to be accounted for from 12 to7.248 (about 60.4%) by eliminating unique variance.

Next, starting with this modified correlation matrix, factors in principal axis fac-toring are extracted successively as in principal components analysis, and they areorthogonal to one another. Because our matrix is still real and symmetric , Theorem11.1 still applies, and we can employ the same procedure that was used for princi-pal components analysis. The first factor is extracted such that it accounts for the

14Recall that a variable’s communality is the sum of squared loadings across components orfactors and is the amount of the variable’s variance accounted for by the components or factors.In factor analysis, we only deal with common factors, and so, a variable’s communality in factoranalysis is the amount of its common variance.

15This is also known as principal axis factor analysis and principal factor analysis.

Page 205: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

11.6. FACTOR ANALYSIS 203

CommunalitiesInitial Extraction

MSLT .605 .763MLLT .696 .857MLCT .722 .923ZDTN .514 .646MPHD .278 .477MAWR .490 .669MRCT .684 .777MRCL .713 .875MVST .581 .750MPVT .638 .769MGED .738 .935MGGT .589 .783

Table 11.17: Initial and Final Communalities: Principal Axis Factoring

maximum amount of common variance. Then, the second factor is extracted fromthe residual correlation matrix obtained by factoring out the contribution of the firstfactor. The process is repeated until all common variance is explained. The factorsextracted are fewer in number than for principal components analysis because thefinal factor accounts for the unique variance. This is clear from Table 11.19. You cansee that Factor 12 does not account for any common variance as this factor derivesfrom specific and error variances which are not shared. Also note that the cumula-tive total explained variance is less than 100% unlike principal component analysis,which is a direct consequence of nonzero unique variance.

When one round of computation is completed, the resulting communalities aresubstituted into the main diagonal of the correlation matrix, and the above processis repeated to obtain a new set of communalities to be placed on the main diagonalof the correlation matrix. This iterative procedure terminates when the changesin communalities from one iteration to the next becomes negligible; i.e. when theiterative procedure converges. The resulting communalities are listed as Extraction inTable 11.17. These extractions form the main diagonal of the reproduced correlationmatrix presented as Table 11.18. The sum of the communalities is now

.763 + .857 + .923 + .646 + .477 + .669 + .777 + .875+.750 + .769 + .935 + .783 = 9.224, (11.43)

Page 206: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

204 CHAPTER 11. FACTOR ANALYSIS AND PRINCIPAL COMPONENTS ANALYSIS

and the final converged model explains (9.224/12) × 100 = 76.9% of the variance asshared common variance. Naturally, the off-diagonal entries of the matrix in Table11.18 are the same as those in the correlation matrix of Table 11.2. Only the diagonal1s have been replaced

1 2 3 4 5 6 7 8 9 10 11 121. MSLT .763 .621 .664 .576 .409 .489 .584 .645 .571 .518 .556 .3792. MLLT .621 .857 .705 .507 .295 .353 .666 .756 .555 .457 .603 .4573. MLCT .664 .705 .923 .659 .375 .610 .658 .583 .570 .544 .616 .4084. ZDTN .576 .507 .659 .646 .386 .478 .539 .507 .425 .470 .564 .4005. MPHD .409 .295 .375 .386 .477 .417 .248 .223 .297 .335 .265 .2496. MAWR .489 .353 .610 .478 .417 .669 .435 .321 .426 .506 .483 .2847. MRCT .584 .666 .658 .539 .248 .435 .777 .719 .573 .611 .748 .5428. MRCL .645 .756 .583 .507 .223 .321 .719 .875 .581 .527 .669 .5449. MVST .571 .555 .570 .425 .297 .426 .573 .581 .750 .655 .538 .57410. MPVT .518 .457 .544 .470 .335 .506 .611 .527 .655 .769 .691 .63711. MGED .556 .603 .616 .564 .265 .483 .748 .669 .538 .691 .935 .68912. MGGT .379 .457 .408 .400 .249 .284 .542 .544 .574 .637 .689 .783

Table 11.18: Full Correlation Matrix with Converged Communalities on the Diagonal

11.6.3 Interpreting and Using the Computer OutputThe PAF equivalent of Component Matrix is called Factor Matrix presented as Table11.20. As stated already, we have only 11 factors in principal axis factoring com-pared with 12 components in principal components analysis, with the 12th factoraccounting for unique variances. The communalities are calculated in exactly thesame way as with principal components analysis. For example, we have

the communality of MSLT = the sum of squared loadings for MSLT= (.768)2 + (.216)2 + (−.143)2 + (.156)2 + (.103)2 + (−.195)2

+(−.099)2 + (−.140)2 + (−.024)2 + (.014)2 + (−.045)2 = .763; (11.44)

and likewise for the other variables.In PAF, we can not reproduce the correlation matrix exactly due to the unique

variance. However, the sum of the products of the loadings for the two variables stillprovides a good estimate of the correlation coefficient most of the time. For example,the correlation coefficient between MSLT and MLLT can be estimated from Table11.20 as follows.

Corr(MSLT,MLLT) = (.768)(.787) + (.216)(−.036) + (−.143)(−.408)

Page 207: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

11.6. FACTOR ANALYSIS 205

Total Variance Explained by Factor AnalysisExtraction Sums

Initial Eigenvalues of Squared Loadings% of Cumulative % of Cumulative

Factor Total Variance % Total Variance %1 6.769 56.408 56.408 6.565 54.709 54.7092 1.136 9.467 65.875 .834 6.954 61.6633 .883 7.358 73.233 .673 5.606 67.2694 .651 5.424 78.657 .366 3.046 70.3155 .563 4.695 83.352 .257 2.141 72.4576 .442 3.687 87.039 .197 1.644 74.1017 .371 3.095 90.133 .134 1.119 75.2198 .318 2.649 92.782 .074 .618 75.8389 .271 2.257 95.039 .063 .527 76.36510 .249 2.071 97.110 .032 .271 76.63511 .184 1.531 98.641 .025 .210 76.84512 .163 1.359 100.000

Table 11.19: Total Variance Explained by Factor Analysis

Factor MatrixFactor

1 2 3 4 5 6 7 8 9 10 11MSLT .768 .216 -.143 .156 .103 -.195 -.099 -.140 -.024 .014 -.045MLLT .787 -.036 -.408 .072 .007 .164 .177 -.039 -.043 -.036 .045MLCT .833 .322 -.130 -.114 -.210 .201 -.077 -.026 .016 -.013 -.067ZDTN .696 .235 -.010 -.141 .162 .057 -.211 .076 -.004 -.022 .078MPHD .432 .357 .185 .151 .284 .057 .124 .043 -.055 .023 -.019MAWR .609 .405 .260 -.090 -.116 -.093 .139 .027 .125 .018 .018MRCT .818 -.146 -.107 -.144 -.085 -.091 .022 .129 -.114 .087 -.031MRCL .804 -.239 -.338 .068 .085 -.137 .010 .089 .129 -.038 -.018MVST .741 -.101 .120 .355 -.193 .008 -.057 -.018 .003 .059 .069MPVT .762 -.126 .358 .073 -.095 -.073 .006 .035 -.082 -.126 -.018MGED .845 -.245 .132 -.336 .059 -.064 .053 -.134 -.011 .013 .037MGGT .679 -.413 .284 .050 .135 .200 -.036 -.008 .073 .033 -.047

Table 11.20: Full Factor Matrix

Page 208: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

206 CHAPTER 11. FACTOR ANALYSIS AND PRINCIPAL COMPONENTS ANALYSIS

+(.156)(.072) + (.103)(.007) + (−.195)(.164) + (−.099)(.177) + (−.140)(−.039)+(−.024)(−.043) + (.014)(−.036) + (−.045)(.045) = .62216 (11.45)

And, this is how Reproduced Correlation Matrix (Table 11.22) was obtained. Youcan see in Table 11.22 that the residuals are larger than those in Table 11.7. However,the convergence criterion for factor analysis is in terms of communalities and not interms of reproduced correlation matrix. The default criterion in SPSS is that thedifference between the input and output communality for each variable is smallerthan .001 [Pett et al., 2003][p.104]. When this convergence is not achieved, SPSSdoes not exhibit any solution. Hence, you can be sure in this case that convergencewas indeed reached.

The diagonal elements of factor score covariance matrix (Table 11.21) are the R2

between the factor and the observed variables. This is an indicator of the internalconsistency of the solution. The cutoff is .70, and values smaller than .70 are re-garded undesirable. If there is no rotation, or if we apply an orthogonal rotation, thefactor score covariance matrix is supposed to be diagonal. Though the matrix shownin Table 11.21 is not diagonal, strictly speaking, one can see that the off-diagonalelements are very small.

Factor Score Covariance MatrixFactor 1 2 3 4 5 6 7 8 9 10 111 .974 -.011 -.014 -.023 -.015 .007 .000 -.017 .006 -.005 -.0052 -.011 .792 -.028 .009 -.084 .048 -.049 -.005 -.008 -.008 -.0303 -.014 -.028 .779 -.037 -.009 -.017 -.032 -.035 -.021 .011 .0084 -.023 .009 -.037 .698 -.006 .002 -.019 .076 .016 -.016 -.0045 -.015 -.084 -.009 -.006 .499 -.071 .021 -.024 .025 -.001 .0286 .007 .048 -.017 .002 -.071 .540 -.004 -.013 -.005 -.004 -.0247 .000 -.049 -.032 -.019 .021 -.004 .353 -.020 -.024 -.013 .0498 -.017 -.005 -.035 .076 -.024 -.013 -.020 .300 .024 -.008 -.0299 .006 -.008 -.021 .016 .025 -.005 -.024 .024 .233 -.009 -.02010 -.005 -.008 .011 -.016 -.001 -.004 -.013 -.008 -.009 .127 .00511 -.005 -.030 .008 -.004 .028 -.024 .049 -.029 -.020 .005 .114

Table 11.21: Factor Score Covariance Matrix

16Note that you have to carry at least 5 to 6 digits under decimal in order to get .622. Forexample, the first term (.768)(.787) should actually be (0.768416774848077)(0.786715179038231).

Page 209: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

11.6.FA

CT

OR

AN

ALY

SIS207

Reproduced CorrelationsReproduced 1 2 3 4 5 6 7 8 9 10 11 121. MSLT .763 .622 .663 .577 .409 .489 .584 .644 .571 .518 .556 .3802. MLLT .622 0.857 .704 .508 .295 .354 .666 .755 .555 .457 .602 .4583. MLCT .663 .704 0.923 .659 .375 .609 .658 .584 .570 .544 .617 .4084. ZDTN .577 .508 .659 0.646 .386 .478 .539 .506 .425 .470 .564 .4015. MPHD .409 .295 .375 .386 0.477 .416 .248 .224 .298 .335 .266 .2496. MAWR .489 .354 .609 .478 .416 0.669 .436 .321 .426 .506 .483 .2857. MRCT .584 .666 .658 .539 .248 .436 0.777 .718 .572 .611 .747 .5438. MRCL .644 .755 .584 .506 .224 .321 .718 0.875 .581 .526 .670 .5439. MVST .571 .555 .570 .425 .298 .426 .572 .581 0.75 .655 .539 .57410. MPVT .518 .457 .544 .470 .335 .506 .611 .526 .655 0.769 .691 .63711. MGED .556 .602 .617 .564 .266 .483 .747 .670 .539 .691 0.935 .68812. MGGT .380 .458 .408 .401 .249 .285 .543 .543 .574 .637 .688 0.783

Residual 1 2 3 4 5 6 7 8 9 10 11 121. MSLT -.001 .001 .000 .001 -.001 .000 .001 .000 .000 .001 -.0012. MLLT -.001 .001 .000 .001 -.001 .000 .001 .001 .000 .001 -.0013. MLCT .001 .001 .000 -.001 .001 .000 -.001 -.001 .000 -.001 .0014. ZDTN .000 .000 .000 .000 .000 .000 .000 .000 -5.089 .000 .000

E-55. MPHD .001 .001 -.001 .000 .001 .000 -.001 -.001 .000 -.001 .0016. MAWR -.001 -.001 .001 .000 .001 .000 .001 .001 .000 .001 -.0017. MRCT .000 .000 .000 .000 .000 .000 .000 .000 -5.271 .000 .000

E-58. MRCL .001 .001 -.001 .000 -.001 .001 .000 -.001 .000 -.001 .0019. MVST .000 .001 -.001 .000 -.001 .001 .000 -.001 8.855 -.001 .001

E-510. MPVT .000 .000 .000 -5.089 .000 .000 -5.271 .000 8.855 .000 .000

E-5 E-5 E-511. MGED .001 .001 -.001 .000 -.001 .001 .000 -.001 -.001 .000 .00112. MGGT -.001 -.001 .001 .000 .001 -.001 .000 .001 .001 .000 .001

Table 11.22: Reproduced Full Correlation Matrix (PAF)

Page 210: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

208

ExercisesClick to go to solutions.

1. Judging from Table 11.23, how many components were retained?

CommunalitiesInitial Extraction

H 1.000 1.000I 1.000 1.000J 1.000 1.000K 1.000 1.000L 1.000 1.000M 1.000 1.000N 1.000 1.000O 1.000 1.000P 1.000 1.000Q 1.000 1.000R 1.000 1.000S 1.000 1.000

Table 11.23: Communalities for Full Rotated Analysis

2. Suppose you are trying to conduct a principal components analysis on variablesx1, x2 . . . xn. What is the total variance you are aiming to account for? Why?

3. According to the scree plot in Figure 11.7, how many components should beretained?

4. Judging from the partial list of eigenvalues in descending order (Table 11.24),how many components should be retained?

List of EigenvaluesComponents 1 2 3 4 5 6Eigenvalues 5.0 1.5 1.1 .46 .15 .083

Table 11.24: List of Eigenvalues

5. If we retain the first 3 components from Table 11.24, how much variance in theobserved variables is explained?

6. Express the observed variable X1 as a linear combination of the three compo-nents, C1, C2, and C3 (Table 11.26).

Page 211: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

209

Component Number

121110987654321

Eig

en

valu

e

6

4

2

0

Scree Plot

Figure 11.7: Scree Plot

Rotated Component Matrix

Component1 2 3

X1 .077 .114 .805X2 .244 .251 .742X3 .681 .512 .148X4 .791 .429 .021X5 .408 .622 .265

Table 11.25: Rotated Component Matrix

Page 212: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

210

7. Table 11.26 shows loadings on three components. The first component C1 isgeneral mathematical ability, C2 represents verbal skills in one’s mother tongue,and C3 is physical ability. The observed variables are the average placementsuccess rate for manufacturers, service industry, and the public sector. Whatcan you conclude about the relations between these three skills and successfulplacement? You can only draw a qualitative conclusion.

Component MatrixComponent1 2 3

Manufacturing .70 .20 .30Service .10 .70 .50Public .50 .50 .50

Table 11.26: Component Matrix

8. Table 11.27 shows a component score coefficient matrix, and Table 11.28 showsthe standardized scores Z1, Z2, . . ., Z5 for the fourth subject. Compute thecomponent scores c4

1 and c45 for the fourth subject.

Component Score Coefficient MatrixComponent

1 2 3 4 5Z1 .264 -.622 .367 .498 .939Z2 .297 -.277 .097 .298 -1.357Z3 .281 -.142 -.224 -1.247 .172Z4 .255 .424 -.951 .513 .297Z5 .226 .789 .794 -.003 .136

Table 11.27: Component Score Coefficient Matrix

Standardized Test ScoresZ1 Z2 Z3 Z4 Z5

Fourth Subject -1.835 -1.143 -0.092 -1.060 -0.687

Table 11.28: Standardized Test Scores for the Fourth Subject

Page 213: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

211

9. Consider a principal components analysis with six variables: X1, X2, X3, X4,X5, and X6

(a) How many components would you retain based on the Scree Plot given inFigure 11.8.

Component Number

654321

Eig

en

va

lue

4

3

2

1

0

Scree Plot

Figure 11.8: Scree Plot

(b) If all six components are retained, we get the component matrix presentedas Table 11.29.i. Express X1 as a linear combination of C1 through C6.

Page 214: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

212

ii. What proportion of the variance in X1 is accounted for by C1 and C2?Show the formula to compute the answer without actually carryingout the computation.

(c) Give the formula to compute the correlation coefficient between X1 andX2 using the entries of Table 11.29. You do not need to carry out theactual computation.

Component MatrixComponent

1 2 3 4 5 6X1 .845 -.148 -.273 -.130 .362 .202X2 .802 .224 -.271 .447 -.159 .086X3 .579 .755 .248 -.118 .118 -.074X4 .858 -.075 .029 -.332 -.362 .124X5 .677 -.324 .628 .191 .065 .048X6 .876 -.238 -.165 -.029 .023 -.383

Table 11.29: Full Component Matrix

Page 215: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

Chapter 12

Structural Equation Modeling

Structural equation modeling (SEM) is a statistical technique for testing and estimat-ing causal relations. The procedure consists of a combination of data and qualitativecausal assumptions, and the output is a graphic representation of the interrelation-ships among the variables accompanied by various data tables1. When SEM is usedfor theory testing, it is called a confirmatory analysis, and when it is used for theorydevelopment, we conduct an exploratory analysis2.

One of the distinct differences between principal components analysis and struc-tural equation modeling is that, one starts with observed data alone in principalcomponents analysis but with both the data and causal assumptions in structuralequation modeling. However, principal components analysis is often helpful in arriv-ing at potential causality assumptions.

The first step of structural equation modeling is a hypothesis represented by acausal model. There are two variables in SEM known as latent variables and in-dicator variables. The indicator variables are the observed or directly measuredvariables such as MSLT through MGGT in the first column of Table 11.16 on p.200.The latent variables are hypothesized to exist but not directly measurable such asthe three components C1, C2, and C3 in the second row of Table 11.16. In thecausal model, latent variables are assumed to cause the indicator variables to takethe measured scores. It is customary to use an oval for a latent variable, a rectanglefor an indicator variable, and an arrow to represent causality as shown in Figure 12.1.

1We will discuss only the most typical concepts and uses of structural equation modeling.2Structural equation modeling is all but impossible without some software package. Major

packages include AMOS, LISREL, and EQS.

213

Page 216: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

214 CHAPTER 12. STRUCTURAL EQUATION MODELING

Figure 12.1: The Relation Between Latent and Indicator Variables

This is the simplest example of a structural equation model. But, the diagramwould be far more complicated for our 12-variable model for example. The full refinedmodel is shown in Figure 12.2.

12.1 Model IdentificationWhen your observed data are not consistent with the causality assumptions, yoursoftware warns you by indicating that the model is not identifiable if the computationdoes not converge, or that it does not fit the data by using a χ2-statistic given below.

χ2ML = (N − 1)

[trace

(SΣ−1

)− p + ln

(∣∣∣Σ∣∣∣)− ln (|S|)]

In this formula, N is the sample size, p is the number of observed variables, S is thesample covariance matrix, Σ is the fitted model covariance matrix, and | · | means thedeterminant3. The null hypothesis is that the predicted matrix Σ has the specifiedmodel structure, and the alternative hypothesis is that Σ is unconstrained. In otherwords, χ2 measures the discrepancy comparing the sample covariance matrix withthe implied model covariance matrix which is computed from the model structureand the model parameters. The degrees of freedom df is the difference between thenumber of elements in the lower half of the covariance matrix and the number of

3The subscript “ML” signifies maximum likelihood which is a very important concept but isbeyond the scope of this book.

Page 217: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

12.1. MODEL IDENTIFICATION 215

Figure 12.2: The Full Refined Model with 12 Indicators

Page 218: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

216 CHAPTER 12. STRUCTURAL EQUATION MODELING

estimated parameters; df = p(p + 1)/2 − (the number of estimated parameters).

For example, the model given in Figure 12.3 does not fit the data while that inFigure 12.4 does. As the difference between the two models is the inclusion/exclu-sion of GED (Grammatical Error Detection Test), this may imply that grammaticalknowledge possessed by Japanese college students is not contributing to their overallproficiency in English. Grammatical knowledge measured by GED does not play arole consistent with the rest of the test battery in measuring the overall proficiency4.

Figure 12.3: The Full Model with GED

4We should note, however, that our test battery emphasizes fluency backed by procedural knowl-edge and not declarative knowledge. As grammatical knowledge tend to depend significantly ondeclarative knowledge, this difference in focus may explain the insignificant role played by grammarknowledge.

Page 219: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

12.1. MODEL IDENTIFICATION 217

Figure 12.4: The Full Model without GED

Page 220: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

218

ExercisesClick to go to solutions.

1.

Page 221: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

Chapter 13

Rasch Analysis

Having learned many techniques to transform the raw data, we have now come aroundthe circle back to the raw data and will deploy Rasch method to pre-process thembefore subjecting to further analyses. Two data types suitable for this treatmentare dichotomous and polytomous data. In Chapter 12, our multi-variable examplecontained 12 multi-item tests. For each item/question, one either gets it right orwrong. This kind of data is called dichotomous data, and we usually denote rightanswers by 1 and wrong ones by 0. On the other hand, in a typical survey, thechoices for each item may cover some range, for which successively higher integersindicate increasing levels of agreement, attainment, or competence. See the examplebelow.Q: Do you agree with the following statement? “Freedom is more important thanwealth.”A: 1. Strongly disagree 2. Disagree 3. Neither agree nor disagree 4. Agree 5.Strongly agreeAnswers to questions like this form a polytomous dataset.

The Rasch model rescales and linearizes raw data to an interval scale on whichitem difficulty and person ability estimates are placed. Conventional statistics, in-cluding such fundamental parameters as the mean and standard deviation, are notlegitimate for non-interval scales (Wright & Mok, 2000, p. 85). In order for mea-surements to be useful, they need to be linear (Wright & Mok, 2000, p. 86). Inthis chapter, we will only consider the dichotomous Rasch model and learn the ba-sic ideas behind the model. Actual analyses are conducted with such software asWinsteps, but knowing the gist of the theory and principles behind it increases yourconfidence as a user of the software, contributing to an enhanced understanding of

219

Page 222: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

220 CHAPTER 13. RASCH ANALYSIS

this technique as well as a higher comfort level.

13.1 Dichotomous Rasch Analysis1

The goal of Rasch modeling is to compute calibrated person abilities and item dif-ficulties and generate an “item-person map” as shown in Figure 13.1. One of thehallmarks of Rasch analysis is to place the abilities of the examinees (called personabilities) and the levels of difficulty of the questions (called item difficulties) alongthe same vertical axis as you can see in Figure 13.1. To the right of the verticalaxis are the items/questions labeled “L1” through “L40”, and to the left are theexaminees each represented by one “x”. The capital letters “M”, “S”, and “T” to theimmediate right of the vertical axis signify the mean, one standard deviation awayfrom the mean, and two standard deviations away from the mean for item difficulty.Likewise for “M”, “S”, and “T” to the immediate left of the vertical axis for personability.

In the following, we will not discuss the details of the mathematical and/ornumerical procedures used to construct the Rasch model. Such is not a productiveuse of time and space for ordinary users of Rasch analysis. Instead, we will studythe guiding concepts and the properties of the consequent model.

13.1.1 Ability and Difficulty: Symmetry and Common ScaleThere are several guiding concepts of the Rasch model.

Guiding Concept 1

1. The Rasch model places the persons and the items on the same footing andpreserves symmetry between them, enabling use of one unified scale for both.Just as a person’s ability is measured by his/her ability to answer items/questions correctly, an item’s difficulty level is measured by its “ability” tomake the examinees answer the item incorrectly.

We first need a way to equate item difficulty with person ability, so that the commonvertical axis can be drawn, and the positions of “M”, “S”, and “T” can be found onthat common axis. To that end, we express the probability that person m answers

1Our presentation and proofs in this section are largely heuristic and not necessarily exact ordetailed. However, this is sufficient to understand how Rasch analysis generally works.

Page 223: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

13.1. DICHOTOMOUS RASCH ANALYSIS 221

Figure 13.1: Item-Person Map

Page 224: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

222 CHAPTER 13. RASCH ANALYSIS

item i correctly by Pmi. Then, 0 ≤ Pmi ≤ 1, and 1 − Pmi is the probability thatperson m answers item i incorrectly. We say person m’s ability matches the difficultylevel of item i when he/she has as much chance of getting it right as getting it wrong,that is, when

Pmi = 1 − Pmi or Pmi = 0.5.

13.1.2 Rasch Measures as Relative MeasuresGuiding Concept 2

2. Rasch person ability and item difficulty measures are relative measures infour senses, of which (1) and (2) are mirror images of each other, and (3)and (4) are also mirror images..

(1) Person abilities can be measured only relative to items with known diffi-culty levels via Pmi.

(2) Item difficulties can be measured only relative to persons with knownabilities. As we will see later this is by way of 1 − Pmi.

(3) Only the difference in ability between two persons can be measured. Thereis no means of measuring the “intrinsic” ability of a person; i.e. personability measure is only relative to other persons.

(4) Only the difference in difficulty between two items can be measured. Thereis no means of measuring the “intrinsic” difficulty of an item; i.e. itemdifficulty measure is only relative to other persons.

Instead of intrinsic person ability, which should not depend on the item used tomeasure it, we will introduce the concept of “the ability of person m in referenceto item i” denoted by Bmi and defined as the natural logarithm of “correct” to“incorrect” or “success” to “failure” ratio23.

Bmi = ln(

Pmi

1 − Pmi

)(13.1)

This measure of person ability depends on which item is used for the measurementbecause Pmi ordinarily differs from item to item. In other words, Bmi is not invariantunder the change of the reference item.

2It is true that Pmi = 1 is theoretically possible, making the denominator 0. However, such apossibility can be ignored for all practical purposes.

3If P is a probability, the quotient P1−P is called the odds, and the natural logarithm of the odds

ln(

P1−P

)is called the log odds or logit. Logit plays a central role in a popular statistical analysis

called logistic regression.

Page 225: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

13.1. DICHOTOMOUS RASCH ANALYSIS 223

13.1.3 Invariance of Differences: An Interval ScaleHowever, Bmi − Bni, (the ability of person m in reference to item i)−(the abilityof person n in reference to item i), is invariant under the change of the referenceitem i as stipulated in Guiding Concept 3.

Guiding Concept 3

3. In the ideal model, consistency among the person measures in reference todifferent items is secured in the following sense.

(a) The person ability Bmi itself depends on which item i is used for themeasurement.

(b) However, the difference between the ability measures for two persons, mand n, is invariant under switching from one reference item i to anotheritem j.

Bmi − Bni = Bmj − Bnj

(c) This means that the origin of measurement changes as the reference itemchanges, but the difference/distance between the ability measures of anypair of persons does not depend on which item is used for the measure-ment.

(d) A good way to understand this is to imagine two points, corresponding totwo persons, and a number line through them. Depending on the item,the position of the origin, and hence the coordinates of the points change.But, the distance between the two points remains the same.

As a corollary to Guiding Concept 3, we have an important property of the Raschmodel called the linearity of measurement scale. Person abilities and item difficultiesas defined in Rasch analysis constitutes an interval scale with linearity. We will seeall these shortly.

13.1.4 The Reference Pair, Abilities, and DifficultiesAccording to Guiding Concept 3(b), we have the following relation for any personsm and n and any items i and j.

ln(

Pmi

1 − Pmi

)− ln

(Pni

1 − Pni

)= ln

(Pmj

1 − Pmj

)− ln

(Pnj

1 − Pnj

)(13.2)

Page 226: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

224 CHAPTER 13. RASCH ANALYSIS

In order to measure the abilities of all the persons in a consistent and unified manner,we will consider a reference pair of person 0, denoted p0, and item 0, denoted i0, suchthat the ability of p0 equals the difficulty level of i0

4; i.e.

P00 = 0.5 or equivalently P00

1 − P00= 1.

With this choice of the reference pair (p0, i0), the ability of an arbitrary person n(denoted by pn) is given by

ln(

Pn0

1 − Pn0

).

How about the difficulty level of item i? According to Guiding Concept 1, itemdifficulty is the item’s “ability” to make the examinees answer the item incorrectly.Consider item i and its level of difficulty in reference to person m. Because theprobability that person m answers item i correctly is Pmi, the probability that personm misses item i, or the probability that item i succeeds in making person m answerincorrectly, is 1 − Pmi. Noting that Pmi is the probability that item i fails to makeperson m to answer incorrectly, we get

ln(1 − Pmi

Pmi

)as the natural logarithm of “success” to “failure” ratio for item i in reference toperson m, and this is the definition of the level of difficulty for item i in referenceto person m. Again, this value depends on the choice of the reference person m. So,we use (p0, i0) as before and obtain

ln(

1 − P0j

P0j

)

as the difficulty level of an arbitrary item j. It is customary to denote person abilityby B and item difficulty by D. Let us summarize the results below.

Person Ability Item DifficultyBn = ln

(Pn0

1−Pn0

)Dj = ln

(1−P0j

P0j

)We can see here how the symmetry between person ability and item difficulty arisesin the Rasch model sense (Guiding Concept 1).

4There may not be a pair that satisfies this condition in the data set. However, we can alwaysconsider a hypothetical person p0 whose ability equals the level of difficulty of the item, any item,chosen as i0.

Page 227: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

13.1. DICHOTOMOUS RASCH ANALYSIS 225

13.1.5 Pmi in Terms of Bm − Di

Let us now express Pmi in terms of Bm and Di using the equality 13.2. We have

ln(

Pmi

1 − Pmi

)− ln

(Pni

1 − Pni

)= ln

(Pmj

1 − Pmj

)− ln

(Pnj

1 − Pnj

)

for any foursome (m, n, i, j). Set n = 0 and j = 0 to obtain

ln(

Pmi

1 − Pmi

)−ln

(P0i

1 − P0i

)= ln

(Pm0

1 − Pm0

)−ln

(P00

1 − P00

)= ln

(Pm0

1 − Pm0

)−ln

( 0.51 − 0.5

)

= ln(

Pm0

1 − Pm0

)−ln 1 = ln

(Pm0

1 − Pm0

)

=⇒ ln(

Pmi

1 − Pmi

)= ln

(Pm0

1 − Pm0

)+ln

(P0i

1 − P0i

)= ln

(Pm0

1 − Pm0

)−ln

(1 − P0i

P0i

)= Bm−Di

=⇒ Pmi

1 − Pmi

= exp(Bm−Di)

=⇒ Pmi = (1−Pmi) exp(Bm−Di) = exp(Bm−Di)−Pmi exp(Bm−Di)

=⇒ [1 + exp(Bm − Di)] Pmi = exp(Bm−Di).

Therefore,

Pmi = exp(Bm−Di)1+exp(Bm−Di) . (13.3)

In words, this means that the probability of person m answering item i correctly is afunction of the difference between the person ability and the item difficulty, Bm −Di,whose form is given by Equation 13.3. The point here is that it is only a functionof the difference Bm − Di and does not depend on what the values of Bm and Di

are. This is what is expected of a linear measure. Equation 13.3 indicates that thelarger the difference Bm − Di is, the larger becomes the probability Pmi of person mgetting item i right. It is easier to see from the following alternative form of Pmi. Infact, we can show that Pmi approaches 1 from below as Bm − Di tends to +∞.

Pmi = 1 − 11+exp(Bm−Di)

Bm−Di−→+∞−−−−−−−−−−→ 1 (13.4)

Page 228: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

226 CHAPTER 13. RASCH ANALYSIS

On the other hand, as Bm − Di gets smaller and smaller approaching −∞, Pmi

becomes smaller and smaller approaching 0 from above.

Pmi = exp(Bm−Di)1+exp(Bm−Di) < exp(Bm − Di)

=⇒ PmiBm−Di−→−∞−−−−−−−−−−→ 0

(13.5)

Note that 0 ≤ Pmi ≤ 1 is satisfied for all pairs (m, i) as any probability should.

Page 229: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

227

ExercisesClick to go to solutions.

1.

Page 230: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

228

Page 231: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

Appendices

229

Page 232: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations
Page 233: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

Appendix A

Scales of Measurement

There are four scales of measurement of data you encounter in statistics. They are“nominal scale”, “ordinal scale”, “interval scale”, and “ratio scale” defined as below.

Nominal Scale a discrete classification of data, in which data are neithermeasured nor ordered but subjects are merely allocated to distinct categories:for example, a record of students’ course choices constitutes nominal datawhich could be correlated with school results

Ordinal Scale a scale on which data is shown simply in order of magnitudesince there is no standard of measurement of differences: for instance, a squashladder is an ordinal scale since one can say only that one person is better thananother, but not by how much

Interval Scale a scale of measurement of data according to which the dif-ferences between values can be quantified in absolute but not relative termsand for which any zero is merely arbitrary: for instance, dates are measuredon an interval scale since differences can be measured in years, but no sensecan be given to a ratio of times

Ratio Scale a scale of measurement of data which permits the comparison ofdifferences of values; a scale having a fixed zero value. The distances traveledby a projectile, for instance, are measured on a ratio scale since it makes senseto talk of one projectile traveling twice as far as another

231

Page 234: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

232

Page 235: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

Appendix B

Chebyshev’s Inequality andMarkov’s Inequality

There is a more general and probabilistic statement of Chebyshev’s Theorem, whichis also commonly known as Chebyshev’s Inequality.

Theorem B.1 (Chebyshev’s Inequality) For a random variable X with mean µand variance σ2, let k > 0 be an arbitrary positive number. Then,

P (|X − µ| ≥ kσ) ≤ 1k2 .

In words, this states that the probability of finding a value x for the variable X suchthat |x − µ| ≥ kσ is less than or equal to 1

k2 . This is equivalent to

P (|X − µ| < kσ) = P (x ∈ (µ − kσ, µ + kσ) > 1 − 1k2 .

While this applies to all k > 0, it does not give any new information when k ≤ 1 asthis means 1 − 1

k2 ≤ 0.

One way to prove this is by way of Markov’s Inequality.

Theorem B.2 (Markov’s Inequality) For any random variable X and a > 0,

P (|X| ≥ a) ≤ E(|X|)a

.

Proof of Markov’s Inequality: Let IE be the indicator variable of event E; thatis, IE = 1 if E occurs and 0 if E does not. Hence, if E is the event such that |X| ≥ a

233

Page 236: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

234

for some positive number a,

I|X|≥a ={

1 when |X| ≥ a0 otherwise .

Hence,

aI|X|≥a ={

a × 1 = a when |X| ≥ aa × 0 = 0 otherwise ,

and we haveaI|X|≥a ≤ |X|.

Therefore,E[aI|X|≥a] ≤ E[|X|] =⇒ aE[I|X|≥a] ≤ E[|X|].

Now, let f be the probability density function of X. Then,

E[I|X|≥a] =∫ +∞

−∞I|X|≥af(x) dx =

∫|X|≥a

f(x) dx = P (|X| ≥ a);

where∫

|X|≥a means the integral is over the support1 of I|X|≥a. Needless to say, whatmakes the second equality above valid is the fact that an indicator variable, includingI|X|≥a, is 1 on its support2. We have

aP (|X| ≥ a) ≤ E[|X|] =⇒ P (|X| ≥ a) ≤ E(|X|)a

.

Proof of Chebyshev’s Inequality: This is a direct application of Markov’sInequality with a = kσ.

P (|X−µ| ≥ kσ) = E[I|X−µ|≥kσ] = E[I[(X−µ)/(kσ)]2≥1] ≤ E

[(X − µ

)2]

= 1k2

E[(X − µ)2]σ2 = 1

k2σ2

σ2 = 1k2

1 The support of a function is the set of points where the function is not zero, or the closure ofthat set.

2If you are concerned about the closure, just remember that one point included or excluded doesnot matter.

Page 237: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

Appendix C

Independence andUncorrelatedness

We will give definitions of (statistical) independence and uncorrelatedness, fol-lowed by a discussion of the difference between them.

Definition C.1 (Independence: Events) Let Pr(A) stand for the probability thatEvent A occurs, and Pr(A|B) signify the conditional probability with which EventA occurs when Event B is co-occurring. Then, Event A is (statistically) inde-pendent of Event B if

Pr(A|B) = Pr(A).This means whether B happens makes no difference to how often A happens. SincePr(A ∩ B) = Pr(A|B)Pr(B) if A is independent of B,

Pr(A ∩ B) = Pr(A)Pr(B), and Pr(B|A) = Pr(A ∩ B)Pr(A)

= Pr(A)Pr(B)Pr(A)

= Pr(B),

and B is independent of A. Hence, we say A and B are independent.

Definition C.2 (Independence: Continuous Random Variables) Random vari-ables X and Y are independent if and only if the events {X ≤ a} and {Y ≤ b} areindependent events for all real numbers a and b. This is equivalent to fX,Y (x, y) =fX(x)fY (y) for all pairs (x, y); where fX and fY are the probability density functionsfor X and Y respectively, and fX,Y is the joint probability density function for thepair (x, y).

Definition C.3 (Uncorrelatedness) Random variables X and Y are said to beuncorrelated if Cov(X, Y ) = 0. As the name suggests, this also means the correlation

235

Page 238: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

236

coefficient Corr(X, Y ) = Cov(X,Y )σXσY

= 0 unless either σX or σY is zero, in which caseCorr(X, Y ) is undefined.

Recall that Cov(X, Y ) = E[XY ] − E[X]E[Y ]. Hence, Two random variables X andY are uncorrelated if and only if E[XY ] = E[X]E[Y ].

Proposition C.1 If random variables X and Y are independent, they are uncorre-lated. However, uncorrelatedness does not necessarily imply independence.

ProofSuppose X and Y are independent and recall that this is equivalent to fX,Y (x, y) =fX(x)fY (y). Then,

E[XY ] =∫ +∞

−∞xyfX,Y (x, y) dxdy =

∫ +∞

−∞xyfX(x)fY (y) dxdy

=(∫ +∞

−∞xfX(x) dx

)(∫ +∞

−∞yfY (y) dy

)= E[X]E[Y ].

Therefore, X and Y are uncorrelated.We will now consider an example of uncorrelated but dependent variables X and Y .Let X be the standard normal variable and Y = X2. Note that the distribution of Y

is well-defined as X = ±√

Y and the probability density function fX(x) = 1√2π

e− x22

is an even function with fX(x) = fX(−x). Then,

• X and Y are not independent as fX,Y (x, y) is nonzero only when y = x2

while fX and fY are nonzero for all (x, y); namely, fX(x) = 1√2π

e− x22 for

−∞ < x < +∞ and fY (y) = 1√2π

y− 12 e− y

2 1 for y > 0.1 In order to see this, consider the following.∫ +∞

−∞

1√2π

e− x22 dx =

∫ 0

−∞

1√2π

e− x22 dx +

∫ +∞

0

1√2π

e− x22 dx (∗)

As y = x2 → dy = 2xdx, we have dx = dy2x = dy

2 √y for x > 0 and dx = dy

2(− √y) = − dy

2 √y for x < 0.

Therefore,

(∗) =∫ 0

+∞

1√2π

e− y2

(− dy

2 √y

)+∫ +∞

0

1√2π

e− y2

dy

2 √y

= 2∫ +∞

0

1√2π

e− y2

dy

2 √y

=∫ +∞

0

1√2π

y− 12 e− y

2 dy.

We can check that this integral is indeed 1 as it should be. W start with∫ ∞

0tn−1e−t dt = Γ(n) for n > 0.

Page 239: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

237

• X and Y are uncorrelated because Cov(X, Y ) = E[XY ]−E[X]E[Y ] = E[X3]−E[X]E[Y ] = 0 − 0 · E[Y ] = 0. Note that E[X3] and E[X] are 0 becausethe functions x3e− x2

2 and xe− x22 are both odd, and the range of integration is

symmetric about the origin.

Note that covariance and correlation are measures of linear dependence (Y =aX + b). Therefore, nonlinear relation such as Y = X2 is not captured by covari-ance or correlation coefficient. That is why Cov(X, Y ) was found to be 0 for Y = X2.

In fact, if Y = 2X, which is a linear dependence, X and Y are indeed correlatedas shown below.

Cov(X, Y ) = E[XY ]−E[X]E[Y ] = E[2X2]−E[X]E[Y ] = E[2X2]−0·E[Y ] = E[2X2] > 0

=⇒ X and Y are correlated.

Let y2 = u, then, dy = 2du, e− y

2 = e−u, and y− 12 = (2u)− 1

2 = 1√2 u

12 −1. So,

1√2π

∫ ∞

0y− 1

2 e− y2 dy = 1√

∫ ∞

0

1√2

u12 −1e−u(2du) = 1√

π

∫ ∞

0u

12 −1e−u du = 1√

πΓ(1

2) = 1√

π

√π = 1.

Page 240: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

238

Page 241: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

Appendix D

Properties of Correlation

We will prove the following propositions for the population correlation. Proofs forthe sample correlation is simply a matter of changing the notation to that for thesample.

Proposition D.1 (Property 7: |ρ| ≤ 1 on p.88 Property 8: −1 ≤ rX,Y ≤ +1 on p.91)ProofVariance is nonnegative by definition. Therefore,

V ar(

X

σX

+ Y

σY

)= V ar

(X

σX

)+V ar

(Y

σY

)+2Cov

(X

σX

,Y

σY

)

= 1σ2

X

V ar(X)+ 1σ2

Y

V ar(Y )+ 2σXσY

Cov(X, Y ) = 2+ 2σXσY

Cov(X, Y ) ≥ 0

=⇒ ρX,Y = Cov(X, Y )σXσY

≥ −1.

Also,

V ar(

X

σX

− Y

σY

)= V ar

(X

σX

)+V ar

(Y

σY

)−2Cov

(X

σX

,Y

σY

)

= 1σ2

X

V ar(X)+ 1σ2

Y

V ar(Y )− 2σXσY

Cov(X, Y ) = 2− 2σXσY

Cov(X, Y ) ≥ 0

=⇒ ρX,Y = Cov(X, Y )σXσY

≤ 1.

Proposition D.2 (Property 8 on p.88 Property 9 on p.91) Corr(X, Y ) = 1 ifand only if Y = a + bX for some constants a and b > 0. Likewise, Corr(X, Y ) = −1

239

Page 242: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

240

if and only if Y = a + bX for some constants a and b < 0.ProofFirst, suppose Y = a + bX with b < 0. Then,

Corr(X, Y ) = Corr(X, a + bX) = −Corr(X, X) = −Cov(X, X)σXσX

= −σ2X

σ2X

= −1.

Likewise for b > 0.

Corr(X, Y ) = Corr(X, a + bX) = +Corr(X, X) = Cov(X, X)σXσX

= σ2X

σ2X

= +1.

Next, suppose Corr(X, Y ) = ±1. Consider V ar(Y − bX) for some and any constantb.

V ar(Y −bX) = V ar(X)+V ar(−bX)+2Cov(Y, −bX) = V ar(Y )+b2V ar(X)−2bCov(X, Y )

= σ2Y +b2σ2

X−2σXσY Corr(X, Y ) = σ2Y +b2σ2

X∓2bσXσY = (bσX∓σY )2

Now let b = ± σY

σX. Then,

(bσX∓σY )2 =(

±σY

σX

σX ∓ σY

)2= 0 =⇒ V ar(Y −bX) = 0

⇐⇒ Y −bX = a for some constant a ⇐⇒ Y = a+bX.

Note that Corr(X, Y ) = +1 is paired with b = + σY

σXand Corr(X, Y ) = −1 is paired

with b = − σY

σX, giving the desired results.

Page 243: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

Appendix E

The t Distribution

The t distribution with ν degrees of freedom is a distribution described by the fol-lowing probability density function.

f(t) = 1√πν

Γ[(ν + 1)/2]Γ(ν/2)

1(1 + t2/ν)(ν+1)/2 , − ∞ < t < +∞; (E.1)

where ν is any strictly positive real number, ν > 0, and the gamma function Γ(x) isdefined by the following improper integral

Γ(x) =∫ ∞

0e−yyx−1dy. (E.2)

In particular, when n is a positive integer,

Γ(n) =∫ ∞

0e−yyn−1dy = (n − 1)!. (E.3)

This is a consequence of

Γ(z + 1) = zΓ(z), (E.4)

which can be proved by integration by parts, and

Γ(1) = 1. (E.5)

The probability density function f(t) above takes the following form for ν = 1, 2, 3, and ∞.

ν = 1 f(t) = 1π(1 + t2)

(E.6)

241

Page 244: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

242

ν = 2 f(t) = 1(2 + t2) 3

2(E.7)

ν = 3 f(t) = 6√

3π(3 + t2)2 (E.8)

ν = ∞ f(t) = 1√2π

e− t22 (E.9)

The t variable shares many properties with the standard normal variable z.Roughly speaking, t is a more variable, spread out, version of z which approaches zas the degrees of freedom tend to ∞. It is also a standardized variable in the sensethat ∫ +∞

∞f(t)dt =

∫ +∞

1√πν

Γ[(ν + 1)/2]Γ(ν/2)

1(1 + t2/ν)(ν+1)/2 dt = 1. (E.10)

Here is a list of the properties of the t distribution as contrasted with the standardnormal distribution exhibited by z.

Properties of the t Distribution

1. The distribution is bell-shaped and symmetric centered on t = 0.

2. The height of the peak at t = 0 is lower than that for z.

3. The variance is νν−2 = 1 + 2

ν−2 > 1. Hence, t is more variable or spread outthan z.

4. For large ν, t approaches z, that is, the t distribution approaches the standardnormal distribution of z as ν −→ ∞; which we denote by t

ν→∞−−−→ z.

5. Therefore, the variance of t approaches 1 as ν −→ ∞. This can be verifiedeasily because the variance of t = 1 + 2

ν−2 from 3 above.

The connection between this distribution and sampling distribution of Chapter4 is Fact E.1.

Fact E.1 : This is a theorem provided without a proof.[Devore and Berk, 2012, p.320]If {x1, x2, . . . , xn} is a random sample from a normal distribution N(µ, σ2),

T = x − µ

s/√

n(E.11)

Page 245: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

243

has the t distribution with (n − 1) degrees of freedom, tn−1.

This can be restated as follows in our context.

Fact E.2 [Devore and Berk, 2012, p.401]When x and s are the mean and standard deviation of a random sample of size n,{x1, x2, . . . , xn}, from a normal distribution with mean µ, the random variable

T = x − µ

s/√

n

has the t distribution with n − 1 degrees of freedom (abbreviated as df).

Now, it is possible to take this as the definition of the t distribution. However, thefollowing equivalent definition of the t distribution is more illuminating as it showshow the t distribution arose.

Definition E.1 The t distribution with ν degrees of freedom is defined to be thedistribution of the following ratio T .

T = Z√X/ν

; (E.12)

where Z is a standard normal random variable, and X is a χ2ν random variable

independent of Z.

Fact E.1 and Fact E.2 are direct consequences of Definition E.1 and Proposition F.6 in Appendix F. In order to see this, rewrite T as below.

T = X − µ

S/√

n= (X − µ)/(σ/

√n)√

(n−1)S2

σ2 /(n − 1)(E.13)

Here, the numerator is Z by definition and the denominator has the form√χ2

n−1/(n − 1).

Page 246: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

244

Page 247: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

Appendix F

The Chi-Squared Distribution

For beginners of statistics, it is probably easier to take Proposition (F.4) as thedefinition of chi-squraed distribution.

Definition F.1 (Chi-Squared Distribution) Let Z1, Z2, . . . Zk for some k ∈ Nbe independent standard normal random variables. Then, the sum of the squares ofthese variables, denoted by Qk,

Qk =k∑

i=1Z2

i (F.1)

is distributed according to the chi-squared distribution with k degrees fo freedom. Weexpress this by

Qk ∼ χ2k. (F.2)

However, it is also possible to start with the distribution (F.3) and prove therelation (F.2). The following is an advanced material that can be skipped with noill effect on your understanding of the main text.

Advanced MaterialFor each ν = 1, 2, 3, . . ., called the degrees of freedom (df), the chi-squared distribu-tion is characterized by the following distribution function.

f(x) ={ 1

21/2Γ(ν/2)x(ν/2)−1e−x/2 x > 00 x ≤ 0

(F.3)

245

Page 248: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

246

It is customary to use the notation χ2ν for a chi-squared variable with ν degrees of

freedom. It can be shown that the mean of the chi-squared distribution is ν, and thevariance is 2ν.

µχ2ν

= ν σ2χ2

ν= 2ν (F.4)

It may seem that this chi-squared distribution appeared out of the blue, but, inactuality, it arises in a rather straightforward manner from the standard normaldistribution.

Proposition F.1 [Devore and Berk, 2012, p.316] If Z has a standard normal dis-tribution, Z2 has a chi-squared distribution with 1 df. That is,

Z2 ∼ χ21. (F.5)

Proposition F.2 If X1 ∼ χ2ν1, X2 ∼ χ2

ν2, and they are independent, then,

X1 + X2 ∼ χ2ν1+ν2 . (F.6)

Proposition F.3 If X3 = X1 + X2, X1 ∼ χ2ν1, X3 ∼ χ2

ν3, ν3 > ν1, and X1 and X2are independent, then,

X2 = χ2ν3−ν1 . (F.7)

Proposition F.4 [Devore and Berk, 2012, p.317] If Z1, Z2, . . . , Zn are independentstandard normal variables, then

Z21 + Z2

2 + . . . + Z2n ∼ χ2

n. (F.8)

Now note the following identity.

n∑i=1

(Xi − µ

σ

)2=

n∑i=1

(Xi − X

σ

)2

+(

X − µ

σ/√

n

)2

(F.9)

Verification of (F.9)It suffices to show

n∑i=1

(Xi − µ)2 =n∑

i=1

(Xi − X

)2+ n

(X − µ

)2. (F.10)

Page 249: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

247

But, we haven∑

i=1(Xi − µ)2 =

n∑i=1

[(Xi − X

)+(X − µ

)]2=

n∑i=1

[(Xi − X

)2+ 2

(Xi − X

) (X − µ

)+(X − µ

)2]

=n∑

i=1

(Xi − X

)2+ 2

n∑i=1

(Xi − X

) (X − µ

)+ n

(X − µ

)2. (F.11)

So, we only need to shown∑

i=1

(Xi − X

) (X − µ

)= 0. (F.12)

Indeed,n∑

i=1

(Xi − X

) (X − µ

)=

n∑i=1

[XiX − Xiµ − X

2 + Xµ]

=(

n∑i=1

Xi

)X −

(n∑

i=1Xi

)µ − nX

2 + nXµ =(nX

)X −

(nX

)µ − nX

2 + nXµ

= nX2 − nXµ − nX

2 + nXµ = 0 (F.13)

The equality (F.9) along with Proposition F.5 leads to Proposition F.6.

Proposition F.5 If X1, X2, . . . , Xn are a random sample from a normal distribu-tion, then, X and S2 are independent.

Proposition F.6 [Devore and Berk, 2012, p.320] If X1, X2, . . . , Xn are a randomsample from a normal distribution, then,

(n − 1)S2

σ2 ∼ χ2n−1. (F.14)

ProofOn the left-hand side of (F.9), Xi−µ

σis a z-statistic, which we denote by Zi. Then,

the left-hand side of (F.9) becomesn∑

i=1

(Xi − µ

σ

)2=

n∑i=1

Z2i . (F.15)

Page 250: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

248

The first term of (F.9) can be expressed in terms of n, S, and σ as below.

n∑i=1

(Xi − X

σ

)2

= n − 1σ2

∑ni=1

(Xi − X

)2

n − 1= (n − 1)S2

σ2 (F.16)

The fraction inside the second term on the right-hand side of (F.9), X−µσ/

√n, is the

z-statistic derived from the sampling distribution of the mean. Denote this by ZX ,so that we have (

X − µ

σ/√

n

)2

= Z2X

. (F.17)

Now, substituting (F.15), (F.16), and (F.17) into (F.9), we obtainn∑

i=1Z2

i = (n − 1)S2

σ2 + Z2X

. (F.18)

By Proposition (F.4),∑ni=1 Z2

i ∼ χ2n, and Proposition (??) implies Z2

X∼ χ2

1. Further-more, we know that X and S2 are independent due to Proposition (F.5). Therefore,according to Proposition (F.3), we get

(n − 1)S2

σ2 ∼ χ2n−1. (F.19)

Page 251: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

Appendix G

The F-Distribution

Definition G.1 (F-Distribution) [DeGroot and Schervish, 2012, p.598]] Let Y andW be independent random variables such that Y has the χ2-distribution with m de-grees of freedom and W has the χ2-distribution with n degrees of freedom, where mand n are given positive integers. Define a new random variable X as follows:

X = Y/m

W/n= nY

mW.

Then, the distribution of X is called the F -distribution with m and n degrees offreedom, denoted by Fm,n. Hence, we have

Fm,n = χ2m/m

χ2n/n

.

From this definition, one can see immediately that the reciprocal 1/X has theF -distribution with n and m degrees of freedom, Fn,m.

Proposition G.1 If Y has the t-distribution with n degrees of freedom, then, Y 2

has the F -distribution with 1 and n degrees of freedom, F1,n.ProofFrom Definition E.1,

Y = tn = Z√χ2

n/n;

where Z is the standard normal variable and χ2n is the chi-squared variable with n

degrees of freedom which is independent of Z. Hence,

Y 2 = t2n = Z2

χ2n/n

= χ21/1

χ2n/n

= F1,n

249

Page 252: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

250

Page 253: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

Appendix H

F-Tests for ANOVA and LinearRegression

H.1 The F-Test for ANOVAThis is a hypothesis test with the following hypotheses.

H0 : µ1 = µ2 = . . . = µk

Ha : At least one of the means is different.We will first show that Relation (8.18) indeed holds. A key theorem in our proof iscalled Cochran’s Theorem. This is a very important theorem in analysis of varianceand regression analysis as it allows us to decompose sums of squares into severalquadratic forms1 and identify their distributions and establish their independence.The importance of the terms in the model is assessed via the distributions of theirsums of squares. Because of its unique importance, we will first encounter severalincarnations of Cochran’s Theorem.

Theorem H.1 (Cochran’s Theorem: general version) Given X ∼ N p(0, I),suppose that XT X is decomposed into k quadratic forms, Qi = XT BiX, i =1, 2, . . . , k, where the rank of Bi is ri and the Bi are positive semidefinite, then, anyone of the following conditions implies the other two.

(1) The ranks of the Qi add to p.

(2) Each Qi ∼ χ2ri

.1

251

Page 254: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

252

(3) All the Qi are mutually independent.Theorem H.2 (Cochran’s Theorem: ANOVA version) [Roger, 2012, p. 252]If n observations are obtained from the same normal population with mean µ andvariance σ2, and if the total sum of squares SST with n − 1 degrees of freedomis partitioned into m sums of squares SS1, SS2, . . . , SSm with degrees of freedomdf1, df2, . . . , dfm; then, the m terms SSj/σ2 are independent χ2 variables with dfj

degrees of freedom if and only if df1 + df2 + . . . + dfm = n − 1.Theorem H.3 (Cochran’s Theorem: an equivalent version for ANOVA) Ifall n observations come from the same normal distribution with mean µ and varianceσ2, and SST is decomposed into k sums of squares SSr, each with degrees of free-

dom dfr such thatk∑

r=1dfr = n − 1, then each SSr/σ2 is an independent χ2

dfrrandom

variable.Theorem H.4 (Cochran’s Theorem: multivariate version)As we know from Appendix G that

Fm,n = χ2m/m

χ2n/n

;

where χ2m and χ2

n are for independent random variables, we can now derive Relation(8.18) using Theorem H.2 or Theorem H.3. These theorems imply

SSB ∼ χ2dfB

(H.1)and

SSW ∼ χ2dfW

, (H.2)which in turn implies that

MSB = SSB

dfB

∼χ2

dfB

dfB

(H.3)

and

MSW = SSW

dfW

∼χ2

dfW

dfW

. (H.4)

Furthermore, Theorems H.2 and H.3 assure that SSB and SSW , and hence MSB

and MSW , are independent. Therefore,

MSB/MSW =SSBdfB

SSWdfW

∼χ2

dfB

dfB

χ2dfW

dfW

= FdfB ,dfW. (H.5)

Page 255: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

253

H.2 The F-Test for Linear RegressionRecall that this is a hypothesis test with the following hypotheses.

H0 : the slope b = 0

Ha : the slope b , 0

We also know from Appendix G that

Fm,n = χ2m/m

χ2n/n

;

where χ2m and χ2

n are for independent random variables.

The following facts assure that the ratio MSR/MSE on p.130 is indeed an F-statistic.

Fact H.1 It can be shown that if H0 is true, and the residuals are unbiased, ho-moscedastic, independent, and normal, the following holds.2

• SSE/σ2 has a χ2 distribution with dfE degrees of freedom. =⇒ SSE/σ2 = χ2dfE

• SSR/σ2 has a χ2 distribution with dfR degrees of freedom. =⇒ SSR/σ2 = χ2dfR

• SSE and SSR are independent random variables.

Therefore,

MSR

MSE= SSR/dfR

SSE/dfE

= (SSR/σ2)/dfR

(SSE/σ2)/dfE

=χ2

dfR/dfR

χ2dfE

/dfE

= FdfR,dfE.

It can be shown that under the null hypothesis H0 we have both

E[MSR] = σ2

andE[MSE] = σ2.

So, F should be near 1 if the null hypothesis H0 holds.

2One way to prove these facts is to use Cochran’s Theorem, but that is beyond the scope of thisbook.

Page 256: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

254

Page 257: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

Appendix I

Moment Generating Function

The distribution of a random variable X can be characterized by its moment gen-erating function. The moment generating function of X is a real function whosederivatives at zero are the moments of X.1 A probability distribution is uniquelyspecified by its moment generating function. Moment generating functions are oftenused to derive the distribution of a sum of two or more random variables. One prob-lem with the moment generating function is that not all random variables have amoment generating function. When this happens, we have to use the characteristicfunction which all random variables possess.

Definition I.1 (Moment Generating Function) Consider a random variable X.If the expected value E[exp (tX)] exists and is finite for all real numbers t ∈ [−a, +a],with a > 0, then we say that X possesses a moment generating function, MGF, andthe function

MX(t) = E[exp (tX)] (I.1)

is called the moment generating function of X.

The reason for the naming “moment generating function” is the following propo-sition.

Proposition I.1 If a random variable X possesses an MGF, the n-th moment ofX, denoted by E[Xn], is finite for any n ∈ N and given by

E[Xn] = dnMX(t)dtn

∣∣∣∣∣t=0

; (I.2)

where the righthand side of (I.2) is the value of the n-th derivative of MX(t) at t = 0.1The n-th moment of X is the expectation value of Xn or E[Xn].

255

Page 258: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

256

The significance and utility of the MFG is due to the following theorems.

Theorem I.1 A moment generating function is uniquely associated with a randomvariable X. In other words, an MGF uniquely determines the cumulative distributionfunction2, CDF, of the associated random variable X.

Theorem I.2 Let X and Y be two random variables. If FX(x) and FY (y) are theircumulative distribution functions and MX(t) and MY (t) are theri moment generatingfunctions, respectively, then,

FX(x) = FY (x) for any x ⇐⇒ MX(t) = MY (t) for any t (I.4)or

FX = FY ⇐⇒ MX = MY . (I.5)

In other words, X and Y have the same distribution if and only if they have the samemoment generating functions.

There are explicit ways to recover the probability distribution function. However,they require contour integration and are beyond the scope of this book.

One problem with the moment generating function is that not all random vari-ables possess a moment generating function. In order to circumvent this problem,we resort to the characteristic function. This is another transform with similar prop-erties to those of the moment generating function.

Definition I.2 (Characteristic Function) Given a random variable X, the char-acteristic function ϕX : R −→ C is defined by

ϕX(t) = E[exp(itX)]. (I.6)

Proposition I.2 If the n-th moment E[Xn] exists and is finite for some n ∈ N,ϕX(t) is n times continuously differentiable, and we have

E[Xn] = 1in

dnϕX(t)dtn

∣∣∣∣∣t=0

; (I.7)

where the righthand side of (I.7) is the value of the n-th derivative of ϕX(t) at t = 0.2The cumulative distribution function, CDF, of a random variable X is a function FX such that

FX(x) = P (X ≤ x); (I.3)

where P (x ≤ x) is the probability that the random variable X takes a value which is less than orequal to x.

Page 259: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

257

Proposition I.3 Consider two random variables X and Y and denote their cumula-tive distribution functions by FX and FY . Then, X and Y have the same distributionif and only if ϕX = ϕY , that is,

FX(x) = FY (x) for any x ⇐⇒ ϕX(t) = ϕY (t) for any t. (I.8)

Page 260: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

258

Page 261: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

Appendix J

Matrix Algebra

Definition J.1 (Unitary Matrix) A complex square matrix U is unitary if

U∗U = UU∗ = I;

where I is the identity matrix and U∗ is the conjugate transpose of U .Properties of a Unitary Matrix

1. U preserves the inner product between any two vectors; that is,

⟨Ux, Uy⟩ = ⟨x, y⟩ .

2. U is normal by definition as U∗U = UU∗.

3. U is diagonalizable, and hence, U is unitarily similar to a diagonal matrix. Uhas a decomposition of the form

U = V DV ∗;

where V is unitary and D is diagonal and unitary.

4.|detU | = 1

5. The eigenspaces of U are orthogonal.

The following are equivalent.

1. U is unitary.

259

Page 262: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

260

2. U∗ is unitary.

3. U is invertible with U−1 = U∗.

4. The columns of U form an orthonormal basis of Cn with respect to the usualinner product.

5. The rows of U form an orthonormal basis of Cn with respect to the usual innerproduct.

6. U is a normal matrix with eigenvalues lying on the unit circle.

Proof: Equivalence of 1 and 4Consider an n-by-n square matrix U whose i-th column is an n-dimensional vector

Vi =

v1i

v2i...

vni

∈ Cn such that

U = [V1V2 . . . Vn] =

v11 v12 . . . v1n

v21 v22 . . . v2n

. . .vn1 vn2 . . . vnn

.

Then, V∗i = (v1i, v2i, . . . , vni), and

U∗ =

v11 v12 . . . v1n

v21 v22 . . . v2n

. . .vn1 vn2 . . . vnn

=

v11 v21 . . . vn1v12 v22 . . . vn2

. . .v1n v2n . . . vnn

=

V∗

1V∗

2...V∗

n

.

So, we have

U∗U =

v11 v21 . . . vn1v12 v22 . . . vn2

. . .v1n v2n . . . vnn

v11 v12 . . . v1n

v21 v22 . . . v2n

. . .vn1 vn2 . . . vnn

=

< V1,V1 > < V2,V1 > . . . < Vn,V1 >< V1,V2 > < V2,V2 > . . . < Vn,V2 >

. . .< V1,Vn > < V2,Vn > . . . < Vn,Vn >

,

Page 263: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

261

where < Vi,Vj > is the usual inner product. It is now clear that U∗U = I if andonly if < Vi,Vj >= δij, which in turn means that the column vectors {Vi} areorthonormal. This suffices as a square matrix with either a left or a right inverse isinvertible with a unique inverse, and U∗U = I if and only if UU∗ = I.

https://onlinecourses.science.psu.edu/stat505/node/49

Definition J.2 (Orthogonal Matrix) An orthogonal matrix is a square matrixwith real entries whose columns and rows are orthonormal vectors. Equivalently, areal square matrix A is orthogonal if its transpose is equal to its inverse, that is,

AT = A−1, and so, AT A = AAT = I.

Note that A is a ”real” version of unitary matrix, and all the properties of a unitarymatrix hold for A. Therefore, any of the equivalent properties of a unitary matrixcan be used as the definition of an orthogonal matrix.

Definition J.3 (Diagonalizable Matrix) A square matrix A is called diagonaliz-able if it is similar to a diagonal matrix; that is, if there exists an invertible matrixP such that P −1AP is a diagonal matrix.

Definition J.4 (Symmetric Matrix) A symmetric matrix is a square matrix Athat is equal to its transpose; i.e. A = AT . So, if we write A = [aij] in a component-wise description, A is symmetric if aij = aji.

The most important (real) symmetric matrices for our applications are covarianceand correlation matrices.

Lemma J.1 The eigenvalues of a real symmetric matrix are real.ProofLet A be a real symmetric matrix with an eigenvalue λ and eigenvector v, so thatAv = λv. We will denote the transpose conjugate of matrix B by B∗. Note that thenotation w∗ makes sense for a vector w as it is a matrix with one column or onerow.

Av = λv =⇒ v∗Av = v∗λv = λ∥v∥2 =⇒ λ = v∗Av

∥v∥2

Now,(v∗Av)∗ = v∗A∗(v∗)∗ = v∗Av =⇒ v∗Av is real.

So, λ is indeed real.

Page 264: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

262

Lemma J.2 If u and v are eigenvectors (column vectors) of a real symmetric matrixA corresponding to distinct eigenvalues λ and µ, u and v are orthogonal, that is, theinner product < u, v >= uT v = 0.

Proof

λuT v = (λu)T v = (Au)T v = uT AT v = uT Av = µuT v =⇒ (λ − µ)uT v = 0

But, λ and µ are distinct and λ − µ , 0. Hence, we have uT v =< u, v >= 0.

Theorem J.1 (Spectral Theorem for Real Symmetric Matrices) A real sym-metric n by n matrix is diagonalizable. In fact, Rn has an orthonormal basis ofeigenvectors for any real symmetric matrix A. If these are taken as the columns ofan (orthogonal) matrix V , then V T AV is a diagonal matrix with the eigenvalues ofA on the diagonal.

In lieu of a proof, we will look at a 3 × 3 example.

Example J.1 Consider

A =

3 1 −11 3 −1

−1 −1 5

.

det(A−λI) =

∣∣∣∣∣∣∣3 − λ 1 −1

1 3 − λ −1−1 −1 5 − λ

∣∣∣∣∣∣∣= (3−λ)[(3−λ)(5−λ)−1]−1·(5−λ−1)−1·[−1+(3−λ)]= (3−λ)(14−8λ+λ2)−5+λ+1+1−3+λ = (3−λ)(14−8λ+λ2)−6+2λ

= (3−λ)(14−8λ+λ2)−2(3−λ) = (3−λ)(12−8λ+λ2) = (3−λ)(λ−2)(λ−6) = 0

=⇒ λ = 2, 3, 6So, the eigenvalues are 2, 3, and 6. Let us find an eigenvector associated with λ = 2. 3 1 −1

1 3 −1−1 −1 5

a

bc

= 2

abc

=⇒

3a + b − ca + 3b − c

−a − b + 5c

=

2a2b2c

=⇒

a + b − ca + b − c

−a − b + 3c

=

000

=⇒

a + b − c = 0a + b − c = 0−a − b + 3c = 0

=⇒ c = 0 and a = −b

Page 265: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

263

If you know Gaussian elimination, you can get the same result as follows.

A−2I =

3 1 −11 3 −1

−1 −1 5

2 0 00 2 00 0 2

=

1 1 −11 1 −1

−1 −1 3

1 1 −10 0 00 0 2

1 1 −10 0 20 0 0

=⇒

1 1 −10 0 20 0 0

a

bc

=⇒ c = 0 and a+b = 0

This suggests that the eigenvector associated with the eigenvalue 2 is a scalar multipleof −1

10

.

Similarly, we can show the eigenvectors for λ = 3 and λ = 6 are scalar multiples of 111

and

11

−2

, respectively.

It is trivial to check these three vectors are orthogonal to each other; i.e. the innerproducts are zero between any pair of these three vectors. After adjusting the lengthto 1, we get

1√2

−110

=

−1√

21√2

0

,1√3

111

=

1√3

1√3

1√3

, and 1√6

11

−2

=

1√6

1√6

−2√6

.

These eigenvectors form an orthonormal basis for Rn. Putting the normalized eigen-vectors in columns, we get the following orthogonal matrix V . This is the V inTheorem J.1.

V =

−1√

21√3

1√6

1√2

1√3

1√6

0 1√3

−2√6

Let us compute V T AV .

V T AV =

−1√

21√2 0

1√3

1√3

1√3

1√6

1√6

−2√6

3 1 −1

1 3 −1−1 −1 5

−1√2

1√3

1√6

1√2

1√3

1√6

0 1√3

−2√6

Page 266: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

264

=

−1√

21√2 0

1√3

1√3

1√3

1√6

1√6

−2√6

−2√2

3√3

6√6

2√2

3√3

6√6

0 3√3

−12√6

=

−1√

21√2 0

1√3

1√3

1√3

1√6

1√6

−2√6

√2

√3

√6√

2√

3√

60

√3 −2

√6

=

2 0 00 3 00 0 6

So, it is diagonal with the eigenvalues on the diagonal as claimed in Theorem J.1.

Definition J.5 (Positive Semidefinite/Definite Matrices) A symmetric n×nmatrix A is called positive semidefinite if xT Ax ≥ 0 for all x ∈ Rn, and is calledpositive definite if xT Ax > 0 for all nonzero x ∈ Rn.

Definition J.6 (Principal Submatrix and Principal Minor) Let A be an n×nmatrix. A k × k submatrix of A formed by deleting nk rows of A, and the same nkcolumns of A, is called principal submatrix of A. The determinant of a principalsubmatrix of A is called a principal minor of A. Note that the definition does notspecify which nk rows and columns to delete, only that their indices must be the same.

Theorem J.2 (Positive Semidefiniteness) The following are equivalent:

1. The symmetric matrix A is positive semidefinite.

2. All eigenvalues of A are nonnegative.

3. All the principal minors of A are nonnegative.

4. There exists B such that A = BT B.

Definition J.7 (Leading Principal Submatrix/Minor) Let A be an n × n ma-trix. The k-th order principal submatrix of A obtained by deleting the last nk rowsand columns of A is called the k-th order leading principal submatrix of A, and itsdeterminant is called the k-th order leading principal minor of A.

Theorem J.3 (Positive Definiteness) The following are equivalent:

1. The symmetric matrix A is positive definite.

2. All eigenvalues of A are positive.

3. All the leading principal minors of A are positive.

Page 267: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

265

4. There exists a nonsingular square matrix B such that A = BT B.

One example of positive semidefinite matrix is a covariance matrix.

Theorem J.4 (Covariance Matrix is Positive Semidefinite) A variance covari-ance matrix Σ is positive semidefinite.

ProofFrom Definition J.5, an n×n matrix A is positive semidefinite if and only if xT Ax ≥ 0for all x ∈ Rn. As Fact 7.3 gives

Σ = E[(X − E[X])(X − E[X])T

], (J.1)

for any pair u, v ∈ Rn, we have

uT Σv = uT E[(X − E[X])(X − E[X])T

]v

= E[uT (X − E[X])(X − E[X])T v

]= E [(u · (X − E[X])) ((X − E[X]) · v)] . (J.2)

Therefore, for u ∈ Rn, we have

uT Σu = E [(u · (X − E[X])) ((X − E[X]) · u)]= E

[(u · (X − E[X]))2

]≥ 0. (J.3)

This proves that the variance covariance matrix Σ is positive semidefinite.

Page 268: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

266

Page 269: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

Appendix K

Different Rotation Schemes

A rotation is a linear transformation that is performed on the initial solution of prin-cipal components analysis for the purpose of making the solution easier to interpret.There are two kinds of rotation technique; orthogonal rotation and oblique rota-tion. Orthogonal rotation keeps the components uncorrelated while increasingthe meaning of each component clearer and easier to interpret. Lack of correlationis equivalent to zero inner product for every pair of components in our vector viewof the components. Geometrically speaking, the components are kept at right anglesto one another. This is often a preferred method due to its mathematical simplic-ity. However, researchers sometimes need to relax the orthogonality condition tofacilitate interpretation of the components. Oblique rotation allows the rotatedcomponents to be correlated. Some of the representative rotation schemes are listedbelow.

Orthogonal Rotation

Varimax Varimax rotation focuses on the column of the loading matrix andtries to maximize the variance of loadings on each component. This has theeffect of making large loadings larger and small loadings smaller within eachcomponent. It tries to make all the loadings either large or near zero, in anattempt to explain each component by a small number of variables, whilepossibly the same variables can be relevant to explain two or more factors.In actuality, such an ideal cannot be realized easily. However, this rotationmakes the interpretation of the component structure easier.

Quatrimax Quatrimax rotation focuses on the row of the loading matrixand tries to maximize the variance of loadings on each variable. This has the

267

Page 270: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

268

effect of making large loadings larger and small loadings smaller within eachvariable. As a result, each variable is expected to explain a reduced numberof components, while one factor may need to be interpreted by a large numberof components.

Equamax Equamax rotation is a compromise that attempts to simplify bothcomponents and variables. It is a hybrid equidistant from varimax and quatri-max.

Orthomax Orthomax rotation is also a hybrid between varimax and quatri-max, where the relative weight of varimax versus quatrimax is determined bythe researcher.

Parsimax Parsimax rotation is a special case of orthomax rotation. Therelative weights of varimax and orthomax are determined based on the numberof observed variables and components.

Oblique RotationOrthoblique In orhtoblique rotation, quatrimax rotation is applied to rescaledcomponent loadings.

Procrustres A target loading matrix is hypothesized and the rotation matrixis found that best approximates the target matrix. Hence, procrustres rotationscheme serves as a test of the hypothesized loadings.

Promax Promax rotation starts from an orthogonal solution and tries tomake smaller loadings approach zero by further oblique rotation. Orthogonalloadings are raised to powers in the process.

By far the most commonly encountered rotation in social science and humanitiesis varimax rotation. So, let us briefly look into the mathematics behind this tech-nique.

Varimax Rotation in More DetailSuppose there are p variables and m components. The first step of varimax rotationis to rescale each loading lij by dividing by the corresponding communality hi. Notethat lij is the loading of the i-th variable on the j-th component, and hi is thecommunality of the i-th variable given by

hi =m∑

j=1l 2ij .

Page 271: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

269

Let us denote the scaled communality by lij so that

lij = lijhi

.

Then, the varimax process selects the rotation that maximizes the sample variancesof the standardized loadings for each factor summed over all the factors given by

1p

m∑j=1

p∑

i=1

(lij)4

− 1p

( p∑i=1

(lij)2)2 .

Page 272: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

270

Page 273: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

Appendix L

Miscellaneous

Definition L.1 (Exponential Random Variable) A variable X is an exponen-tial random variable of parameter λ > 0 when its probability distribution functionf(x) is given by

f(x) ={

λe−λx x ≥ 00 x < 0 . (L.1)

271

Page 274: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

272

Page 275: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

Bibliography

[Abdi, 2003] Abdi, H. (2003). Factor rotations in factor analyses, pages 978–982.Sage Publications, Thousand Oaks, CA.

[Bureau, 2012] Bureau, U. C. (2012). Statistical abstract of the united states: 2012- section 7 elections.

[contributors, nd] contributors, W. (n.d.). Coefficient of determination.

[Dallal, 2012] Dallal, G. E. (2012). The little handbook of statistical practice.

[DeGroot and Schervish, 2012] DeGroot, M. H. and Schervish, M. J. (2012). Prob-ability and statistics. Pearson, Boston, fourth edition.

[Devore and Berk, 2012] Devore, J. L. and Berk, K. N. (2012). Modern mathematicalstatistics with applications. Springer, New York/Heidelberg, second edition.

[Habing, 2003] Habing, B. (2003). Exploratory factor analysis.

[Hogg et al., 2012] Hogg, R. V., McKean, J. W., and Craig, A. T. (2012). Introduc-tion to mathematical statistics. Pearson, Boston, seventh edition.

[Hunter, 2002a] Hunter, D. (2002a). 22. pearson’s chi-square statistic.

[Hunter, 2002b] Hunter, D. (2002b). Statistics 597a asymptotic tools fall 2002 lecturenotes 2002.

[Hunter, 2006] Hunter, D. (2006). Chapter 7: Pearson’s chi-square test.

[Institute for Digital Research and Education UCLA, nd] Institute for Digital Re-search and Education UCLA (n.d.). Annotated spss output: Factor analysis.

[Kootstra, 2004] Kootstra, G. J. (2004). Exploratory factor analysis: Theory andapplication.

273

Page 276: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

274

[Lebanon, 2009] Lebanon, G. (2009). Pearson’s chi-square.

[Norušis, 2008] Norušis, M. J. (2008). SPSS16.0 statistical procedures companion.Prentice Hall, Upper Saddle River, NJ.

[O’Connor, 2012] O’Connor, B. (2012). Ai and social science: Cosine similarity,pearson correlation, and ols coefficients.

[of Technology Institute of Statistics, nd] of Technology Institute of Statistics, G. U.(n.d.). Multivariate statistical analysis.

[Pardoe, 2012] Pardoe, I. (2012). Applied regression modeling. John Wiley & Sons,Hoboken, NJ, second edition.

[Pett et al., 2003] Pett, M. A., Lackey, N. R., and Sullivan, J. J. (2003). Makingsense of factor analysis: The use of factor analysis for instrument development inhealth care research. SAGE, Thousand Oaks, CA.

[Preacher, ] Preacher, K. J. Calculation for the chi-square test: An interactive cal-culation tool for chi-square tests of goodness of fit and independence.

[Roger, 2012] Roger, K. E. (2012). Experimental design: Procedures for the behav-ioral sciences. SAGE Publications, Los Angeles, fourth edition.

[Siegrist, 2012a] Siegrist, K. (2012a). 9. sample covariance and correlation.

[Siegrist, 2012b] Siegrist, K. (2012b). Virtual laboratories in probability and statis-tics: 3. expected value 3. covariance and correlation.

[Siegrist, 2012c] Siegrist, K. (2012c). Virtual laboratories in probability and statis-tics: 5. random samples 3. the sample variance.

[Siegrist, 2012d] Siegrist, K. (2012d). Virtual laboratories in probability and statis-tics: 5. random samples 7. sample covariance, correlation, and regression.

[Siegrist, 2012e] Siegrist, K. (2012e). Virtual laboratories in probability and statis-tics: Basic statistics b. random samples 9. sample covariance and correlation.

[Soper, nd] Soper, D. (n.d.). Statistics calculators.

[Tabachnick and Fidell, 2001] Tabachnick, B. G. and Fidell, L. S. (2001). Usingmultivariate statistics. Allyn & Bacon, Needham Heights, MA, 4th edition.

Page 277: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

275

[Taboga, 2012a] Taboga, M. (2012a). Statlect: Fundamentals of probaiblity: Inde-pendent random variables.

[Taboga, 2012b] Taboga, M. (2012b). Statlect: Fundamentals of probaiblity: Ran-dom variables.

[Trek, 2012] Trek, S. (2012). Sampling distributions.

[Walker, 1940] Walker, H. M. (1940). Degrees of freedom. Journal of EducationalPsychology, 31(4):253–269. Freedom; Theoretical Interpretation. Classification:Educational Psychology (3500); Population: Human (10); . References Available:Y.. Issue Publication Date: Apr, 1940.

[Weisstein, 2012a] Weisstein, E. W. (2012a). Wolfram mathworld: Correlation coef-ficient.

[Weisstein, 2012b] Weisstein, E. W. (2012b). Wolfram mathworld: Statistical corre-lation.

[Xia, 2012] Xia, Y. (2012). Chapter 1 simple linear regression (part 2).

[Zhang, 2011a] Zhang, D. (2011a). Chapter 1: Linear regression with one predictor.

[Zhang, 2011b] Zhang, D. (2011b). Chapter 2: Inferences in simple linear regression.

Page 278: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

276

Page 279: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

Answers to Exercises

Assignment 8 is done.Final 2013 is done.Chapter 2Click to go back to the problems.

1.

2.

3. (a) 4+42 = 4

(b) 2, 4, 6(c) 1+2+2+3+4+4+5+6+6+7

10 = 4

(d)√

(1−4)2+(2−4)2+(2−4)2+(3−4)2+(4−4)2+(4−4)2+(5−4)2+(6−4)2+(6−4)2+(7−4)2

9

4. (a) 4+42 = 4

(b) 3, 4, 5(c) 1+2+3+3+4+4+5+5+6+7

10 = 4

(d)√

(1−4)2+(2−4)2+(3−4)2+(3−4)2+(4−4)2+(4−4)2+(5−4)2+(5−4)2+(6−4)2+(7−4)2

9

5. (a) 4+42 = 4

(b) 2, 4, 6

(c) (1−4)2+(2−4)2+(2−4)2+(3−4)2+(4−4)2+(4−4)2+(5−4)2+(6−4)2+(6−4)2+(7−4)2

9

6. (a) 4+42 = 4

277

Page 280: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

278 ANSWERS TO EXERCISES

(b) 3, 4, 5(c) 1+2+3+3+4+4+5+5+6+7

10 = 4

(d)√

(1−4)2+(2−4)2+(3−4)2+(3−4)2+(4−4)2+(4−4)2+(5−4)2+(5−4)2+(6−4)2+(7−4)2

9

7. (a) 4+42 = 4

(b) 2, 4, 6(c) 1+2+2+3+4+4+5+6+6+7

10 = 4

(d)√

(1−4)2+(2−4)2+(2−4)2+(3−4)2+(4−4)2+(4−4)2+(5−4)2+(6−4)2+(6−4)2+(7−4)2

9

8. (a) 4+42 = 4

(b) 2, 4, 6(c) 1+2+2+3+4+4+5+6+6+7

10 = 4

(d)√

(1−4)2+(2−4)2+(2−4)2+(3−4)2+(4−4)2+(4−4)2+(5−4)2+(6−4)2+(6−4)2+(7−4)2

9

9. (a) 4+42 = 4

(b) 3, 4, 5(c) 1+2+3+3+4+4+5+5+6+7

10 = 4

(d)√

(1−4)2+(2−4)2+(3−4)2+(3−4)2+(4−4)2+(4−4)2+(5−4)2+(5−4)2+(6−4)2+(7−4)2

10

10. (a) 4+42 = 4

(b) 2, 4, 6(c) 1+2+2+3+4+4+5+6+6+7

10 = 4

(d)√

(1−4)2+(2−4)2+(2−4)2+(3−4)2+(4−4)2+(4−4)2+(5−4)2+(6−4)2+(6−4)2+(7−4)2

10

11. (a) (6 + 3 + 5 + 4 + 7 + 4 + 5 + 6)/8 = 40/8 = 5

(b) (6−5)2+(3−5)2+(5−5)2+(4−5)2+(7−5)2+(4−5)2+(5−5)2+(6−5)2

8−1

Chapter 3Click to go back to the problems.

1.

Page 281: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

CHAPTER 3 279

2.

3.

4. (a) 0.4772(b) 0.5 − 0.4495 = 0.0505(c) P (|z| > 1.96) = 2 × (0.5 − 0.4750) = 0.05

5. At x = 40: z = 40−5010 = −1, while at x = 55: z = 55−50

10 = 0.5.Therefore, we get P (−1 < z < 0.5) = 0.3413 + 0.1915 = 0.5328.

6. (a) 0.3413(b) 0.5 − 0.4505 = 0.0495(c) P (|z| ≤ 2) = 2 × 0.4772 = 0.9544

7. At x = 90: z = 90−10020 = −0.5, while at x = 125: z = 125−100

20 = 1.25.Therefore, we get P (−0.5 < z < 1.25) = 0.1915 + 0.3944 = 0.5859.

8. (a) 0.3413(b) 0.3413 + 0.4505 = 0.7918(c) P (|z| ≤ 1) = 2 × 0.3413 = 0.6826

9. (a) 0.3413(b) 0.5 − 0.4505 = 0.0495(c) P (|z| ≤ 2) = 2 × 0.4772 = 0.9544

10. At x = 90: z = 90−10020 = −0.5, while at x = 125: z = 125−100

20 = 1.25.Therefore, we get P (−0.5 < z < 1.25) = 0.1915 + 0.3944 = 0.5859.

11. Compute or find the following probabilities.

(a) P (z > 1.65)(b) The probability that a normal random variable z lies within 2.0 standard

deviations of its mean

12. At x = 40: z = 40−5010 = −1, while at x = 55: z = 55−50

10 = 0.5.Therefore, we get P (−1 < z < 0.5) = 0.3413 + 0.1915 = 0.5328.

13. (a) 0.3413

Page 282: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

280 ANSWERS TO EXERCISES

(b) 0.3413 + 0.4505 = 0.7918(c) P (|z| ≤ 1) = 2 × 0.3413 = 0.6826

14. At x = 40: z = 40−5010 = −1, while at x = 55: z = 55−50

10 = 0.5.Therefore, we get P (−1 < z < 0.5) = 0.3413 + 0.1915 = 0.5328.

15. (a) 0.3413(b) 0.5 − 0.4505 = 0.0495(c) P (|z| ≤ 2) = 2 × 0.4772 = 0.9544

16. z = 80−5015 = 2 and z0.4772 = 2 ⇒ 1000 × 0.4772 = 477.2 ⇒ 477 students.

17. (a) 0.3413(b) 0.3413 + 0.4505 = 0.7918(c) P (|z| ≤ 1) = 2 × 0.3413 = 0.6826

18. (a) .3849(b) P (1 < x < 2) = P (1−1

0.5 < z < 2−10.5 ) = P (0 < z < 2) = .4772

19. At x = 90: z = 90−10020 = −0.5, while at x = 125: z = 125−100

20 = 1.25.Therefore, we get P (−0.5 < z < 1.25) = 0.1915 + 0.3944 = 0.5859.

Chapter 4Click to go back to the problems.

1.

Chapter 5Click to go back to the problems.

1.

2. z = 515−50080/

√64 = 1.5

P (x ≤ 515) = P (z ≤ 1.5) = 0.5 + P (0 < z ≤ 1.5) = 0.5 + 0.4332 = 0.9332

3. z = x−60√100/

√25 = ±1.645 =⇒ x = 60 ± 2 × 1.645 = 60 ± 3.29 =⇒ [56.71, 63.29]

Page 283: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

CHAPTER 5 281

4. (a) t = 17.1−20s/

√10

(b) 1% from the table

5.

6. z = 1010−1000100/

√100 = 1.0

P (x ≤ 1010) = P (z ≤ 1.0) = 0.5 + P (0 < z ≤ 1.0) = 0.5 + 0.3413 = 0.8413

7. (a) z = x−120.5/

√100 = −3.0

(b) z 0.012

= 2.575, and this is smaller than 3.0. Hence, the machine appears tobe underfilling the box on the average.

8. (a) µx = µ = 35(b) σx = σ√

n= 30√

225 = 2

(c) 41−352 = 3 0.5 − 0.4987 = 0.0013

9. σ = 2 n = 100 x = 10 σx = σ√n

= 2√100 = 0.2

(a) x ± zσx = 10 ± 1.64 × 0.2 =⇒ (9.672, 10.328)

(b) n =(

zσw/2

)2=(

1.96×20.5

)2= 61.4656 ∴ 62

10. n = 16 x = 2.50 sx = 0.05√16 = 0.0125 df = 15

(a) x ± tdfsx = 2.50 ± 2.947 × 0.0125 =⇒ (2.4632, 2.5368)(b) 2.5 + t15 × 0.0125 = 2.522 =⇒ t15 = 0.012

0.0125 = 1.76 =⇒ 0.05

11. z = 1010−1000100/

√100 = 1.0

P (x ≤ 1010) = P (z ≤ 1.0) = 0.5 + P (0 < z ≤ 1.0) = 0.5 + 0.3413 = 0.8413

12. (a) z = x−120.5/

√100 = −3.0

(b) z 0.012

= 2.575, and this is smaller than 3.0. Hence, the machine appears tobe underfilling the box on the average.

13. (a) t = 732−75038/

√17

(b) The corresponding t-value is 2.12 from the table. They can not concludethe prices are different.

14. (a) µx = µ = 35

Page 284: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

282 ANSWERS TO EXERCISES

(b) σx = σ√n

= 30√225 = 2

(c) 41−352 = 3 0.5 − 0.4987 = 0.0013

15. σ = 2 n = 100 x = 10 σx = σ√n

= 2√100 = 0.2

(a) x ± zσx = 10 ± 1.64 × 0.2 =⇒ (9.672, 10.328)

(b) n =(

zσw/2

)2=(

1.96×20.5

)2= 61.4656 ∴ 62

16. (a) t0.05 = 2.12

(b) t = x−µs

=⇒ −2.12 < x−µs

< 2.12 =⇒ x − 2.12s < µ < x + 2.12s =⇒75.77 < µ < 92.23

17. (a) σ ≈ s√N

= 9000√225 = 600 → σ ≈ $600

(b) 24,400−25,000600 = −600

600 = −1 and 26,200−25,000600 = 1,200

600 = 2 → z0.34131 =1 and z0.4772 = 2 → 0.3413 + 0.4772 = 0.8185 [2 pts]

18. (a) σ ≈ s√N

= 30√9 = 10 and df = 9 − 1 = 8 [2 pts]

(b) [x−t.025s√N

, x+t.025s√N

] = [105−2.306×10, 105+2.306×10] = [81.94, 128.06]

19. z0.025 = 1.96 and σx = σ√n

= 8.3√85 = 0.900261 =⇒ x ± zα

2σx = 66.3 ± (1.96 ×

0.900261) ≈ 66.3 ± 1.76 =⇒ (64.54, 68.06)

20. sx = s√n

= 3.23√16 = 0.807500, ν = n − 1 = 16 − 1 = 15, t0.05/2,15 = 2.131 =⇒

x ± tα/2,νsx = 27.9 ± (2.131 × 0.8075) = 27.9 ± 1.72 =⇒ (26.18, 29.62)

21. (a) z.025 = 1.96

(b) z.025 = x−µx

σx= x−µ

s/√

n= x−90000

20000/√

400 = x−9000020000/20 = x−90000

1000 =⇒ x = 90000 +1.96 × 1000 = 91960

(c) You can get the upper and lower limits using the answer to 21b as well asthe symmetry between µ and x inherent in this problem. However, a fullsolution would be as follows.z = x−µx

σx= x−µ

s/√

n= 90000−µ

20000/√

400 = 90000−µ20000/20 = 90000−µ

1000 =⇒ −1.96 < 90000−µ1000 <

1.96 =⇒ −1.96 × 1000 < 90000 − µ < 1.96 × 1000 =⇒ −90000 − 1960 <−µ < −90000 + 1960 =⇒ 90000 − 1960 < µ < 90000 + 1960 =⇒ 88040 <µ < 91960

Page 285: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

CHAPTER 6 283

Chapter 6Click to go back to the problems.

1.

2.

3.

4.

5. (a) t = 732−75038/

√17

(b) The corresponding t-value is 2.12 from the table. They can not concludethe prices are different.

6. (a) z = x−120.5/

√100 = −3.0

(b) z 0.012

= 2.575, and this is smaller than 3.0. Hence, the machine appears tobe underfilling the box on the average.

7. (a) t = x−µs/

√n

= 23−203/

√9 = 3.0

(b) df = 8 and t0.01 = 2.896(c) If the mean for the engine were 20 (or less), the probability of observing

a value greater than 2.896 is 0.01. So, it is unlikely that the new enginemeets the pollution standard.

8. The null hypothesis H0 is that there is no effect of malnutrition. Under thishypothesis, we still have µ = 25 and σ = 6 even if the mother is under-nourished. Since this is a large-sample case, we use the z-statistic. z = x−µ

σ/√

n=

22−256/

√100 = −3

0.6 = −5. The probability that a z-value of −5 or smaller would resultis P (z ≤ −5) = 0.0013 < 0.05. Therefore, we reject the null hypothesis andconclude there is a negative effect of malnutrition on the birth weight.

9. z = x−µσ/

√n

= 3.35−3.20.6/

√36 = 1.5 But, P (1.5 ≤ z) = 0.067. So, we fail to reject the

null hypothesis.

10. z = x−µσ/

√n

= 0.270−0.2750.02/

√25 = −0.005

0.02/5 = −1.25 As this is a two-tailed test, P (1.25 ≤|z|) = 2P (z > 1.25) = 2(0.1056) = 0.2112 > 0.05 indicates that we cannotreject the null hypothesis. Hence, it is not clear whether the steroid actuallyaffects the performance.

Page 286: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

284 ANSWERS TO EXERCISES

11. t = x−µs/

√n

= 9.6−100.5/

√9 = 0.4

0.5/3 = 2.4 The degrees of freedom df = 9 − 1 = 8. Now,t8,0.01 = 2.896 and t8,0.05 = 1.860. Therefore, we reject H0 at α = 0.05 and failto reject at α = 0.01.

12. (a) z = x−µx

σx= x−µ

σ/√

n≈ x−µ

s/√

n= 2,460−2,000

200/√

50 = 2.12

(b) z0.05 = 1.645 and 2.12 > 1.645. =⇒ The probability that the value xis greater than 2.12 is less than 5% obviously. So, we can be sure thatµ > 2, 400 with a probability greater than 0.95.

13. (a) n = 9, df= 8, x = 95+99+98+97+95+101+98+93+979 = 97

(b) t = 97−1002.4/

√9 = −3

0.8 = −3.75

(c) t8,.05 = 1.86 and −3.75 < −1.86 Hence, the null hypothesis is rejected,and we conclude µ < 100.

Chapter 7Click to go back to the problems.

1. r = (9−11)(10−17)+(11−11)(22−17)+(13−11)(19−17)√[(9−11)2+(11−11)2+(13−11)2][(10−17)2+(22−17)2+(19−17)2]

2. r = (60−70)(10−20)+(70−70)(30−20)+(80−70)(20−20)√[(60−70)2+(70−70)2+(80−70)2][(10−20)2+(30−20)2+(20−20)2]

3. (a) A: The number of tropical fish found in the ocean in the temperate zone.B: The amount of ultraviolet light that reaches the surface in the Antarc-tic. (Of course, this is just one example. There are many others.)

(b) They are both caused by global warming. Neither causes the other.

4. r = (60−70)(10−20)+(70−70)(30−20)+(80−70)(20−20)√[(60−70)2+(70−70)2+(80−70)2][(10−20)2+(30−20)2+(20−20)2]

5. (a) A: The number of tropical fish found in the ocean in the temperate zone.B: The amount of ultraviolet light that reaches the surface in the Antarc-tic. (Of course, this is just one example. There are many others.)

(b) They are both caused by global warming. Neither causes the other.

6. (a) r = (9−11)(10−17)+(11−11)(22−17)+(13−11)(19−17)√[(9−11)2+(11−11)2+(13−11)2][(10−17)2+(22−17)2+(19−17)2]

Page 287: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

CHAPTER 8 285

(b) To give a wacky example, A: family size and B: population of cricketsmay be such a pair. They are both negatively correlated with humanpopulation.

7. r = (60−70)(10−20)+(70−70)(30−20)+(80−70)(20−20)√[(60−70)2+(70−70)2+(80−70)2][(10−20)2+(30−20)2+(20−20)2]

8. (a) r = (9−11)(10−17)+(11−11)(22−17)+(13−11)(19−17)√[(9−11)2+(11−11)2+(13−11)2][(10−17)2+(22−17)2+(19−17)2]

(b) To give a wacky example, A: family size and B: population of cricketsmay be such a pair. They are both negatively correlated with humanpopulation density.

Chapter 8Click to go back to the problems.

1.

2. (a) For the within groups mean square: 20154.3444

For the between groups mean square: 4141.5763

(b) F = 1380.525458.053

(c) Because the significance level is .04 > .01, we fail to reject the null hy-pothesis that the means are all the same. Therefore, we cannot concludethat any of the training methods is effective at least based on these data.

Chapter 9Click to go back to the problems.

1.

2. (a) “R square” is the amount/proportion of variance explained by the model.“Adjusted R square” includes a correction for overestimation of the popu-lation R square. This is because a particular model tends to fit the samplebetter than the entire population.

(b) F = 146.579 from the table. Because the significance level is smaller than.05, the slope is not zero at α = .05. (Additional comment: In SPSS,any significance level < .0005 is reported as .000. Due to the rounding

Page 288: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

286 ANSWERS TO EXERCISES

scheme, anything greater than .0005 will be reported as a significance level≥ .001.)

(c) Standard error of the estimates =√

SSEn−2 =

√Residual Sum of Squares

dfE=√

28.730110

Chapter 10Click to go back to the problems.

1.

2. (a)Yes No

(55 + 15) × 60150 or (55 + 5) × 70

150 = 28 3242 48

(b) χ2 = (55−28)2

28 + (5−32)2

32 + (15−42)2

42 + (75−48)2

48

(c) (r − 1)(c − 1) = (2 − 1)(2 − 1) = 1(d) When df = 1, the 99% value for χ2 is 6.63490. Since our value 81.37

exceeds this, we conclude that there is a significant relationship betweenmother’s attitude and the child’s enrollment in a foreign language class.

3. (a)Yes No

(55 + 15) × 60150 or (55 + 5) × 70

150 = 28 3242 48

(b) χ2 = (55−28)2

28 + (5−32)2

32 + (15−42)2

42 + (75−48)2

48

(c) (r − 1)(c − 1) = (2 − 1)(2 − 1) = 1(d) When df = 1, the 99% value for χ2 is 6.63490. Since our value 81.37

exceeds this, we conclude that there is a significant relationship betweenmother’s attitude and the child’s enrollment in a foreign language class.

4. (a)Yes No

(180 + 90) × 240500 or (180 + 60) × 270

500 = 129.6 140.4110.4 119.6

(b) χ2 = (180−129.6)2

129.6 + (90−140.4)2

140.4 + (60−110.4)2

110.4 + (170−119.6)2

119.6

(c) (r − 1)(c − 1) = (2 − 1)(2 − 1) = 1(d) When df = 1, the 95% value for χ2 is 3.84. Since our value 81.93 exceeds

this, we conclude that there is a significant relationship between mother’sattitude and the child’s enrollment in a foreign language class.

Page 289: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

CHAPTER 10 287

5. (a) 1503 = 50

(b) χ2 = (61−50)2

50 + (53−50)2

50 + (36−50)2

50

(c) df = k − 1 = 3 − 1 = 2(d) When df = 2, the 95% value of χ2 is 5.99147. Since our value 6.52 exceeds

this, we conclude that there is a customer preference for one or more ofthe brands of bread.

6. (a)Yes No

(55 + 15) × 60150 or (55 + 5) × 70

150 = 28 3242 48

(b) χ2 = (55−28)2

28 + (5−32)2

32 + (15−42)2

42 + (75−48)2

48

(c) (r − 1)(c − 1) = (2 − 1)(2 − 1) = 1(d) When df = 1, the 99% value for χ2 is 6.63490. Since our value 81.37

exceeds this, we conclude that there is a significant relationship betweenmother’s attitude and the child’s enrollment in a foreign language class.

7. (a) 1503 = 50

(b) χ2 = (61−50)2

50 + (53−50)2

50 + (36−50)2

50

(c) df = k − 1 = 3 − 1 = 2(d) When df = 2, the 95% value of χ2 is 5.99147. Since our value 6.52 exceeds

this, we conclude that there is a customer preference for one or more ofthe brands of bread.

8. (a)

Supports Opposes Undecided Row totalMale 35.75 46.75 27.5 110Female 29.25 38.25 50×90

200 = 22.5 90Column total 65 85 50 200

(b) χ2 = (40−35.75)2

35.75 + (40−46.75)2

46.75 + (30−27.5)2

27.5 + (25−29.25)2

29.25 + (45−38.25)2

38.25 + (20−22.5)2

22.5

(c) (r − 1)(c − 1) = (2 − 1)(3 − 1) = 2(d) When df = 2, the 95% value for χ2 is 5.991. Since our value 3.794 does not

exceed this, we conclude that there is not a significant difference betweenmales and females in their attitude about raising sales tax.

9. (a) 1503 = 50

(b) χ2 = (61−50)2

50 + (53−50)2

50 + (36−50)2

50

Page 290: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

288 ANSWERS TO EXERCISES

(c) df = k − 1 = 3 − 1 = 2(d) When df = 2, the 95% value of χ2 is 5.99147. Since our value 6.52 exceeds

this, we conclude that there is a customer preference for one or more ofthe brands of bread.

10. (a) A = 500 × 1801000(180 × 500

1000) = 90B = 500 × 470

1000(470 × 5001000) = 235

C = 500 × 3501000(350 × 500

1000) = 175

(b) ∑2i=1

∑3j=1

(oij−eij)2

eij= (100−90)2

90 + (250−235)2

235 + (150−175)2

175 + (80−90)2

90 + (220−235)2

235 +(200−175)2

175

(c) df = (r − 1)(c − 1) = (2 − 1)(3 − 1) = 1 · 2 = 2(d) From the table, χ2

2,.01 = 9.210 < 11.28. So, we reject the null hypothesisthat there is no interaction between education and longevity.

Chapter 11Click to go back to the problems.

1.

2. Because principal components analysis uses standardized variables, each vari-able contributes one unit of variance. Hence, the total variance to be explainedis n.

3.

4.

5.

6.

7.

8. c41 = (−1.835)(.264) + (−1.143)(.297) + (−0.092)(.281) + (−1.060)(.255) +

(−0.687)(.226) = −1.276c4

5 = (−1.835)(.939) + (−1.143)(−1.357) + (−0.092)(.172) + (−1.060)(.297) +(−0.687)(.136) = −0.595

Page 291: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

CHAPTER 12 289

9. (a) 4 components(b) i. X1 = .845C1 − .148C2 − .273C3 − .130C4 + .362C5 + .202C6

ii. (.845)2 + (−.148)2 (Additional comment: No need to compute, butthis gives .736.)

(c) (.845)(.802)+(−.148)(.224)+(−.273)(−.271)+(−.130)(.447)+(.362)(−.159)+(.202)(.086)

Chapter 12Click to go back to the problems.

1.

Chapter 13Click to go back to the problems.

1.

Page 292: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

290 ANSWERS TO EXERCISES

Page 293: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

Subject Index

This is a work in progress. A very slow progress it is. Chapter 1 is done.

A

ascending order, 16

C

central tendency, 15

D

decision making, 13degrees of freedom, 18descending order, 16descriptive statistics, 13

G

graphical methods, 15

H

histogram, 15

I

inferential statistics, 13

M

mean, 16measures of variability, 18median, 16mode(s), 17

N

numerical methods, 15numerical summary, 13

P

population mean, 16population: definition, 13

R

range, 18

S

sample mean, 16sample variance, 18sample: definition, 13skewed, 17

291

Page 294: APPLIED STATISTICS WITH HELPFUL DETAILS - …aoitani.net/Applied_Statistics.pdf · APPLIED STATISTICS WITH HELPFUL DETAILS Fascinating, eye-opening, and even life-changing explanations

292 INDEX

squared distances, 18statistical inference: definition, 13sum of squares, 18

V

variability, 15variable: definition, 13