Numerical Descriptive Techniques

Numerical Descriptive Techniques

Chapter 4

4-22007 會計資訊系統計學 ( 一 ) 上課投影片

Introduction

Recall Chapter 2, where we used graphical techniques to describe data:

While this histogram provides some new insight, other interesting questions (e.g. what is the class average? what is the mark spread?) go unanswered.


Numerical Descriptive Techniques

Measures of Central Location （中央位置） Mean, Median, Mode

Measures of Variability （離散程度） Range, Standard Deviation, Variance, Coefficient of Variation

Measures of Relative Standing （相對位置） Percentiles, Quartiles

Measures of Linear Relationship （線性關係） Covariance, Correlation, Least Squares Line


4.1 Measures of Central Location

Usually, we focus our attention on two types of measures when describing population characteristics: Central location (e.g. average) Variability or spread

The measure of central location reflects the locations of all the actual data points.


With one data pointclearly the central location is at the pointitself.

4.1 Measures of Central Location

The measure of central location reflects the locations of all the actual data points.

How?

But if the third data point appears on the left hand-sideof the midrange, it should “pull”the central location to the left.

With two data points,the central location should fall in the middlebetween them (in order to reflect the location ofboth of them).


Sum of the observationsNumber of observations

Mean =

This is the most popular and useful measure of central location

The Arithmetic Mean（算術平均數）


Notation

When referring to the number of observations in a population, we use uppercase letter N

When referring to the number of observations in a sample, we use lower case letter n

The arithmetic mean for a population is denoted with Greek letter “mu”: （母體平均數）

The arithmetic mean for a sample is denoted with an “x-bar”. （樣本平均數）


Statistics is a pattern language

Population

Sample

Size N n

Mean


nx

x in

1i

Sample mean Population mean

Nx i

N1i

Sample size Population size

nx

x in

1i

The Arithmetic Mean



Population

Sample

Size N n

Mean


10

...

101021

101 xxxx

x ii

• Example 4.1The reported time on the Internet of 10 adults are 0, 7, 12, 5, 33, 14, 8, 0, 9, 22 hours. Find the mean time on the Internet.

00 77 222211.011.0

• Example 4.2

Suppose the telephone bills of Example 2.1 representthe population of measurements. The population mean is

200x...xx

200x 20021i

2001i 42.1942.19 38.4538.45 45.7745.77

43.5943.59

The Arithmetic Mean


When many of the measurements have the same value, the measurement can be summarized in a frequency table. Suppose the number of children in a sample of 16 employees were recorded as follows:

Number of children per family 0 1 2 3Number of families 3 4 7 2

• Additional Example

16 employees

+ + +

5.116

)3(2)2(7)1(4)0(316

x...xx16

xx 1621i

161i

The Arithmetic Mean


The Arithmetic Mean

…is appropriate for describing measurement data, e.g. heights of people, marks of student papers, etc.

…is seriously affected by extreme values called “outliers”. E.g. as soon as a billionaire moves into a neighborhood, the average household income increases beyond what it was previously!


Odd number of observations

0, 0, 5, 7, 8 9, 12, 14, 220, 0, 5, 7, 8, 9, 12, 14, 22, 330, 0, 5, 7, 8, 9, 12, 14, 22, 33

Even number of observations

Example 4.3

Find the median of the time on the internetfor the 10 adults of example 4.1

The Median of a set of observations is the value that falls in the middle when the observations are arranged in order of magnitude.

The Median（中位數）

Suppose only 9 adults were sampled (exclude, say, the longest time (33))

Comment

8.5, 8


The Mode of a set of observations is the value that occurs most frequently.

Set of data may have one mode (or modal class), or two or more modes.

The modal classFor large data setsthe modal class is much more relevant than a single-value mode.

The Mode（眾數）


Example 4.5Find the mode for the data in Example 4.1. Here are the data again: 0, 7, 12, 5, 33, 14, 8, 0, 9, 22

Solution

All observation except “0” occur once. There are two “0”. Thus, the mode is zero.

Is this a good measure of central location? The value “0” does not reside at the center of this set

(compare with the mean = 11.0 and the mode = 8.5).

The Mode


Additional example The manager of a men’s store observes the waist size (in

inches) of trousers sold yesterday: 31, 34, 36, 33, 28, 34, 30, 34, 32, 40.

The mode of this data set is 34 in.

This information seems to be valuable (for example, for the design of a new display in the store), much more than “ the median is 33.5 in.”

The Mode


Measures of Central Location

The mode of a set of observations is the value that occurs most frequently.

A set of data may have one mode (or modal class), or two, or more modes.

Mode is a useful for all data types, though mainly used for nominal data.

For large data sets the modal class is much more relevant than a single-value mode.

※ Sample and population modes are computed the same way.


=MODE(range) in Excel

Note: if you are using Excel for your data analysis and your data is multi-modal (i.e. there is more than one mode), Excel only calculates the smallest one.

You will have to use other techniques (i.e. histogram) to determine if your data is bimodal, trimodal, etc.


Additional example A professor of statistics wants to report the results of a midterm exam, taken by 100 students.

• The mean of the test marks is 73.90• The median of the test marks is 81• The mode of the test marks is 84

Describe the information each one provides.

The mean provides informationabout the over-all performance level of the class. It can serve as a tool for making comparisons with other classes and/or other exams.

The Median indicates that half of the class received a grade below 81%, and half of the class received a grade above 81%. A student can use this statistic to place his mark relative to other students in the class.

The mode must be used when data are nominal If marks are classified by letter grade, the frequency of each grade can be calculated. Then, the mode becomes a logical measure to compute.

The mode must be used when data are nominal If marks are classified by letter grade, the frequency of each grade can be calculated. Then, the mode becomes a logical measure to compute.

The Mean, Median and Mode


Relationship among Mean, Median, and Mode

If a distribution is symmetrical, the mean, median and mode coincide

If a distribution is asymmetrical, and skewed to the left or to the right, the three measures differ.A positively skewed distribution

(“skewed to the right”)

MeanMedian

Mode


If a distribution is symmetrical, the mean, median and mode coincide

If a distribution is non symmetrical, and skewed to the left or to the right, the three measures differ.

A positively skewed distribution(“skewed to the right”)

MeanMedian

Mode MeanMedian

Mode

A negatively skewed distribution(“skewed to the left”)

Relationship among Mean, Median, and Mode


Mean, Median, Mode

If data are symmetric, the mean, median, and mode will be approximately the same.

If data are multimodal, report the mean, median and/or mode for each subgroup.

If data are skewed, report the median.


Mean, Median, & Modes for Ordinal & Nominal Data

For ordinal and nominal data the calculation of the mean is not valid.

Median is appropriate for ordinal data.

For nominal data, a mode calculation is useful for determining highest frequency but not “central location”.


This is a measure of the average growth rate.Let Ri denote the the rate of return in period i

(i=1,2…,n). The geometric mean of the returns R1, R2, …,Rn is the constant Rg that produces the same terminal wealth at the end of period n as do the actual returns for the n periods.

The Geometric Mean（幾何平均數）


If the rate of return was Rg in everyperiod, the nth period return wouldbe calculated by:

ng )R1( )R1)...(R1)(R1( n21

For the given series of rate of returns the nth period return iscalculated by:

Rg is selected such that…

1)R1)...(R1)(R1(R nn21g 1)R1)...(R1)(R1(R n

n21g

The Geometric Mean


Finance Example

Suppose a 2-year investment of $1,000 grows by 100% to $2,000 in the first year, but loses 50% from $2,000 back to the original $1,000 in the second year. What is your average return?

Using the arithmetic mean, we have

This would indicate we should have $1,250 at the end of our investment, not $1,000.

Solving for the geometric mean yields a rate of 0%, which is correct.

The upper case Greek Letter “Pi” represents a product of terms…


Additional Example A firm’s sales were $1,000,000 three years ago. Sales have grown annually by 20%, 10%, -5%. Find the geometric mean rate of growth in sales.

Solution Since Rg is the geometric mean

(1+Rg)3 = (1+.2)(1+.1)(1-.05)= 1.2540

Thus, %.84.7or,0784.1)05.1)(1.1)(2.1(R 3

g

The Geometric Mean


Measures of Central Location ： Summary Compute the Mean to Describe the central location of a single set of interval data

Compute the Median to Describe the central location of a single set of interval or ordinal data

Compute the Mode to Describe a single set of nominal data

Compute the Geometric Mean to Describe a single set of interval data based on growth rates


4.2 Measures of variability

Measures of central location fail to tell the whole story about the distribution.

A question of interest still remains unanswered:

How much are the observations spread outaround the mean value?



Observe two hypothetical data sets:

The average value provides a good representation of theobservations in the data set.

Small variability

This data set is now changing to...



Observe two hypothetical data sets:

The average value provides a good representation of theobservations in the data set.

Small variability

Larger variability

The same average value does not provide as good representation of theobservations in the data set as before.


Measures of Variability

Measures of central location fail to tell the whole story about the distribution; that is, how much are the observations spread out around the mean value?

For example, two sets of class grades are shown. The mean (=50) is the same in each case…

But, the red class has greater variability than the blue class.


The range of a set of observations is the difference between the largest and smallest observations.

Its major advantage is the ease with which it can be computed.

Its major shortcoming is its failure to provide information on the dispersion of the observations between the two end points.

? ? ?

But, how do all the observations spread out?

Smallestobservation

Largestobservation

The range cannot assist in answering this questionRange

The range（全距）


Range

The range is the simplest measure of variability, calculated as:

Range = Largest observation – Smallest observation

E.g. Data: {4, 4, 4, 4, 50} Range = 46 Data: {4, 8, 15, 24, 39, 50} Range = 46 The range is the same in both cases, but the data sets have very different distributions…


Range

Its major advantage is the ease with which it can be computed.

Its major shortcoming is its failure to provide information on the dispersion of the observations between the two end points.

Hence we need a measure of variability that incorporates all the data and not just two observations. Hence…


Variance （變異數）

Variance and its related measure, standard deviation, are arguably the most important statistics. Used to measure variability, they also play a vital role in almost all statistical inference procedures.

Population variance is denoted by （母體變異數） (Lower case Greek letter “sigma” squared)

Sample variance is denoted by （樣本變異數） (Lower case “S” squared)



Population

Sample

Size N n

Mean

Variance


This measure reflects the dispersion of all the observations

The variance of a population of size N x1, x2,…,xN

whose mean is is defined as

The variance of a sample of n observationsx1, x2, …,xn whose mean is is defined asx

N

)x( 2i

N1i2

N

)x( 2i

N1i2

1n

)xx(s

2i

n1i2

1n

)xx(s

2i

n1i2

The Variance


Example 4.7 The following sample consists of the number of jobs six

students applied for: 17, 15, 23, 7, 9, 13. Finds its mean and variance

Solution

2

2222

in

1i2

jobs2.33

)1413...()1415()1417(16

11n

)xx(s

jobs146

846

13972315176

xx i

61i

The Variance


2

2222

2i

n1i2

i

n

1i

2

jobs2.33

6

13...151713...1517

16

1

n

)x(x

1n

1s

The Variance – Shortcut method


Why not use the sum of deviations?

Consider two small populations:

1098

74 10

11 12

13 16

8-10= -2

9-10= -111-10= +1

12-10= +2

4-10 = - 6

7-10 = -3

13-10 = +3

16-10 = +6

Sum = 0

Sum = 0

The mean of both populations is 10...

…but measurements in Bare more dispersedthen those in A.

A measure of dispersion Should agrees with this observation.

Can the sum of deviationsBe a good measure of dispersion?

A

B

The sum of deviations is zero for both populations, therefore, is not a good measure of dispersion.


Let us calculate the variance of the two populations

185

)1016()1013()1010()107()104( 222222B

25

)1012()1011()1010()109()108( 222222A

Why is the variance defined as the average squared deviation?Why not use the sum of squared deviations as a measure of variation instead?

After all, the sum of squared deviations increases in magnitude when the variationof a data set increases!!

The Variance


Which data set has a larger dispersion?Which data set has a larger dispersion?

1 3 1 32 5

A B

Data set Bis more dispersedaround the mean

Let us calculate the sum of squared deviations for both data sets.

The Variance


1 3 1 32 5

A B

SumA = (1-2)2 +…+(1-2)2 +(3-2)2 +… +(3-2)2= 10SumB = (1-3)2 + (5-3)2 = 8

SumA > SumB. This is inconsistent with the observation that set B is more dispersed.

The Variance


1 3 1 32 5

A B

However, when calculated on “per observation” basis (variance), the data set dispersions are properly ranked.

A2 = SumA/N = 10/10 = 1

B2 = SumB/N = 8/2 = 4

The Variance


The standard deviation of a set of observations is the square root of the variance .

2

2

:deviationandardstPopulation

ss:deviationstandardSample

2

2

:deviationandardstPopulation

ss:deviationstandardSample

Standard Deviation（標準差）



Population

Sample

Size N nMean

Variance

Standard Deviation


Example 4.8 To examine the consistency of shots for a new innovative

golf club, a golfer was asked to hit 150 shots, 75 with a currently used (7-iron) club, and 75 with the new club.

The distances were recorded. Which 7-iron is more consistent?

Standard Deviation


Example 4.8 – solution

Standard Deviation

Excel printout, from the “Descriptive Statistics” sub-menu.

Current Innovation

Mean 150.5467 Mean 150.1467Standard Error 0.668815 Standard Error 0.357011Median 151 Median 150Mode 150 Mode 149Standard Deviation 5.792104 Standard Deviation 3.091808Sample Variance 33.54847 Sample Variance 9.559279Kurtosis 0.12674 Kurtosis -0.88542Skewness -0.42989 Skewness 0.177338Range 28 Range 12Minimum 134 Minimum 144Maximum 162 Maximum 156Sum 11291 Sum 11261Count 75 Count 75

The innovation club is more consistent, and because the means are close, is considered a better club


Additional Example • Rates of return over the past 10 years for two mutual funds are

shown below. Which one have a higher level of risk?

Fund A: 8.3, -6.2, 20.9, -2.7, 33.6, 42.9, 24.4, 5.2, 3.1, 30.05

Fund B: 12.1, -2.8, 6.4, 12.2, 27.8, 25.3, 18.2, 10.7, -1.3, 11.4

Standard Deviation


SolutionLet us use the Excel printout that is run from the “Descriptive statistics” sub-menu.

Fund A Fund B

Mean 16 Mean 12Standard Error 5.295 Standard Error 3.152Median 14.6 Median 11.75Mode #N/A Mode #N/AStandard Deviation 16.74 Standard Deviation 9.969Sample Variance 280.3 Sample Variance 99.37Kurtosis -1.34 Kurtosis -0.46Skewness 0.217 Skewness 0.107Range 49.1 Range 30.6Minimum -6.2 Minimum -2.8Maximum 42.9 Maximum 27.8Sum 160 Sum 120Count 10 Count 10

Fund A Fund B

Mean 16 Mean 12Standard Error 5.295 Standard Error 3.152Median 14.6 Median 11.75Mode #N/A Mode #N/AStandard Deviation 16.74 Standard Deviation 9.969Sample Variance 280.3 Sample Variance 99.37Kurtosis -1.34 Kurtosis -0.46Skewness 0.217 Skewness 0.107Range 49.1 Range 30.6Minimum -6.2 Minimum -2.8Maximum 42.9 Maximum 27.8Sum 160 Sum 120Count 10 Count 10

Fund A should be considered riskier because its standard deviation is larger

Standard Deviation


Interpreting Standard Deviation

The standard deviation can be used to compare the variability of several distributions make a statement about the general shape of a distribution.

The empirical rule （經驗法則） : If a sample of observations has a bell-shaped distribution, the interval

tsmeasuremen the of 68%ely approximat contains )sx,sx(

tsmeasuremen the of 95%ely approximat contains )s2x,s2x( tsmeasuremen the of 99.7%ely approximat contains )s3x,s3x(


The Empirical Rule

Approximately 68% of all observations fallwithin one standard deviation of the mean.

Approximately 95% of all observations fallwithin two standard deviations of the mean.

Approximately 99.7% of all observations fallwithin three standard deviations of the mean.


Example 4.9A statistics practitioner wants to describe the way returns on investment are distributed. The mean return = 10% The standard deviation of the return = 8% The histogram is bell shaped.



Example 4.9 – solution The empirical rule can be applied (bell shaped histogram) Describing the return distribution

Approximately 68% of the returns lie between 2% and 18% [10 – 1(8), 10 + 1(8)]

Approximately 95% of the returns lie between -6% and 26% [10 – 2(8), 10 + 2(8)]

Approximately 99.7% of the returns lie between -14% and 34% [10 – 3(8), 10 + 3(8)]



Chebysheff’s Theorem （柴比氏定理）A more general interpretation of the standard deviation is

derived from Chebysheff’s Theorem, which applies to all shapes of histograms (not just bell shaped).

The proportion of observations in any sample that liewithin k standard deviations of the mean is at least:

For k=2 (say), the theorem states that at least 3/4 of all observations lie within 2 standard deviations of the mean. This is a “lower bound” compared to Empirical Rule’s approximation (95%).


The proportion of observations in any sample that lie within k standard deviations of the mean is at least 1-1/k2 for k > 1.

This theorem is valid for any set of measurements (sample, population) of any shape!!

K Interval Chebysheff Empirical Rule1 at least 0% approximately 68%2 at least 75% approximately 95%3 at least 89% approximately

99.7%

s2x,s2x sx,sx

s3x,s3x

The Chebysheff’s Theorem

(1-1/12)

(1-1/22)

(1-1/32)


Example 4.10 The annual salaries of the employees of a chain of computer

stores produced a positively skewed histogram. The mean and standard deviation are $28,000 and $3,000,respectively. What can you say about the salaries at this chain?

SolutionAt least 75% of the salaries lie between $22,000 and $34,000

28000 – 2(3000) 28000 + 2(3000)

At least 88.9% of the salaries lie between $19,000 and $37,000 28000 – 3(3000) 28000 + 3(3000)

The Chebysheff’s Theorem


Coefficient of Variation （變異係數）The coefficient of variation of a set of observations

is the standard deviation of the observations divided by their mean, that is:

Population coefficient of variation = CV =

Sample coefficient of variation = cv =



Population

Sample

Size N nMean

VarianceStandard Deviation S

Coefficient of Variation CV cv


Coefficient of Variation

This coefficient provides a proportionate measure of variation, e.g.

A standard deviation of 10 may be perceived as large when the mean value is 100, but only moderately large when the mean value is 500.


Measures of Variability

If data are symmetric, with no serious outliers, use range and standard deviation.

If comparing variation across two data sets, use coefficient of variation.

The measures of variability introduced in this section can be used only for interval data.


Your score

4.3 Measures of Relative Standing and Box Plots

Measures of relative standing are designed to provide information about the position of particular values relative to the entire data set.

Percentile（百分位數） The pth percentile of a set of measurements is the value for which

• p percent of the observations are less than that value• (100-p) percent of all the observations are greater than that value.

Example• Suppose your score is the 60% percentile of a SAT test. Then

60% of all the scores lie here 40%


Quartiles （四分位數）We have special names for the 25th, 50th, and 75th

percentiles, namely quartiles.

The first or lower quartile is labeled Q1 = 25th percentile.

The second quartile, Q2 = 50th percentile (which is also the median).

The third or upper quartile, Q3 = 75th percentile.

We can also convert percentiles into quintiles (fifths) and deciles (tenths).


Commonly Used Percentiles

First (lower) decile = 10th percentileFirst (lower) quartile, Q1, = 25th percentile

Second (middle)quartile,Q2, = 50th percentile

Third quartile, Q3, = 75th percentileNinth (upper) decile = 90th percentile

Note: If your exam mark places you in the 80th percentile, that doesn’t mean you scored 80% on the exam – it means that 80% of your peers scored lower than you on the exam; its about your position relative to others.


Quartiles

Example

Find the quartiles of the following set of measurements 7, 8, 12, 17, 29, 18, 4, 27, 30, 2, 4, 10, 21, 5, 8


Sort the observations

2, 4, 4, 5, 7, 8, 10, 12, 17, 18, 18, 21, 27, 29, 30

At most (.25)(15) = 3.75 observations should appear below the first quartile.Check the first 3 observations on the left hand side.

At most (.25)(15) = 3.75 observations should appear below the first quartile.Check the first 3 observations on the left hand side.

At most (.75)(15)=11.25 observations should appear above the first quartile.Check 11 observations on the right hand side.

At most (.75)(15)=11.25 observations should appear above the first quartile.Check 11 observations on the right hand side.

The first quartileThe first quartile

Comment:If the number of observations is even, two observations remain unchecked. In this case choose the midpoint between these two observations.

Comment:If the number of observations is even, two observations remain unchecked. In this case choose the midpoint between these two observations.

15 observations

Quartiles ： Solution


Find the location of any percentile using the formula

Example 4.11Calculate the 25th, 50th, and 75th percentile of the data in Example 4.1

Location of Percentiles

percentilePtheoflocationtheisLwhere100P

)1n(L

thP

P

percentilePtheoflocationtheisLwhere100P

)1n(L

thP

P


2 3

0 5

1

0

LocationLocation

Values

Location 3

Example 4.11 – solution After sorting the data we have 0, 0, 5, 7, 8, 9, 12, 14, 22, 33.

3.75

The 2.75th locationTranslates to the value(.75)(5 – 0) = 3.75

75.210025

)110(L25 2.75



Example 4.11 – solution continued

The 50th percentile is halfway between the fifth and sixth observations (in the middle between 8 and 9), that is 8.5.


5.510050

)110(L50


Example 4.11 – solution continued

The 75th percentile is one quarter of the distance between the eighth and ninth observation that is14+.25(22 – 14) = 16.


25.810075

)110(L75

Eighth observation

Ninthobservation



Please remember…

Lp determines the position in the data set where the percentile value lies, not the value of the percentile itself.

0 0 | 5 7 8 9 12 14 | 22 33

16

3.75position

8.25

position2.75


Quartiles and Variability

Quartiles can provide an idea about the shape of a histogram

Q1 Q2 Q3

Positively skewedhistogram

Q1 Q2 Q3

Negatively skewedhistogram


The quartiles can be used to create another measure of variability, the interquartile range, which is defined as follows:

This is a measure of the spread of the middle 50% of the observations

Large value indicates a large spread of the observations

Interquartile range = Q3 – Q1

Interquartile Range（四分位距）


1.5(Q3 – Q1) 1.5(Q3 – Q1)

This is a pictorial display that provides the main descriptive measures of the data set:

• L - the largest observation• Q3 - The upper quartile

• Q2 - The median

• Q1 - The lower quartile• S - The smallest observation

S Q1 Q2 Q3 LWhisker Whisker

Box Plot（箱形圖、盒鬚圖）


Example 4.14 (Xm02-01)

Box Plot

Bills42.1938.4529.2389.35118.04110.46...

Smallest = 0Q1 = 9.275Median = 26.905Q3 = 84.9425Largest = 119.63IQR = 75.6675Outliers = ()

Left hand boundary = 9.275–1.5(IQR)= -104.226Right hand boundary=84.9425+ 1.5(IQR)=198.4438

9.2750 84.9425 198.4438119.63-104.22626.905

No outliers are found


Additional Example - GMAT scoresCreate a box plot for the data regarding the GMAT scores of 200 applicants (see GMAT.XLS)

Box Plot

GMAT512531461515...

Smallest = 449Q1 = 512Median = 537Q3 = 575Largest = 788IQR = 63Outliers = (788, 788, 766, 763, 756, 719, 712, 707, 703, 694, 690, 675, )

537512449 575417.5512-1.5(IQR) 575+1.5(IQR)

669.5 788


Interpreting the box plot results• The scores range from 449 to 788.• About half the scores are smaller than 537, and about half are larger

than 537.• About half the scores lie between 512 and 575.• About a quarter lies below 512 and a quarter above 575.

Q1

512Q2

537Q3

575

25% 50% 25%

449 669.5

Box Plot

GMAT - continued


50%

25% 25%

The histogram is positively skewed

Q1

512Q2

537Q3

575

25% 50% 25%

449 669.5

Box Plot

GMAT - continued


Example 4.15 (Xm04-15) A study was organized to compare the quality of service

in 5 drive through restaurants. Interpret the results

Example 4.15 – solution Minitab box plot

Box Plot


100 200 300

1

2

3

4

5

C6

C7

Wendy’s service time appears to be the shortest and most consistent.

Hardee’s service time variability is the largest

Jack in the box is the slowest in service

Box Plot

Jack in the Box

Hardee’s

McDonalds

Wendy’s

Popeyes


100 200 300

1

2

3

4

5

C6

C7

Popeyes

Wendy’s

Hardee’s

Jack in the Box

Wendy’s service time appears to be the shortest and most consistent.

McDonalds

Hardee’s service time variability is the largest

Jack in the box is the slowest in service

Box Plot

Times are positively skewed

Times are symmetric


4.4 Measures of Linear RelationshipWe now present two numerical measures of linear

relationship that provide information as to the strength & direction of a linear relationship between two variables (if one exists).

They are the covariance and the coefficient of correlation.

Covariance（共變數） - is there any pattern to the way two variables move together?

Coefficient of correlation （相關係數） - how strong is the linear relationship between two variables?


N

)y)((xY)COV(X,covariance Population yixi

N

)y)((xY)COV(X,covariance Population yixi

x (y) is the population mean of the variable X (Y).N is the population size.

1-n)yy)(x(x

y) cov(x,covariance Sample ii

1-n)yy)(x(x

y) cov(x,covariance Sample ii

Covariance（共變數）

x (y) is the sample mean of the variable X (Y).n is the sample size.


Covariance

In much the same way there was a “shortcut” for calculating sample variance without having to calculate the sample mean, there is also a shortcut for calculating sample covariance without having to first calculate the mean:



Population Sample

Size N nMean

Variance S2

Standard Deviation S

Coefficient of Variation CV cvCovariance Sxy


Compare the following three sets

Covariance

xi yi (x – x) (y – y) (x – x)(y – y)267

132027

-312

-707

21014

x=5 y =20 Cov(x,y)=17.5

xi yi (x – x) (y – y) (x – x)(y – y)267

272013

-312

70-7

-210-14

x=5 y =20 Cov(x,y)=-17.5

xi yi

267

202713

Cov(x,y) = -3.5

x=5 y =20


Covariance Illustrated

Consider the following three sets of data (textbook §4.5)

In each set, the values of X are the same, and the value for Y are the same; the only thing that’s changed is the order of the Y’s.

In set #1, as X increases so does Y; Sxy is large & positive

In set #2, as X increases, Y decreases; Sxy is large & negative

In set #3, as X increases, Y doesn’t move in any particular way; Sxy is “small”


Covariance (Generally speaking)

When two variables move in the same direction (both increase or both decrease), the covariance will be a large positive number.

When two variables move in opposite directions, the covariance is a large negative number.

When there is no particular pattern, the covariance is a small number （ close to zero ） .


Covariance

Y

X

ⅠⅡ

Ⅲ Ⅳ

( )x X 0

( )x X 0

( )y Y 0

( )y Y 0( )y Y 0

( )x X 0

( )x X 0( )y Y 0

Y

X

COV(X, Y) ＞ 0

COV(X, Y) ＞ 0

COV(X, Y) ＜ 0

COV(X, Y) ＜ 0


Covariance

COV(X, Y) ＞ 0 COV(X, Y) ＜ 0 COV(X, Y) 0


This coefficient answers the question: How strong is the association between X and Y.

yx

)Y,X(COV

ncorrelatio oft coefficien Population

yx

)Y,X(COV

ncorrelatio oft coefficien Population

yxss)Y,Xcov(

r

ncorrelatio oft coefficien Sample

yxss

)Y,Xcov(r

ncorrelatio oft coefficien Sample

The coefficient of correlation（相關係數）

Greek letter “rho”



Population Sample

Size N nMean

Variance S2

Standard Deviation SCoefficient of Variation CV cv

Covariance Sxy

Coefficient of Correlation r


Coefficient of Correlation

The advantage of the coefficient of correlation over covariance is that it has fixed range from -1 to +1, thus:

If the two variables are very strongly positively related, the coefficient value is close to +1 (strong positive linear relationship).

If the two variables are very strongly negatively related, the coefficient value is close to -1 (strong negative linear relationship).

No straight line relationship is indicated by a coefficient close to zero.


COV(X,Y)=0 or r =

+1

0

-1

Strong positive linear relationship

No linear relationship

Strong negative linear relationship

or

COV(X,Y)>0

COV(X,Y)<0





Compute the covariance and the coefficient of correlation to measure how GMAT scores and GPA in an MBA program are related to one another.

Solution We believe GMAT affects GPA. Thus

• GMAT is labeled X• GPA is labeled Y

The coefficient of correlation and the covariance Example 4.16


1 599 9.6 358801 92.16 5750.4

2 689 8.8 474721 77.44 6063.2

3 584 7.4 341056 54.76 4321.6

4 631 10 398161 100 6310

11 593 8.8 351649 77.44 5218.4

12 683 8 466489 64 5464

Total 7,587 106.4 4,817,755 957.2 67,559.2

Student x y x2 y2 xy

………………………………………………….

n

xx

ns

n

yxyx

n

yx

i

iiii

222

1

1

1

1

),cov(

FormulasShortcut

The coefficient of correlation and the covariance Example 4.16

cov(x,y)=(1/12-1)[67,559.2-(7587)(106.4)/12]=26.16

Sx = {(1/12-1)[4,817,755-(7587)2/12)]}.5=43.56Sy = (similar to Sx ) = 1.12

r = cov(x,y)/SxSy = 26.16/(43.56)(1.12) = .5362


Use the Covariance option in Data Analysis If your version of Excel returns the population covariance

and variances, multiply each one by n/n-1 to obtain the corresponding sample values.

Use the Correlation option to produce the correlation matrix.

The coefficient of correlation and the covariance Example 4.16 – Excel

GPA GMAT

GPA 1.15

GMAT 23.98 1739.52

GPA GMAT

GPA 1.25

GMAT 26.16 1897.66

Variance-Covariance MatrixPopulation values

Sample values

Population values

Sample values

1212-1


Interpretation The covariance (26.16) indicates that GMAT score and

performance in the MBA program are positively related. The coefficient of correlation (.5365) indicates that there

is a moderately strong positive linear relationship between GMAT and MBA GPA.

The coefficient of correlation and the covariance Example 4.16 – Excel


Least Squares Method （最小平方法）Recall, the slope-intercept equation for a line is expressed in these terms:

y = mx + b

Where:m is the slope of the lineb is the y-intercept.

If we’ve determined there is a linear relationship between two variables with covariance and the coefficient of correlation, can we determine a linear function of the relationship?


The Least Squares Method

…produces a straight line drawn through the points so that the sum of squared deviations between the points and the line is minimized. This line is represented by the equation:

b0 (“b” naught) is the y-intercept, b1 is the slope, and (“y” hat) is the value of y determined by the line.


The least Squares Method

Different lines generate different errors, thus different sum of squares of errors. There is a line that minimizes the sum of squared errors.

Errors

X

YErrors


We are seeking a line that best fits the data when two variables are (presumably) related to one another.

We define “best fit line” as a line for which the sum of squared differences between it and the data points is minimized.

2ii

n

1i)yy(Minimize

The actual y value of point i The y value of point icalculated from the equation i10i xbby



The least Squares Method

The coefficients b0 and b1 of the line that minimizes the sum of squares of errors are calculated from the data.

n

x

xandn

y

ywhere

xbyb

xx

yyxx

s

yxb

n

ii

n

ii

n

ii

n

iii

x

11

10

1

2

121 ,

)(

))((),cov(

n

x

xandn

y

ywhere

xbyb

xx

yyxx

s

yxb

n

ii

n

ii

n

ii

n

iii

x

11

10

1

2

121 ,

)(

))((),cov(


145.)25.632)(0138(.87.8

87.812

4.106

25.63212

587,7

0138.2.1897

16.26),cov(

10

21

xbybn

yy

n

xx

s

yxb

i

i

x

145.)25.632)(0138(.87.8

87.812

4.106

25.63212

587,7

0138.2.1897

16.26),cov(

10

21

xbybn

yy

n

xx

s

yxb

i

i

xScatter Diagram

y = 0.1496 + 0.0138x

6

8

10

12

500 600 700 800

Scatter Diagram

y = 0.1496 + 0.0138x

6

8

10

12

500 600 700 800

Example 4.17 Find the least squares line for Example 4.16 (Xm04-16.xls)


Documents

Numerical Descriptive Techniques