107
Numerical Descriptive Techniques Chapter 4

Numerical Descriptive Techniques

Embed Size (px)

DESCRIPTION

Numerical Descriptive Techniques. Chapter 4. Introduction. Recall Chapter 2, where we used graphical techniques to describe data:. While this histogram provides some new insight, other interesting questions (e.g. what is the class average? what is the mark spread?) go unanswered. - PowerPoint PPT Presentation

Citation preview

Page 1: Numerical Descriptive Techniques

Numerical Descriptive Techniques

Chapter 4

Page 2: Numerical Descriptive Techniques

4-22007 會計資訊系統計學 ( 一 ) 上課投影片

Introduction

Recall Chapter 2, where we used graphical techniques to describe data:

While this histogram provides some new insight, other interesting questions (e.g. what is the class average? what is the mark spread?) go unanswered.

Page 3: Numerical Descriptive Techniques

4-32007 會計資訊系統計學 ( 一 ) 上課投影片

Numerical Descriptive Techniques

Measures of Central Location (中央位置) Mean, Median, Mode

Measures of Variability (離散程度) Range, Standard Deviation, Variance, Coefficient of Variation

Measures of Relative Standing (相對位置) Percentiles, Quartiles

Measures of Linear Relationship (線性關係) Covariance, Correlation, Least Squares Line

Page 4: Numerical Descriptive Techniques

4-42007 會計資訊系統計學 ( 一 ) 上課投影片

4.1 Measures of Central Location

Usually, we focus our attention on two types of measures when describing population characteristics: Central location (e.g. average) Variability or spread

The measure of central location reflects the locations of all the actual data points.

Page 5: Numerical Descriptive Techniques

4-52007 會計資訊系統計學 ( 一 ) 上課投影片

With one data pointclearly the central location is at the pointitself.

4.1 Measures of Central Location

The measure of central location reflects the locations of all the actual data points.

How?

But if the third data point appears on the left hand-sideof the midrange, it should “pull”the central location to the left.

With two data points,the central location should fall in the middlebetween them (in order to reflect the location ofboth of them).

Page 6: Numerical Descriptive Techniques

4-62007 會計資訊系統計學 ( 一 ) 上課投影片

Sum of the observationsNumber of observations

Mean =

This is the most popular and useful measure of central location

The Arithmetic Mean(算術平均數)

Page 7: Numerical Descriptive Techniques

4-72007 會計資訊系統計學 ( 一 ) 上課投影片

Notation

When referring to the number of observations in a population, we use uppercase letter N

When referring to the number of observations in a sample, we use lower case letter n

The arithmetic mean for a population is denoted with Greek letter “mu”: (母體平均數)

The arithmetic mean for a sample is denoted with an “x-bar”. (樣本平均數)

Page 8: Numerical Descriptive Techniques

4-82007 會計資訊系統計學 ( 一 ) 上課投影片

Statistics is a pattern language

Population

Sample

Size N n

Mean

Page 9: Numerical Descriptive Techniques

4-92007 會計資訊系統計學 ( 一 ) 上課投影片

nx

x in

1i

Sample mean Population mean

Nx i

N1i

Sample size Population size

nx

x in

1i

The Arithmetic Mean

Page 10: Numerical Descriptive Techniques

4-102007 會計資訊系統計學 ( 一 ) 上課投影片

Statistics is a pattern language

Population

Sample

Size N n

Mean

Page 11: Numerical Descriptive Techniques

4-112007 會計資訊系統計學 ( 一 ) 上課投影片

10

...

101021

101 xxxx

x ii

• Example 4.1The reported time on the Internet of 10 adults are 0, 7, 12, 5, 33, 14, 8, 0, 9, 22 hours. Find the mean time on the Internet.

00 77 222211.011.0

• Example 4.2

Suppose the telephone bills of Example 2.1 representthe population of measurements. The population mean is

200x...xx

200x 20021i

2001i 42.1942.19 38.4538.45 45.7745.77

43.5943.59

The Arithmetic Mean

Page 12: Numerical Descriptive Techniques

4-122007 會計資訊系統計學 ( 一 ) 上課投影片

When many of the measurements have the same value, the measurement can be summarized in a frequency table. Suppose the number of children in a sample of 16 employees were recorded as follows:

Number of children per family 0 1 2 3Number of families 3 4 7 2

• Additional Example

16 employees

+ + +

5.116

)3(2)2(7)1(4)0(316

x...xx16

xx 1621i

161i

The Arithmetic Mean

Page 13: Numerical Descriptive Techniques

4-132007 會計資訊系統計學 ( 一 ) 上課投影片

The Arithmetic Mean

…is appropriate for describing measurement data, e.g. heights of people, marks of student papers, etc.

…is seriously affected by extreme values called “outliers”. E.g. as soon as a billionaire moves into a neighborhood, the average household income increases beyond what it was previously!

Page 14: Numerical Descriptive Techniques

4-142007 會計資訊系統計學 ( 一 ) 上課投影片

Odd number of observations

0, 0, 5, 7, 8 9, 12, 14, 220, 0, 5, 7, 8, 9, 12, 14, 22, 330, 0, 5, 7, 8, 9, 12, 14, 22, 33

Even number of observations

Example 4.3

Find the median of the time on the internetfor the 10 adults of example 4.1

The Median of a set of observations is the value that falls in the middle when the observations are arranged in order of magnitude.

The Median(中位數)

Suppose only 9 adults were sampled (exclude, say, the longest time (33))

Comment

8.5, 8

Page 15: Numerical Descriptive Techniques

4-152007 會計資訊系統計學 ( 一 ) 上課投影片

The Mode of a set of observations is the value that occurs most frequently.

Set of data may have one mode (or modal class), or two or more modes.

The modal classFor large data setsthe modal class is much more relevant than a single-value mode.

The Mode(眾數)

Page 16: Numerical Descriptive Techniques

4-162007 會計資訊系統計學 ( 一 ) 上課投影片

Example 4.5Find the mode for the data in Example 4.1. Here are the data again: 0, 7, 12, 5, 33, 14, 8, 0, 9, 22

Solution

All observation except “0” occur once. There are two “0”. Thus, the mode is zero.

Is this a good measure of central location? The value “0” does not reside at the center of this set

(compare with the mean = 11.0 and the mode = 8.5).

The Mode

Page 17: Numerical Descriptive Techniques

4-172007 會計資訊系統計學 ( 一 ) 上課投影片

Additional example The manager of a men’s store observes the waist size (in

inches) of trousers sold yesterday: 31, 34, 36, 33, 28, 34, 30, 34, 32, 40.

The mode of this data set is 34 in.

This information seems to be valuable (for example, for the design of a new display in the store), much more than “ the median is 33.5 in.”

The Mode

Page 18: Numerical Descriptive Techniques

4-182007 會計資訊系統計學 ( 一 ) 上課投影片

Measures of Central Location

The mode of a set of observations is the value that occurs most frequently.

A set of data may have one mode (or modal class), or two, or more modes.

Mode is a useful for all data types, though mainly used for nominal data.

For large data sets the modal class is much more relevant than a single-value mode.

※ Sample and population modes are computed the same way.

Page 19: Numerical Descriptive Techniques

4-192007 會計資訊系統計學 ( 一 ) 上課投影片

=MODE(range) in Excel

Note: if you are using Excel for your data analysis and your data is multi-modal (i.e. there is more than one mode), Excel only calculates the smallest one.

You will have to use other techniques (i.e. histogram) to determine if your data is bimodal, trimodal, etc.

Page 20: Numerical Descriptive Techniques

4-202007 會計資訊系統計學 ( 一 ) 上課投影片

Additional example A professor of statistics wants to report the results of a midterm exam, taken by 100 students.

• The mean of the test marks is 73.90• The median of the test marks is 81• The mode of the test marks is 84

Describe the information each one provides.

The mean provides informationabout the over-all performance level of the class. It can serve as a tool for making comparisons with other classes and/or other exams.

The Median indicates that half of the class received a grade below 81%, and half of the class received a grade above 81%. A student can use this statistic to place his mark relative to other students in the class.

The mode must be used when data are nominal If marks are classified by letter grade, the frequency of each grade can be calculated. Then, the mode becomes a logical measure to compute.

The mode must be used when data are nominal If marks are classified by letter grade, the frequency of each grade can be calculated. Then, the mode becomes a logical measure to compute.

The Mean, Median and Mode

Page 21: Numerical Descriptive Techniques

4-212007 會計資訊系統計學 ( 一 ) 上課投影片

Relationship among Mean, Median, and Mode

If a distribution is symmetrical, the mean, median and mode coincide

If a distribution is asymmetrical, and skewed to the left or to the right, the three measures differ.A positively skewed distribution

(“skewed to the right”)

MeanMedian

Mode

Page 22: Numerical Descriptive Techniques

4-222007 會計資訊系統計學 ( 一 ) 上課投影片

If a distribution is symmetrical, the mean, median and mode coincide

If a distribution is non symmetrical, and skewed to the left or to the right, the three measures differ.

A positively skewed distribution(“skewed to the right”)

MeanMedian

Mode MeanMedian

Mode

A negatively skewed distribution(“skewed to the left”)

Relationship among Mean, Median, and Mode

Page 23: Numerical Descriptive Techniques

4-232007 會計資訊系統計學 ( 一 ) 上課投影片

Mean, Median, Mode

If data are symmetric, the mean, median, and mode will be approximately the same.

If data are multimodal, report the mean, median and/or mode for each subgroup.

If data are skewed, report the median.

Page 24: Numerical Descriptive Techniques

4-242007 會計資訊系統計學 ( 一 ) 上課投影片

Mean, Median, & Modes for Ordinal & Nominal Data

For ordinal and nominal data the calculation of the mean is not valid.

Median is appropriate for ordinal data.

For nominal data, a mode calculation is useful for determining highest frequency but not “central location”.

Page 25: Numerical Descriptive Techniques

4-252007 會計資訊系統計學 ( 一 ) 上課投影片

This is a measure of the average growth rate.Let Ri denote the the rate of return in period i

(i=1,2…,n). The geometric mean of the returns R1, R2, …,Rn is the constant Rg that produces the same terminal wealth at the end of period n as do the actual returns for the n periods.

The Geometric Mean(幾何平均數)

Page 26: Numerical Descriptive Techniques

4-262007 會計資訊系統計學 ( 一 ) 上課投影片

If the rate of return was Rg in everyperiod, the nth period return wouldbe calculated by:

ng )R1( )R1)...(R1)(R1( n21

For the given series of rate of returns the nth period return iscalculated by:

Rg is selected such that…

1)R1)...(R1)(R1(R nn21g 1)R1)...(R1)(R1(R n

n21g

The Geometric Mean

Page 27: Numerical Descriptive Techniques

4-272007 會計資訊系統計學 ( 一 ) 上課投影片

Finance Example

Suppose a 2-year investment of $1,000 grows by 100% to $2,000 in the first year, but loses 50% from $2,000 back to the original $1,000 in the second year. What is your average return?

Using the arithmetic mean, we have

This would indicate we should have $1,250 at the end of our investment, not $1,000.

Solving for the geometric mean yields a rate of 0%, which is correct.

The upper case Greek Letter “Pi” represents a product of terms…

Page 28: Numerical Descriptive Techniques

4-282007 會計資訊系統計學 ( 一 ) 上課投影片

Additional Example A firm’s sales were $1,000,000 three years ago. Sales have grown annually by 20%, 10%, -5%. Find the geometric mean rate of growth in sales.

Solution Since Rg is the geometric mean

(1+Rg)3 = (1+.2)(1+.1)(1-.05)= 1.2540

Thus, %.84.7or,0784.1)05.1)(1.1)(2.1(R 3

g

The Geometric Mean

Page 29: Numerical Descriptive Techniques

4-292007 會計資訊系統計學 ( 一 ) 上課投影片

Measures of Central Location : Summary Compute the Mean to Describe the central location of a single set of interval data

Compute the Median to Describe the central location of a single set of interval or ordinal data

Compute the Mode to Describe a single set of nominal data

Compute the Geometric Mean to Describe a single set of interval data based on growth rates

Page 30: Numerical Descriptive Techniques

4-302007 會計資訊系統計學 ( 一 ) 上課投影片

4.2 Measures of variability

Measures of central location fail to tell the whole story about the distribution.

A question of interest still remains unanswered:

How much are the observations spread outaround the mean value?

Page 31: Numerical Descriptive Techniques

4-312007 會計資訊系統計學 ( 一 ) 上課投影片

4.2 Measures of variability

Observe two hypothetical data sets:

The average value provides a good representation of theobservations in the data set.

Small variability

This data set is now changing to...

Page 32: Numerical Descriptive Techniques

4-322007 會計資訊系統計學 ( 一 ) 上課投影片

4.2 Measures of variability

Observe two hypothetical data sets:

The average value provides a good representation of theobservations in the data set.

Small variability

Larger variability

The same average value does not provide as good representation of theobservations in the data set as before.

Page 33: Numerical Descriptive Techniques

4-332007 會計資訊系統計學 ( 一 ) 上課投影片

Measures of Variability

Measures of central location fail to tell the whole story about the distribution; that is, how much are the observations spread out around the mean value?

For example, two sets of class grades are shown. The mean (=50) is the same in each case…

But, the red class has greater variability than the blue class.

Page 34: Numerical Descriptive Techniques

4-342007 會計資訊系統計學 ( 一 ) 上課投影片

The range of a set of observations is the difference between the largest and smallest observations.

Its major advantage is the ease with which it can be computed.

Its major shortcoming is its failure to provide information on the dispersion of the observations between the two end points.

? ? ?

But, how do all the observations spread out?

Smallestobservation

Largestobservation

The range cannot assist in answering this questionRange

The range(全距)

Page 35: Numerical Descriptive Techniques

4-352007 會計資訊系統計學 ( 一 ) 上課投影片

Range

The range is the simplest measure of variability, calculated as:

Range = Largest observation – Smallest observation

E.g. Data: {4, 4, 4, 4, 50} Range = 46 Data: {4, 8, 15, 24, 39, 50} Range = 46 The range is the same in both cases, but the data sets have very different distributions…

Page 36: Numerical Descriptive Techniques

4-362007 會計資訊系統計學 ( 一 ) 上課投影片

Range

Its major advantage is the ease with which it can be computed.

Its major shortcoming is its failure to provide information on the dispersion of the observations between the two end points.

Hence we need a measure of variability that incorporates all the data and not just two observations. Hence…

Page 37: Numerical Descriptive Techniques

4-372007 會計資訊系統計學 ( 一 ) 上課投影片

Variance (變異數)

Variance and its related measure, standard deviation, are arguably the most important statistics. Used to measure variability, they also play a vital role in almost all statistical inference procedures.

Population variance is denoted by (母體變異數) (Lower case Greek letter “sigma” squared)

Sample variance is denoted by (樣本變異數) (Lower case “S” squared)

Page 38: Numerical Descriptive Techniques

4-382007 會計資訊系統計學 ( 一 ) 上課投影片

Statistics is a pattern language

Population

Sample

Size N n

Mean

Variance

Page 39: Numerical Descriptive Techniques

4-392007 會計資訊系統計學 ( 一 ) 上課投影片

This measure reflects the dispersion of all the observations

The variance of a population of size N x1, x2,…,xN

whose mean is is defined as

The variance of a sample of n observationsx1, x2, …,xn whose mean is is defined asx

N

)x( 2i

N1i2

N

)x( 2i

N1i2

1n

)xx(s

2i

n1i2

1n

)xx(s

2i

n1i2

The Variance

Page 40: Numerical Descriptive Techniques

4-402007 會計資訊系統計學 ( 一 ) 上課投影片

Example 4.7 The following sample consists of the number of jobs six

students applied for: 17, 15, 23, 7, 9, 13. Finds its mean and variance

Solution

2

2222

in

1i2

jobs2.33

)1413...()1415()1417(16

11n

)xx(s

jobs146

846

13972315176

xx i

61i

The Variance

Page 41: Numerical Descriptive Techniques

4-412007 會計資訊系統計學 ( 一 ) 上課投影片

2

2222

2i

n1i2

i

n

1i

2

jobs2.33

6

13...151713...1517

16

1

n

)x(x

1n

1s

The Variance – Shortcut method

Page 42: Numerical Descriptive Techniques

4-422007 會計資訊系統計學 ( 一 ) 上課投影片

Why not use the sum of deviations?

Consider two small populations:

1098

74 10

11 12

13 16

8-10= -2

9-10= -111-10= +1

12-10= +2

4-10 = - 6

7-10 = -3

13-10 = +3

16-10 = +6

Sum = 0

Sum = 0

The mean of both populations is 10...

…but measurements in Bare more dispersedthen those in A.

A measure of dispersion Should agrees with this observation.

Can the sum of deviationsBe a good measure of dispersion?

A

B

The sum of deviations is zero for both populations, therefore, is not a good measure of dispersion.

Page 43: Numerical Descriptive Techniques

4-432007 會計資訊系統計學 ( 一 ) 上課投影片

Let us calculate the variance of the two populations

185

)1016()1013()1010()107()104( 222222B

25

)1012()1011()1010()109()108( 222222A

Why is the variance defined as the average squared deviation?Why not use the sum of squared deviations as a measure of variation instead?

After all, the sum of squared deviations increases in magnitude when the variationof a data set increases!!

The Variance

Page 44: Numerical Descriptive Techniques

4-442007 會計資訊系統計學 ( 一 ) 上課投影片

Which data set has a larger dispersion?Which data set has a larger dispersion?

1 3 1 32 5

A B

Data set Bis more dispersedaround the mean

Let us calculate the sum of squared deviations for both data sets.

The Variance

Page 45: Numerical Descriptive Techniques

4-452007 會計資訊系統計學 ( 一 ) 上課投影片

1 3 1 32 5

A B

SumA = (1-2)2 +…+(1-2)2 +(3-2)2 +… +(3-2)2= 10SumB = (1-3)2 + (5-3)2 = 8

SumA > SumB. This is inconsistent with the observation that set B is more dispersed.

The Variance

Page 46: Numerical Descriptive Techniques

4-462007 會計資訊系統計學 ( 一 ) 上課投影片

1 3 1 32 5

A B

However, when calculated on “per observation” basis (variance), the data set dispersions are properly ranked.

A2 = SumA/N = 10/10 = 1

B2 = SumB/N = 8/2 = 4

The Variance

Page 47: Numerical Descriptive Techniques

4-472007 會計資訊系統計學 ( 一 ) 上課投影片

The standard deviation of a set of observations is the square root of the variance .

2

2

:deviationandardstPopulation

ss:deviationstandardSample

2

2

:deviationandardstPopulation

ss:deviationstandardSample

Standard Deviation(標準差)

Page 48: Numerical Descriptive Techniques

4-482007 會計資訊系統計學 ( 一 ) 上課投影片

Statistics is a pattern language

Population

Sample

Size N nMean

Variance

Standard Deviation

Page 49: Numerical Descriptive Techniques

4-492007 會計資訊系統計學 ( 一 ) 上課投影片

Example 4.8 To examine the consistency of shots for a new innovative

golf club, a golfer was asked to hit 150 shots, 75 with a currently used (7-iron) club, and 75 with the new club.

The distances were recorded. Which 7-iron is more consistent?

Standard Deviation

Page 50: Numerical Descriptive Techniques

4-502007 會計資訊系統計學 ( 一 ) 上課投影片

Example 4.8 – solution

Standard Deviation

Excel printout, from the “Descriptive Statistics” sub-menu.

Current Innovation

Mean 150.5467 Mean 150.1467Standard Error 0.668815 Standard Error 0.357011Median 151 Median 150Mode 150 Mode 149Standard Deviation 5.792104 Standard Deviation 3.091808Sample Variance 33.54847 Sample Variance 9.559279Kurtosis 0.12674 Kurtosis -0.88542Skewness -0.42989 Skewness 0.177338Range 28 Range 12Minimum 134 Minimum 144Maximum 162 Maximum 156Sum 11291 Sum 11261Count 75 Count 75

The innovation club is more consistent, and because the means are close, is considered a better club

Page 51: Numerical Descriptive Techniques

4-512007 會計資訊系統計學 ( 一 ) 上課投影片

Additional Example • Rates of return over the past 10 years for two mutual funds are

shown below. Which one have a higher level of risk?

Fund A: 8.3, -6.2, 20.9, -2.7, 33.6, 42.9, 24.4, 5.2, 3.1, 30.05

Fund B: 12.1, -2.8, 6.4, 12.2, 27.8, 25.3, 18.2, 10.7, -1.3, 11.4

Standard Deviation

Page 52: Numerical Descriptive Techniques

4-522007 會計資訊系統計學 ( 一 ) 上課投影片

SolutionLet us use the Excel printout that is run from the “Descriptive statistics” sub-menu.

Fund A Fund B

Mean 16 Mean 12Standard Error 5.295 Standard Error 3.152Median 14.6 Median 11.75Mode #N/A Mode #N/AStandard Deviation 16.74 Standard Deviation 9.969Sample Variance 280.3 Sample Variance 99.37Kurtosis -1.34 Kurtosis -0.46Skewness 0.217 Skewness 0.107Range 49.1 Range 30.6Minimum -6.2 Minimum -2.8Maximum 42.9 Maximum 27.8Sum 160 Sum 120Count 10 Count 10

Fund A Fund B

Mean 16 Mean 12Standard Error 5.295 Standard Error 3.152Median 14.6 Median 11.75Mode #N/A Mode #N/AStandard Deviation 16.74 Standard Deviation 9.969Sample Variance 280.3 Sample Variance 99.37Kurtosis -1.34 Kurtosis -0.46Skewness 0.217 Skewness 0.107Range 49.1 Range 30.6Minimum -6.2 Minimum -2.8Maximum 42.9 Maximum 27.8Sum 160 Sum 120Count 10 Count 10

Fund A should be considered riskier because its standard deviation is larger

Standard Deviation

Page 53: Numerical Descriptive Techniques

4-532007 會計資訊系統計學 ( 一 ) 上課投影片

Interpreting Standard Deviation

The standard deviation can be used to compare the variability of several distributions make a statement about the general shape of a distribution.

The empirical rule (經驗法則) : If a sample of observations has a bell-shaped distribution, the interval

tsmeasuremen the of 68%ely approximat contains )sx,sx(

tsmeasuremen the of 95%ely approximat contains )s2x,s2x( tsmeasuremen the of 99.7%ely approximat contains )s3x,s3x(

Page 54: Numerical Descriptive Techniques

4-542007 會計資訊系統計學 ( 一 ) 上課投影片

The Empirical Rule

Approximately 68% of all observations fallwithin one standard deviation of the mean.

Approximately 95% of all observations fallwithin two standard deviations of the mean.

Approximately 99.7% of all observations fallwithin three standard deviations of the mean.

Page 55: Numerical Descriptive Techniques

4-552007 會計資訊系統計學 ( 一 ) 上課投影片

Example 4.9A statistics practitioner wants to describe the way returns on investment are distributed. The mean return = 10% The standard deviation of the return = 8% The histogram is bell shaped.

Interpreting Standard Deviation

Page 56: Numerical Descriptive Techniques

4-562007 會計資訊系統計學 ( 一 ) 上課投影片

Example 4.9 – solution The empirical rule can be applied (bell shaped histogram) Describing the return distribution

Approximately 68% of the returns lie between 2% and 18% [10 – 1(8), 10 + 1(8)]

Approximately 95% of the returns lie between -6% and 26% [10 – 2(8), 10 + 2(8)]

Approximately 99.7% of the returns lie between -14% and 34% [10 – 3(8), 10 + 3(8)]

Interpreting Standard Deviation

Page 57: Numerical Descriptive Techniques

4-572007 會計資訊系統計學 ( 一 ) 上課投影片

Chebysheff’s Theorem (柴比氏定理)A more general interpretation of the standard deviation is

derived from Chebysheff’s Theorem, which applies to all shapes of histograms (not just bell shaped).

The proportion of observations in any sample that liewithin k standard deviations of the mean is at least:

For k=2 (say), the theorem states that at least 3/4 of all observations lie within 2 standard deviations of the mean. This is a “lower bound” compared to Empirical Rule’s approximation (95%).

Page 58: Numerical Descriptive Techniques

4-582007 會計資訊系統計學 ( 一 ) 上課投影片

The proportion of observations in any sample that lie within k standard deviations of the mean is at least 1-1/k2 for k > 1.

This theorem is valid for any set of measurements (sample, population) of any shape!!

K Interval Chebysheff Empirical Rule1 at least 0% approximately 68%2 at least 75% approximately 95%3 at least 89% approximately

99.7%

s2x,s2x sx,sx

s3x,s3x

The Chebysheff’s Theorem

(1-1/12)

(1-1/22)

(1-1/32)

Page 59: Numerical Descriptive Techniques

4-592007 會計資訊系統計學 ( 一 ) 上課投影片

Example 4.10 The annual salaries of the employees of a chain of computer

stores produced a positively skewed histogram. The mean and standard deviation are $28,000 and $3,000,respectively. What can you say about the salaries at this chain?

SolutionAt least 75% of the salaries lie between $22,000 and $34,000

28000 – 2(3000) 28000 + 2(3000)

At least 88.9% of the salaries lie between $19,000 and $37,000 28000 – 3(3000) 28000 + 3(3000)

The Chebysheff’s Theorem

Page 60: Numerical Descriptive Techniques

4-602007 會計資訊系統計學 ( 一 ) 上課投影片

Coefficient of Variation (變異係數)The coefficient of variation of a set of observations

is the standard deviation of the observations divided by their mean, that is:

Population coefficient of variation = CV =

Sample coefficient of variation = cv =

Page 61: Numerical Descriptive Techniques

4-612007 會計資訊系統計學 ( 一 ) 上課投影片

Statistics is a pattern language

Population

Sample

Size N nMean

VarianceStandard Deviation S

Coefficient of Variation CV cv

Page 62: Numerical Descriptive Techniques

4-622007 會計資訊系統計學 ( 一 ) 上課投影片

Coefficient of Variation

This coefficient provides a proportionate measure of variation, e.g.

A standard deviation of 10 may be perceived as large when the mean value is 100, but only moderately large when the mean value is 500.

Page 63: Numerical Descriptive Techniques

4-632007 會計資訊系統計學 ( 一 ) 上課投影片

Measures of Variability

If data are symmetric, with no serious outliers, use range and standard deviation.

If comparing variation across two data sets, use coefficient of variation.

The measures of variability introduced in this section can be used only for interval data.

Page 64: Numerical Descriptive Techniques

4-642007 會計資訊系統計學 ( 一 ) 上課投影片

Your score

4.3 Measures of Relative Standing and Box Plots

Measures of relative standing are designed to provide information about the position of particular values relative to the entire data set.

Percentile(百分位數) The pth percentile of a set of measurements is the value for which

• p percent of the observations are less than that value• (100-p) percent of all the observations are greater than that value.

Example• Suppose your score is the 60% percentile of a SAT test. Then

60% of all the scores lie here 40%

Page 65: Numerical Descriptive Techniques

4-652007 會計資訊系統計學 ( 一 ) 上課投影片

Quartiles (四分位數)We have special names for the 25th, 50th, and 75th

percentiles, namely quartiles.

The first or lower quartile is labeled Q1 = 25th percentile.

The second quartile, Q2 = 50th percentile (which is also the median).

The third or upper quartile, Q3 = 75th percentile.

We can also convert percentiles into quintiles (fifths) and deciles (tenths).

Page 66: Numerical Descriptive Techniques

4-662007 會計資訊系統計學 ( 一 ) 上課投影片

Commonly Used Percentiles

First (lower) decile = 10th percentileFirst (lower) quartile, Q1, = 25th percentile

Second (middle)quartile,Q2, = 50th percentile

Third quartile, Q3, = 75th percentileNinth (upper) decile = 90th percentile

Note: If your exam mark places you in the 80th percentile, that doesn’t mean you scored 80% on the exam – it means that 80% of your peers scored lower than you on the exam; its about your position relative to others.

Page 67: Numerical Descriptive Techniques

4-672007 會計資訊系統計學 ( 一 ) 上課投影片

Quartiles

Example

Find the quartiles of the following set of measurements 7, 8, 12, 17, 29, 18, 4, 27, 30, 2, 4, 10, 21, 5, 8

Page 68: Numerical Descriptive Techniques

4-682007 會計資訊系統計學 ( 一 ) 上課投影片

Sort the observations

2, 4, 4, 5, 7, 8, 10, 12, 17, 18, 18, 21, 27, 29, 30

At most (.25)(15) = 3.75 observations should appear below the first quartile.Check the first 3 observations on the left hand side.

At most (.25)(15) = 3.75 observations should appear below the first quartile.Check the first 3 observations on the left hand side.

At most (.75)(15)=11.25 observations should appear above the first quartile.Check 11 observations on the right hand side.

At most (.75)(15)=11.25 observations should appear above the first quartile.Check 11 observations on the right hand side.

The first quartileThe first quartile

Comment:If the number of observations is even, two observations remain unchecked. In this case choose the midpoint between these two observations.

Comment:If the number of observations is even, two observations remain unchecked. In this case choose the midpoint between these two observations.

15 observations

Quartiles : Solution

Page 69: Numerical Descriptive Techniques

4-692007 會計資訊系統計學 ( 一 ) 上課投影片

Find the location of any percentile using the formula

Example 4.11Calculate the 25th, 50th, and 75th percentile of the data in Example 4.1

Location of Percentiles

percentilePtheoflocationtheisLwhere100P

)1n(L

thP

P

percentilePtheoflocationtheisLwhere100P

)1n(L

thP

P

Page 70: Numerical Descriptive Techniques

4-702007 會計資訊系統計學 ( 一 ) 上課投影片

2 3

0 5

1

0

LocationLocation

Values

Location 3

Example 4.11 – solution After sorting the data we have 0, 0, 5, 7, 8, 9, 12, 14, 22, 33.

3.75

The 2.75th locationTranslates to the value(.75)(5 – 0) = 3.75

75.210025

)110(L25 2.75

Location of Percentiles

Page 71: Numerical Descriptive Techniques

4-712007 會計資訊系統計學 ( 一 ) 上課投影片

Example 4.11 – solution continued

The 50th percentile is halfway between the fifth and sixth observations (in the middle between 8 and 9), that is 8.5.

Location of Percentiles

5.510050

)110(L50

Page 72: Numerical Descriptive Techniques

4-722007 會計資訊系統計學 ( 一 ) 上課投影片

Example 4.11 – solution continued

The 75th percentile is one quarter of the distance between the eighth and ninth observation that is14+.25(22 – 14) = 16.

Location of Percentiles

25.810075

)110(L75

Eighth observation

Ninthobservation

Page 73: Numerical Descriptive Techniques

4-732007 會計資訊系統計學 ( 一 ) 上課投影片

Location of Percentiles

Please remember…

Lp determines the position in the data set where the percentile value lies, not the value of the percentile itself.

0 0 | 5 7 8 9 12 14 | 22 33

16

3.75position

8.25

position2.75

Page 74: Numerical Descriptive Techniques

4-742007 會計資訊系統計學 ( 一 ) 上課投影片

Quartiles and Variability

Quartiles can provide an idea about the shape of a histogram

Q1 Q2 Q3

Positively skewedhistogram

Q1 Q2 Q3

Negatively skewedhistogram

Page 75: Numerical Descriptive Techniques

4-752007 會計資訊系統計學 ( 一 ) 上課投影片

The quartiles can be used to create another measure of variability, the interquartile range, which is defined as follows:

This is a measure of the spread of the middle 50% of the observations

Large value indicates a large spread of the observations

Interquartile range = Q3 – Q1

Interquartile Range(四分位距)

Page 76: Numerical Descriptive Techniques

4-762007 會計資訊系統計學 ( 一 ) 上課投影片

1.5(Q3 – Q1) 1.5(Q3 – Q1)

This is a pictorial display that provides the main descriptive measures of the data set:

• L - the largest observation• Q3 - The upper quartile

• Q2 - The median

• Q1 - The lower quartile• S - The smallest observation

S Q1 Q2 Q3 LWhisker Whisker

Box Plot(箱形圖、盒鬚圖)

Page 77: Numerical Descriptive Techniques

4-772007 會計資訊系統計學 ( 一 ) 上課投影片

Example 4.14 (Xm02-01)

Box Plot

Bills42.1938.4529.2389.35118.04110.46...

Smallest = 0Q1 = 9.275Median = 26.905Q3 = 84.9425Largest = 119.63IQR = 75.6675Outliers = ()

Left hand boundary = 9.275–1.5(IQR)= -104.226Right hand boundary=84.9425+ 1.5(IQR)=198.4438

9.2750 84.9425 198.4438119.63-104.22626.905

No outliers are found

Page 78: Numerical Descriptive Techniques

4-782007 會計資訊系統計學 ( 一 ) 上課投影片

Additional Example - GMAT scoresCreate a box plot for the data regarding the GMAT scores of 200 applicants (see GMAT.XLS)

Box Plot

GMAT512531461515...

Smallest = 449Q1 = 512Median = 537Q3 = 575Largest = 788IQR = 63Outliers = (788, 788, 766, 763, 756, 719, 712, 707, 703, 694, 690, 675, )

537512449 575417.5512-1.5(IQR) 575+1.5(IQR)

669.5 788

Page 79: Numerical Descriptive Techniques

4-792007 會計資訊系統計學 ( 一 ) 上課投影片

Interpreting the box plot results• The scores range from 449 to 788.• About half the scores are smaller than 537, and about half are larger

than 537.• About half the scores lie between 512 and 575.• About a quarter lies below 512 and a quarter above 575.

Q1

512Q2

537Q3

575

25% 50% 25%

449 669.5

Box Plot

GMAT - continued

Page 80: Numerical Descriptive Techniques

4-802007 會計資訊系統計學 ( 一 ) 上課投影片

50%

25% 25%

The histogram is positively skewed

Q1

512Q2

537Q3

575

25% 50% 25%

449 669.5

Box Plot

GMAT - continued

Page 81: Numerical Descriptive Techniques

4-812007 會計資訊系統計學 ( 一 ) 上課投影片

Example 4.15 (Xm04-15) A study was organized to compare the quality of service

in 5 drive through restaurants. Interpret the results

Example 4.15 – solution Minitab box plot

Box Plot

Page 82: Numerical Descriptive Techniques

4-822007 會計資訊系統計學 ( 一 ) 上課投影片

100 200 300

1

2

3

4

5

C6

C7

Wendy’s service time appears to be the shortest and most consistent.

Hardee’s service time variability is the largest

Jack in the box is the slowest in service

Box Plot

Jack in the Box

Hardee’s

McDonalds

Wendy’s

Popeyes

Page 83: Numerical Descriptive Techniques

4-832007 會計資訊系統計學 ( 一 ) 上課投影片

100 200 300

1

2

3

4

5

C6

C7

Popeyes

Wendy’s

Hardee’s

Jack in the Box

Wendy’s service time appears to be the shortest and most consistent.

McDonalds

Hardee’s service time variability is the largest

Jack in the box is the slowest in service

Box Plot

Times are positively skewed

Times are symmetric

Page 84: Numerical Descriptive Techniques

4-842007 會計資訊系統計學 ( 一 ) 上課投影片

4.4 Measures of Linear RelationshipWe now present two numerical measures of linear

relationship that provide information as to the strength & direction of a linear relationship between two variables (if one exists).

They are the covariance and the coefficient of correlation.

Covariance(共變數) - is there any pattern to the way two variables move together?

Coefficient of correlation (相關係數) - how strong is the linear relationship between two variables?

Page 85: Numerical Descriptive Techniques

4-852007 會計資訊系統計學 ( 一 ) 上課投影片

N

)y)((xY)COV(X,covariance Population yixi

N

)y)((xY)COV(X,covariance Population yixi

x (y) is the population mean of the variable X (Y).N is the population size.

1-n)yy)(x(x

y) cov(x,covariance Sample ii

1-n)yy)(x(x

y) cov(x,covariance Sample ii

Covariance(共變數)

x (y) is the sample mean of the variable X (Y).n is the sample size.

Page 86: Numerical Descriptive Techniques

4-862007 會計資訊系統計學 ( 一 ) 上課投影片

Covariance

In much the same way there was a “shortcut” for calculating sample variance without having to calculate the sample mean, there is also a shortcut for calculating sample covariance without having to first calculate the mean:

Page 87: Numerical Descriptive Techniques

4-872007 會計資訊系統計學 ( 一 ) 上課投影片

Statistics is a pattern language

Population Sample

Size N nMean

Variance S2

Standard Deviation S

Coefficient of Variation CV cvCovariance Sxy

Page 88: Numerical Descriptive Techniques

4-882007 會計資訊系統計學 ( 一 ) 上課投影片

Compare the following three sets

Covariance

xi yi (x – x) (y – y) (x – x)(y – y)267

132027

-312

-707

21014

x=5 y =20 Cov(x,y)=17.5

xi yi (x – x) (y – y) (x – x)(y – y)267

272013

-312

70-7

-210-14

x=5 y =20 Cov(x,y)=-17.5

xi yi

267

202713

Cov(x,y) = -3.5

x=5 y =20

Page 89: Numerical Descriptive Techniques

4-892007 會計資訊系統計學 ( 一 ) 上課投影片

Covariance Illustrated

Consider the following three sets of data (textbook §4.5)

In each set, the values of X are the same, and the value for Y are the same; the only thing that’s changed is the order of the Y’s.

In set #1, as X increases so does Y; Sxy is large & positive

In set #2, as X increases, Y decreases; Sxy is large & negative

In set #3, as X increases, Y doesn’t move in any particular way; Sxy is “small”

Page 90: Numerical Descriptive Techniques

4-902007 會計資訊系統計學 ( 一 ) 上課投影片

Covariance (Generally speaking)

When two variables move in the same direction (both increase or both decrease), the covariance will be a large positive number.

When two variables move in opposite directions, the covariance is a large negative number.

When there is no particular pattern, the covariance is a small number ( close to zero ) .

Page 91: Numerical Descriptive Techniques

4-912007 會計資訊系統計學 ( 一 ) 上課投影片

Covariance

Y

X

ⅠⅡ

Ⅲ Ⅳ

( )x X 0

( )x X 0

( )y Y 0

( )y Y 0( )y Y 0

( )x X 0

( )x X 0( )y Y 0

Y

X

COV(X, Y) > 0

COV(X, Y) > 0

COV(X, Y) < 0

COV(X, Y) < 0

Page 92: Numerical Descriptive Techniques

4-922007 會計資訊系統計學 ( 一 ) 上課投影片

Covariance

COV(X, Y) > 0 COV(X, Y) < 0 COV(X, Y) 0

Page 93: Numerical Descriptive Techniques

4-932007 會計資訊系統計學 ( 一 ) 上課投影片

This coefficient answers the question: How strong is the association between X and Y.

yx

)Y,X(COV

ncorrelatio oft coefficien Population

yx

)Y,X(COV

ncorrelatio oft coefficien Population

yxss)Y,Xcov(

r

ncorrelatio oft coefficien Sample

yxss

)Y,Xcov(r

ncorrelatio oft coefficien Sample

The coefficient of correlation(相關係數)

Greek letter “rho”

Page 94: Numerical Descriptive Techniques

4-942007 會計資訊系統計學 ( 一 ) 上課投影片

Statistics is a pattern language

Population Sample

Size N nMean

Variance S2

Standard Deviation SCoefficient of Variation CV cv

Covariance Sxy

Coefficient of Correlation r

Page 95: Numerical Descriptive Techniques

4-952007 會計資訊系統計學 ( 一 ) 上課投影片

Coefficient of Correlation

The advantage of the coefficient of correlation over covariance is that it has fixed range from -1 to +1, thus:

If the two variables are very strongly positively related, the coefficient value is close to +1 (strong positive linear relationship).

If the two variables are very strongly negatively related, the coefficient value is close to -1 (strong negative linear relationship).

No straight line relationship is indicated by a coefficient close to zero.

Page 96: Numerical Descriptive Techniques

4-962007 會計資訊系統計學 ( 一 ) 上課投影片

COV(X,Y)=0 or r =

+1

0

-1

Strong positive linear relationship

No linear relationship

Strong negative linear relationship

or

COV(X,Y)>0

COV(X,Y)<0

Coefficient of Correlation

Page 97: Numerical Descriptive Techniques

4-972007 會計資訊系統計學 ( 一 ) 上課投影片

Coefficient of Correlation

Page 98: Numerical Descriptive Techniques

4-982007 會計資訊系統計學 ( 一 ) 上課投影片

Compute the covariance and the coefficient of correlation to measure how GMAT scores and GPA in an MBA program are related to one another.

Solution We believe GMAT affects GPA. Thus

• GMAT is labeled X• GPA is labeled Y

The coefficient of correlation and the covariance Example 4.16

Page 99: Numerical Descriptive Techniques

4-992007 會計資訊系統計學 ( 一 ) 上課投影片

1 599 9.6 358801 92.16 5750.4

2 689 8.8 474721 77.44 6063.2

3 584 7.4 341056 54.76 4321.6

4 631 10 398161 100 6310

11 593 8.8 351649 77.44 5218.4

12 683 8 466489 64 5464

Total 7,587 106.4 4,817,755 957.2 67,559.2

Student x y x2 y2 xy

………………………………………………….

n

xx

ns

n

yxyx

n

yx

i

iiii

222

1

1

1

1

),cov(

FormulasShortcut

The coefficient of correlation and the covariance Example 4.16

cov(x,y)=(1/12-1)[67,559.2-(7587)(106.4)/12]=26.16

Sx = {(1/12-1)[4,817,755-(7587)2/12)]}.5=43.56Sy = (similar to Sx ) = 1.12

r = cov(x,y)/SxSy = 26.16/(43.56)(1.12) = .5362

Page 100: Numerical Descriptive Techniques

4-1002007 會計資訊系統計學 ( 一 ) 上課投影片

Use the Covariance option in Data Analysis If your version of Excel returns the population covariance

and variances, multiply each one by n/n-1 to obtain the corresponding sample values.

Use the Correlation option to produce the correlation matrix.

The coefficient of correlation and the covariance Example 4.16 – Excel

GPA GMAT

GPA 1.15

GMAT 23.98 1739.52

GPA GMAT

GPA 1.25

GMAT 26.16 1897.66

Variance-Covariance MatrixPopulation values

Sample values

Population values

Sample values

1212-1

Page 101: Numerical Descriptive Techniques

4-1012007 會計資訊系統計學 ( 一 ) 上課投影片

Interpretation The covariance (26.16) indicates that GMAT score and

performance in the MBA program are positively related. The coefficient of correlation (.5365) indicates that there

is a moderately strong positive linear relationship between GMAT and MBA GPA.

The coefficient of correlation and the covariance Example 4.16 – Excel

Page 102: Numerical Descriptive Techniques

4-1022007 會計資訊系統計學 ( 一 ) 上課投影片

Least Squares Method (最小平方法)Recall, the slope-intercept equation for a line is expressed in these terms:

y = mx + b

Where:m is the slope of the lineb is the y-intercept.

If we’ve determined there is a linear relationship between two variables with covariance and the coefficient of correlation, can we determine a linear function of the relationship?

Page 103: Numerical Descriptive Techniques

4-1032007 會計資訊系統計學 ( 一 ) 上課投影片

The Least Squares Method

…produces a straight line drawn through the points so that the sum of squared deviations between the points and the line is minimized. This line is represented by the equation:

b0 (“b” naught) is the y-intercept, b1 is the slope, and (“y” hat) is the value of y determined by the line.

Page 104: Numerical Descriptive Techniques

4-1042007 會計資訊系統計學 ( 一 ) 上課投影片

The least Squares Method

Different lines generate different errors, thus different sum of squares of errors. There is a line that minimizes the sum of squared errors.

Errors

X

YErrors

Page 105: Numerical Descriptive Techniques

4-1052007 會計資訊系統計學 ( 一 ) 上課投影片

We are seeking a line that best fits the data when two variables are (presumably) related to one another.

We define “best fit line” as a line for which the sum of squared differences between it and the data points is minimized.

2ii

n

1i)yy(Minimize

The actual y value of point i The y value of point icalculated from the equation i10i xbby

The Least Squares Method

Page 106: Numerical Descriptive Techniques

4-1062007 會計資訊系統計學 ( 一 ) 上課投影片

The least Squares Method

The coefficients b0 and b1 of the line that minimizes the sum of squares of errors are calculated from the data.

n

x

xandn

y

ywhere

xbyb

xx

yyxx

s

yxb

n

ii

n

ii

n

ii

n

iii

x

11

10

1

2

121 ,

)(

))((),cov(

n

x

xandn

y

ywhere

xbyb

xx

yyxx

s

yxb

n

ii

n

ii

n

ii

n

iii

x

11

10

1

2

121 ,

)(

))((),cov(

Page 107: Numerical Descriptive Techniques

4-1072007 會計資訊系統計學 ( 一 ) 上課投影片

145.)25.632)(0138(.87.8

87.812

4.106

25.63212

587,7

0138.2.1897

16.26),cov(

10

21

xbybn

yy

n

xx

s

yxb

i

i

x

145.)25.632)(0138(.87.8

87.812

4.106

25.63212

587,7

0138.2.1897

16.26),cov(

10

21

xbybn

yy

n

xx

s

yxb

i

i

xScatter Diagram

y = 0.1496 + 0.0138x

6

8

10

12

500 600 700 800

Scatter Diagram

y = 0.1496 + 0.0138x

6

8

10

12

500 600 700 800

Example 4.17 Find the least squares line for Example 4.16 (Xm04-16.xls)

The Least Squares Method