65
Chapter 2 Descriptive statistics for quantitative data 定定定定定定定定 定定定

Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析

Embed Size (px)

Citation preview

Page 1: Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析

Chapter 2

Descriptive statistics for quantitative data

定量资料的描述性统计分析

Page 2: Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析

Numerical

data:

--- continuous

--- discrete Categorical

data:

--- nominal

--- ordinal

reviewTypes of

data

Page 3: Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析

Statistics :

It is a branch of applied

mathematics that refers to the

collection and interpretation of

data, and evaluation of the

reliability of the conclusions

based on the data.

review

Page 4: Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析

Types of statistical analysis

Descriptive analysis :

---Data collection

---Data interpretation

Inferential analysis :

---Evaluate the reliability of the

conclusions

Page 5: Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析

Contents

Frequency distribution ★

Central tendency  ★ Dispersion ( measures of variability ) ★ Tables and graphs

Page 6: Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析

New words

• Frequency

频数

• Proportion

比例

• Percentage

百分数

• Histogram

直方图

• Polygon

折线图

• Distribution

分布

• Frequency distribution 频数分布

Page 7: Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析

• Cumulative frequency 累积频数

• Cumulative proportion 累积比例

• Central tendency 集中趋势

• Dispersion

离散程度

• Mean

均数

• Arithmetic mean 算术均数

• Geometric mean 几何均数

Page 8: Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析

• Median

中位数

• Mode

众数

• Skewness

偏度

• Kurtosis

峰度

• Descriptive analysis 描述分析

• Inferential analysis 推断分析

Page 9: Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析

1. Frequency distribution

Id sex age

1 m 6

2 m 8

3 f 13

4 m 16

5 f 16

6 f 15

7 f 23

8 m 19

9 f 25

10 f 21

11 m 13

12 f 19

13 f 9

14 f 10

15 f 14

Frequency ( 频数 ):

For a given variable, the

number of times a value

occurs is called its

frequency.

Frequency table of sex Sex Label

Frequency m Male 5 f Female 10

Page 10: Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析

Frequency table of sex Sex Label Frequency proportion -------------------------------------------------- m Male 5 33.33 f Female 10 66.67 -------------------------------------------------- Total m+f 15 100.00

Proportion or percent ( 比例或百分数 ) :The ratio of a frequency to total frequency

Page 11: Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析

Frequency di stri buti on of sex

0

5

10

15

mal e femal eSex

Freq

uenc

y

Frequency

distribution:

A table or a graph

that list all the

distinct values in a

variable together

with the freq and

proportion of these

values occurs

Freq distribution of

sex

Sex Frequency

Percentage

m 5

33.33

f 10

66.67

Page 12: Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析

Method of displaying frequency distribution of

categorical data

1.Nominal

data

2.Ordinal

data

Page 13: Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析

Frequency di stri buti on of sex

0

5

10

15

mal e femal eSex

Freq

uenc

y

Freq distribution of nominal data

Freq distribution of

sex

Sex Frequency

Percentage

m 5

33.33

f 10

66.67

Id sex eyesight age

1 m 1 6

2 m 2 83 f 3 134 m 3 165 f 4 166 f 4 157 f 5 238 m 6 199 f 6 2510 f 6 2111 m 7 13 12 f 7 1913 f 8 914 f 9 1015 f 9 14

Page 14: Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析

Freq distribution of ordinal data

Id sex eyesight age

1 m 1 6

2 m 2 83 f 3

134 m 3 165 f 4

166 f 4

157 f 5

238 m 6 199 f 6

2510 f 6

2111 m 7 13 12 f 7

1913 f 8 914 f 9

1015 f 9

14

Freq distribution of

eyesight

Eyesight Frequency

Percentage

1-3 4 26.67

4-6 6 40.00

Frequency di stri buti on of eyesi ght

02468

1-3 4-6 7-9

Eyesi ght

Freq

uenc

y

Page 15: Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析

• first dividing the whole interval into several

un-overlapped subintervals,

• count how many observations lies in each

subinterval to make a frequency table,

• take the midpoint of each subinterval as x-

axis label, draw a histogram( 直方图 ) or a

polygon ( 折线图 ).

Method of displaying frequency distribution of

numerical data

Page 16: Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析

Freq distribution of age Age midpoint Frequency 0 ~ 5 3 10 ~ 15 9 20 ~ 30 25 3

[0-10)

[10-20)

[20-30]

Frequency di stri buti on of age

0

5

10

5 15 25

Age

Freq

uenc

y

Id sex eyesight age

1 m 1 6

2 m 2 83 f 3

134 m 3 165 f 4

166 f 4

157 f 5

238 m 6 199 f 6

2510 f 6

2111 m 7 13 12 f 7

1913 f 8 914 f 9

1015 f 9

14

Freq distribution of numerical data

Page 17: Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析

Frequency pol ygon f or age

0

5

10

0 5 10 15 20 25 30

AgeFr

eque

ncy

Frequency di stri buti on of age

0

5

10

5 15 25

Age

Freq

uenc

y

Histogram and polygon

Histogram polygon

Page 18: Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析

Frequency di stri buti on of sex

0

5

10

15

mal e femal eSex

Freq

uenc

yFrequency di stri buti on of eyesi ght

02468

1-3 4-6 7-9

Eyesi ght

Freq

uenc

y

Frequency di stri buti on of age

0

5

10

5 15 25

Age

Freq

uenc

y

Nominal data Ordinal data

Frequency pol ygon f or age

0

5

10

0 5 10 15 20 25 30

Age

Freq

uenc

y

Numerical data

Page 19: Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析

Cumulative frequency and cumulative proportion

Frequency table of age Cumulative Cumulative Age midpoint Frequency Proportion frequency proportion 0-10 5 3 20.0 3 20.0 10-20 15 9 60.0 12 80.0 20-30 25 3 20.0 15 100.0

Cumulative frequency ( 累计频数 ): sum

of total frequency from low to a certain

category

Cumulative proportion ( 累计比例 ):

sum of total proportion from low to a

certain category

Page 20: Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析

The plot of cumulative frequency

and cumulative proportion

Page 21: Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析

Central tendency ( 集中趋势 ) Dispersion ( 离散程度 )

The major measures of the

characteristics of observations for a

numerical variable

Frequency di stri buti on of red bl ood cel l s

0

5

10

15

20

25

30

420- 440- 460- 480- 500- 520- 540- 560- 580- 600- 620- 640-

Red bl ood cel l s

Frequency

Page 22: Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析

2. Central tendency

Central tendency( 集中趋势 ) :

The description of the concentration

near the middle of the range of all

values in a variable.

The major measures of central tendency are: mean, median, mode.

Page 23: Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析

The mean

sample mean

The mean ( 均数 ) :

It is a measure of the average level of all

observations in a variable, it is defined

as follow: population mean

---------Arithmetic mean ( 算术均数 )

N

iiXN 1

11

1 n

ii

X Xn

Page 24: Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析

Eg1a: Estimate the mean The data listed below is the content of

haemoglobin (g/L) ( 血色素 ), estimate the mean. Solution:

= (121+118+…+125+132)/12

= 123.5

So, the estimated mean of the

Haemoglobin is 123.5 g/L.

n=12id x id x 1 121 7 116 2 118 8 124 3 130 9 127 4 120 10 129 5 122 11 125 6 118 12 132

Data:

n

iiXn

x1

1

Page 25: Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析

Another formula for mean

x freq x1 f1 x2 f2 …… …… xk f k n

Data:

If x has k different values, and fi is the frequency of i-th value xi occurring in the sample, then the sample mean can be estimated as follow:

Formula:

1 1

1

( ) / ( )

( ) /

k k

i i ii i

k

i ii

X f x f

f x n

Page 26: Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析

Eg1b: Estimate the mean

Serum Mid- Cholest. point Freq.2.5 ~ 3.0 93.5 ~ 4.0 324.5 ~ 5.0 42 5.5 ~ 6.0 156.5 ~ 7.0 3 101

data:

The following data are measured serum cholesterol ( 血清胆固醇 ) from 101 aged 30-49 men. Estimate the mean.

Solution:

n=101,

=(3×9+4×32+5×42+

6×15+7×3) / 101

= 4.71 (mmol/L)

)/()(11

k

ii

k

iii fxfx

Page 27: Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析

The medianThe median ( 中位数 ):It is a middle measure in an ordered values of all observations in a variable. It is defined as below:

population median sample median

In which,

are ordered values in pop, the

are ordered values in sample.the

2/)1( NXM 2/)1( nxm

NXXX ,,, 21

nxxx ,,, 21

Page 28: Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析

eg, if n=9, then m=x((9+1)/2)=x(5)=x5

if n=10, then m=x((10+1)/2)=x(5.5)=(x5+x6)/2

The method of estimating the median:1)Order all values of observations in a

variable from smaller to larger;2)If n is odd, find out middle one

observation, this value is the required median;

3)If n is even, find out middle two observations, the average of this two values is the required median.

Page 29: Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析

Eg2a: Estimate the median The data listed below is the content

of haemoglobin (g/L), estimate the

median. Solution:

med=

(122+124)/2=123

So, the median of the

Haemoglobin is 123

g/L.

The ordering values are:

116,118,118,120,121,122,

124,125,127,129,130,132.

n=12, is even, therefore,

id x id x 1 121 7 116 2 118 8 124 3 130 9 127 4 120 10 129 5 122 11 125 6 118 12 132

Data:

Page 30: Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析

Eg2b: Estimate the medianThe following data are measured serum cholesterol (mmol/L) from 101 aged 30-49 men. estimate the median.

Serum Mid- Cholest. point Freq.2.5 ~ 3.0 93.5 ~ 4.0 324.5 ~ 5.0 42 5.5 ~ 6.0 156.5 ~ 7.0 3

Data: Solution:

Since n=101 is odd number,so the median is middle onevalue, that is, the ordering number is 51, from the data, the 51th value is 5.0, ie, the median M=5.0. More accurate value of M is4.5+(5.5-4.5) / 42×10=4.74

Page 31: Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析

Frequency distribution about

mean and median

Central tendency of serum cholesterol

0

20

40

60

3 4 5 6 7Serum Cholesterol

Frequency Mean=4.71Median=5.0

Page 32: Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析

f requency

0

20

40

60

80

100

0 20 40 60 80 100 120 140

f requency

0

20

40

60

80

100

0 20 40 60 80 100 120 140

median mean mean median

positive or right skewed

negative or left skewed

Skewed distribution

Page 33: Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析

Comparing mean and median

mean median

more(actual values)

less(ranks)

not available for ordinal data

available for any data

symmetric

+ skewed

- skewed

Mean=median

Mean>median

Mean<median

information

data available

size in magnitude

Page 34: Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析

The definition of median

The median is a value for which no

more than half the data are smaller

than it and no more than half the

data are larger than it.

eg, 12, 14, 14, 15, 16, 16, 16, 17, 18.

M=16, for which, four < M and

two>M.

Page 35: Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析

The Geometric meanWhen distribution of a variable is not symmetry, or the data has no up or low bound, then the geometric mean is a best measure for the central tendency.

Page 36: Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析

Eg3. The following data are 10 patients’ white blood cell counts(×1000): 11, 9, 35, 5, 9, 8, 3, 10, 12, 8. Estimate the arithmetic mean and geometric mean.

Page 37: Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析

The mode

• It is a relatively great concentration.

• If a data consists of the values: 6,7,7,8,8,8,8,9,10,11,11,12,12,12,12,13

then the mode is 8 and 12.

The mode ( 众数 ):

It is defined as the most frequently

occurring values in a set of data.

Page 38: Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析

Summary

• Frequency distribution

• Histogram & polygon

• Measures of central

tendency

• Measures of dispersion

Page 39: Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析

Note: When the width of subinterval are not equal, or the data no up or low bound, then polygon is more available than histogram.

0

5

10

15

20

25

0 20 40 60 80 100 120 140 160 180( )体重 盎司

频数

Frequency distribution of birthweight

Page 40: Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析

New words

• Dispersion 离散程度

• Range 全距• Deviation 离均差• Variance 方差 • Standard deviation 标准差• Coefficient of variation 变异系

Page 41: Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析

New words

• Quartile 四分位数• Percentile 百分位

数• Inter-quartile interval 四分位间

Page 42: Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析

§3. Dispersion

Dispersion ( 离散程度 ) :

The indication of a spread of

measurements around the center of a

variable distribution

The major measures of dispersion are:

range, variance, standard deviation, inter-

quartile range, coefficient of variation, etc.

Page 43: Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析

The range

The range ( 全距 ):

It measures the distributed length of data.

Range = max - min

Population range

Range = max - min

Sample range

* It is a simple measure, it has the same unit as the original data.

# It use less information (only max & min);

# Sample range underestimates the pop range—biased, inefficient

# It convey no information about the middle of the distribution.

Page 44: Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析

The quartiles

The first-quartile ( 第一四分位数 ) Q1:

It is a value, for which no more than 25% of observed values are less than it, and no more than 75% of observed values are greater than it.

X1 XnM

≤25% ≤ 75%

Page 45: Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析

The second-quartile ( 第二四分位数 ) Q2=M:

It is a value, for which no more than 50% of observed values are less than it, and no more than 50% of observed values are greater than it.

X1 Xn

M

≤50% ≤ 50%

Page 46: Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析

The third-quartile ( 第三四分位数 ) Q3:

It is a value, for which no more than 75% of observed values are less than it, and no more than 25% of observed values are greater than it.

X1 XnM

≤75% ≤ 25%

Page 47: Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析

X1 XnM

≤ 50% ≤50%

Location of quartiles

Q2Q1 Q3

≤ 25% ≤ 25% ≤ 25% ≤ 25%

Page 48: Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析

The method of estimate the quartiles

If the subscript is not an integer or

half-integer,then it is rounded up to a

nearest integer or half-integer.

Page 49: Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析

A B34 3436 3637 3739 3940 4041 4142 4243 4379 44 45--------------n=9 n=10

Eg1: Estimate the quartiles

Page 50: Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析

The inter-quartile range ( 四分位数间距 ) :

It is a the difference between Q1 and Q3:

Q3-Q1.

X1 XnM

Middle 50%

Q1 Q3

Page 51: Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析

A B34 3436 3637 3739 3940 4041 4142 4243 4379 44 45--------------n=9 n=10

Eg2: Estimate the interquartile range

Interquartile tange of A=42.5-36.5=6.0

Interquartile tange of A=43.5-37.0=6.5

Page 52: Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析

The percentiles

Theαth percentile (α 百分位数 ) Pα :

It is a value , for which no more than α%

of data less than it, and no more than α%

larger than it, where , 0 ≤ α≤100.

• P0=min, p100=max

• P25= Q1, P50= Q2=M, P75= Q3.

Page 53: Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析

If the subscript is not an integer or half-

integer, then it is rounded up to a nearest

integer or half-integer.

The method of estimate the percentiles

Page 54: Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析

Eg3: Estimate the percentiles

For data A:

P0=34, P10=34, P20=36, P30=37, …, P90=79, P100=79.

For data B:

P0=34, P10=34, P20=36, P30=37, …, P90=44, P100=45.

Note: there are many ways to estimate percentiles, the results are not unique.

Data: A B34 3436 3637 3739 3940 4041 4142 4243 4379 44 45-------------n=9

n=10

Page 55: Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析

The variance

Population variance Sample variance

note: degree of freedom are not same: N and n-1.* It convey information about the middle of the distribution.* S2 is a unbiased estimate of σ2, they are positive values;# The unit is not same as the original data.

The variance (Var, 方差 ):

It measures the average dispersion of the data about the mean.

Page 56: Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析

Simplify formulas of variance

Population variance Sample variance

Page 57: Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析

Proving of simplify formula

Page 58: Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析

Eg4a : Estimate the variance

id x x*x

1 1 12 2 43 3 94 4 165 5 25-------------------∑ 15 55

Data: Solution:

Page 59: Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析

Another formula for variance

x freq x1 f1 x2 f2 …… …… xk f k n

Data:

k k

Page 60: Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析

Eg4b: Estimate the variance

id x f f*x f*x*x

1 1 3 3 32 2 3 6 123 3 2 6 184 4 1 4 165 5 2 10 50-----------------------------∑ 15 11 29 99

Data: Solution:

Page 61: Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析

The standard deviationThe standard deviation (sd, SD, 标准差 ):

It measures the average dispersion of the data about the mean.

Population sd Sample sd

* It convey information about the mean of the distribution.

* s is an unbiased estimate of σ, they are positive values;

* The unit is the same as the original data.

Page 62: Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析

Eg5: Estimate the SD

id x x*x

1 1 12 2 43 3 94 4 165 5 25-------------------∑ 15 55

Data: Solution:

Page 63: Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析

The coefficient of variation

The coefficient of variance (cv, CV , 变异系数 ): It measures the relative variation about mean.

Sample cvPopulation cv

* It measures a relative variability or relative dispersion.* Its value does not depends on the unit of variable, Instead of variance or standard deviation with units.* It can be used to compare variations with different units

Page 64: Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析

Eg6: Estimate the CV

id x1 1

2 2

3 3

4 4

5 5

Data: age

sum: 15 mean: 3var: 2.5 sd 1.58 cv: 52.70

id y

1 11 2 12 3 13 4 14 5 15

Data: weight

sum: 65 mean: 13var: 2.5 sd 1.58 cv: 12.16

id y

1 110

2 120

3 130

4 140

5 150

Data: weight

sum: 650 mean: 130var: 250 sd 15.8 cv: 12.16

Coding effects: (1) +- : S is unchanged;

(2) ×÷: CV is unchanged.

Page 65: Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析

Summary

1. Measures of central tendency:

mean, median, mode.

2. Masures of dispersion:

variance, standard deviation, range, inter-quartile, CV.