Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析

Chapter 2

Descriptive statistics for quantitative data

定量资料的描述性统计分析

Numerical

data:

--- continuous

--- discrete Categorical

data:

--- nominal

--- ordinal

reviewTypes of

data

Statistics :

It is a branch of applied

mathematics that refers to the

collection and interpretation of

data, and evaluation of the

reliability of the conclusions

based on the data.

review

Types of statistical analysis

Descriptive analysis :

---Data collection

---Data interpretation

Inferential analysis :

---Evaluate the reliability of the

conclusions

Contents

Frequency distribution ★

Central tendency 　★ Dispersion （ measures of variability ）　★ Tables and graphs

New words

• Frequency

频数

• Proportion

比例

• Percentage

百分数

• Histogram

直方图

• Polygon

折线图

• Distribution

分布

• Frequency distribution 频数分布

• Cumulative frequency 累积频数

• Cumulative proportion 累积比例

• Central tendency 集中趋势

• Dispersion

离散程度

• Mean

均数

• Arithmetic mean 算术均数

• Geometric mean 几何均数

• Median

中位数

• Mode

众数

• Skewness

偏度

• Kurtosis

峰度

• Descriptive analysis 描述分析

• Inferential analysis 推断分析

1. Frequency distribution

Id sex age

1 m 6

2 m 8

3 f 13

4 m 16

5 f 16

6 f 15

7 f 23

8 m 19

9 f 25

10 f 21

11 m 13

12 f 19

13 f 9

14 f 10

15 f 14

Frequency ( 频数 ):

For a given variable, the

number of times a value

occurs is called its

frequency.

Frequency table of sex Sex Label

Frequency m Male 5 f Female 10

Frequency table of sex Sex Label Frequency proportion -------------------------------------------------- m Male 5 33.33 f Female 10 66.67 -------------------------------------------------- Total m+f 15 100.00

Proportion or percent ( 比例或百分数 ) ：The ratio of a frequency to total frequency

Frequency di stri buti on of sex

0

5

10

15

mal e femal eSex

Freq

uenc

y

Frequency

distribution:

A table or a graph

that list all the

distinct values in a

variable together

with the freq and

proportion of these

values occurs

Freq distribution of

sex

Sex Frequency

Percentage

m 5

33.33

f 10

66.67

Method of displaying frequency distribution of

categorical data

1.Nominal

data

2.Ordinal

data


0

5

10

15

mal e femal eSex

Freq

uenc

y

Freq distribution of nominal data


sex

Sex Frequency

Percentage

m 5

33.33

f 10

66.67

Id sex eyesight age

1 m 1 6

2 m 2 83 f 3 134 m 3 165 f 4 166 f 4 157 f 5 238 m 6 199 f 6 2510 f 6 2111 m 7 13 12 f 7 1913 f 8 914 f 9 1015 f 9 14

Freq distribution of ordinal data

Id sex eyesight age

1 m 1 6

2 m 2 83 f 3

134 m 3 165 f 4

166 f 4

157 f 5

238 m 6 199 f 6

2510 f 6

2111 m 7 13 12 f 7

1913 f 8 914 f 9

1015 f 9

14


eyesight

Eyesight Frequency

Percentage

1-3 4 26.67

4-6 6 40.00

Frequency di stri buti on of eyesi ght

02468

1-3 4-6 7-9

Eyesi ght

Freq

uenc

y

• first dividing the whole interval into several

un-overlapped subintervals,

• count how many observations lies in each

subinterval to make a frequency table,

• take the midpoint of each subinterval as x-

axis label, draw a histogram( 直方图 ) or a

polygon ( 折线图 ).

Method of displaying frequency distribution of

numerical data

Freq distribution of age Age midpoint Frequency 0 ～ 5 3 10 ～ 15 9 20 ～ 30 25 3

[0-10)

[10-20)

[20-30]

Frequency di stri buti on of age

0

5

10

5 15 25

Age

Freq

uenc

y

Id sex eyesight age

1 m 1 6

2 m 2 83 f 3

134 m 3 165 f 4

166 f 4

157 f 5

238 m 6 199 f 6

2510 f 6

2111 m 7 13 12 f 7

1913 f 8 914 f 9

1015 f 9

14

Freq distribution of numerical data

Frequency pol ygon f or age

0

5

10

0 5 10 15 20 25 30

AgeFr

eque

ncy


0

5

10

5 15 25

Age

Freq

uenc

y

Histogram and polygon

Histogram polygon


0

5

10

15

mal e femal eSex

Freq

uenc

yFrequency di stri buti on of eyesi ght

02468

1-3 4-6 7-9

Eyesi ght

Freq

uenc

y


0

5

10

5 15 25

Age

Freq

uenc

y

Nominal data Ordinal data

Frequency pol ygon f or age

0

5

10

0 5 10 15 20 25 30

Age

Freq

uenc

y

Numerical data

Cumulative frequency and cumulative proportion

Frequency table of age Cumulative Cumulative Age midpoint Frequency Proportion frequency proportion 0-10 5 3 20.0 3 20.0 10-20 15 9 60.0 12 80.0 20-30 25 3 20.0 15 100.0

Cumulative frequency ( 累计频数 ): sum

of total frequency from low to a certain

category

Cumulative proportion ( 累计比例 ):

sum of total proportion from low to a

certain category

The plot of cumulative frequency

and cumulative proportion

Central tendency ( 集中趋势 ) Dispersion ( 离散程度 )

The major measures of the

characteristics of observations for a

numerical variable

Frequency di stri buti on of red bl ood cel l s

0

5

10

15

20

25

30

420- 440- 460- 480- 500- 520- 540- 560- 580- 600- 620- 640-

Red bl ood cel l s

Frequency

2. Central tendency

Central tendency( 集中趋势 ) ：

The description of the concentration

near the middle of the range of all

values in a variable.

The major measures of central tendency are: mean, median, mode.

The mean

sample mean

The mean ( 均数 ) :

It is a measure of the average level of all

observations in a variable, it is defined

as follow: population mean

---------Arithmetic mean ( 算术均数 )

N

iiXN 1

11

1 n

ii

X Xn

Eg1a: Estimate the mean The data listed below is the content of

haemoglobin (g/L) ( 血色素 ), estimate the mean. Solution:

= (121+118+…+125+132)/12

= 123.5

So, the estimated mean of the

Haemoglobin is 123.5 g/L.

n=12id x id x 1 121 7 116 2 118 8 124 3 130 9 127 4 120 10 129 5 122 11 125 6 118 12 132

Data:

n

iiXn

x1

1

Another formula for mean

x freq x1 f1 x2 f2 …… …… xk f k n

Data:

If x has k different values, and fi is the frequency of i-th value xi occurring in the sample, then the sample mean can be estimated as follow:

Formula:

1 1

1

( ) / ( )

( ) /

k k

i i ii i

k

i ii

X f x f

f x n

Eg1b: Estimate the mean

Serum Mid- Cholest. point Freq.2.5 ～ 3.0 93.5 ～ 4.0 324.5 ～ 5.0 42 5.5 ～ 6.0 156.5 ～ 7.0 3 101

data:

The following data are measured serum cholesterol ( 血清胆固醇 ) from 101 aged 30-49 men. Estimate the mean.

Solution:

n=101,

=(3×9+4×32+5×42+

6×15+7×3) / 101

= 4.71 (mmol/L)

)/()(11

k

ii

k

iii fxfx

The medianThe median ( 中位数 ):It is a middle measure in an ordered values of all observations in a variable. It is defined as below:

population median sample median

In which,

are ordered values in pop, the

are ordered values in sample.the

2/)1( NXM 2/)1( nxm

NXXX ,,, 21

nxxx ,,, 21

eg, if n=9, then m=x((9+1)/2)=x(5)=x5

if n=10, then m=x((10+1)/2)=x(5.5)=(x5+x6)/2

The method of estimating the median:1)Order all values of observations in a

variable from smaller to larger;2)If n is odd, find out middle one

observation, this value is the required median;

3)If n is even, find out middle two observations, the average of this two values is the required median.

Eg2a: Estimate the median The data listed below is the content

of haemoglobin (g/L), estimate the

median. Solution:

med=

(122+124)/2=123

So, the median of the

Haemoglobin is 123

g/L.

The ordering values are:

116,118,118,120,121,122,

124,125,127,129,130,132.

n=12, is even, therefore,

id x id x 1 121 7 116 2 118 8 124 3 130 9 127 4 120 10 129 5 122 11 125 6 118 12 132

Data:

Eg2b: Estimate the medianThe following data are measured serum cholesterol (mmol/L) from 101 aged 30-49 men. estimate the median.

Serum Mid- Cholest. point Freq.2.5 ～ 3.0 93.5 ～ 4.0 324.5 ～ 5.0 42 5.5 ～ 6.0 156.5 ～ 7.0 3

Data: Solution:

Since n=101 is odd number,so the median is middle onevalue, that is, the ordering number is 51, from the data, the 51th value is 5.0, ie, the median M=5.0. More accurate value of M is4.5+(5.5-4.5) / 42×10=4.74

Frequency distribution about

mean and median

Central tendency of serum cholesterol

0

20

40

60

3 4 5 6 7Serum Cholesterol

Frequency Mean=4.71Median=5.0

f requency

0

20

40

60

80

100

0 20 40 60 80 100 120 140

f requency

0

20

40

60

80

100

0 20 40 60 80 100 120 140

median mean mean median

positive or right skewed

negative or left skewed

Skewed distribution

Comparing mean and median

mean median

more(actual values)

less(ranks)

not available for ordinal data

available for any data

symmetric

+ skewed

- skewed

Mean=median

Mean>median

Mean<median

information

data available

size in magnitude

The definition of median

The median is a value for which no

more than half the data are smaller

than it and no more than half the

data are larger than it.

eg, 12, 14, 14, 15, 16, 16, 16, 17, 18.

M=16, for which, four < M and

two>M.

The Geometric meanWhen distribution of a variable is not symmetry, or the data has no up or low bound, then the geometric mean is a best measure for the central tendency.

Eg3. The following data are 10 patients’ white blood cell counts(×1000): 11, 9, 35, 5, 9, 8, 3, 10, 12, 8. Estimate the arithmetic mean and geometric mean.

The mode

• It is a relatively great concentration.

• If a data consists of the values: 6,7,7,8,8,8,8,9,10,11,11,12,12,12,12,13

then the mode is 8 and 12.

The mode ( 众数 ):

It is defined as the most frequently

occurring values in a set of data.

Summary

• Frequency distribution

• Histogram & polygon

• Measures of central

tendency

• Measures of dispersion

Note: When the width of subinterval are not equal, or the data no up or low bound, then polygon is more available than histogram.

0

5

10

15

20

25

0 20 40 60 80 100 120 140 160 180( )体重盎司

频数

Frequency distribution of birthweight

New words

• Dispersion 离散程度

• Range 全距• Deviation 离均差• Variance 方差 • Standard deviation 标准差• Coefficient of variation 变异系

数

New words

• Quartile 四分位数• Percentile 百分位

数• Inter-quartile interval 四分位间

距

§3. Dispersion

Dispersion ( 离散程度 ) ：

The indication of a spread of

measurements around the center of a

variable distribution

The major measures of dispersion are:

range, variance, standard deviation, inter-

quartile range, coefficient of variation, etc.

The range

The range ( 全距 ):

It measures the distributed length of data.

Range = max - min

Population range

Range = max - min

Sample range

* It is a simple measure, it has the same unit as the original data.

# It use less information (only max & min);

# Sample range underestimates the pop range—biased, inefficient

# It convey no information about the middle of the distribution.

The quartiles

The first-quartile ( 第一四分位数 ) Q1:

It is a value, for which no more than 25% of observed values are less than it, and no more than 75% of observed values are greater than it.

X1 XnM

≤25% ≤ 75%

The second-quartile ( 第二四分位数 ) Q2=M:


X1 Xn

M

≤50% ≤ 50%

The third-quartile ( 第三四分位数 ) Q3:


X1 XnM

≤75% ≤ 25%

X1 XnM

≤ 50% ≤50%

Location of quartiles

Q2Q1 Q3

≤ 25% ≤ 25% ≤ 25% ≤ 25%

The method of estimate the quartiles

If the subscript is not an integer or

half-integer,then it is rounded up to a

nearest integer or half-integer.

A B34 3436 3637 3739 3940 4041 4142 4243 4379 44 45--------------n=9 n=10

Eg1: Estimate the quartiles

The inter-quartile range ( 四分位数间距 ) :

It is a the difference between Q1 and Q3:

Q3-Q1.

X1 XnM

Middle 50%

Q1 Q3

A B34 3436 3637 3739 3940 4041 4142 4243 4379 44 45--------------n=9 n=10

Eg2: Estimate the interquartile range

Interquartile tange of A=42.5-36.5=6.0

Interquartile tange of A=43.5-37.0=6.5

The percentiles

Theαth percentile (α 百分位数 ) Pα :

It is a value ， for which no more than α%

of data less than it, and no more than α%

larger than it, where ， 0 ≤ α≤100.

• P0=min, p100=max

• P25= Q1, P50= Q2=M, P75= Q3.

If the subscript is not an integer or half-

integer, then it is rounded up to a nearest

integer or half-integer.

The method of estimate the percentiles

Eg3: Estimate the percentiles

For data A:

P0=34, P10=34, P20=36, P30=37, …, P90=79, P100=79.

For data B:

P0=34, P10=34, P20=36, P30=37, …, P90=44, P100=45.

Note: there are many ways to estimate percentiles, the results are not unique.

Data: A B34 3436 3637 3739 3940 4041 4142 4243 4379 44 45-------------n=9

n=10

The variance

Population variance Sample variance

note: degree of freedom are not same: N and n-1.* It convey information about the middle of the distribution.* S2 is a unbiased estimate of σ2, they are positive values;# The unit is not same as the original data.

The variance (Var, 方差 ):

It measures the average dispersion of the data about the mean.

Simplify formulas of variance

Population variance Sample variance

Proving of simplify formula

Eg4a ： Estimate the variance

id x x*x

1 1 12 2 43 3 94 4 165 5 25-------------------∑ 15 55

Data: Solution:

Another formula for variance

x freq x1 f1 x2 f2 …… …… xk f k n

Data:

k k

Eg4b: Estimate the variance

id x f f*x f*x*x

1 1 3 3 32 2 3 6 123 3 2 6 184 4 1 4 165 5 2 10 50-----------------------------∑ 15 11 29 99

Data: Solution:

The standard deviationThe standard deviation (sd, SD, 标准差 ):

It measures the average dispersion of the data about the mean.

Population sd Sample sd

* It convey information about the mean of the distribution.

* s is an unbiased estimate of σ, they are positive values;

* The unit is the same as the original data.

Eg5: Estimate the SD

id x x*x

1 1 12 2 43 3 94 4 165 5 25-------------------∑ 15 55

Data: Solution:

The coefficient of variation

The coefficient of variance (cv, CV ，变异系数 ): It measures the relative variation about mean.

Sample cvPopulation cv

* It measures a relative variability or relative dispersion.* Its value does not depends on the unit of variable, Instead of variance or standard deviation with units.* It can be used to compare variations with different units

Eg6: Estimate the CV

id x1 1

2 2

3 3

4 4

5 5

Data: age

sum: 15 mean: 3var: 2.5 sd 1.58 cv: 52.70

id y

1 11 2 12 3 13 4 14 5 15

Data: weight

sum: 65 mean: 13var: 2.5 sd 1.58 cv: 12.16

id y

1 110

2 120

3 130

4 140

5 150

Data: weight

sum: 650 mean: 130var: 250 sd 15.8 cv: 12.16

Coding effects: (1) ＋－ : S is unchanged;

(2) ×÷: CV is unchanged.

Summary

1. Measures of central tendency:

mean, median, mode.

2. Masures of dispersion:

variance, standard deviation, range, inter-quartile, CV.

Documents

Chapter 2 Descriptive statistics for quantitative data 定量资料的描述性统计分析