Upload
kory-mcbride
View
430
Download
0
Embed Size (px)
Citation preview
Chapter 2
Descriptive statistics for quantitative data
定量资料的描述性统计分析
Numerical
data:
--- continuous
--- discrete Categorical
data:
--- nominal
--- ordinal
reviewTypes of
data
Statistics :
It is a branch of applied
mathematics that refers to the
collection and interpretation of
data, and evaluation of the
reliability of the conclusions
based on the data.
review
Types of statistical analysis
Descriptive analysis :
---Data collection
---Data interpretation
Inferential analysis :
---Evaluate the reliability of the
conclusions
Contents
Frequency distribution ★
Central tendency ★ Dispersion ( measures of variability ) ★ Tables and graphs
New words
• Frequency
频数
• Proportion
比例
• Percentage
百分数
• Histogram
直方图
• Polygon
折线图
• Distribution
分布
• Frequency distribution 频数分布
• Cumulative frequency 累积频数
• Cumulative proportion 累积比例
• Central tendency 集中趋势
• Dispersion
离散程度
• Mean
均数
• Arithmetic mean 算术均数
• Geometric mean 几何均数
• Median
中位数
• Mode
众数
• Skewness
偏度
• Kurtosis
峰度
• Descriptive analysis 描述分析
• Inferential analysis 推断分析
1. Frequency distribution
Id sex age
1 m 6
2 m 8
3 f 13
4 m 16
5 f 16
6 f 15
7 f 23
8 m 19
9 f 25
10 f 21
11 m 13
12 f 19
13 f 9
14 f 10
15 f 14
Frequency ( 频数 ):
For a given variable, the
number of times a value
occurs is called its
frequency.
Frequency table of sex Sex Label
Frequency m Male 5 f Female 10
Frequency table of sex Sex Label Frequency proportion -------------------------------------------------- m Male 5 33.33 f Female 10 66.67 -------------------------------------------------- Total m+f 15 100.00
Proportion or percent ( 比例或百分数 ) :The ratio of a frequency to total frequency
Frequency di stri buti on of sex
0
5
10
15
mal e femal eSex
Freq
uenc
y
Frequency
distribution:
A table or a graph
that list all the
distinct values in a
variable together
with the freq and
proportion of these
values occurs
Freq distribution of
sex
Sex Frequency
Percentage
m 5
33.33
f 10
66.67
Method of displaying frequency distribution of
categorical data
1.Nominal
data
2.Ordinal
data
Frequency di stri buti on of sex
0
5
10
15
mal e femal eSex
Freq
uenc
y
Freq distribution of nominal data
Freq distribution of
sex
Sex Frequency
Percentage
m 5
33.33
f 10
66.67
Id sex eyesight age
1 m 1 6
2 m 2 83 f 3 134 m 3 165 f 4 166 f 4 157 f 5 238 m 6 199 f 6 2510 f 6 2111 m 7 13 12 f 7 1913 f 8 914 f 9 1015 f 9 14
Freq distribution of ordinal data
Id sex eyesight age
1 m 1 6
2 m 2 83 f 3
134 m 3 165 f 4
166 f 4
157 f 5
238 m 6 199 f 6
2510 f 6
2111 m 7 13 12 f 7
1913 f 8 914 f 9
1015 f 9
14
Freq distribution of
eyesight
Eyesight Frequency
Percentage
1-3 4 26.67
4-6 6 40.00
Frequency di stri buti on of eyesi ght
02468
1-3 4-6 7-9
Eyesi ght
Freq
uenc
y
• first dividing the whole interval into several
un-overlapped subintervals,
• count how many observations lies in each
subinterval to make a frequency table,
• take the midpoint of each subinterval as x-
axis label, draw a histogram( 直方图 ) or a
polygon ( 折线图 ).
Method of displaying frequency distribution of
numerical data
Freq distribution of age Age midpoint Frequency 0 ~ 5 3 10 ~ 15 9 20 ~ 30 25 3
[0-10)
[10-20)
[20-30]
Frequency di stri buti on of age
0
5
10
5 15 25
Age
Freq
uenc
y
Id sex eyesight age
1 m 1 6
2 m 2 83 f 3
134 m 3 165 f 4
166 f 4
157 f 5
238 m 6 199 f 6
2510 f 6
2111 m 7 13 12 f 7
1913 f 8 914 f 9
1015 f 9
14
Freq distribution of numerical data
Frequency pol ygon f or age
0
5
10
0 5 10 15 20 25 30
AgeFr
eque
ncy
Frequency di stri buti on of age
0
5
10
5 15 25
Age
Freq
uenc
y
Histogram and polygon
Histogram polygon
Frequency di stri buti on of sex
0
5
10
15
mal e femal eSex
Freq
uenc
yFrequency di stri buti on of eyesi ght
02468
1-3 4-6 7-9
Eyesi ght
Freq
uenc
y
Frequency di stri buti on of age
0
5
10
5 15 25
Age
Freq
uenc
y
Nominal data Ordinal data
Frequency pol ygon f or age
0
5
10
0 5 10 15 20 25 30
Age
Freq
uenc
y
Numerical data
Cumulative frequency and cumulative proportion
Frequency table of age Cumulative Cumulative Age midpoint Frequency Proportion frequency proportion 0-10 5 3 20.0 3 20.0 10-20 15 9 60.0 12 80.0 20-30 25 3 20.0 15 100.0
Cumulative frequency ( 累计频数 ): sum
of total frequency from low to a certain
category
Cumulative proportion ( 累计比例 ):
sum of total proportion from low to a
certain category
The plot of cumulative frequency
and cumulative proportion
Central tendency ( 集中趋势 ) Dispersion ( 离散程度 )
The major measures of the
characteristics of observations for a
numerical variable
Frequency di stri buti on of red bl ood cel l s
0
5
10
15
20
25
30
420- 440- 460- 480- 500- 520- 540- 560- 580- 600- 620- 640-
Red bl ood cel l s
Frequency
2. Central tendency
Central tendency( 集中趋势 ) :
The description of the concentration
near the middle of the range of all
values in a variable.
The major measures of central tendency are: mean, median, mode.
The mean
sample mean
The mean ( 均数 ) :
It is a measure of the average level of all
observations in a variable, it is defined
as follow: population mean
---------Arithmetic mean ( 算术均数 )
N
iiXN 1
11
1 n
ii
X Xn
Eg1a: Estimate the mean The data listed below is the content of
haemoglobin (g/L) ( 血色素 ), estimate the mean. Solution:
= (121+118+…+125+132)/12
= 123.5
So, the estimated mean of the
Haemoglobin is 123.5 g/L.
n=12id x id x 1 121 7 116 2 118 8 124 3 130 9 127 4 120 10 129 5 122 11 125 6 118 12 132
Data:
n
iiXn
x1
1
Another formula for mean
x freq x1 f1 x2 f2 …… …… xk f k n
Data:
If x has k different values, and fi is the frequency of i-th value xi occurring in the sample, then the sample mean can be estimated as follow:
Formula:
1 1
1
( ) / ( )
( ) /
k k
i i ii i
k
i ii
X f x f
f x n
Eg1b: Estimate the mean
Serum Mid- Cholest. point Freq.2.5 ~ 3.0 93.5 ~ 4.0 324.5 ~ 5.0 42 5.5 ~ 6.0 156.5 ~ 7.0 3 101
data:
The following data are measured serum cholesterol ( 血清胆固醇 ) from 101 aged 30-49 men. Estimate the mean.
Solution:
n=101,
=(3×9+4×32+5×42+
6×15+7×3) / 101
= 4.71 (mmol/L)
)/()(11
k
ii
k
iii fxfx
The medianThe median ( 中位数 ):It is a middle measure in an ordered values of all observations in a variable. It is defined as below:
population median sample median
In which,
are ordered values in pop, the
are ordered values in sample.the
2/)1( NXM 2/)1( nxm
NXXX ,,, 21
nxxx ,,, 21
eg, if n=9, then m=x((9+1)/2)=x(5)=x5
if n=10, then m=x((10+1)/2)=x(5.5)=(x5+x6)/2
The method of estimating the median:1)Order all values of observations in a
variable from smaller to larger;2)If n is odd, find out middle one
observation, this value is the required median;
3)If n is even, find out middle two observations, the average of this two values is the required median.
Eg2a: Estimate the median The data listed below is the content
of haemoglobin (g/L), estimate the
median. Solution:
med=
(122+124)/2=123
So, the median of the
Haemoglobin is 123
g/L.
The ordering values are:
116,118,118,120,121,122,
124,125,127,129,130,132.
n=12, is even, therefore,
id x id x 1 121 7 116 2 118 8 124 3 130 9 127 4 120 10 129 5 122 11 125 6 118 12 132
Data:
Eg2b: Estimate the medianThe following data are measured serum cholesterol (mmol/L) from 101 aged 30-49 men. estimate the median.
Serum Mid- Cholest. point Freq.2.5 ~ 3.0 93.5 ~ 4.0 324.5 ~ 5.0 42 5.5 ~ 6.0 156.5 ~ 7.0 3
Data: Solution:
Since n=101 is odd number,so the median is middle onevalue, that is, the ordering number is 51, from the data, the 51th value is 5.0, ie, the median M=5.0. More accurate value of M is4.5+(5.5-4.5) / 42×10=4.74
Frequency distribution about
mean and median
Central tendency of serum cholesterol
0
20
40
60
3 4 5 6 7Serum Cholesterol
Frequency Mean=4.71Median=5.0
f requency
0
20
40
60
80
100
0 20 40 60 80 100 120 140
f requency
0
20
40
60
80
100
0 20 40 60 80 100 120 140
median mean mean median
positive or right skewed
negative or left skewed
Skewed distribution
Comparing mean and median
mean median
more(actual values)
less(ranks)
not available for ordinal data
available for any data
symmetric
+ skewed
- skewed
Mean=median
Mean>median
Mean<median
information
data available
size in magnitude
The definition of median
The median is a value for which no
more than half the data are smaller
than it and no more than half the
data are larger than it.
eg, 12, 14, 14, 15, 16, 16, 16, 17, 18.
M=16, for which, four < M and
two>M.
The Geometric meanWhen distribution of a variable is not symmetry, or the data has no up or low bound, then the geometric mean is a best measure for the central tendency.
Eg3. The following data are 10 patients’ white blood cell counts(×1000): 11, 9, 35, 5, 9, 8, 3, 10, 12, 8. Estimate the arithmetic mean and geometric mean.
The mode
• It is a relatively great concentration.
• If a data consists of the values: 6,7,7,8,8,8,8,9,10,11,11,12,12,12,12,13
then the mode is 8 and 12.
The mode ( 众数 ):
It is defined as the most frequently
occurring values in a set of data.
Summary
• Frequency distribution
• Histogram & polygon
• Measures of central
tendency
• Measures of dispersion
Note: When the width of subinterval are not equal, or the data no up or low bound, then polygon is more available than histogram.
0
5
10
15
20
25
0 20 40 60 80 100 120 140 160 180( )体重 盎司
频数
Frequency distribution of birthweight
New words
• Dispersion 离散程度
• Range 全距• Deviation 离均差• Variance 方差 • Standard deviation 标准差• Coefficient of variation 变异系
数
New words
• Quartile 四分位数• Percentile 百分位
数• Inter-quartile interval 四分位间
距
§3. Dispersion
Dispersion ( 离散程度 ) :
The indication of a spread of
measurements around the center of a
variable distribution
The major measures of dispersion are:
range, variance, standard deviation, inter-
quartile range, coefficient of variation, etc.
The range
The range ( 全距 ):
It measures the distributed length of data.
Range = max - min
Population range
Range = max - min
Sample range
* It is a simple measure, it has the same unit as the original data.
# It use less information (only max & min);
# Sample range underestimates the pop range—biased, inefficient
# It convey no information about the middle of the distribution.
The quartiles
The first-quartile ( 第一四分位数 ) Q1:
It is a value, for which no more than 25% of observed values are less than it, and no more than 75% of observed values are greater than it.
X1 XnM
≤25% ≤ 75%
The second-quartile ( 第二四分位数 ) Q2=M:
It is a value, for which no more than 50% of observed values are less than it, and no more than 50% of observed values are greater than it.
X1 Xn
M
≤50% ≤ 50%
The third-quartile ( 第三四分位数 ) Q3:
It is a value, for which no more than 75% of observed values are less than it, and no more than 25% of observed values are greater than it.
X1 XnM
≤75% ≤ 25%
X1 XnM
≤ 50% ≤50%
Location of quartiles
Q2Q1 Q3
≤ 25% ≤ 25% ≤ 25% ≤ 25%
The method of estimate the quartiles
If the subscript is not an integer or
half-integer,then it is rounded up to a
nearest integer or half-integer.
A B34 3436 3637 3739 3940 4041 4142 4243 4379 44 45--------------n=9 n=10
Eg1: Estimate the quartiles
The inter-quartile range ( 四分位数间距 ) :
It is a the difference between Q1 and Q3:
Q3-Q1.
X1 XnM
Middle 50%
Q1 Q3
A B34 3436 3637 3739 3940 4041 4142 4243 4379 44 45--------------n=9 n=10
Eg2: Estimate the interquartile range
Interquartile tange of A=42.5-36.5=6.0
Interquartile tange of A=43.5-37.0=6.5
The percentiles
Theαth percentile (α 百分位数 ) Pα :
It is a value , for which no more than α%
of data less than it, and no more than α%
larger than it, where , 0 ≤ α≤100.
• P0=min, p100=max
• P25= Q1, P50= Q2=M, P75= Q3.
If the subscript is not an integer or half-
integer, then it is rounded up to a nearest
integer or half-integer.
The method of estimate the percentiles
Eg3: Estimate the percentiles
For data A:
P0=34, P10=34, P20=36, P30=37, …, P90=79, P100=79.
For data B:
P0=34, P10=34, P20=36, P30=37, …, P90=44, P100=45.
Note: there are many ways to estimate percentiles, the results are not unique.
Data: A B34 3436 3637 3739 3940 4041 4142 4243 4379 44 45-------------n=9
n=10
The variance
Population variance Sample variance
note: degree of freedom are not same: N and n-1.* It convey information about the middle of the distribution.* S2 is a unbiased estimate of σ2, they are positive values;# The unit is not same as the original data.
The variance (Var, 方差 ):
It measures the average dispersion of the data about the mean.
Simplify formulas of variance
Population variance Sample variance
Proving of simplify formula
Eg4a : Estimate the variance
id x x*x
1 1 12 2 43 3 94 4 165 5 25-------------------∑ 15 55
Data: Solution:
Another formula for variance
x freq x1 f1 x2 f2 …… …… xk f k n
Data:
k k
Eg4b: Estimate the variance
id x f f*x f*x*x
1 1 3 3 32 2 3 6 123 3 2 6 184 4 1 4 165 5 2 10 50-----------------------------∑ 15 11 29 99
Data: Solution:
The standard deviationThe standard deviation (sd, SD, 标准差 ):
It measures the average dispersion of the data about the mean.
Population sd Sample sd
* It convey information about the mean of the distribution.
* s is an unbiased estimate of σ, they are positive values;
* The unit is the same as the original data.
Eg5: Estimate the SD
id x x*x
1 1 12 2 43 3 94 4 165 5 25-------------------∑ 15 55
Data: Solution:
The coefficient of variation
The coefficient of variance (cv, CV , 变异系数 ): It measures the relative variation about mean.
Sample cvPopulation cv
* It measures a relative variability or relative dispersion.* Its value does not depends on the unit of variable, Instead of variance or standard deviation with units.* It can be used to compare variations with different units
Eg6: Estimate the CV
id x1 1
2 2
3 3
4 4
5 5
Data: age
sum: 15 mean: 3var: 2.5 sd 1.58 cv: 52.70
id y
1 11 2 12 3 13 4 14 5 15
Data: weight
sum: 65 mean: 13var: 2.5 sd 1.58 cv: 12.16
id y
1 110
2 120
3 130
4 140
5 150
Data: weight
sum: 650 mean: 130var: 250 sd 15.8 cv: 12.16
Coding effects: (1) +- : S is unchanged;
(2) ×÷: CV is unchanged.
Summary
1. Measures of central tendency:
mean, median, mode.
2. Masures of dispersion:
variance, standard deviation, range, inter-quartile, CV.