Upload
supergeofg
View
214
Download
0
Embed Size (px)
Citation preview
8/2/2019 Bio Stat 600
1/35
Introduction to Biostatistics
BIOSTATISTICS 600
Instructor: T. E. Raghunathan (Raghu)
E-mail: [email protected]
About the course
To provide an overview of basic concepts in Designand Analysis of Biostatistical Investigations.
Unify the thought process as many students takecourses under different circumstances at varioustimes
Kindle imaginations for your course work andrefresh memory
Grade Letter grade will be based on a multiple choice test on
the last day of the lecture
What is biostatistics? Biostatistics, as a field of science, is concerned with:
The design and conduct of experiments (orstudies) to collect observations (or data).
Display and analyze the data to infer about thepopulation
Duly acknowledge the uncertainty in the statedconclusions while inferring about the population.
R. A. Fisher defined statistics as
Study of populations
Study of variations
Study of methods to reduce the data
Key Concepts
Goal of statistical analysis is to draw inference or conclusions inan unbiased fashion
The target population should be clearly defined (that is, thepopulation for which inference is drawn should be clearly stated)
It is important recognize the variations in the population and
should be reflected in the inferences . The same experiment conducted on two different populations
may yield different results (systematic component of variation)
Two different conditions or experiments conducted on thesame population may yield different results (systematiccomponent of variation)
The experiment replicated under the same conditions on the
same population may yield different results (randomcomponent of variation).
Always find ways to succinctly describe the results usinggraphical and numerical summaries
8/2/2019 Bio Stat 600
2/35
8/2/2019 Bio Stat 600
3/35
Two Displays
1 4466
5431 2 0446888
331 3 012
4
4 5
1 4466
2 0134456888
3 011233
4
5 4
Control
Treatment
Such displays are useful to visually
inspect the extent to which
distributions overlap as well as the
magnitude of the differences
Displays for qualitative variables
Distribution of Cause of Death
Circulatory system 137,165
Neoplasms 69,948
Respiratory system 33,223
Injury/Poisoning 6,427
Digestive system 10,779
Nervous system 5,990
Others 30,695
Total 294,227
Dot chart
Bar chart
You can create these
and other types plots
using PROC
GPLOT in SAS or
in Excel. These
graphs were
produced by using a
freeware called R
which can be
downloaded from
www.r-project.org
We will discuss some more graphical displays after introducingsome numerical summaries
Numerical Summaries
Central tendency: Represents a typical or middle value
Spread: Extent of variability across observations
Shape: The structure of the distribution
Central tendency Mean: Sum all the observations and divide by the number of observations
being summed (arithmetic mean)
Mean survival time of guinea pigs in control group
(21+23+24+25+31+33+33+54)/8=244/8=30.5
Median: The number such that 50% of the observations are less than thenumber and 50% are greater than the number
Median survival time of guinea pigs in the control group
21,23,24,25, 31,33,33,54
Technically a number between 25 and 31. Sometimes the average (25+31)/2=28 is used.
8/2/2019 Bio Stat 600
4/35
8/2/2019 Bio Stat 600
5/35
Histogram
Divide the data into groups choosing intervals (Mutuallyexclusive, equal width and exhaustive)
Some rules: n/3 or n1/2log10(n)
Count the observations in each group
Draw a bar chart with area of the bar proportional to the count
If you fix the width of the bar to 1 then the height of the baris proportional to the count
Shape of the distribution:
Symmetric: Mean, Median and Modes are the same.
Values on the either side of the mean are equallylikely
Skewed: Mean is larger or smaller than the median Normal distribution
Characterized by the mean and
standard deviation:
68% of observations lie with 1 SD
95% of observations lie within 2 SD
99.7% of observations lie within 3SD
Key Concepts Population: It is a collection of units in whom we are interested
All people living in the United States
In the study of treatment for diabetes, all people with diabetes
Blood pressure for a person: All possible measurements blood pressure inthat person
Generally, it is impossible to measure each and every unit in the population (ifwe could,it is called a Census).
A practical approach: Sample is usually a very small subset units in thepopulation. The sample is measured and studied to draw conclusions about thepopulation.
Method used to draw the sample is the key step in a biostatistical investigation.
Sample should be representative of the population (probability or
random sampling designs assure such unbiased representativeness)
Due to sampling from the population there is uncertainty in the inferences.Statistical analysis expresses these uncertainties in terms of probabilisticstatements
Probability
Meaning of probability is controversial.
Empirical definition: Probability of an eventA is the relativefrequency with which the events occur in a long sequence of trials inwhichA is one of the outcomes.
It only makes sense to talk about the probability when the event under question
can be thought of as a result of an experiment that could be performedrepeatedly
Tossing of a coin or throwing a dice
Suppose the median height in the population is 168 cm. Suppose we keepdrawing one individual at a time and measuring height. Over the long run, as thesample size gets large, half the people in our sample will have heights below168 cm.
A random person chosen from this population will have height below 168cm withprobability
8/2/2019 Bio Stat 600
6/35
Probability (contd.) Subjective interpretation of probability
It is a degree of belief expressing the certainty with which the event isexpected to occur.
This broader definition allows probabilistic statements withoutnecessarily contemplating a series of trials
Anything that is not known to you means that you are uncertain about it.The probability is simply an expression of that uncertainty.
Statistical inference based on the empirical definition of probability is calleda frequentist or repeated sampling inference
Statistical inference based on the subjective interpretation of the probabilityis called Bayesian inference.
Fortunately, for large samples the numerical results under both system ofinferences are very similar but the interpretation differ.
Frequentist inference is the focus of this course
Key Concepts A collection of all possible outcome from an
experiment is call sample space
Tossing a coin: S={H,T}
Study on health insurance: Random sample ofn subjectsand assessing how many have health insurance:S={0,1,2,,n}
An event is a subset of sample space
Tossing a coin:E={H}
Study on health insurance: None have insurance (E={0})
At least 60% have health insurance
{ | 0.6 }E X S X n=
Experiment may involve measuring a continuousvariable on an individual
Sample space: An interval on the real line
Assume or with almost zero mass
outside the appropriate interval Example: X=Systolic blood pressure
Mathematical convenience
An event is a subset of the real (or positive real)line
E={X > 140}
(0, ) ( , )
Probability Distribution Rule for assigning probability to all possible events
Probability Mass function: Probability assignment to eachindividual element of the sample space (Discrete SampleSpace)
Probability density function: Probability assignment to anarbitrarily small interval around each potential value of acontinuous variable (Continuous Sample Space)
Distribution function
Pr( ) ( ),
Pr( ) 0,
X x f x x S
X x x S
= =
= =
Pr( ) ( )X dx f x dx =
Pr( ) ( ) ( )
u
a
X u F u f x dx = =
8/2/2019 Bio Stat 600
7/35
Rules of Probability
0 Pr( ) 1
Pr( ) 0
Pr( ) 1
( )
Pr( ) 1 Pr( )
c
c
A Event
A
A A will not occur in the entire
sequence of experiments
A Only A will occur in the entire
sequence of experiments
A or A Complement of A or Not A
A A
=
=
=
=
=
Two events A and B are mutually
exclusive when occurrence of Arules out the occurrence of B in a
trial
Pr (A or B)=Pr(A)+Pr(B)
Two events A and B are independent
when occurrence of A has no bearingon the occurrence or non-occurrence
of B
Pr (A and B)=Pr(A)*Pr(B)
Example 1: Median height of the population is 168cm. Two individuals are chosen at random
independently. What is the probability that height of both is less than 168cm?
A= First persons height is less than 168cm
B= Second persons height is less than 168cm
A and B=Both have height less than 168cm
Because of independence,Pr(A and B)=Pr(A)*Pr(B)=1/2 *1/2=1/4
Example 2: Suppose that 10% of the population has height exceeding 180cm. What is theprobability that exactly one persons height exceeds 180cm?
Two possible scenarios:
C1: A 180
C2: A>180 and B
8/2/2019 Bio Stat 600
8/35
Properties of the Binomial distribution
In a typical sample of size n, you may expect nP subjects to havedisease
If you take several samples of size n from this population and note
down the number of diseased subjects, the variance among thesenumbers will be nP(1-P).
Inferential problem: Given n andx, how do we infer about P?
Intuitively the estimate ofP isx/n. This estimate turns out to bereally a good estimate.
How do we decide that it is good? We will see later.
Poisson Distribution
Poisson distribution is a close cousin of Binomial distribution.
IfP is very small (rare disease) and n is very large thenprobability that in a sample of size n, you will findx diseasedsubjects is
( )
! !
x nP xnP e e
x x
where nP
=
=
Expected number of diseased people in the population
If you take a large number of very large samples
and count the number of diseased people in
each sample, the variance among these numbers
will be a
=
pproximately
Inferential problem:
Givenx how do we
draw inference about
Normal Distribution A popular model for many continuous variables
It is a symmetric bell shaped curve characterizedby two parameters: Mean and Standard Deviation
Mean: Center of the distribution (the same as themedian and mode)
90% of observations lie between mean-1.64*SDand mean+1.64*SD
95% of observations lie between mean-1.96*SDand mean+1.96*SD
Conditional Probability
How likely that an Event A will happen given thatthe event B has occurred?
If A and B are independent then
Inverse Problem (Bayes Rule)
Pr( )( | )
Pr( )
A BP A B
B=
I
Pr( ) Pr( ) Pr( )Pr( | ) Pr( )
Pr( ) Pr( )
A B A BA B A
B B
= = =
I
( ) Pr( | ) Pr( )Pr( | )
Pr( ) Pr( )
P A B A B BB A
A A= =
I
8/2/2019 Bio Stat 600
9/35
Measures used in Diagnostic tests
Diagnostic test indicates T+ or T
True state of the disease: D+ or D-
Properties of diagnostic tests
Sensitivity: Pr(T+|D+)
Specificity: Pr(T-|D-)
Usefulness or value of diagnostic tests
Positive Predictive Value (PPV): Pr(D+|T+)
Negative Predictive Value (NPV): Pr(D-|T-)
Key Concept in statistical inference: Sampling Distribution
How do we judge whether the estimator, (x/n), (sample proportion) is agood estimator of the population proportion P?
Imagine that you draw several samples each of size n. Each will give youdifferent estimate. Variation in the estimates from sample to sample is
called the sampling variance. The square root of the sampling variance iscalled the standard error.
Two important criteria:
You would want the estimates to be the same as the estimand, on the average. Suchestimates are called unbiased
The sample to sample variation in the estimates should be as small as possible. Thatis, the standard error should be as small as possible
The most desirable estimate: An unbiased estimate that has the smallest sampling
variance.
In this sense the sample proportion is the most desirable estimate of the populationproportion
Sample Proportion The sampling variance of the sample proportion,p, is approximatelyp(1-
p)/n.
The standard error:
Instead of using the single value to estimate the population proportion,sometimes it is desired to provide a range of plausible values for theunknown population proportion with reasonable degree of confidence.
Confidence interval is a summary measure that provides such set ofplausible values. Usually confidence levels are 90%,95% or even 99%.
An approximate 95% confidence interval for the unknown populationproportion is
(1 ) / p p n
1.96
1.96 (1 ) /
p SE
p p p n
Confidence intervals
90% confidence intervals:
99% confidence intervals:
Example: In a random sample of size 2,837children in the State of Michigan, 118said they usually coughed, first thing in the morning. What can you infer about theprevalence of this condition in the entire state?
Sample prevalence is= 118/2837=0.0416, the estimated prevalence rate for entirestate.
The uncertainty in the estimate is
1.64p SE 2.57p SE
0.0416 (1 0.0416) / 2837 0.0037
95% confidence interval:
0.0416 1.96 0.0037=(0.034,0.049)
With reasonable confidence one could conclude thatthe population prevalence rate is between 3.4% to
4.9%
=
8/2/2019 Bio Stat 600
10/35
8/2/2019 Bio Stat 600
11/35
8/2/2019 Bio Stat 600
12/35
E l
8/2/2019 Bio Stat 600
13/35
Example
The following data was collected in a study of plasma magnesium indiabetic patients. The diabetic subjects were all insulin dependent subjectsattending a diabetic clinic over a 5 month period. The non-diabeticcontrols were mixture of blood donors and people attending day centersfor elderly, to give wide age distribution. Plasma magnesium follows aNormal distribution very closely.
The summary data is as follows:
Number of diabetic subjects=227
Mean plasma magnesium=0.719
Standard deviation =0.068
Number of non-diabetic controls=140
Mean plasma magnesium=0.810
Standard deviation=0.057
Questions of Interest
Calculate an interval which would include 95% of plasma magnesiummeasurements from the control population. This is called reference interval.It give information about the distribution of plasma magnesium in thepopulation.
Given that the distribution of plasma magnesium is normal, the mean and standard
deviation completely specify the distribution. Thus we would expect 95% of theobservations to lie between 0.810-1.96*0.057 and 0.810+1.96*0.057. That is, between0.698 and 0.922.
What proportion of diabetic subjects do we expect to lie in the referenceinterval?
The plasma magnesium level for diabetic subject is normal with mean 0.719 andstandard deviation 0.068. What is the area under this normal curve between 0.698 and0.922?
0.698 0.719 0.922 0.719Pr(0.698 0.922) Pr
0.068 0.068X Z
=
P r ( 0 . 3 1 2 . 9 9 )
P r ( 2 . 9 9 ) P r ( 0 . 3 1 )
0 . 9 9 8 6 0 . 3 7 8 3
0 . 6 2 0 3
Z
Z Z
=
=
=
Only about 62% of diabetic patient will lie in the referenceinterval.
What are the estimates of the population mean of plasma
magnesium for diabetic and non-diabetic populations?
Estimate of the population mean for diabetic subjects is
0.719 mmol/liter
Estimate of population mean for non-diabetic subjects is
0.810 mmol/liter.
What are the standard errors of population mean estimates?
Sample-to-sample variation in estimated mean for the population diabeticsubjects is 0.0045 and for the control population it is 0.0048.
1 1
1 1 1
2 2
2 2 2
Diabetic population:
227, 0.068;
/ 0.068/ 227 0.0045
Non-diabetic population:
140, 0.057;
/ 0.057 / 140 0.0048
n s
SE s n
n s
SE s n
= =
= = =
= =
= = =
8/2/2019 Bio Stat 600
14/35
Find 95% confidence interval for the population meanfor the control population.
How does the confidence interval differ from the 95%reference interval? Why are they different?
2
2
2 2 2 2
0.810
0.004895% confidence interval:
( 1.96 , 1.96 )
(0.810 1.96 0.0048,0.810 1.96 0.0048)
(0.801,0.819)
x
SE
x SE x SE
=
=
+
= +
=
Find the standard error of difference in the mean plasma magnesiumbetween diabetic and non-diabetic population?
Find 95% confidence interval for the difference in the means betweendiabetic and non-diabetic populations.
More than 95% confident that the difference in the population meansis negative. That is, the mean magnesium for diabetic subjects issmaller than the mean magnesium level for non-diabetic subjects.
2 2 2 2
1 2
:
0.719 0.810 0.091
( ) 0.0045 0.00480.0066
Estimated difference
SE diff SE SE
=
= + = +=
( 0.091 1.96 0.0066, 0.091 1.96 0.0066)
( 0.104, 0.078)
+
=
Would plasma magnesium be a good diagnostic test for diabetes?
The method discussed so far can be used to compare two populationproportions. Note that the proportion is simply the average of 0s and1s. The proportion is the mean of the binary variable.
Example: A study was conducted to determine to what extent childrenwith bronchitis in infancy get more respiratory symptoms in later lifethan others. 273 children who had bronchitis before age 5 (group 1)
were compared to 1046 children who did not(group 2). The outcomewas whether or not these children coughed during the day or night at age14.
26 of 273 reported coughing in group 1 and 44 of 1046 reportedcoughing in group 2.
1
2
1 2
1 1 2 21 2
1 2
26 / 273 0.095
44/1046 0.042
0.095 0.042 0.053
(1 ) (1 )( )
0.095 (1 0.095) 0.042 (1 0.042)
273 1046
0.0188
p
p
p p
p p p pSE p p
n n
= =
= =
= =
= +
= +
=
95% confidence interval:
0.053 1.96 0.0188=(0.016,0.090)
8/2/2019 Bio Stat 600
15/35
Adjustments for small samples
When the sample size is large, the central limit theorem applies and thesample mean has a normal distribution regardless of the originaldistribution of the the outcome variables.
When the same size is small, the distribution of the sample is not normal
even if the distribution of the outcome is normal.
Adjustment is needed when the sample size is small
Example: Does increasing the amount of calcium in our diet reduce bloodpressure? In a randomized experiment 10 black men were a calciumsupplement for 12 weeks and 11 black men received placebo thatappeared identical. The experiment was double blind. The outcome wasthe change in the blood pressure over a 12 week period.
Data
Calcium group: n=10, mean=5 and standard deviation =8.743
Placebo group: n=11, mean=-0.273 and standarddeviation=5.901
Two situations
Suppose the population standard deviations in the twopopulations are the same
Pooled standard deviation
2 22 1 1 2 2
1 2
2 2
( 1) ( 1)
( 1) ( 1)
9 (8.743) 10 (5.901)54.536
9 10
54.536 7.385
p
p
n s n ss
n n
s
+ =
+
+ = =
+
= =
2 2
1 2
1 2
( )
7.385 1/ 1/ 3.227
95%
(5.273 3.227)
9 10 19
2.093
(5.273 2.093 3.227,5.273 2.093 3.227)
( 1.48,12.027
10 11
)
p ps sSE x x
n n
confidence interval
t
Degrees of freedom
t
= +
= + =
= + ==
+
=
Given the considerable uncertainty, no change in the
population mean difference between calcium and placebo
groups is plausible. Based on this data, we are confidentthat the mean difference that one would observe is between
1.5 to 12.0
What if the two population standard deviations are not the same?
2 2 2 2
1 21 2
1 2
22 2
1 2
1 2
2 22 2
1 2
1 1 2 2
2 2 2
2 2 2 2
2
2 2
( 8 .7 4 3 ) ( 5 .9 0 1)( )
1 0 1 1
7 .6 5 3 .1 7 3 .2 9
1 1
1 1
[ ( ) / 1 0 ( 5 .9 0 1) / 1 1]
( / 1 0 ) / 9 ( 5 .9 0 1 / 1 1) / 1 0
( 3 .1 6 )
( / 9 3 .1 6 / 1
8 . 7 4 3
8 . 7 4 3
7 . 6 4
7 . 6 4
s sS E x x
n n
s s
n n
d f s s
n n n n
= + = +
= + =
+
=
+
+=
+
+=
+ 0 ) 1 .0 0
2 .1 3 1 ( 2 .1 2 0 2 .1 3 1) *
9 5 % c o n f i d e n c e in t e rv a l :
( 5
1 1 6 . 6 41 5 . 5 7
6 . 4 9
0 .5 7 2 .1 2 5
2 .1.2 7 3 3 .2 9 , 5 .2 72 5 2 .1 23 3 .25
1 .
9 )
( , )7 2 1 2 .2 6
t
= =+
+ =
+
=
8/2/2019 Bio Stat 600
16/35
How can we check whether the population standard deviations are Ho to make the decision?
8/2/2019 Bio Stat 600
17/35
How can we check whether the population standard deviations arethe same?
We will discuss this in the context of hypothesis testing.
It can be argued that the equality of population standard deviationscan never be empirically verified, especially, if the sample size issmall. One should always, therefore, use the procedure which does
not assume equality of population standard deviations.
How to make the decision?
General principle: Check whether the data is consistent with the nullhypothesis.
We will answer the question by assessing how likely the observeddata would have been generated if the null hypothesis were true.
A simple procedure
If the null hypothesis were true then the difference betweenpronethalol and baseline values will be on the average 0 withroughly half of them positives and the rest of them negatives. Thatis, under the hypothesis the number of negative signs will occurwith probability 0.5. But only one negative value has occurred.
Probability of observing 1 negative and 11 positives is
Even more extreme observation will be 0 negative and 12positives which has the probability
Test statistic: Number of negative values
P-value: The probability of observing the test statistic which isas or more extreme than the observed, if the null hypothesis
were true P-value=0.00293+0.00024=0.00317
1 1112
0.5 0.5 0.002931
=
0 1212
0.5 0.5 0.000240
=
The above test is called sign test. Also we performed a onesided test because only smaller values of the number ofnegative values were considered.
If one were to find large number of negative values, say 11 or12 then that is also evidence against the null hypothesis
The probability of obtaining 11 or 12 negatives is also0.00317 (verify!).
The two sided p-value is 0.00317+0.00317=0.00634
How do we define extreme values that constitute evidenceagainst the null hypothesis?
A T l f T E R i i i h h l l l
8/2/2019 Bio Stat 600
18/35
A Tale of Two Errors Type 1 error: Rejecting the null hypothesis when it is actually
true.
Type 2 error: Failure to reject the null hypothesis when it isactually false. (Equivalently, accepting the null hypothesis
when it is actually false) Chances of making type 1 error is called significance level and
is denoted by
Chances of making type 2 error is denoted by.
1-is called power. Power is the chance of Rejecting the nullhypothesis when alternative is true
Objective is to control chances of making either types of errors
Strategy: For a fixed significance level, we will define theextreme values.
Revisiting the pronethalol example
Suppose we specify that the chances of making type 1 error is 0.05
For two sided alternatives: The extreme value then is determined bychoosing values, c and d, so that the number of negative values lessthan or equal to c or greater than or equal to dis 0.05.
For one sided alternatives: The extreme value is determined bychoosing a value, c, so that the number of negative values less than orequal to c is 0.05.
Looking at the binomial table on Page T-9 (last column, n=12). If wechoose c = 2 and d = 10. The probability of type 1 error is0.0161+0.0161=0.0322. It is not possible to determine c and dtoachieve significance level 0.05.
For one sided hypothesis, choose c=3. (It gives slightly larger than the
specified significance level 0.05
Usually power is calculated instead of chances of making type 2error.
We need a specific alternative value which is the truth. Supposethat 1/12 (approximately, 8%) is the indeed the true value. Thechances of rejecting the null hypothesis is:0.3677+0.3837+0.1835+0.0000+0.0000+0.0000=0.9349
Try calculating power for alternatives 0.05, 0.10, 0.20 etc. Power curve is a plot of power against the alternative values.
The sign test so far has considered only the sign of the differenceand not the magnitude of the difference. Let us consider somealternatives.
Suppose that the differences can be assumed to be normallydistributed. The estimate of the mean difference in the populationis 7.7 and standard deviation is 15.1.
The standard error is 4.4.
If the null hypothesis is true then the sample mean should bedistributed around 0. The extent to which the observed sample
mean is different from 0 is the evidence against the nullhypothesis.
One way to measure the distance between the observed samplemean and the null hypothesis value is in terms of standard errorunit.
0
( )
x
t SE x
=
Two Sample Tests
8/2/2019 Bio Stat 600
19/35
Calculated value of t-statistic is 7.7/4.4 = 1.75
What is the probability that one would observe this extreme oreven more extreme values of the test statistic under the nullhypothesis?
If the null hypothesis were true then the statistic has a t-
distribution with 11 degrees of freedom.
-1.75 1.75
Shaded area: 0.1079
Computed using a computer
For a fixed significance level,say 0.05. The value of the test
statistic considered to be large
is 2.201
Two-Sample Tests Revisit the plasma magnesium example
The following data was collected in a study of plasma magnesium in diabetic patients.The diabetic subjects were all insulin dependent subjects attending a diabetic clinicover a 5 month period. The non-diabetic controls were mixture of blood donors andpeople attending day centers for elderly, to give wide age distribution. Plasmamagnesium follows a Normal distribution very closely.
The summary data is as follows: Number of diabetic subjects=227
Mean plasma magnesium=0.719
Standard deviation =0.068
Number of non-diabetic controls=140
Mean plasma magnesium=0.810
Standard deviation=0.057
Frame the question as testing of statistical hypothesis: Are the means of plasma magnesium in the two populations (diabetic and non-diabetic)
the same?
Diabetic
Non
Diabetic1
1
1
227
0.719
0.068
n
x
s
=
=
=
2
2
2
140
0.810
0.057
n
x
s
=
=
=
Mean for Diabetic population:
Mean for Non-diabetic population:
Null Hypothesis:
Alternative Hypothesis:
is an estimate of
If the Null hypothesis were true then should bedistributed around mean 0. The extent to which it is away from 0
is evidence against the null hypothesis.
Test statistic:
1
2
1 2:oH =
1 2:AH
1 2x x 1 2
1 2
x x
1 2 1 2 1 2
1 2 1 2
( ) ( )
( ) ( )
0.09113.780.0066
x x x xt
SE x x SE x x
= =
= =
13.78- 13.78
Sampling distribution is normal given thelarge sample sizes from each population. Ifthe null hypothesis were true, 68% of thesamples should result in the value of the teststatistic to be between -1 and 1, 90% of thesamples between -1.64 and 1.64 and 95% ofsamples between -1.96 and 1.96. What wehave observed is very unlikely under thenull hypothesis. Therefore, the nullhypothesis is a suspect
Small sample example revisited Two situations: Population variances are equal or unequal
8/2/2019 Bio Stat 600
20/35
Example: Does increasing the amount of calcium in our diet reduce blood pressure? In a
randomized experiment 10 black men were given a calcium supplement for 12 weeks
and 11 black men received placebo that appeared identical. The experiment was double
blind. The outcome was the change in the blood pressure over a 12 week period.
Data
Calcium group: sample size =10, mean=5 and standard deviation =8.743
Placebo group: sample size=11, mean=-0.273 and standard deviation=5.901 Population mean if everybody in the population were given calcium supplement:
Population mean if everybody in the population were given only Placebo:
Null hypothesis:
Alternative hypothesis:
Large positive mean difference is evidence against the null hypothesis in
favor of the alternative hypothesis
Alternative hypothesis:
Large positive or negative mean difference is evidence against the null hypothesis infavor of alternative hypothesis
1
21 2:oH =
1 2:oH >
1 2x x
1 2:AH
p q q
Equal
Pooled standard deviation
Standard error of the difference in the means
Test statistic
2 22 1 1 2 2
1 2
2 2
( 1) ( 1)
( 1) ( 1)
9 (8.743) 10 (5.901)54.536
9 10
54.536 7.385
p
p
n s n ss
n n
s
+ =
+
+ = =
+= =
2 2
1 2
1 2
10 1
( )
7.385 1/ 1/ 3.221 7
p ps sSE x x
n n = +
= + =
1 2 1 2
1 2
( ) ( ) 5.2731.63
( ) 3.227
9 10 19
x xt
SE x x
Degrees of freedom
= = =
= + =
Sampling
distribution: t with 19
degrees of freedom
P-value
One sided alternative
From Table D on page
T-11, the shaded area is
between 0.05 and 0.10
Computer software:0.0598
Two sided alternative
P-value=2*0.0598=0.1196
1.63
1.63-1.63
Variances are unequal
2 2 2 2
1 21 2
1 2
22 2
1 2
1 22 2
2 2
1 2
1 1 2 2
2 2 2
2 2 2 2
2
2 2
(8.743) (5.901)( )
10 11
7.65 3.17 3.29
1 1
1 1
[( ) /10 (5.901) /11]
( /10) / 9 (5.901 /11) /10
( 3.16)
( / 9 3.16 /1
8.743
8.743
7.64
7.64
s sSE x x
n n
s s
n ndfs s
n n n n
= + = +
= + =
+
=
+
+=
+
+=
+
116.6415.57
6.490) 1.00= =
+
Test statistic
5.273
1.6033.29
:
: 0.065
: 0.13
t
P value
One sided
Two sided
= =
8/2/2019 Bio Stat 600
21/35
A l i f V i
8/2/2019 Bio Stat 600
22/35
Blood Pressure Example
1 2
7.4
n n n
=
= =2.5 5 7.5
30
40
50
60
70100
25.6 74.4 97.5
32.7 85.6 99.4
39.3 92.2 99.9
45.6 95.9 100
51.5 97.9 10066.6 99.8 100
n
Analysis of Variance
Suppose now that we want to compare more than 2 populations
One could do pair-wise comparisons. This is cumbersome and is not easyto summarize when the number of populations compared is large.
The analysis of variance is used by framing question in terms of in-depthinvestigation of variations in the observed data
Analysis variance basically partitions the overall variability into one ormore assignable causes or reasoning.
What is left unassigned is called residual variability.
Based on the partition of the variability relative merits of assignablecauses are investigated.
Generally, the variation due to an assignable cause relative to the residualvariability is used as a yard stick for judging the importance of theassignable cause.
The assignable causes can be carefully planned or manipulatedthrough an experimental design
The assignable causes are based on substantive reasoning in anobservational study design
Example:
A randomized study was conducted to test the generality of the observation
that stimulation of the walking and placing reflexes in the newbornpromotes increase walking and placing (Zelazo, Zelazo and Kolb (1972,Science, pages 314-315)). A total of 29 one-week old males wererandomized to four groups. 1: Active exercise, 2: Passive exercise, 3: No-exercise and 4: 8-week control group. Age of infants walking alone (inmonths) was the outcome variable of interest.
The assignable cause is the levels of exercise
Is the variation caused by this assignable cause substantial?
Data
ActiveExercise
PassiveExercise
No Exercise Control
9.00
9.50
9.75
10.00
13.00
9.50
11.00
10.00
10.00
11.75
10.50
15.00
11.50
12.00
9.00
11.50
13.25
13.00
13.25
11.50
12.00
13.50
11.50
2
1 1
Observation for subject in group
1,2,...,
1,2,...,
Overall mean
Total variation= ( )i
ij
i
nk
ij
i j
y j i
j n
i k
y
y y
++
++= =
=
=
=
=
1 2 3 4
:
4
6, 5261 / 23 11.34
58.47
Example
k
n n n ny
Total variation
+ +
=
= = = == =
=
ANOVA
8/2/2019 Bio Stat 600
23/35
ANOVA
( ) ( )ij ij i iy y y y y y++ + + ++ = +
Overall
deviation
Within Groups or
Between-subjects nested
within groupsBetween Groups
2 2 2
1 1 1 1 1 1
( ) ( ) ( )
58.47 43.69 14.78
i i in n nk k k
ij ij i i
i j i j i j
y y y y y y
TotalSS WithinSS BetweenSS
++ + + ++= = = = = =
= +
= += +
Degrees of freedom: Number of independent statistics
used to compute the sum of square
An alternative Expression
( ) ( )ij i ij i
ij i ij
y y y y y y
y
++ + ++ += + +
= + +
Overall mean Deviation of the
group i mean
from overall mean
Residual
( ) ij j is Effect of subject j nested within group i = =
Df for Total SS=22 ( Every observation
is used but sum of deviations is zero)
Df for Within SS=19 (Every
observation is used but sum of
deviations within each group is zero)
Df for Between SS=3 (Four means areused but sum of deviations from the
overall mean is zero)
To compare the Sums of squares, differences in the
degrees of freedom has to be taken into account.
Mean square =Sum of square/Degrees of Freedom
1
( ) 1
( ) 1
( )
k
i
i
N n Total sample size
Df TotalSS N
Df BetweenSS k
Df WithinSS N k
=
= =
=
=
=
ANOVA Example
( ) 14.78 / 3 4.93
( ) 43.69 /19 2.30
( ) / ( ) 2.14
MS Between
MS Within
MS Between MS Within
= =
= =
=
Is 2.14 large?
Use F-distribution with (numerator df=3,
denominator df=19) to determine how likely is
2.14 or even larger F when in actuality there are
no differences among the four groups?
P-value: 0.1228
Regression Analysis TerminologyX I d d i bl
8/2/2019 Bio Stat 600
24/35
Bulk of scientific investigations are concerned with relationships.
Causal relationship: If one changes the variableXby a certain amount how muchdoes the variable Ychange?
Association or correlational relationship: Are subjects with different values ofXalsotend have different values ofY?
What is the nature of these relationships in the population?
How do you quantify these relationships in the population?
How to estimate the quantities describing these population relationships?
How accurately are those estimates? How much uncertainty is there in assessingthese relationships?
The two sample tests and ANOVA also fit into this category
Are the population means related to treatments assigned or the observed grouping ?
We will later see that the two-sample t-tests and ANOVA F-tests are particular casesof the general regression framework.
X= Independent variable.
A variable that an investigator can change in an experiment
Amenable to intervention in an observational study
Simply the variable whose impact is to be assessed.
It is possible that there can be more than one independent variable ofinterest
Other names: Predictors, correlates, right-had-side variables, exogenousvariables
Y= Dependent variable of interest.
Variable for which you want to assess effect of X.
Other names: Outcome, endogenous variables, left-hand-side variables
Impact of different values of X on differences in Y expressed in somemeaningful terms is of interest
Example The following table gives data collected by a group of
medical students in a physiology class. The objective is toassess association between height and FEV1.
Height FEV1
164.0 3.54
167.0 3.54
170.4 3.19
171.2 2.85
171.2 3.42
171.3 3.20
172.0 3.60
Height FEV1
172.0 3.78
174.0 4.32
176.0 3.75
177.0 3.09
177.0 4.05
177.0 5.43
177.4 3.60
Height FEV1
178.0 2.98
180.7 4.80
181.0 3.96
183.1 4.78
183.6 4.56
183.7 4.68
Scatter plotScatter plot:
A graphical device to assess
the type of relationship.
Each point is a pair (X,Y)
Dependent variable on thevertical axis
Independent variable on the
horizontal axis
Inspection of the graph
suggests a linear
relationship
Linear relationship
Method of Least
8/2/2019 Bio Stat 600
25/35
Linear relationship
Representation
Clearly not (none) every observations will satisfy thisequation
How to determine a and b?
1 2 1 2( ) ( )y y x x
y a b x= +
i i iy a b x e= + +
ResidualLine-value or
the expected value
{1
Squares:
Find a and b that
minimizes the
residual sum of
squares:
2 2
1 1
( )n n
i i i
i i
y a b x= =
=
2
( )( )
( )
i i
i
i
i
x x y y
b
x x
a y b x
=
=
Simplified formulas Slope
Intercept
Needed quantities
2
2
/
/
i i i i
i i i
i i
i i
x y x y n
b
x x n
=
/ /i ii i
a y n b x n
=
2, , ,i i i i i
i i i i
x y x y x
Example y=FEV1, x=Height
Prediction equation
2
2
3,507.6, 77.12
13,568.18, 615,739.24
13,568.18 3,507.6 77.12 / 20 0.074389615,739.24 (3,507.6) / 20
77.12 / 20 0.074389 3, 507.6 / 20 9.19
i i
i i
i i i
i i
x y
x y x
b
a
= =
= =
= =
= =
1 9.19 0.0744FEV Height = +
Interpretation Interpretation (Contd.)
8/2/2019 Bio Stat 600
26/35
p
Slope
Expected difference in for unit positve
difference inTwo Individuals
Individual 1:
Individual 2: 1
Expected or line-value for individual 1:
Expected or line-value for individual 2:
b y
x
x h
x h
a b h
=
=
= +
+
( 1)
a b h
Difference b
+ +
=
Intercept
Expected value ofy whenx=0.
It is not very interpretable in this particular problem. Value
of FEV1 when Height is 0!
Modification: Centering
( )y c d x x= +
d b
c Expected value of y for average height
=
=
Residual Residuals from the estimated line
Residuals represent deviations from the expected value.Large residuals reflect unreliability or uncertainty.
One way to measure this uncertainty is through variance ofthe residuals (or the standard deviation of the residuals).
i i ie y a b x=
2
2
2
i
ie
e
sn=
Computational formulas2 2 2 2
2
2 2 2
( ) ( )
2 2
( 1)( )
( 2)
of '
of '
i i i
i i ie
y x
y
x
e y y b x x
sn n
n s b sn
s SD y s
s SD x s
= =
=
=
=
2
( )( )
1
xy
x
i i
xy
sb
s
x x y y
sn
=
=
Covariance
Example How useful isx in predictingy?
8/2/2019 Bio Stat 600
27/35
p
2 2 22
5.51, 0.71
19 (0.71 0.0744 5.51 )0.35
18
x y
e
s s
s
= =
= =
2.260.56
5.51 0.71
xy
x y
sr
s s
= = =
Correlation Coefficient: Another measure of
strength of linear relationship
Measure of linear association between x
and y.
p g y2
2
2
2
19
18
( 1)
2
y
e
y
Total variance= 0.71 = 0.504 s
(Residual variance from a horizontal line)
Degrees of freedom
Residual variance= 0.35 s
(Residual variance from the regression line on x)
Degrees of freedom
n sR
=
=
=
=
=2
2( 2) 19 0.50 18 0.35
( 1) 19 0.50
0.34
34% (in percentage)
e
y
n sn s
=
=
Another Form ofR2
2
2
2
( ' )
( ' )
y
y
sVariance y sR
Variance y s s= =
R-square is a simple measure to assess how much
variability iny is explained by the variation inx.
Large values of R-square indicates substantial
variation in y is due to variation in x. Small R-square
indicates the opposite
This measure also has disadvantages and we willdiscuss those when we consider multiple preidtors
Inference How much the slope and intercept estimates vary from sample to
sample?
Standard error of the estimates
95% confidence interval
2
2
( ) ( 1)
e
x
s
SE b n s=
0.025, 2 ( )nb t SE b
Estimated Line Value
8/2/2019 Bio Stat 600
28/35
( )
,
( )
b bt
SE b
Under the null
bt
SE b
=
=
Test the
hypothesisHo: b is
equal to 0 versusHA: b is not equal to
zero
Supposex=fand it is not one of the observed values in the data set.
What would one expecty to be on the average?
22
2
1 ( )( )
( 1)
f
f e
x
y a b f
f xSE y s
n n s
= +
= +
2
2
175
9.19 0.0744 175 3.83
1 (175 175.38)( ) 0.35 0.133
20 19 5.51
f
f
f
y
SE y
=
= + =
= + =
Prediction Interval
This refers to a confidence interval for a singleobservation on outcome variable for a given valueof the independent variablex = f.
22
2
1 / 2, 2
1 ( )( ) 1
( 1)
f
f e
x
f n
y a b f
f xPrediction SE y s
n n s
y t PSE
= +
= + +
Discrete Predictors In the example considered so far, the independent variable
was a continuous variable.
Suppose now the independent variable is a binary coded asx = 0 or 1.
Interpretation of regression coefficients:
( | )
( | 0)
( | 1)
( | 1) ( | 0)
E y x a bx
E y x a
E y x a b
b E y x E y x
= +
= =
= = +
= = =
a = Mean for the reference group
defined as subjects with x = 0
b = Difference in the mean between
two groups x=1 versus x=0
-
Test for significance of b is identical to two sample
t test
Multiple Predictors ( | )E Y X a bX= +
8/2/2019 Bio Stat 600
29/35
Often in practice several variables mightinfluence the dependent variable.
Some common examples
C
X Y
Part of X, Y relationship is due to common
relationship with C
( | , )
( | 1, ) ( | , )
[ ( 1) ] [ ]
b Ignores the influence of C on X and Y
E Y X C c dX eC
E Y X x C f E Y X x C f
c d x ef c dx ef d
=
= + +
= + = = =
+ + + + + =
d = Difference in the expected values of Y associated with one
positive unit difference in X holding C constant.
b= Unadjusted estimate
d= Estimate adjusted for C; Usually d will be smaller than b
(but it can be larger than b)
C=Confounding variable
X I Y
( | )E Y X a bX= +
b=representing the
effect of actually the
variable I
I: Intervening
variable( | , )
0
E Y X I c dX eI
d
= + +
X I YX may act through I as
well as act
independently on Y
Statistical effect of I or C will be the same on the regression
coefficient of X. The conceptual understanding has to
distinguish between confounding and Intervening variables
X
M=0
M=1
Y0
Y1
The effect of X depends on M. The effect of X is
modified by the presence or absence of M
( | , )E Y X M a bX cM dX M= + + +
8/2/2019 Bio Stat 600
30/35
( , )
( | , 0)
( | , 1) ( ) ( )
E Y X M a bX
E Y X M a c b d X
= = +
= = + + +
d=The extent to which the effect of X is modified by the
presence of M (that is, M=1)
d=0 c arbitrary: Parallel lines for two groups M=1 and
M=0
d=0,c=0: Coincidental lines
c=0,d arbitrary: Same intercept, lines for M=0 and M=1
are fanning out
d=0,c arbitrary d,c arbitrary
d=0,c=0 c=0,d arbitrary
Analysis of Cross-classifed Data So far we have concentrated on analyzing relationships between a
continuous dependent variable and continuous or discrete (orcategorical) independent variables.
Regression
ANOVA
Many times the dependent and independent variables are bothdiscrete.
Qualitative categories (Such as Gender, Race/Ethnicity,geographical location, type of health insurance)
Quantitative or ordered categories
low, medium and high socioeconomic status
none, very low, low, medium, high doses of environmentalexposure
Other combinations are also possible
Discrete dependent (Yes/No for a disease), continuousindependent (Age, BMI etc)
Logistic regression
Number of events as dependent (Number of seizuresamong epileptic patients over a fixed or variable
period of time)
Poisson Regression
Continuous dependent but truncated (or censored).For example, failure time, time to death, time tosymptoms. These may be known for some individualsand for others it is only known to exceed some known
value. Survivial analysis
8/2/2019 Bio Stat 600
31/35
Chi-square statistic is one of the distance T has a chi-square distribution with df=(r-1)(c-1) degrees of
freedom.
8/2/2019 Bio Stat 600
32/35
measure:2
1 1
( )
Number of rows
Number of columns
Observed frequency in row and column
Expected frequency in row and column
r cij ij
i j ij
ij
ij
O ET
E
r
c
O i j
E i j
= =
=
=
=
=
=
2 2 2
2 2
(50 61.7) (849 837.3) (29 17.7)
61.7 837.3 17.7
(3 2.7) (36 36.3)... 10.5
2.7 36.3
T
= + + +
+ + =
r=5, c=2; df=4.
Critical value for significance level is 0.05 is 9.49. The data arenot consistent with the hypothesis of no association between
housing tenure and time of delivery. That is, there is a goodevidence of association between housing tenure and time ofdelivery
The chi-square statistics is not a measure of association. If wedouble the frequencies in each cell, the association will remainunchanged but chi-square will double.
Chi-square is a large sample test and is questionable if anyexpected frequency is less than 5. Alternatives are
Yates correction
Fishers exact test
Yates Correction
Fishers exact test. It is based on computing theprobability of observing a particular contingency table ortables that are more inconsistent with the hypothesis of
no association. It is a complicated algorithm and usuallyis performed using a computer.
Example: The following table is from a trialinvestigating the efficacy of streptomycin for thetreatment of pulmonary tuberculosis. The data is forsubgroup for patients with an initial temperature of 100-100.9F. The two variables are radiological assessment of
the disease 6 months later and treatment.
2(| | 0.5)
Y
O ET
E
=
Radiological assessment Streptomycin Control
Improvement 13 5
Deterioration 2 7
Death 0 5
Pooled table
R di l i l S i C l
Log(OR)=2.75
(l (OR)) 1/ +1/b+1/ +1/d 1/13+1/2+1/5+1/12 0 86
8/2/2019 Bio Stat 600
33/35
Radiological assessment Streptomycin Control
Improvement 13 (a) 5 (b)
Deterioration or 2 (c) 12 (d)
Death
Odds of improvement in the streptomycin group=13/2=a/c
Odds of improvement in the control group=5/12=b/d
Odds ratio=(13/2)/(5/12)=15.6 =(ab)/(cd)
Confidence interval:
exp[Log(OR)-z*SE(log(OR)), Log(OR)+z*SE(log(OR))]
var(log(OR))=1/a+1/b+1/c+1/d=1/13+1/2+1/5+1/12=0.86
SE=0.93
95% confidence interval for log-odds-ratio:
(2.75-1.96*0.93,2.75+1.96*0.93)
=(0.93,4.57)
95% confidence interval for odds-ratio:
(2.53,96.54)
Analysis so far involved only two variables. What to do ifwe have more than two variables? For example, suppose wewant to adjust for Age and other confounding variables
while assessing association between treatment and outcome(or home ownership and time of delivery).
Technique is called logistic regression
Matched study: Binary Outcome
A questionnaire was administered to1,319 school children at ages 12 and14. One question asked was whetherthe prevalence of reported symptomswas different at the two ages. Thefollowing two by two table gives theresult
As in the paired t-test example, wewant to exploit the fact that the samechildren were asked at ages 12 andthen again when they were 14
The concordant pairs are (yes, yes)and (no, no).
The discordant pairs (yes, no), (no,yes)
Severecolds atage 12
Severe colds at age14
Yes NoYes 212 144
No 256 707
: The prepvalence of reported symptoms
is the same at two ages
Under the null hypothesis,
proportion of subjects answering (yes,no) should
be same as the subjects answering (no,yes)
oH
2 2
2
:
144, 256
:
2002
2 2
2 2
31.4
1
yn ny
yn ny
yn ny yn ny
yn ny
yn ny yn ny
Observed
f f
Expected
f f
f f f ff f
f f f f
df
= =
+=
+ + = +
+ +
=
=
The chi-square value is highly
significant. The proportions at
two ages are not the same.
Odds of transition from No to Yes is
256/144=1.78
Conditional analysis (conditional on
transition)
This is called
McNemars test
Nonparametric Approaches
Most statistical approaches discussed so far assume some distribution for Paired Designs
8/2/2019 Bio Stat 600
34/35
Most statistical approaches discussed so far assume some distribution forthe population (mostly normal).
The approaches such as one and two sample t-tests, linear regression etc.are valid unless the departure from normality is very severe.
Nevertheless, it will be useful to have a set of techniques that can be appliedwithout any distributional assumptions.
The sign test discussed earlier in the course is an example of anonparametric test. However, this procedure can have low power because ituses only the signs and not the magnitude.
An alternative is to use magnitude in some way but still maintaining thenonparametric nature of the tests.
Rank-based procedures are quite popular
Paired Designs Revisit the husband-wife
pair example
Wilcoxon signed rank
procedure Step 1: Rank the absolute
values of the differences
Step 2: Take the differencein the sums of the ranks ofthe positive and negativedifferences
Pair
1
2
3
4
5
6
7
89
10
Rank
(r)
7
4.5
3
8
9
10
1.5
1.54.5
6
Difference
(d)
2.3
-1.1
0.8
2.4
-3.1
-3.2
-0.6
0.6-1.1
-1.5
(6 3 7 1.5) (4.5 8 9 1.5 4.5 5)
17.5 32.5 15
( ) 0; ( ) ( 1)(2 1) / 6
10 11 21/ 6 385
( ( )) / var( ) 15 / 385 0.76
w
E w Var w n n n
z w E w w
= + + + + + + + += =
= = + +
= =
= = =
Null hypothesis: Median of the distribution of differences is
zero.
All nonparametric procedures formulate hypotheses in terms
of medians rather than mean
Two sample nonparametric tests
These are analog of two-sample t-tests
Mann-Whitney-Wilcoxon test
Sample of size n from population 1
Sample of size m from population 2
Rank (n+m) units regardless of the populations
Sum the ranks of subjects in sample 1and call it T.
Define U=T-n(n+1)/2
Alternatively, one can sum the ranks of subjects in sample 2and then replace n by m
[ / 2] / ( 1) /12z U mn mn m n= + +
Null hypothesis: The distributionin the two populations are the
Crohnsdi
Coeliacdi
Rank all 29 observations
8/2/2019 Bio Stat 600
35/35
in the two populations are thesame
Example:The following tablegives biceps skinfold
measurements for 20 patientswith Crohns disease and 9patients Coelic disease. Theobjective is to assess whether thedistribution of the bicepmeasurements are the same
disease
1.8,2.8,4.2,6.2,2.2,3.2,4.4,6.6,2.4,3.6,
4.8,7.0,2.5,3.8,5.6,10.0,2.8, 4.0,6.0,10.4
disease
1.8,2.0,2.0,2.0, 3.0, 3.8,4.2,5.4, 7.6
1.8 1.5 3.0 11 5.4 21
1.8 1.5 3.2 12 5.6 22
2.0 4 3.6 13 6.0 23
2.0 4 3.8 14.5 6.2 24
2.0 4 3.8 14.5 6.6 25
2.2 6 4.0 16 7.0 26
2.4 7 4.2 17.5 7.6 27
2.5 8 4.2 17.5 10.0 28
2.8 9.5 4.4 19 10.6 29
2.8 9.5 4.8 20
Circled
numbers are
from Sample 2
Rank
sum=104.5
104.5 -9 10 / 2 59.5
(59.5 9 10 / 2) / 9 20 (9 20 1) /12
1.44
0.15
U
z
Two sided p value
= =
= + +
=
=
This is very similar to the result one obtains using two
sample t-test
Generalizations
What if you have more than two groups?
Rank all the observations regardless of group and
then perform the one-way analysis of variance ofthe ranks.
The null hypothesis: The distributions for thevarious populations defined by the groups are thesame.
You can get ranks by using PROC RANK inSAS. See the handout for example