Bio Stat 600

8/2/2019 Bio Stat 600

1/35

Introduction to Biostatistics

BIOSTATISTICS 600

Instructor: T. E. Raghunathan (Raghu)

E-mail: [email protected]

About the course

To provide an overview of basic concepts in Designand Analysis of Biostatistical Investigations.

Unify the thought process as many students takecourses under different circumstances at varioustimes

Kindle imaginations for your course work andrefresh memory

Grade Letter grade will be based on a multiple choice test on

the last day of the lecture

What is biostatistics? Biostatistics, as a field of science, is concerned with:

The design and conduct of experiments (orstudies) to collect observations (or data).

Display and analyze the data to infer about thepopulation

Duly acknowledge the uncertainty in the statedconclusions while inferring about the population.

R. A. Fisher defined statistics as

Study of populations

Study of variations

Study of methods to reduce the data

Key Concepts

Goal of statistical analysis is to draw inference or conclusions inan unbiased fashion

The target population should be clearly defined (that is, thepopulation for which inference is drawn should be clearly stated)

It is important recognize the variations in the population and

should be reflected in the inferences . The same experiment conducted on two different populations

may yield different results (systematic component of variation)

Two different conditions or experiments conducted on thesame population may yield different results (systematiccomponent of variation)

The experiment replicated under the same conditions on the

same population may yield different results (randomcomponent of variation).

Always find ways to succinctly describe the results usinggraphical and numerical summaries

8/2/2019 Bio Stat 600

2/35

8/2/2019 Bio Stat 600

3/35

Two Displays

1 4466

5431 2 0446888

331 3 012

4

4 5

1 4466

2 0134456888

3 011233

4

5 4

Control

Treatment

Such displays are useful to visually

inspect the extent to which

distributions overlap as well as the

magnitude of the differences

Displays for qualitative variables

Distribution of Cause of Death

Circulatory system 137,165

Neoplasms 69,948

Respiratory system 33,223

Injury/Poisoning 6,427

Digestive system 10,779

Nervous system 5,990

Others 30,695

Total 294,227

Dot chart

Bar chart

You can create these

and other types plots

using PROC

GPLOT in SAS or

in Excel. These

graphs were

produced by using a

freeware called R

which can be

downloaded from

www.r-project.org

We will discuss some more graphical displays after introducingsome numerical summaries

Numerical Summaries

Central tendency: Represents a typical or middle value

Spread: Extent of variability across observations

Shape: The structure of the distribution

Central tendency Mean: Sum all the observations and divide by the number of observations

being summed (arithmetic mean)

Mean survival time of guinea pigs in control group

(21+23+24+25+31+33+33+54)/8=244/8=30.5

Median: The number such that 50% of the observations are less than thenumber and 50% are greater than the number

Median survival time of guinea pigs in the control group

21,23,24,25, 31,33,33,54

Technically a number between 25 and 31. Sometimes the average (25+31)/2=28 is used.

8/2/2019 Bio Stat 600

4/35

8/2/2019 Bio Stat 600

5/35

Histogram

Divide the data into groups choosing intervals (Mutuallyexclusive, equal width and exhaustive)

Some rules: n/3 or n1/2log10(n)

Count the observations in each group

Draw a bar chart with area of the bar proportional to the count

If you fix the width of the bar to 1 then the height of the baris proportional to the count

Shape of the distribution:

Symmetric: Mean, Median and Modes are the same.

Values on the either side of the mean are equallylikely

Skewed: Mean is larger or smaller than the median Normal distribution

Characterized by the mean and

standard deviation:

68% of observations lie with 1 SD

95% of observations lie within 2 SD

99.7% of observations lie within 3SD

Key Concepts Population: It is a collection of units in whom we are interested

All people living in the United States

In the study of treatment for diabetes, all people with diabetes

Blood pressure for a person: All possible measurements blood pressure inthat person

Generally, it is impossible to measure each and every unit in the population (ifwe could,it is called a Census).

A practical approach: Sample is usually a very small subset units in thepopulation. The sample is measured and studied to draw conclusions about thepopulation.

Method used to draw the sample is the key step in a biostatistical investigation.

Sample should be representative of the population (probability or

random sampling designs assure such unbiased representativeness)

Due to sampling from the population there is uncertainty in the inferences.Statistical analysis expresses these uncertainties in terms of probabilisticstatements

Probability

Meaning of probability is controversial.

Empirical definition: Probability of an eventA is the relativefrequency with which the events occur in a long sequence of trials inwhichA is one of the outcomes.

It only makes sense to talk about the probability when the event under question

can be thought of as a result of an experiment that could be performedrepeatedly

Tossing of a coin or throwing a dice

Suppose the median height in the population is 168 cm. Suppose we keepdrawing one individual at a time and measuring height. Over the long run, as thesample size gets large, half the people in our sample will have heights below168 cm.

A random person chosen from this population will have height below 168cm withprobability

8/2/2019 Bio Stat 600

6/35

Probability (contd.) Subjective interpretation of probability

It is a degree of belief expressing the certainty with which the event isexpected to occur.

This broader definition allows probabilistic statements withoutnecessarily contemplating a series of trials

Anything that is not known to you means that you are uncertain about it.The probability is simply an expression of that uncertainty.

Statistical inference based on the empirical definition of probability is calleda frequentist or repeated sampling inference

Statistical inference based on the subjective interpretation of the probabilityis called Bayesian inference.

Fortunately, for large samples the numerical results under both system ofinferences are very similar but the interpretation differ.

Frequentist inference is the focus of this course

Key Concepts A collection of all possible outcome from an

experiment is call sample space

Tossing a coin: S={H,T}

Study on health insurance: Random sample ofn subjectsand assessing how many have health insurance:S={0,1,2,,n}

An event is a subset of sample space

Tossing a coin:E={H}

Study on health insurance: None have insurance (E={0})

At least 60% have health insurance

{ | 0.6 }E X S X n=

Experiment may involve measuring a continuousvariable on an individual

Sample space: An interval on the real line

Assume or with almost zero mass

outside the appropriate interval Example: X=Systolic blood pressure

Mathematical convenience

An event is a subset of the real (or positive real)line

E={X > 140}

(0, ) ( , )

Probability Distribution Rule for assigning probability to all possible events

Probability Mass function: Probability assignment to eachindividual element of the sample space (Discrete SampleSpace)

Probability density function: Probability assignment to anarbitrarily small interval around each potential value of acontinuous variable (Continuous Sample Space)

Distribution function

Pr( ) ( ),

Pr( ) 0,

X x f x x S

X x x S

= =

= =

Pr( ) ( )X dx f x dx =

Pr( ) ( ) ( )

u

a

X u F u f x dx = =

8/2/2019 Bio Stat 600

7/35

Rules of Probability

0 Pr( ) 1

Pr( ) 0

Pr( ) 1

( )

Pr( ) 1 Pr( )

c

c

A Event

A

A A will not occur in the entire

sequence of experiments

A Only A will occur in the entire

sequence of experiments

A or A Complement of A or Not A

A A

=

=

=

=

=

Two events A and B are mutually

exclusive when occurrence of Arules out the occurrence of B in a

trial

Pr (A or B)=Pr(A)+Pr(B)

Two events A and B are independent

when occurrence of A has no bearingon the occurrence or non-occurrence

of B

Pr (A and B)=Pr(A)*Pr(B)

Example 1: Median height of the population is 168cm. Two individuals are chosen at random

independently. What is the probability that height of both is less than 168cm?

A= First persons height is less than 168cm

B= Second persons height is less than 168cm

A and B=Both have height less than 168cm

Because of independence,Pr(A and B)=Pr(A)*Pr(B)=1/2 *1/2=1/4

Example 2: Suppose that 10% of the population has height exceeding 180cm. What is theprobability that exactly one persons height exceeds 180cm?

Two possible scenarios:

C1: A 180

C2: A>180 and B

8/2/2019 Bio Stat 600

8/35

Properties of the Binomial distribution

In a typical sample of size n, you may expect nP subjects to havedisease

If you take several samples of size n from this population and note

down the number of diseased subjects, the variance among thesenumbers will be nP(1-P).

Inferential problem: Given n andx, how do we infer about P?

Intuitively the estimate ofP isx/n. This estimate turns out to bereally a good estimate.

How do we decide that it is good? We will see later.

Poisson Distribution

Poisson distribution is a close cousin of Binomial distribution.

IfP is very small (rare disease) and n is very large thenprobability that in a sample of size n, you will findx diseasedsubjects is

( )

! !

x nP xnP e e

x x

where nP

=

=

Expected number of diseased people in the population

If you take a large number of very large samples

and count the number of diseased people in

each sample, the variance among these numbers

will be a

=

pproximately

Inferential problem:

Givenx how do we

draw inference about

Normal Distribution A popular model for many continuous variables

It is a symmetric bell shaped curve characterizedby two parameters: Mean and Standard Deviation

Mean: Center of the distribution (the same as themedian and mode)

90% of observations lie between mean-1.64*SDand mean+1.64*SD

95% of observations lie between mean-1.96*SDand mean+1.96*SD

Conditional Probability

How likely that an Event A will happen given thatthe event B has occurred?

If A and B are independent then

Inverse Problem (Bayes Rule)

Pr( )( | )

Pr( )

A BP A B

B=

I

Pr( ) Pr( ) Pr( )Pr( | ) Pr( )

Pr( ) Pr( )

A B A BA B A

B B

= = =

I

( ) Pr( | ) Pr( )Pr( | )

Pr( ) Pr( )

P A B A B BB A

A A= =

I

8/2/2019 Bio Stat 600

9/35

Measures used in Diagnostic tests

Diagnostic test indicates T+ or T

True state of the disease: D+ or D-

Properties of diagnostic tests

Sensitivity: Pr(T+|D+)

Specificity: Pr(T-|D-)

Usefulness or value of diagnostic tests

Positive Predictive Value (PPV): Pr(D+|T+)

Negative Predictive Value (NPV): Pr(D-|T-)

Key Concept in statistical inference: Sampling Distribution

How do we judge whether the estimator, (x/n), (sample proportion) is agood estimator of the population proportion P?

Imagine that you draw several samples each of size n. Each will give youdifferent estimate. Variation in the estimates from sample to sample is

called the sampling variance. The square root of the sampling variance iscalled the standard error.

Two important criteria:

You would want the estimates to be the same as the estimand, on the average. Suchestimates are called unbiased

The sample to sample variation in the estimates should be as small as possible. Thatis, the standard error should be as small as possible

The most desirable estimate: An unbiased estimate that has the smallest sampling

variance.

In this sense the sample proportion is the most desirable estimate of the populationproportion

Sample Proportion The sampling variance of the sample proportion,p, is approximatelyp(1-

p)/n.

The standard error:

Instead of using the single value to estimate the population proportion,sometimes it is desired to provide a range of plausible values for theunknown population proportion with reasonable degree of confidence.

Confidence interval is a summary measure that provides such set ofplausible values. Usually confidence levels are 90%,95% or even 99%.

An approximate 95% confidence interval for the unknown populationproportion is

(1 ) / p p n

1.96

1.96 (1 ) /

p SE

p p p n

Confidence intervals

90% confidence intervals:

99% confidence intervals:

Example: In a random sample of size 2,837children in the State of Michigan, 118said they usually coughed, first thing in the morning. What can you infer about theprevalence of this condition in the entire state?

Sample prevalence is= 118/2837=0.0416, the estimated prevalence rate for entirestate.

The uncertainty in the estimate is

1.64p SE 2.57p SE

0.0416 (1 0.0416) / 2837 0.0037

95% confidence interval:

0.0416 1.96 0.0037=(0.034,0.049)

With reasonable confidence one could conclude thatthe population prevalence rate is between 3.4% to

4.9%

=

8/2/2019 Bio Stat 600

10/35

8/2/2019 Bio Stat 600

11/35

8/2/2019 Bio Stat 600

12/35

E l

8/2/2019 Bio Stat 600

13/35

Example

The following data was collected in a study of plasma magnesium indiabetic patients. The diabetic subjects were all insulin dependent subjectsattending a diabetic clinic over a 5 month period. The non-diabeticcontrols were mixture of blood donors and people attending day centersfor elderly, to give wide age distribution. Plasma magnesium follows aNormal distribution very closely.

The summary data is as follows:

Number of diabetic subjects=227

Mean plasma magnesium=0.719

Standard deviation =0.068

Number of non-diabetic controls=140


Standard deviation=0.057

Questions of Interest

Calculate an interval which would include 95% of plasma magnesiummeasurements from the control population. This is called reference interval.It give information about the distribution of plasma magnesium in thepopulation.

Given that the distribution of plasma magnesium is normal, the mean and standard

deviation completely specify the distribution. Thus we would expect 95% of theobservations to lie between 0.810-1.96*0.057 and 0.810+1.96*0.057. That is, between0.698 and 0.922.

What proportion of diabetic subjects do we expect to lie in the referenceinterval?

The plasma magnesium level for diabetic subject is normal with mean 0.719 andstandard deviation 0.068. What is the area under this normal curve between 0.698 and0.922?

0.698 0.719 0.922 0.719Pr(0.698 0.922) Pr

0.068 0.068X Z

=

P r ( 0 . 3 1 2 . 9 9 )

P r ( 2 . 9 9 ) P r ( 0 . 3 1 )

0 . 9 9 8 6 0 . 3 7 8 3

0 . 6 2 0 3

Z

Z Z

=

=

=

Only about 62% of diabetic patient will lie in the referenceinterval.

What are the estimates of the population mean of plasma

magnesium for diabetic and non-diabetic populations?

Estimate of the population mean for diabetic subjects is

0.719 mmol/liter

Estimate of population mean for non-diabetic subjects is

0.810 mmol/liter.

What are the standard errors of population mean estimates?

Sample-to-sample variation in estimated mean for the population diabeticsubjects is 0.0045 and for the control population it is 0.0048.

1 1

1 1 1

2 2

2 2 2

Diabetic population:

227, 0.068;

/ 0.068/ 227 0.0045

Non-diabetic population:

140, 0.057;

/ 0.057 / 140 0.0048

n s

SE s n

n s

SE s n

= =

= = =

= =

= = =

8/2/2019 Bio Stat 600

14/35

Find 95% confidence interval for the population meanfor the control population.

How does the confidence interval differ from the 95%reference interval? Why are they different?

2

2

2 2 2 2

0.810

0.004895% confidence interval:

( 1.96 , 1.96 )

(0.810 1.96 0.0048,0.810 1.96 0.0048)

(0.801,0.819)

x

SE

x SE x SE

=

=

+

= +

=

Find the standard error of difference in the mean plasma magnesiumbetween diabetic and non-diabetic population?

Find 95% confidence interval for the difference in the means betweendiabetic and non-diabetic populations.

More than 95% confident that the difference in the population meansis negative. That is, the mean magnesium for diabetic subjects issmaller than the mean magnesium level for non-diabetic subjects.

2 2 2 2

1 2

:

0.719 0.810 0.091

( ) 0.0045 0.00480.0066

Estimated difference

SE diff SE SE

=

= + = +=

( 0.091 1.96 0.0066, 0.091 1.96 0.0066)

( 0.104, 0.078)

+

=

Would plasma magnesium be a good diagnostic test for diabetes?

The method discussed so far can be used to compare two populationproportions. Note that the proportion is simply the average of 0s and1s. The proportion is the mean of the binary variable.

Example: A study was conducted to determine to what extent childrenwith bronchitis in infancy get more respiratory symptoms in later lifethan others. 273 children who had bronchitis before age 5 (group 1)

were compared to 1046 children who did not(group 2). The outcomewas whether or not these children coughed during the day or night at age14.

26 of 273 reported coughing in group 1 and 44 of 1046 reportedcoughing in group 2.

1

2

1 2

1 1 2 21 2

1 2

26 / 273 0.095

44/1046 0.042

0.095 0.042 0.053

(1 ) (1 )( )

0.095 (1 0.095) 0.042 (1 0.042)

273 1046

0.0188

p

p

p p

p p p pSE p p

n n

= =

= =

= =

= +

= +

=

95% confidence interval:

0.053 1.96 0.0188=(0.016,0.090)

8/2/2019 Bio Stat 600

15/35

Adjustments for small samples

When the sample size is large, the central limit theorem applies and thesample mean has a normal distribution regardless of the originaldistribution of the the outcome variables.

When the same size is small, the distribution of the sample is not normal

even if the distribution of the outcome is normal.

Adjustment is needed when the sample size is small

Example: Does increasing the amount of calcium in our diet reduce bloodpressure? In a randomized experiment 10 black men were a calciumsupplement for 12 weeks and 11 black men received placebo thatappeared identical. The experiment was double blind. The outcome wasthe change in the blood pressure over a 12 week period.

Data

Calcium group: n=10, mean=5 and standard deviation =8.743

Placebo group: n=11, mean=-0.273 and standarddeviation=5.901

Two situations

Suppose the population standard deviations in the twopopulations are the same

Pooled standard deviation

2 22 1 1 2 2

1 2

2 2

( 1) ( 1)

( 1) ( 1)

9 (8.743) 10 (5.901)54.536

9 10

54.536 7.385

p

p

n s n ss

n n

s

+ =

+

+ = =

+

= =

2 2

1 2

1 2

( )

7.385 1/ 1/ 3.227

95%

(5.273 3.227)

9 10 19

2.093

(5.273 2.093 3.227,5.273 2.093 3.227)

( 1.48,12.027

10 11

)

p ps sSE x x

n n

confidence interval

t

Degrees of freedom

t

= +

= + =

= + ==

+

=

Given the considerable uncertainty, no change in the

population mean difference between calcium and placebo

groups is plausible. Based on this data, we are confidentthat the mean difference that one would observe is between

1.5 to 12.0

What if the two population standard deviations are not the same?

2 2 2 2

1 21 2

1 2

22 2

1 2

1 2

2 22 2

1 2

1 1 2 2

2 2 2

2 2 2 2

2

2 2

( 8 .7 4 3 ) ( 5 .9 0 1)( )

1 0 1 1

7 .6 5 3 .1 7 3 .2 9

1 1

1 1

[ ( ) / 1 0 ( 5 .9 0 1) / 1 1]

( / 1 0 ) / 9 ( 5 .9 0 1 / 1 1) / 1 0

( 3 .1 6 )

( / 9 3 .1 6 / 1

8 . 7 4 3

8 . 7 4 3

7 . 6 4

7 . 6 4

s sS E x x

n n

s s

n n

d f s s

n n n n

= + = +

= + =

+

=

+

+=

+

+=

+ 0 ) 1 .0 0

2 .1 3 1 ( 2 .1 2 0 2 .1 3 1) *

9 5 % c o n f i d e n c e in t e rv a l :

( 5

1 1 6 . 6 41 5 . 5 7

6 . 4 9

0 .5 7 2 .1 2 5

2 .1.2 7 3 3 .2 9 , 5 .2 72 5 2 .1 23 3 .25

1 .

9 )

( , )7 2 1 2 .2 6

t

= =+

+ =

+

=

8/2/2019 Bio Stat 600

16/35

How can we check whether the population standard deviations are Ho to make the decision?

8/2/2019 Bio Stat 600

17/35

How can we check whether the population standard deviations arethe same?

We will discuss this in the context of hypothesis testing.

It can be argued that the equality of population standard deviationscan never be empirically verified, especially, if the sample size issmall. One should always, therefore, use the procedure which does

not assume equality of population standard deviations.

How to make the decision?

General principle: Check whether the data is consistent with the nullhypothesis.

We will answer the question by assessing how likely the observeddata would have been generated if the null hypothesis were true.

A simple procedure

If the null hypothesis were true then the difference betweenpronethalol and baseline values will be on the average 0 withroughly half of them positives and the rest of them negatives. Thatis, under the hypothesis the number of negative signs will occurwith probability 0.5. But only one negative value has occurred.

Probability of observing 1 negative and 11 positives is

Even more extreme observation will be 0 negative and 12positives which has the probability

Test statistic: Number of negative values

P-value: The probability of observing the test statistic which isas or more extreme than the observed, if the null hypothesis

were true P-value=0.00293+0.00024=0.00317

1 1112

0.5 0.5 0.002931

=

0 1212

0.5 0.5 0.000240

=

The above test is called sign test. Also we performed a onesided test because only smaller values of the number ofnegative values were considered.

If one were to find large number of negative values, say 11 or12 then that is also evidence against the null hypothesis

The probability of obtaining 11 or 12 negatives is also0.00317 (verify!).

The two sided p-value is 0.00317+0.00317=0.00634

How do we define extreme values that constitute evidenceagainst the null hypothesis?

A T l f T E R i i i h h l l l

8/2/2019 Bio Stat 600

18/35

A Tale of Two Errors Type 1 error: Rejecting the null hypothesis when it is actually

true.

Type 2 error: Failure to reject the null hypothesis when it isactually false. (Equivalently, accepting the null hypothesis

when it is actually false) Chances of making type 1 error is called significance level and

is denoted by

Chances of making type 2 error is denoted by.

1-is called power. Power is the chance of Rejecting the nullhypothesis when alternative is true

Objective is to control chances of making either types of errors

Strategy: For a fixed significance level, we will define theextreme values.

Revisiting the pronethalol example

Suppose we specify that the chances of making type 1 error is 0.05

For two sided alternatives: The extreme value then is determined bychoosing values, c and d, so that the number of negative values lessthan or equal to c or greater than or equal to dis 0.05.

For one sided alternatives: The extreme value is determined bychoosing a value, c, so that the number of negative values less than orequal to c is 0.05.

Looking at the binomial table on Page T-9 (last column, n=12). If wechoose c = 2 and d = 10. The probability of type 1 error is0.0161+0.0161=0.0322. It is not possible to determine c and dtoachieve significance level 0.05.

For one sided hypothesis, choose c=3. (It gives slightly larger than the

specified significance level 0.05

Usually power is calculated instead of chances of making type 2error.

We need a specific alternative value which is the truth. Supposethat 1/12 (approximately, 8%) is the indeed the true value. Thechances of rejecting the null hypothesis is:0.3677+0.3837+0.1835+0.0000+0.0000+0.0000=0.9349

Try calculating power for alternatives 0.05, 0.10, 0.20 etc. Power curve is a plot of power against the alternative values.

The sign test so far has considered only the sign of the differenceand not the magnitude of the difference. Let us consider somealternatives.

Suppose that the differences can be assumed to be normallydistributed. The estimate of the mean difference in the populationis 7.7 and standard deviation is 15.1.

The standard error is 4.4.

If the null hypothesis is true then the sample mean should bedistributed around 0. The extent to which the observed sample

mean is different from 0 is the evidence against the nullhypothesis.

One way to measure the distance between the observed samplemean and the null hypothesis value is in terms of standard errorunit.

0

( )

x

t SE x

=

Two Sample Tests

8/2/2019 Bio Stat 600

19/35

Calculated value of t-statistic is 7.7/4.4 = 1.75

What is the probability that one would observe this extreme oreven more extreme values of the test statistic under the nullhypothesis?

If the null hypothesis were true then the statistic has a t-

distribution with 11 degrees of freedom.

-1.75 1.75

Shaded area: 0.1079

Computed using a computer

For a fixed significance level,say 0.05. The value of the test

statistic considered to be large

is 2.201

Two-Sample Tests Revisit the plasma magnesium example

The following data was collected in a study of plasma magnesium in diabetic patients.The diabetic subjects were all insulin dependent subjects attending a diabetic clinicover a 5 month period. The non-diabetic controls were mixture of blood donors andpeople attending day centers for elderly, to give wide age distribution. Plasmamagnesium follows a Normal distribution very closely.

The summary data is as follows: Number of diabetic subjects=227


Standard deviation =0.068

Number of non-diabetic controls=140


Standard deviation=0.057

Frame the question as testing of statistical hypothesis: Are the means of plasma magnesium in the two populations (diabetic and non-diabetic)

the same?

Diabetic

Non

Diabetic1

1

1

227

0.719

0.068

n

x

s

=

=

=

2

2

2

140

0.810

0.057

n

x

s

=

=

=

Mean for Diabetic population:

Mean for Non-diabetic population:

Null Hypothesis:

Alternative Hypothesis:

is an estimate of

If the Null hypothesis were true then should bedistributed around mean 0. The extent to which it is away from 0

is evidence against the null hypothesis.

Test statistic:

1

2

1 2:oH =

1 2:AH

1 2x x 1 2

1 2

x x

1 2 1 2 1 2

1 2 1 2

( ) ( )

( ) ( )

0.09113.780.0066

x x x xt

SE x x SE x x

= =

= =

13.78- 13.78

Sampling distribution is normal given thelarge sample sizes from each population. Ifthe null hypothesis were true, 68% of thesamples should result in the value of the teststatistic to be between -1 and 1, 90% of thesamples between -1.64 and 1.64 and 95% ofsamples between -1.96 and 1.96. What wehave observed is very unlikely under thenull hypothesis. Therefore, the nullhypothesis is a suspect

Small sample example revisited Two situations: Population variances are equal or unequal

8/2/2019 Bio Stat 600

20/35

Example: Does increasing the amount of calcium in our diet reduce blood pressure? In a

randomized experiment 10 black men were given a calcium supplement for 12 weeks

and 11 black men received placebo that appeared identical. The experiment was double

blind. The outcome was the change in the blood pressure over a 12 week period.

Data

Calcium group: sample size =10, mean=5 and standard deviation =8.743

Placebo group: sample size=11, mean=-0.273 and standard deviation=5.901 Population mean if everybody in the population were given calcium supplement:

Population mean if everybody in the population were given only Placebo:

Null hypothesis:

Alternative hypothesis:

Large positive mean difference is evidence against the null hypothesis in

favor of the alternative hypothesis

Alternative hypothesis:

Large positive or negative mean difference is evidence against the null hypothesis infavor of alternative hypothesis

1

21 2:oH =

1 2:oH >

1 2x x

1 2:AH

p q q

Equal

Pooled standard deviation

Standard error of the difference in the means

Test statistic

2 22 1 1 2 2

1 2

2 2

( 1) ( 1)

( 1) ( 1)

9 (8.743) 10 (5.901)54.536

9 10

54.536 7.385

p

p

n s n ss

n n

s

+ =

+

+ = =

+= =

2 2

1 2

1 2

10 1

( )

7.385 1/ 1/ 3.221 7

p ps sSE x x

n n = +

= + =

1 2 1 2

1 2

( ) ( ) 5.2731.63

( ) 3.227

9 10 19

x xt

SE x x

Degrees of freedom

= = =

= + =

Sampling

distribution: t with 19

degrees of freedom

P-value

One sided alternative

From Table D on page

T-11, the shaded area is

between 0.05 and 0.10

Computer software:0.0598

Two sided alternative

P-value=2*0.0598=0.1196

1.63

1.63-1.63

Variances are unequal

2 2 2 2

1 21 2

1 2

22 2

1 2

1 22 2

2 2

1 2

1 1 2 2

2 2 2

2 2 2 2

2

2 2

(8.743) (5.901)( )

10 11

7.65 3.17 3.29

1 1

1 1

[( ) /10 (5.901) /11]

( /10) / 9 (5.901 /11) /10

( 3.16)

( / 9 3.16 /1

8.743

8.743

7.64

7.64

s sSE x x

n n

s s

n ndfs s

n n n n

= + = +

= + =

+

=

+

+=

+

+=

+

116.6415.57

6.490) 1.00= =

+

Test statistic

5.273

1.6033.29

:

: 0.065

: 0.13

t

P value

One sided

Two sided

= =

8/2/2019 Bio Stat 600

21/35

A l i f V i

8/2/2019 Bio Stat 600

22/35

Blood Pressure Example

1 2

7.4

n n n

=

= =2.5 5 7.5

30

40

50

60

70100

25.6 74.4 97.5

32.7 85.6 99.4

39.3 92.2 99.9

45.6 95.9 100

51.5 97.9 10066.6 99.8 100

n

Analysis of Variance

Suppose now that we want to compare more than 2 populations

One could do pair-wise comparisons. This is cumbersome and is not easyto summarize when the number of populations compared is large.

The analysis of variance is used by framing question in terms of in-depthinvestigation of variations in the observed data

Analysis variance basically partitions the overall variability into one ormore assignable causes or reasoning.

What is left unassigned is called residual variability.

Based on the partition of the variability relative merits of assignablecauses are investigated.

Generally, the variation due to an assignable cause relative to the residualvariability is used as a yard stick for judging the importance of theassignable cause.

The assignable causes can be carefully planned or manipulatedthrough an experimental design

The assignable causes are based on substantive reasoning in anobservational study design

Example:

A randomized study was conducted to test the generality of the observation

that stimulation of the walking and placing reflexes in the newbornpromotes increase walking and placing (Zelazo, Zelazo and Kolb (1972,Science, pages 314-315)). A total of 29 one-week old males wererandomized to four groups. 1: Active exercise, 2: Passive exercise, 3: No-exercise and 4: 8-week control group. Age of infants walking alone (inmonths) was the outcome variable of interest.

The assignable cause is the levels of exercise

Is the variation caused by this assignable cause substantial?

Data

ActiveExercise

PassiveExercise

No Exercise Control

9.00

9.50

9.75

10.00

13.00

9.50

11.00

10.00

10.00

11.75

10.50

15.00

11.50

12.00

9.00

11.50

13.25

13.00

13.25

11.50

12.00

13.50

11.50

2

1 1

Observation for subject in group

1,2,...,

1,2,...,

Overall mean

Total variation= ( )i

ij

i

nk

ij

i j

y j i

j n

i k

y

y y

++

++= =

=

=

=

=

1 2 3 4

:

4

6, 5261 / 23 11.34

58.47

Example

k

n n n ny

Total variation

+ +

=

= = = == =

=

ANOVA

8/2/2019 Bio Stat 600

23/35

ANOVA

( ) ( )ij ij i iy y y y y y++ + + ++ = +

Overall

deviation

Within Groups or

Between-subjects nested

within groupsBetween Groups

2 2 2

1 1 1 1 1 1

( ) ( ) ( )

58.47 43.69 14.78

i i in n nk k k

ij ij i i

i j i j i j

y y y y y y

TotalSS WithinSS BetweenSS

++ + + ++= = = = = =

= +

= += +

Degrees of freedom: Number of independent statistics

used to compute the sum of square

An alternative Expression

( ) ( )ij i ij i

ij i ij

y y y y y y

y

++ + ++ += + +

= + +

Overall mean Deviation of the

group i mean

from overall mean

Residual

( ) ij j is Effect of subject j nested within group i = =

Df for Total SS=22 ( Every observation

is used but sum of deviations is zero)

Df for Within SS=19 (Every

observation is used but sum of

deviations within each group is zero)

Df for Between SS=3 (Four means areused but sum of deviations from the

overall mean is zero)

To compare the Sums of squares, differences in the

degrees of freedom has to be taken into account.

Mean square =Sum of square/Degrees of Freedom

1

( ) 1

( ) 1

( )

k

i

i

N n Total sample size

Df TotalSS N

Df BetweenSS k

Df WithinSS N k

=

= =

=

=

=

ANOVA Example

( ) 14.78 / 3 4.93

( ) 43.69 /19 2.30

( ) / ( ) 2.14

MS Between

MS Within

MS Between MS Within

= =

= =

=

Is 2.14 large?

Use F-distribution with (numerator df=3,

denominator df=19) to determine how likely is

2.14 or even larger F when in actuality there are

no differences among the four groups?

P-value: 0.1228

Regression Analysis TerminologyX I d d i bl

8/2/2019 Bio Stat 600

24/35

Bulk of scientific investigations are concerned with relationships.

Causal relationship: If one changes the variableXby a certain amount how muchdoes the variable Ychange?

Association or correlational relationship: Are subjects with different values ofXalsotend have different values ofY?

What is the nature of these relationships in the population?

How do you quantify these relationships in the population?

How to estimate the quantities describing these population relationships?

How accurately are those estimates? How much uncertainty is there in assessingthese relationships?

The two sample tests and ANOVA also fit into this category

Are the population means related to treatments assigned or the observed grouping ?

We will later see that the two-sample t-tests and ANOVA F-tests are particular casesof the general regression framework.

X= Independent variable.

A variable that an investigator can change in an experiment

Amenable to intervention in an observational study

Simply the variable whose impact is to be assessed.

It is possible that there can be more than one independent variable ofinterest

Other names: Predictors, correlates, right-had-side variables, exogenousvariables

Y= Dependent variable of interest.

Variable for which you want to assess effect of X.

Other names: Outcome, endogenous variables, left-hand-side variables

Impact of different values of X on differences in Y expressed in somemeaningful terms is of interest

Example The following table gives data collected by a group of

medical students in a physiology class. The objective is toassess association between height and FEV1.

Height FEV1

164.0 3.54

167.0 3.54

170.4 3.19

171.2 2.85

171.2 3.42

171.3 3.20

172.0 3.60

Height FEV1

172.0 3.78

174.0 4.32

176.0 3.75

177.0 3.09

177.0 4.05

177.0 5.43

177.4 3.60

Height FEV1

178.0 2.98

180.7 4.80

181.0 3.96

183.1 4.78

183.6 4.56

183.7 4.68

Scatter plotScatter plot:

A graphical device to assess

the type of relationship.

Each point is a pair (X,Y)

Dependent variable on thevertical axis

Independent variable on the

horizontal axis

Inspection of the graph

suggests a linear

relationship

Linear relationship

Method of Least

8/2/2019 Bio Stat 600

25/35

Linear relationship

Representation

Clearly not (none) every observations will satisfy thisequation

How to determine a and b?

1 2 1 2( ) ( )y y x x

y a b x= +

i i iy a b x e= + +

ResidualLine-value or

the expected value

{1

Squares:

Find a and b that

minimizes the

residual sum of

squares:

2 2

1 1

( )n n

i i i

i i

y a b x= =

=

2

( )( )

( )

i i

i

i

i

x x y y

b

x x

a y b x

=

=

Simplified formulas Slope

Intercept

Needed quantities

2

2

/

/

i i i i

i i i

i i

i i

x y x y n

b

x x n

=

/ /i ii i

a y n b x n

=

2, , ,i i i i i

i i i i

x y x y x

Example y=FEV1, x=Height

Prediction equation

2

2

3,507.6, 77.12

13,568.18, 615,739.24

13,568.18 3,507.6 77.12 / 20 0.074389615,739.24 (3,507.6) / 20

77.12 / 20 0.074389 3, 507.6 / 20 9.19

i i

i i

i i i

i i

x y

x y x

b

a

= =

= =

= =

= =

1 9.19 0.0744FEV Height = +

Interpretation Interpretation (Contd.)

8/2/2019 Bio Stat 600

26/35

p

Slope

Expected difference in for unit positve

difference inTwo Individuals

Individual 1:

Individual 2: 1

Expected or line-value for individual 1:

Expected or line-value for individual 2:

b y

x

x h

x h

a b h

=

=

= +

+

( 1)

a b h

Difference b

+ +

=

Intercept

Expected value ofy whenx=0.

It is not very interpretable in this particular problem. Value

of FEV1 when Height is 0!

Modification: Centering

( )y c d x x= +

d b

c Expected value of y for average height

=

=

Residual Residuals from the estimated line

Residuals represent deviations from the expected value.Large residuals reflect unreliability or uncertainty.

One way to measure this uncertainty is through variance ofthe residuals (or the standard deviation of the residuals).

i i ie y a b x=

2

2

2

i

ie

e

sn=

Computational formulas2 2 2 2

2

2 2 2

( ) ( )

2 2

( 1)( )

( 2)

of '

of '

i i i

i i ie

y x

y

x

e y y b x x

sn n

n s b sn

s SD y s

s SD x s

= =

=

=

=

2

( )( )

1

xy

x

i i

xy

sb

s

x x y y

sn

=

=

Covariance

Example How useful isx in predictingy?

8/2/2019 Bio Stat 600

27/35

p

2 2 22

5.51, 0.71

19 (0.71 0.0744 5.51 )0.35

18

x y

e

s s

s

= =

= =

2.260.56

5.51 0.71

xy

x y

sr

s s

= = =

Correlation Coefficient: Another measure of

strength of linear relationship

Measure of linear association between x

and y.

p g y2

2

2

2

19

18

( 1)

2

y

e

y

Total variance= 0.71 = 0.504 s

(Residual variance from a horizontal line)

Degrees of freedom

Residual variance= 0.35 s

(Residual variance from the regression line on x)

Degrees of freedom

n sR

=

=

=

=

=2

2( 2) 19 0.50 18 0.35

( 1) 19 0.50

0.34

34% (in percentage)

e

y

n sn s

=

=

Another Form ofR2

2

2

2

( ' )

( ' )

y

y

sVariance y sR

Variance y s s= =

R-square is a simple measure to assess how much

variability iny is explained by the variation inx.

Large values of R-square indicates substantial

variation in y is due to variation in x. Small R-square

indicates the opposite

This measure also has disadvantages and we willdiscuss those when we consider multiple preidtors

Inference How much the slope and intercept estimates vary from sample to

sample?

Standard error of the estimates

95% confidence interval

2

2

( ) ( 1)

e

x

s

SE b n s=

0.025, 2 ( )nb t SE b

Estimated Line Value

8/2/2019 Bio Stat 600

28/35

( )

,

( )

b bt

SE b

Under the null

bt

SE b

=

=

Test the

hypothesisHo: b is

equal to 0 versusHA: b is not equal to

zero

Supposex=fand it is not one of the observed values in the data set.

What would one expecty to be on the average?

22

2

1 ( )( )

( 1)

f

f e

x

y a b f

f xSE y s

n n s

= +

= +

2

2

175

9.19 0.0744 175 3.83

1 (175 175.38)( ) 0.35 0.133

20 19 5.51

f

f

f

y

SE y

=

= + =

= + =

Prediction Interval

This refers to a confidence interval for a singleobservation on outcome variable for a given valueof the independent variablex = f.

22

2

1 / 2, 2

1 ( )( ) 1

( 1)

f

f e

x

f n

y a b f

f xPrediction SE y s

n n s

y t PSE

= +

= + +

Discrete Predictors In the example considered so far, the independent variable

was a continuous variable.

Suppose now the independent variable is a binary coded asx = 0 or 1.

Interpretation of regression coefficients:

( | )

( | 0)

( | 1)

( | 1) ( | 0)

E y x a bx

E y x a

E y x a b

b E y x E y x

= +

= =

= = +

= = =

a = Mean for the reference group

defined as subjects with x = 0

b = Difference in the mean between

two groups x=1 versus x=0

-

Test for significance of b is identical to two sample

t test

Multiple Predictors ( | )E Y X a bX= +

8/2/2019 Bio Stat 600

29/35

Often in practice several variables mightinfluence the dependent variable.

Some common examples

C

X Y

Part of X, Y relationship is due to common

relationship with C

( | , )

( | 1, ) ( | , )

[ ( 1) ] [ ]

b Ignores the influence of C on X and Y

E Y X C c dX eC

E Y X x C f E Y X x C f

c d x ef c dx ef d

=

= + +

= + = = =

+ + + + + =

d = Difference in the expected values of Y associated with one

positive unit difference in X holding C constant.

b= Unadjusted estimate

d= Estimate adjusted for C; Usually d will be smaller than b

(but it can be larger than b)

C=Confounding variable

X I Y

( | )E Y X a bX= +

b=representing the

effect of actually the

variable I

I: Intervening

variable( | , )

0

E Y X I c dX eI

d

= + +

X I YX may act through I as

well as act

independently on Y

Statistical effect of I or C will be the same on the regression

coefficient of X. The conceptual understanding has to

distinguish between confounding and Intervening variables

X

M=0

M=1

Y0

Y1

The effect of X depends on M. The effect of X is

modified by the presence or absence of M

( | , )E Y X M a bX cM dX M= + + +

8/2/2019 Bio Stat 600

30/35

( , )

( | , 0)

( | , 1) ( ) ( )

E Y X M a bX

E Y X M a c b d X

= = +

= = + + +

d=The extent to which the effect of X is modified by the

presence of M (that is, M=1)

d=0 c arbitrary: Parallel lines for two groups M=1 and

M=0

d=0,c=0: Coincidental lines

c=0,d arbitrary: Same intercept, lines for M=0 and M=1

are fanning out

d=0,c arbitrary d,c arbitrary

d=0,c=0 c=0,d arbitrary

Analysis of Cross-classifed Data So far we have concentrated on analyzing relationships between a

continuous dependent variable and continuous or discrete (orcategorical) independent variables.

Regression

ANOVA

Many times the dependent and independent variables are bothdiscrete.

Qualitative categories (Such as Gender, Race/Ethnicity,geographical location, type of health insurance)

Quantitative or ordered categories

low, medium and high socioeconomic status

none, very low, low, medium, high doses of environmentalexposure

Other combinations are also possible

Discrete dependent (Yes/No for a disease), continuousindependent (Age, BMI etc)

Logistic regression

Number of events as dependent (Number of seizuresamong epileptic patients over a fixed or variable

period of time)

Poisson Regression

Continuous dependent but truncated (or censored).For example, failure time, time to death, time tosymptoms. These may be known for some individualsand for others it is only known to exceed some known

value. Survivial analysis

8/2/2019 Bio Stat 600

31/35

Chi-square statistic is one of the distance T has a chi-square distribution with df=(r-1)(c-1) degrees of

freedom.

8/2/2019 Bio Stat 600

32/35

measure:2

1 1

( )

Number of rows

Number of columns

Observed frequency in row and column

Expected frequency in row and column

r cij ij

i j ij

ij

ij

O ET

E

r

c

O i j

E i j

= =

=

=

=

=

=

2 2 2

2 2

(50 61.7) (849 837.3) (29 17.7)

61.7 837.3 17.7

(3 2.7) (36 36.3)... 10.5

2.7 36.3

T

= + + +

+ + =

r=5, c=2; df=4.

Critical value for significance level is 0.05 is 9.49. The data arenot consistent with the hypothesis of no association between

housing tenure and time of delivery. That is, there is a goodevidence of association between housing tenure and time ofdelivery

The chi-square statistics is not a measure of association. If wedouble the frequencies in each cell, the association will remainunchanged but chi-square will double.

Chi-square is a large sample test and is questionable if anyexpected frequency is less than 5. Alternatives are

Yates correction

Fishers exact test

Yates Correction

Fishers exact test. It is based on computing theprobability of observing a particular contingency table ortables that are more inconsistent with the hypothesis of

no association. It is a complicated algorithm and usuallyis performed using a computer.

Example: The following table is from a trialinvestigating the efficacy of streptomycin for thetreatment of pulmonary tuberculosis. The data is forsubgroup for patients with an initial temperature of 100-100.9F. The two variables are radiological assessment of

the disease 6 months later and treatment.

2(| | 0.5)

Y

O ET

E

=

Radiological assessment Streptomycin Control

Improvement 13 5

Deterioration 2 7

Death 0 5

Pooled table

R di l i l S i C l

Log(OR)=2.75

(l (OR)) 1/ +1/b+1/ +1/d 1/13+1/2+1/5+1/12 0 86

8/2/2019 Bio Stat 600

33/35

Radiological assessment Streptomycin Control

Improvement 13 (a) 5 (b)

Deterioration or 2 (c) 12 (d)

Death

Odds of improvement in the streptomycin group=13/2=a/c

Odds of improvement in the control group=5/12=b/d

Odds ratio=(13/2)/(5/12)=15.6 =(ab)/(cd)

Confidence interval:

exp[Log(OR)-z*SE(log(OR)), Log(OR)+z*SE(log(OR))]

var(log(OR))=1/a+1/b+1/c+1/d=1/13+1/2+1/5+1/12=0.86

SE=0.93

95% confidence interval for log-odds-ratio:

(2.75-1.96*0.93,2.75+1.96*0.93)

=(0.93,4.57)

95% confidence interval for odds-ratio:

(2.53,96.54)

Analysis so far involved only two variables. What to do ifwe have more than two variables? For example, suppose wewant to adjust for Age and other confounding variables

while assessing association between treatment and outcome(or home ownership and time of delivery).

Technique is called logistic regression

Matched study: Binary Outcome

A questionnaire was administered to1,319 school children at ages 12 and14. One question asked was whetherthe prevalence of reported symptomswas different at the two ages. Thefollowing two by two table gives theresult

As in the paired t-test example, wewant to exploit the fact that the samechildren were asked at ages 12 andthen again when they were 14

The concordant pairs are (yes, yes)and (no, no).

The discordant pairs (yes, no), (no,yes)

Severecolds atage 12

Severe colds at age14

Yes NoYes 212 144

No 256 707

: The prepvalence of reported symptoms

is the same at two ages

Under the null hypothesis,

proportion of subjects answering (yes,no) should

be same as the subjects answering (no,yes)

oH

2 2

2

:

144, 256

:

2002

2 2

2 2

31.4

1

yn ny

yn ny

yn ny yn ny

yn ny

yn ny yn ny

Observed

f f

Expected

f f

f f f ff f

f f f f

df

= =

+=

+ + = +

+ +

=

=

The chi-square value is highly

significant. The proportions at

two ages are not the same.

Odds of transition from No to Yes is

256/144=1.78

Conditional analysis (conditional on

transition)

This is called

McNemars test

Nonparametric Approaches

Most statistical approaches discussed so far assume some distribution for Paired Designs

8/2/2019 Bio Stat 600

34/35

Most statistical approaches discussed so far assume some distribution forthe population (mostly normal).

The approaches such as one and two sample t-tests, linear regression etc.are valid unless the departure from normality is very severe.

Nevertheless, it will be useful to have a set of techniques that can be appliedwithout any distributional assumptions.

The sign test discussed earlier in the course is an example of anonparametric test. However, this procedure can have low power because ituses only the signs and not the magnitude.

An alternative is to use magnitude in some way but still maintaining thenonparametric nature of the tests.

Rank-based procedures are quite popular

Paired Designs Revisit the husband-wife

pair example

Wilcoxon signed rank

procedure Step 1: Rank the absolute

values of the differences

Step 2: Take the differencein the sums of the ranks ofthe positive and negativedifferences

Pair

1

2

3

4

5

6

7

89

10

Rank

(r)

7

4.5

3

8

9

10

1.5

1.54.5

6

Difference

(d)

2.3

-1.1

0.8

2.4

-3.1

-3.2

-0.6

0.6-1.1

-1.5

(6 3 7 1.5) (4.5 8 9 1.5 4.5 5)

17.5 32.5 15

( ) 0; ( ) ( 1)(2 1) / 6

10 11 21/ 6 385

( ( )) / var( ) 15 / 385 0.76

w

E w Var w n n n

z w E w w

= + + + + + + + += =

= = + +

= =

= = =

Null hypothesis: Median of the distribution of differences is

zero.

All nonparametric procedures formulate hypotheses in terms

of medians rather than mean

Two sample nonparametric tests

These are analog of two-sample t-tests

Mann-Whitney-Wilcoxon test

Sample of size n from population 1

Sample of size m from population 2

Rank (n+m) units regardless of the populations

Sum the ranks of subjects in sample 1and call it T.

Define U=T-n(n+1)/2

Alternatively, one can sum the ranks of subjects in sample 2and then replace n by m

[ / 2] / ( 1) /12z U mn mn m n= + +

Null hypothesis: The distributionin the two populations are the

Crohnsdi

Coeliacdi

Rank all 29 observations

8/2/2019 Bio Stat 600

35/35

in the two populations are thesame

Example:The following tablegives biceps skinfold

measurements for 20 patientswith Crohns disease and 9patients Coelic disease. Theobjective is to assess whether thedistribution of the bicepmeasurements are the same

disease

1.8,2.8,4.2,6.2,2.2,3.2,4.4,6.6,2.4,3.6,

4.8,7.0,2.5,3.8,5.6,10.0,2.8, 4.0,6.0,10.4

disease

1.8,2.0,2.0,2.0, 3.0, 3.8,4.2,5.4, 7.6

1.8 1.5 3.0 11 5.4 21

1.8 1.5 3.2 12 5.6 22

2.0 4 3.6 13 6.0 23

2.0 4 3.8 14.5 6.2 24

2.0 4 3.8 14.5 6.6 25

2.2 6 4.0 16 7.0 26

2.4 7 4.2 17.5 7.6 27

2.5 8 4.2 17.5 10.0 28

2.8 9.5 4.4 19 10.6 29

2.8 9.5 4.8 20

Circled

numbers are

from Sample 2

Rank

sum=104.5

104.5 -9 10 / 2 59.5

(59.5 9 10 / 2) / 9 20 (9 20 1) /12

1.44

0.15

U

z

Two sided p value

= =

= + +

=

=

This is very similar to the result one obtains using two

sample t-test

Generalizations

What if you have more than two groups?

Rank all the observations regardless of group and

then perform the one-way analysis of variance ofthe ranks.

The null hypothesis: The distributions for thevarious populations defined by the groups are thesame.

You can get ranks by using PROC RANK inSAS. See the handout for example

Documents

Bio Stat 600