View
214
Download
1
Embed Size (px)
Citation preview
The usual course of events for conducting scientific work
“The Scientific Method”
Reformulate orextend hypothesis
Develop a Working HypothesisObservationConduct an experimentor a series of controlledsystematic observations
Appropriate statistical tests
Confirm orreject hypothesis
The usual course of events for conducting scientific work
“The Scientific Method”
Reformulate orextend hypothesis
Develop a Working HypothesisObservationConduct an experimentor a series of controlledsystematic observations
Appropriate statistical tests
Confirm orreject hypothesis
In the intertidal zone,algae seem to be confined to specific areas
There will be a positive correlation of algal abundance and tide height
Measure tide heightsand count number ofalgae at each
Product-moment correlation
There is a positive correlation of tide height and algal abundance
Algal will grow higher on the shore in areas of high wave action
Imagine that you are collecting samples (i.e. individuals) from a population of little ball creatures - Critterus sphericales
Little ball creatures come in 3 sizes:
Small =
Medium =
Large =
-sample 1
-sample 2
-sample 3
-sample 4
-sample 5
You take a total of five samples
The real population(all the little ball creatures that exist)
Your samples
Each sample is a representation of the population
BUT
No single sample can be expected to accurately representthe whole population
So……………
To be statistically valid, each sample must be:
1) Random:
Thrown quadrat?? Guppies netted froman aquarium?
To be truly random:
20
15
Choose numbers randomly from 1 to 300
To be truly random:
20
15
Choose numbers randomly from 1 to 300
1
2
34 56
7
8
9
10
11
12
13
14
15
Assign numbers from a random number table
To be statistically valid, each sample must be:
2) Replicated:
•••••••
•••
• • • • • ••• • •
Bark Samples for levels of cadmium
•••••••
•••
• • • • • ••• • •
Pseudoreplicated
Sample size (n) =1
Not pseudoreplicated
Sample size (n) =10
10 samples from 10 different trees 10 samples from the same tree
IF YOUR DATA ARE:
1. Continuous data
2. Ratio or interval
3. Approximately normal distribution
4. Equal variance (F-test)
5. Conclusions about population based on sample (inductive)
6. Sample size > 10
sample population
CHARACTERIZING DATA
Variables
-dependent – in any experiment, the dependent variable is the one being measured by the experimenter
-also known as a reponse or test variable
-independent – in any experiment, the independent variable is the one being changed by the experimenter
-also known as a factor
Nominal data (nominal scales, nominal variables)
Drosophila genetic traits
- data are in categories
Species
Sex
Look at the distribution of lizards in the forests
Tree branchesTree trunks
Ground
Species A Species B Species C Species D
- Both the dependent and independent variables are nominal/categorical
Habitat
Ground Tree trunk Tree branch Species totals
Lizard Species
Species A 9 0 15 24
Species B 9 0 12 21
Species C 9 5 0 14
Species D 9 10 3 22
Totals 36 15 30 81
- data are in categories
-grades
Ordinal data (ordinal scales, ordinal variables)
- categories are ranked
-surveys
-behavioural responses
Interval data (interval scales, interval variables)
zero point depends on the scale used
e.g. temperature
- constant size interval- no true zero point
- values can be treated arithmetically (only +, -) to give a meaningful result
Ratio data (or ratio scales or ratio variables)
- constant size interval
- a zero point with some reality
height weight time
- values can be treated arithmetically (+, -, x, ÷ ) to give a meaningful result
Ratio data (or ratio scales or ratio variables)
- constant size interval
- a zero point with some reality
Can also be continuous
- values can be treated arithmetically (+, -, x, ÷ ) to give a meaningful result
Or discrete
- counts, “number of …..”
Kinds of Variables
Assignment as a discrete (= categorical) or continuous variable can depend on the method of measurement
DappledFullOpen
Continuous
Discrete ( = categorical)
The kind of data you are dealing with is one determining factor in the kind of statistical test you will
use.
IF YOUR DATA ARE:
1. Continuous data
2. Ratio or interval
3. Approximately normal distribution
4. Equal variance (F-test)
5. Conclusions about population based on sample (inductive)
6. Sample size > 10
sample population
Two ways of arriving at a conclusion
2. Inductive inference
sample population
sample population
1. Deductive inference
IF YOUR DATA ARE:
1. Continuous data
2. Ratio or interval
3. Approximately normal distribution
4. Equal variance (F-test)
5. Conclusions about population based on sample (inductive)
6. Sample size > 10
sample population
Imagine the following experiment:
2 groups of crickets
Group 1 – fed a diet with extra supplements
Group 2 – fed a diet with no supplements
Weights
12.1 13.9 13.0 12.1
14.9 12.2 12.9 14.9
13.6 12.0 13.5 13.6
12.0 15.9 12.4 12.0
10.9 12.1 11.0 10.9
9.1 8.9 11.0 10.1
9.9 9.2 8.0 11.9
8.6 9.0 8.5 9.6
10.0 10.9 9.4 8.0
11.9 7.1 10.0 8.9
Mean = 12.8 Mean = 9.49
What you’re doing here is comparing two samples that, because you’ve not violated any of the assumptions we saw before, should represent populations that look like this:
9.49 12.8
Are the means of these populations different??
Frequency
Weight
Are the means of these populations different??
To answer this question – use a statistical test
A statistical test is just a method of determining mathematically whether you definitively say ‘yes’ or ‘no’ to this question
What test should I use??
IF YOU HAVEN’T VIOLATED ANY OF THE ASSUMPTIONS WE MENTIONED BEFORE……
Number of groups compared
2 other than 2
T -test
Direction of difference specified?
Yes No
One-tailed Two- tailed
Does each data point in one data set (population) have a corresponding
one in the other data set?
Yes No
Paired t-test Unpaired t-test
Are the means of two populations the same?
Are the means of more than two populations
the same?
Number of factors being tested
1 2 >2
Does each data point in one data set (population) have a corresponding one
in the other data sets?Two way ANOVA
ANOVA
Yes No
One way ANOVA
Repeated Measures ANOVA
Other tests
A simple t-test
1. State hypotheses
Ho – there is no difference between the means of the two populations of crickets (i.e. the extra nutrients had no effect on weight)
H1 – there is a difference between the means of the two populations of crickets (i.e. the extra nutrients had an effect on weight)
A simple t-test
2. Calculate a t-value (any stats program does this for you)
3. Use a probability table for the test you used to determine the probability that corresponds to the t-value that was calculated.
(for the truly masochistic)
A simple t-test
2. Calculate a t-value (any stats program does this for you)
3. Use a probability table for the test you used to determine the probability that corresponds to the t-value that was calculated.
Data Test statistic Probability
Unpaired t test Do the means of Nutrient fed and No nutrient differ significantly? P value The two-tailed P value is < 0.0001, considered extremely significant. t = 7.941 with 38 degrees of freedom. 95% confidence interval Mean difference = -3.307 (Mean of No nutrient minus mean of Nutrient fed) The 95% confidence interval of the difference: -4.150 to -2.464 Assumption test: Are the standard deviations equal? The t test assumes that the columns come from populations with equal SDs. The following calculations test that assumption. F = 1.192 The P value is 0.7062. This test suggests that the difference between the two SDs is not significant. Assumption test: Are the data sampled from Gaussian distributions? The t test assumes that the data are sampled from populations that follow Gaussian distributions. This assumption is tested using the method Kolmogorov and Smirnov: Group KS P Value Passed normality test? =============== ====== ======== ======================= Nutrient fed 0.1676 >0.10 Yes No nutrient 0.1279 >0.10 Yes
Interpretation of p < .0001?
This means that there is less than 1 chance in 10,000 that these two means are from the same population.
In the world of statistics, that is too small a chance to have happened randomly and so the Ho is rejected and the H1 accepted
For all statistical tests that you’ll use, it is convention that the minimum probability that two samples can differ and still be from the same population is 5% or p = .05
What happens if you violate any of the assumptions?
Step 1 - Panic
What happens if you violate any of the assumptions?
Step 1 - Panic
Step 2 - It depends on what assumptions have been violated.
Assumption Other tests Another solution?
1. Continuous data Yes
2. Ratio/interval Yes
3. Normal distribution Yes Transform the data
4. Equal variance Yes - Welch’s
5. Sample Population Yes
6. N<10 Yes Take more samples
Nonparametric Tests
These tests are used when the assumptions of t-tests andANOVA have been violated
They are called “nonparametric” because there is no estimation of parameters (means, standard deviations or variances) involved.
Several kinds:1) Goodness-of-Fit tests - when you calculate an expected value2) Non-parametric equivalents of parametric tests
SUMMARY
Problem - trying to determine the expected frequencies of any result in a particular experiment
Type of data
Discrete
2 categories &Bernoulli process
> 2 categories
Use a Binomial modelto calculate expected frequencies
Use a Poisson distribution to calculate expected frequencies
Consider the following problem:
Sampling earthworms
25 plots
1 3
2 4
3 1
4 1
5 3
6 0
7 0
8 1
9 2
10 3
11 4
12 5
13 0
14 1
15 3
16 5
17 5
18 2
19 6
20 3
21 1
22 1
23 1
24 0
25 1
Quadrat # of worms
1 3
2 4
3 1
4 1
5 3
6 0
7 0
8 1
9 2
10 3
11 4
12 5
13 0
14 1
15 3
16 5
17 5
18 2
19 6
20 3
21 1
22 1
23 1
24 0
25 1
Quadrat # of worms
N = 25
X = 2.24 worms/quadrat
What is the expected number of worms/quadrat?
OR
What is the probability of x worms being in a particular quadrat?
Use a Poisson distribution
->2 mutually exclusive categories-N is relatively large and p is relatively small
The distribution of worms in space is expected to be random
Formula for a Poisson distribution
Px = e-µ µx
X!
Probability of observing X
individuals in a category
Base of natural logarithms
(= 2.71828….)
True mean of the population(approximated by sample mean)
An integer(number of indviduals)
Formula for a Poisson distribution
Px = e-µ µx
X!
Probability of observing X worms
in a quadrat
Base of natural logarithms
(= 2.71828….)
µ = X = 2.24
Number of worms)
# of worms
Probability of finding X worms in
a quadrat
Calculation
0 Po = e-µ(µx/0!) =e-2.24 = .1065
1 Po = e-µ(µ1/1!) =e-2.24(2.24/1) = .2385
2 Po = e-µ(µ2/2!) =e-2.24(2.242/2) = .2671
3 Po = e-µ(µ3/3!) =e-2.24(2.243/6) = .1994
4 Po = e-µ(µ4/4!) =e-2.24(2.244/24) = .1117
5 Po = e-µ(µ5/5!) =.05
6 Po = e-µ(µ6/6!) =.0187
7 Po = e-µ(µ7/7!) =.006
Could go on forever or to ∞ - whichever comes first!
Practically….
P0 + P1 + P2 + P3 + P4 + P5 + P6 + P7 = .998
And
P8 + P9……= .002
For convenience - P8 = .002
Other kinds of Poisson problems
1. Cell counts in a hemocytometer
2. Number of parasitic mites per fly in a population
3. Number of fish per seine
4. Number of animals in a particular subdivision of the habitat
Poisson Distributions are very common in biological work!
Goodness-of-Fit Tests
Use with nominal scale data
e.g. results of genetic crosses
Also, you’re using the population to deduce what the sample should look like
Classic example - genetic crosses
Do they conform to an “expected’ Mendelian ratio?
Back to our little ball creatures - Critterus sphericales
Phenotypes:
A_B_
A_bb
aaB_
aabb
Mendelian inheritance-Predict a 9:3:3:1 ratio
-sampled 320 animals
A_B_ A_bb aaB_ aabb
Observed (o) 194 53 67 6
-sampled 320 animals
A_B_ A_bb aaB_ aabb
Observed (o) 194 53 67 6
Expected (e) 180 60 60 20
-sampled 320 animals
A_B_ A_bb aaB_ aabb
Observed (o) 194 53 67 6
Expected (e) 180 60 60 20
o - e 14 -7 7 -14
-sampled 320 animals
A_B_ A_bb aaB_ aabb
Observed (o) 194 53 67 6
Expected (e) 180 60 60 20
o - e 14 -7 7 -14
(o - e)2 196 49 49 196
-sampled 320 animals
A_B_ A_bb aaB_ aabb
Observed (o) 194 53 67 6
Expected (e) 180 60 60 20
o - e 14 -7 7 -14
(o - e)2 196 49 49 196
(o - e)2
e1.08 .82 .82 9.8
-sampled 320 animals
A_B_ A_bb aaB_ aabb
Observed (o) 194 53 67 6
Expected (e) 180 60 60 20
o - e 14 -7 7 -14
(o - e)2 196 49 49 196
(o - e)2
e1.08 .82 .82 9.8
(o -e)2
eSC2 = = 1.08 + .82 + .82 + 9.8 = 12.52
df = number of classes -1 = 3
X2 = 12.52 Critical value for 3 degrees of freedom at .05 level is 7.82
X2 Table
Conclusion: Probability of these data fitting the expected distribution is < .05,therefore they are not from a Mendelian population
The actual probability of X2 =12.52 and df = 3 is .01 > p > .001
A little X2 wrinkle - the Yates correction
Formula is (o -e)2
eSC2 =
Except of df = 1 (i.e. you’re using two categories of data)
Then the formula becomes
(|o -e| - 0.5)2
eSC2 =
Type of data Number of samples
Are data related?
Test to use
Nominal 2 Yes McNemar
Nominal 2 No Fisher’s Exact
Nominal >2 Yes Cochran’s Q
Summary!
Type of data Number of samples Are data related? Test to use
Nominal 2 Yes McNemar
Nominal 2 No Fisher’s Exact
Nominal >2 Yes Cochran’s Q
Ordinal 1 No Komolgorov- Smirnov
Ordinal+ 2 Yes Wilcoxon(paired t-test
analogue)
Ordinal+ 2 No Mann Whitney U (unpaired t-test
analogue)
Ordinal+ >2 No Kruskal Wallis (analogue of one-
way ANOVA
Ordinal >2 Yes Friedman two-way ANOVA
All of the parametric tests (remember the big flow chart!) have non-parametric equivalents (or analogues)