View
253
Download
0
Category
Preview:
Citation preview
TREES
Chimp HumanGorillaHuman ChimpGorilla
=
Chimp GorillaHuman
= =
Human GorillaChimp
Trees
Same thing…
s4 s5s1 s3s2s4 s5s1 s3s2
=
Terminology
A branch =An edge
External node - leaf
Human ChimpChicken Gorilla
The root
Internal nodes
אלו מהמשפטים הבאים נכון, בהתייחס לעץ הנ"ל?
א. האדם והגורילה יותר קרובים זה לזה מהשימפנזה והגורילה.ב. האדם קרוב לתרנגולת ולברווז באותה מידה.
ג. התרנגולת יותר קרובה לגורילה מהאדם.ד. א'+ב'.ה. א'+ג'.ו. ב'+ג'.
ז. א'+ב'+ג'.ח. אף תשובה אינה נכונה.
תרגיל
The maximum parsimony principle.Tree building
Genes: 0 = absence, 1 = presence
speciesg1g2g3g4g5g6
s1100110
s2001000
s3110000
s4110111
s5001110
Tree building
s1 s4 s3 s2 s5
Evaluate this tree…
Tree building
s1 s4 s3 s2 s5
Gene number 1
1 1 1 0 0
10
1
Tree building
s1 s4 s3 s2 s5
Gene number 1, Option number 1.
1 1 1 0 0
1
0
1
1
Tree building
s1 s4 s3 s2 s5
Gene number 1, Option number 2.
Number of changes for gene 1 (character 1) = 1
1 1 1 0 0
1
0
0
1
Tree building
s1 s4 s3 s2 s5
Gene number 2, Option number 1.
0 1 1 0 0
1
0
0
1
Tree building
s1 s4 s3 s2 s5
Gene number 2, Option number 2.
0 1 1 0 0
1
0
1
1
Tree building
s1 s4 s3 s2 s5
Gene number 2, Option number 3.
0 1 1 0 0
0
0
0
0
Number of changes for gene 2 (character 2) = 2
Tree building
s1 s4 s3 s2 s5
Gene number 3, Option number 1.
0 0 0 1 1
0
1
0
0
Tree building
s1 s4 s3 s2 s5
Gene number 3, Option number 2.
0 0 0 1 1
0
1
1
0
Number of changes for gene 3 (character 3) = 1
Tree building
s1 s4 s3 s2 s5
Gene number 4, Option number 1.
1 1 0 0 1
1
1
1
1
Tree building
s1 s4 s3 s2 s5
Gene number 4, Option number 2.
1 1 0 0 1
0
0
0
1
Number of changes for gene 4 (character 4) = 2
Tree building
Gene number 5 is the same as Gene number 4
Number of changes for gene 5 (character 5) = 2
Tree building
s1 s4 s3 s2 s5
Gene number 6, 1 option only:
0 1 0 0 0
0
0
0
0
Number of changes for gene 6 (character 6) = 1
Tree building
Sum of changes
Number of changes for gene 6 (character 6) = 1
Number of changes for gene 5 (character 5) = 2
Number of changes for gene 4 (character 4) = 2
Number of changes for gene 3 (character 3) = 1
Number of changes for gene 2 (character 2) = 2
Sum of changes for this tree topology = 9
Can we do better ???
Number of changes for gene 1 (character 1) = 1
Tree building
s1 s4 s3 s2 s5
The MP (most parsimonious) tree:
Sum of changes for this tree topology = 8
Tree building
How to efficiently compute the MP score of a tree
The Fitch algorithm (1971):
A GC CA
Human ChimpChicken GorillaDuck
{A,G}
{A,C,G}
{A,C}
{A,C}
Postorder tree scan. In each node, if the intersection between the leaves is empty: we apply a union operator. Otherwise, an intersection.
U
U
U
U
Number of changes
C A
Total number of changes = number of union operators => 3 in this case.
Human ChimpChicken GorillaDuck
A GC
{A,G}
{A,C,G}
{A,C}
{A,C}
U
U
U
U
GACA GGGACAAG GCGAGAAA
Human ChimpChicken GorillaDuck
Find minimum number of changes.
תרגיל
Chimpanzee
HumanGorilla
Chimp
Gorilla
Position 3 A A T
Chimp HumanGorilla
AAAAT
ACTAG
ACAAC
Human
Position 1 A A A
Position 4 A A APosition 5 T C G
Position 2 A C C
U
1 1
4
0
0 2
Chimp
Gorilla
Position 3 A A T
Chimp HumanGorilla
AAAAT
ACTAG
ACAAC
Human
Position 1 A A A
Position 4 A A APosition 5 T C G
Position 2 A C C
U
1 1
4
0
0 2
Chimp
Gorilla
Position 3 A A T
Gorilla HumanChimp
AAAAT
ACTAG
ACAAC
Human
Position 1 A A A
Position 4 A A APosition 5 C T G
Position 2 C A C
U
1 1
4
0
0 2
Chimp
Gorilla
Gorilla HumanChimp
AAAAT
ACTAG
ACAAC
Human
Chimp HumanGorillaChimp HumanGorilla
These 3 trees will ALWAYS get the same score
The unrooted tree represents a set of rooted trees
1
2
3
3 1
2
A general observation: the position of the root does not affect the MP score.
E
D E C A BBC
D
A
A B C E D A B C E D
s1 s4 s3 s2 s5
1 1 1 0 0
1
0
1
Intuition as to why rooting does not change the score.
The change will always be on the same branch, no matter where the root is positioned…
1
Which is not a rooted version of this tree?
E
C E D A BBC
D
A
A B D E C A B C D E
תרגיל T3
T1T2
Gorilla gorilla
(Gorilla)
Homo sapiens (human)
Pan troglodytes (Chimpanzee)
Gallus gallus (chicken)
Evaluate all 3 possible UNROOTED trees:
Human
Chimp
Chicken
Gorilla
Human
Gorilla
Chimp
Chicken
Human
Chicken
Chimp
Gorilla
MP tree
Rooting based on a priori knowledge:
Human
Chimp
Chicken
Gorilla
Human ChimpChicken Gorilla
Ingroup / Outgroup:
Human ChimpChicken Gorilla
INGROUPOUTGROUP
Subtrees
Human ChimpChicken GorillaDuck
A subtree
Monophyletic groups
Human ChimpChicken Gorilla
The Gorilla+Human+Chimp are monophyletic.A clade is a monophyletic group.
Paraphyletic = Non-monophyletic groups
Whale ChimpDrosophila Zebrafish
The Zebrafish+Whale are paraphyletic
Human
Chimp
Chicken
Gorilla
Chicken + Rat seems to be monophyletic but they are not, since the root of the tree is between Chicken and the rest.
Human and Gorilla are not monophyletic no matter where the root is…
Rat
When an unrooted tree is given, you cannot know which groups are monophyletic. You can only say which are not.
HOW MANY TREES
How many rooted trees
a ba b c b a c c a b
N=3, TR(3) = 3
b c da c b da d b ca a c db c a db
TR = “TREE ROOTED”
N=2, TR(2) = 1
d a cb a b dc b a dc d a bc a b cd
b a cd c a bd b c da c b da d b ca
N=4, TR(4) = 15
How many rooted trees
a b
c a b
TR = “TREE ROOTED”
2 branches. 3 possible places to add “c”
b a cdd b ca
c c
c
4 branches. 5 possible places to add “d”
6 branches. 7 possible places to add “e”
The number of branches is increased by 2 each time. The number of branches is an arithmetic series.0,2,4,6,8,…. A(n) = A(1)+(n-1)d. A(1) = 0; d=2. => A(n) = (n-1)*2 = 2n-2
How many rooted treesTR = “TREE ROOTED”
The number of branches is increased by 2 each time. The number of branches is an arithmetic series.0,2,4,6,8,…. A(n) = A(1)+(n-1)d. A(1) = 0; d=2. => A(n) = (n-1)*2 = 2n-2
a b
2 branches. 3 possible places to add “c”c c
c
Each time we can add a new branch in Br(n)+1 places. [Br(n)=number of branches]
TR(n+1) = TR(n)*(BR(n)+1)=TR(n)*(2n-1)TR(5) = TR(4)*7=TR(3)*5*7=TR(2)*3*5*7=1*3*5*7…TR(n) = 1*3*5*7*…..*(2n-3)
[Tr(n)=number of trees with n sequences]
How many rooted treesTR = “TREE ROOTED”
n!=1*2*3*4*5*6…..*n = n factorial.
TR(n) = 1*3*5*7*…..*(2n-3) =
2*4*6*8*….*(2n-4) =
1*2*3*4*5*6*7*…*(2n-3)
(2*1)*(2*2)*(2*3)*(2*4)*….*(2*(n-2)) =
1*2*3*4*5*6*7*…*(2n-3)
(2(n-2))*(1*2*3*4*….(n-2)) =
(2n-3)!
(2(n-2))*(n-2)!
(2n-3)! =
How many rooted treesTR = “TREE ROOTED”
TR(n) = 1*3*5*7*…..*(2n-3) =
(2(N-2))*(n-2)!
(2n-3)! =
=(2n-3)!!
HEURISTIC SEARCH
There are many trees..,
We cannot go over all the trees. We will try to find a way to find the best tree.These are approximate solutions…
Finding the maximum is the same thing as finding the minimum
Say we have a computer procedure that given a function, it finds its minimum, andwe want to find the maximum of a function f(x). We can just find the minimum of -f(x) and this is minus the maximum of f(x).
Example.
f(0) = 3; f(1) = 7; f(2) = -5; f(3) = 0; max f(x) = 7. argmax f(x) = 1;-f(0)=-3; -f(1) = -7; -f(2) = 5; -f(3) =0; min(-f(x)) = -7. argmax –(f(x) = 1;
Score = 1700
Score = 1700
Score = 1825
Score = 1710
Score = 1410
Score = 1695
Score = 1825
Score = 1828
Score = 1910
Score = 1800
Max score = 2900
Score = 2100
Problem number 1: local maximum
Score = 3100
Score = 2900
Local max
Global max
This algorithm is “greedy” – it seizes the first improvement encountered.
One way to avoid local maxima is to start from many random starting points
Several options to define a neighbor.
Option 1Option 2
Nearest-neighbor interchange
A
BC
D
A
DC
B
D
BC
A
Each internal branchdefines two neighbors
How many neighbors do we check each time?
For unrooted trees of n taxa, we have 2n-3 branches. However, only internal branches are interesting, thus we have n-3. Each defines two neighbors, thus the total number of neighbors in each NNI cycle is 2n-6.
A
BC
D
E
Internal branches
External branches
NNI is possible only in internal branches
I am greedy
(1)Most greedy: Start searching your neighbors. If you find something better – move there, and start the search again.
(2)Just greedy: Check ALL your neighbors. Move to the one that is the highest.
(3)Smart greedy: Try all NNI of trees that are tied for the best score.
Greedy variants
There are many other variants of the greedy search
that would not be discussed in this course.
Parsimony has many shortcomings. To name a few:
(1) All changes are counted the same, which is not true for biological systems (Leu->Ile is much more likely than Leu->His).
(2) Cannot take biological context into account (secondary structures, dependencies among sites, evolutionary distances between the analyzed organisms, etc).
(3) Statistical basis questionable.
Alternative:
MAXIMUM-LIKELIHOOD METHOD.
Maximum likelihood uses a probabilistic model of evolution
Each amino acid has a certain probability to change and this probability depends on the evolutionary distances.
Evolutionary distances are inferred from the entire set of sequences.
Evolutionary distances
Positions can be conserved because of two reasons. Either because of functional constraints, or because of short evolutionary time.
5 replacements in 10 positions between 2 chimps, is considered very variable. 5 replacements between human, and cucumber, is not considered that variable…
Maximum likelihood takes this information into account.
Maximum Parsimony
Maximum Likelihood
All changes counted the same
Different probabilities to the different types
of substitutions
Statistically questionable
Statistically robust
Ignores biological context
Accounts for biological context
)]()()()(
)()()([
)]()()()(
)()()([
)]()()()(
)()()([
6543
21
6543
21
6543
21
tPtPtPtP
tPtPXP
tPtPtPtP
tPtPXP
tPtPtPtP
tPtPXPDataP
FZEZZYCY
X Y ZYXGX
AZTZZYCY
X Y ZYXLX
AZMZZYCY
X Y ZYXKX
The likelihood computations
t1
t5
t3
X
CK
t2
ZY
M At6
t4
We can infer the phylogenetic tree using maximum likelihood. This is more accurate than maximum parsimony.
Maximum likelihood tree reconstruction
This is incredibly difficult (and challenging) from the computational point of view, but efficient algorithms to find approximate solutions were developed.
HIV evolution – an example of using phylogeny tools
The virus = HIV
The disease = AIDS (Aquired Immunodeficiency Syndrome)
First recognized clinically in 1981
By 1992, it had become the major cause of death in individuals 25-44 years of age in the States.
Human Immunodeficiency Virus (HIV)
Till Dec 2007: 25 million people died of AIDS (20 million in 2002)
People living with HIV/AIDS in 2007 33.2 million
Africa has 12 million AIDS orphans (2007). 1 out of 3 children in some areas lost at least one of his/her parents
HIV Statistics
HIV is a lentivirus
Species = HIVGenus = LentivirusesFamily = Retroviridae
Lentiviruses have long incubation time, and are thus called “slow viruses”.
In 1986, a distinct type of HIV prevalent in certain regions of West Africa was discovered and was termed HIV type 2.
Individuals infected with type 2 also had AIDS, but had longer incubation time and lower morbidity (# of cases/population size).
HIV-1 and HIV-2
HIV subtypes
HIV subtypes
published by the International AIDS Vaccine Initiative
Five lines of evidence have been used to substantiate zoonotic transmission of primate lentivirus:
1. Similarities in viral genome organization;2. Phylogenetic relatedness;3. Prevalence in the natural host;4. Geographic coincidence;5. Plausible routes of transmission.
For HIV-2, a virus (SIVsm) that is genomically indistinguishable and closely related phylogenetically was found in substantial numbers of wild-living sooty mangabeys whose natural habitat coincides with the epicenter of the HIV-2 epidemic
מנגבי, קוף ארוך זנב מסוג סרקוסבוס מצוי באזורי היערות של אפריקה
Close contact between sooty mangabeys and humans is common because these monkey are hunted for food and kept as pets.
No fewer than six independent transmissions of SIVsm to humans have been proposed.
The origin of HIV-1 is much less certain.
HIV and SIV tree based on maximum parsimony
1990
This tree can be explained by co-evolution of virus and host.
Virus A
Primate B
Primate C
Primate A
Virus C
Virus B
Host-pathogen co-evolution in other SIV
1999
There are at least two different HIV-1 clades, and two different SIVcpz clades
Phylogenetic tree
2006. Nature
“We tested 378 chimpanzees and 213 gorilla fecal samples from remote forest regions in Cameroon for HIV-1 cross-reactive antibodies”
“Surprisingly, 6 of 213 fecal samples from wild-living gorillas also gave a positive HIV-1 signal”
The origin of HIV-O
Bayesian analysis
HIV-1 O is a sister clade of SIV from Gorilla!
It seems that chimpanzee transmitted SIV to gorilla and gorilla to human type O, or
Chimpanzee transmitted to both gorilla and to human type O
Note: gorilla and chimps rarely interact + gorilla are herbivores
The origin of HIV-O
Thanksתודה
Thank You…
Recommended