A Study on Measuring Distance between Two Trees
Advisor: 阮夙姿 教授阮夙姿 教授Presenter : 林陳輝林陳輝
CSIE, National Chi Nan University 2
OutlineIntroduction
Problem definition
Related workThe metric and algorithms
Mixture distanceBasic algorithmThe modified algorithm
Mixture - matching distanceMixture - matching distance
Conclusions and Future work
CSIE, National Chi Nan University 3
Introduction
Evolutionary tree
Comparing trees
Comparing trees is not easy
-Phylogenetic tree, wikipedia
CSIE, National Chi Nan University 4
Mixture tree
taxa
Time
S.-C. Chen and B. G. Lindsay, “Building Mixture Trees from Binary Sequence Data,” Biometrika, 2006.
CSIE, National Chi Nan University 5
Problem definition
1111
99 88
11 33 55 77
A B C D E F G H
v1
v2 v3
v4v5 v6
v7
•The leaves are associating taxas
•There is a time parameter on every internal node
CSIE, National Chi Nan University 6
OutlineIntroduction
Problem definition
Related workThe metric and algorithms
Mixture distanceBasic algorithmThe modified algorithm
Mixture - matching distanceMixture - matching distance
Conclusions and Future work
CSIE, National Chi Nan University 7
Related workPath difference metric
dp(T1, T2) = ||d(T1) – d(T2)||2
d(Ti) is a vector that contains all pair leaves distance of
Ti.
M. A. Steel and D. Penny, “Distributions of Tree Comparison Metrics – Some New Results,” Syst. Biol. 42(2):126-141, 1993.
CSIE, National Chi Nan University 8
Related workNodal metric
In full binary trees, the complexity is O(n3).In complete binary trees, the complexity is O(n2 log n). John Bluis and Dong-Guk Shin, “Nodal Distance Algorithm: Calculating a Phylogenetic Tree Comparison Metric,” Proc. of the 3rd IEEE Symposium on BioInformatics and BioEngineering, 87- 94, 2003
leaves. are for ,) ,() ,(Distance21
yx,yxDyxD TT
CSIE, National Chi Nan University 9
Related work
Matching distanceP. W. Diaconis and S. P. Holmes, “Matchings and Phylogenetic Trees.," Proc. Natl Acad Sci U S A, Vol. 95, No. 25, pp. 14600~14602, 1998.
The algorithm for matching distanceG. Valiente, A Fast Algorithmic Technique for Comparing Large Phylogenetic Trees," SPIRE, pp. 370~375, 2005.
CSIE, National Chi Nan University 10
Matching Representation
1 2
3 4
5 6
0
0
0
0
07 8
9 10
11
{1,2} {5,6} {3,7} {4,8} {9,10}
CSIE, National Chi Nan University 11
Matching distance
{1,2} {5,6} {3,7} {4,8} {9,10}
{1,3} {4,6} {2,7} {5,8} {9,10}
The distance is 2
3 4
5 6
8
9 10
7
1 2
2 5
4 6
8
9 10
7
1 3
11 11T1
T2
T1
T2
CSIE, National Chi Nan University 12
OutlineIntroduction
Problem definition
Related workThe metric and algorithms
Mixture distanceBasic algorithmThe modified algorithm
Mixture - matching distanceMixture - matching distance
Conclusion and Future work
CSIE, National Chi Nan University 13
Mixture distance and algorithmsDefinition:
pTi (x, y) is time parameter of the LCA of leaves x, y
leaves. are for ,),(),(Distance21
yx,yxpyxp TT
99
11 33
A B C D
v1
v3v2
99
22 33
A BC D
v1
v3v2
CSIE, National Chi Nan University 14
Distance conditions
The distance from an object to itself is zero.
The distance from A to B is the same as the distance from B to A.
The Triangle Inequality holds true.
- J. Felsenstein, Inferring phylogenies. Sunderland, MA: Sinauer Associates, 2004.
CSIE, National Chi Nan University 16
Algorithm
C(n, 2)
Algorithmic idea: grouping
Full binary tree99
11 33
A B C D
v1
v2
88
44
11
A B C D
v1
v2
v3v3
AB: |8 – 1| = 7
AC: |8 – 9| = 1
AD: |8 – 9| = 1
BC: |4 – 9| = 5
BD: |4 – 9| = 5
CD: |1 – 3| = 2
Distance = 21
leaves. are for ,),(),(Distance21
yx,yxpyxp TT
CSIE, National Chi Nan University 17
99
77 88
22 33 44 55
A B C D E F G H
v1
v2 v3
v4 v5 v6v7
T199
66 88
11 33 44 55
HG FA B CD E
v1
v2 v3
v4 v5 v6v7
T2
Algorithm
CSIE, National Chi Nan University 18
99
HG FA B CD E
T2
Red:1 Green:1
99
7788
22 33 44 55
A B C D E F G H
v1
v2v3
v4 v5 v6v7
Red:0 Green:1
Red:1 Green:0
Red:0 Green:1
Red:1 Green:0
66 88
11 33 44 55
v1
v2 v3
v4 v5 v6v7
Red:1Green:1
Red:2 Green:2
T1
|pT1(v1) - pT2
(v6)| × (1 × 1+0 × 0) = |9 - 4| × (1*1+0*0) =
5
|pT1(v1) - pT2
(v7)| × (0 × 0+1 × 1) = |9 - 5| × (0*0+1*1) =
4
|pT1(v1) - pT2
(v3)| × (1 × 1+1 × 1) = |9 - 8| × (1*1+1*1) =
2
CSIE, National Chi Nan University 19
T2
99
66 88
11 33 44 55
HG FA B CD E
v1
v2 v3
v4 v5 v6v7
Red:0 Green:1
Red:0Green:1
99
77 88
22 3344 55
A B C D E F G H
v1
v2 v3
v4 v5 v6v7
T1
Red:1 Green:0
Red:1 Green:0
Red:0 Green:0
Red:0 Green:0
Red:0Green:2
Red:2Green:0
|pT1(v2) - pT2
(v2)| × (2 × 0 + 0 × 0) = |7 - 6| × (2 × 0 + 0 × 0) =
0|pT1(v2) - pT2
(v3)| × (0 × 1 + 0 × 1) = |7 - 8| × (0 × 1 + 0 × 1)
= 0|pT1(v2) - pT2
(v1)| × (2 × 2 + 0 × 0) = |7 - 9| × (2 × 2 + 0 × 0) =
8
Red:2Green:2
CSIE, National Chi Nan University 20
Complexity analysis
For every internal node of T1, coloring all leaves
needs O(n).
Counting distance in T2 needs O(n).
The time complexity is O(n2).
CSIE, National Chi Nan University 21
The modified algorithm
Boost up the basic algorithm
Too much empty color information
CSIE, National Chi Nan University 22
T2
99
66 88
11 33 44 55
HG FA B CD E
v1
v2 v3
v4 v5 v6v7
Red:0 Green:1
Red:0Green:1
99
77 88
22 3344 55
A B C D E F G H
v1
v2 v3
v4 v5 v6v7
T1
Red:1 Green:0
Red:1 Green:0
Red:0 Green:0
Red:0 Green:0
Red:0Green:2
Red:2Green:0
|pT1(v2) - pT2
(v2)| × (2 × 0 + 0 × 0) = |7 - 6| × (2 × 0 + 0 × 0) =
0|pT1(v2) - pT2
(v3)| × (0 × 1 + 0 × 1) = |7 - 8| × (0 × 1 + 0 × 1)
= 0|pT1(v2) - pT2
(v1)| × (2 × 2 + 0 × 0) = |7 - 9| × (2 × 2 + 0 × 0) =
8
Red:2Green:2
Empty color information
CSIE, National Chi Nan University 23
T2
99
66 88
11 33 44 55
HG FA B CD E
v1
v2 v3
v4 v5 v6v7
T2
99
88
11
A B CD
v1
v3
v4
CSIE, National Chi Nan University 24
The modified algorithm
Finding LCA in constant time with O(n) preprocessing
MA Bender, MIF Colton, The LCA Problem Revisited, Proc. LATIN, 2000
2-way merge problemR.C.T. Lee, S. S. Tseng, R.C. Chang and Y. T. Tsai, Introduction to the Design and Analysis of Algorithms. McGraw-Hill Education, 2005
CSIE, National Chi Nan University 25
9
7 8
2 3 4 5
HG FA B CD E
v1
v2 v3
v4 v5 v6v7
T2
9
6 8
1 3 4 5
A B C D E F G H
v1
v2 v3
v4 v5 v6v7
T1
1 2
3
4 5
6
7
8 9
10
11 12
13
14
15
1 2 45 8 911 12
CSIE, National Chi Nan University 26
9
7 8
2 3 4 5
HG FA B CD E
v1
v2 v3
v4 v5 v6
v7
T2
1 2
45 8 911 12
1, 2 11, 12 5,84, 9
13 v4 |1 – 2| (1 1 + 0 0) = 19
6 8
1 3 4 5
A B C D E F G H
v1
v2 v3
v4 v5 v6v7
T1
1 2
3
4 5
6
7
8 9
10
11 12
13
14
15
1 2
CSIE, National Chi Nan University 27
9
7 8
2 3 4 5
HG FA B CD E
v1
v2 v3
v4 v5 v6
v7
T2
45 8 9
11 12
1, 2 11, 12 5,84, 9
1, 2, 11, 12 4, 5, 8, 9
1, 2, 4, 5, 8, 9, 11, 12
|9 – 7| (2 2 – 0 0) = 8
9
6 8
1 3 4 5
A B C D E F G H
v1
v2 v3
v4 v5 v6v7
T1
1 2
3
4 5
6
7
8 9
10
11 12
13
14
15
9
1 5
v1
v4
3 13v7
11 121 2
1 2
15
HGA B
CSIE, National Chi Nan University 28
Complexity analysis
To reconstruct subtree of T1 is in linear time
Counting distance in reconstructed subtree needs O(m).
The height of complete binary tree is O(logn)
The total complexity is O(nlogn) in complete binary tree.
CSIE, National Chi Nan University 29
OutlineIntroduction
Problem definition
Related worksThe metric and algorithms
Mixture distanceBasic algorithmThe modified algorithm
Mixture - matching distanceMixture - matching distance
Conclusions and Future work
CSIE, National Chi Nan University 30
Mixture-matching distance
Distance =
i is matching distance between T1 and T2.
PTm denotes the product of all time parameter in Tm
2 ,1 , and ,for , /1 mnPPiPP mnmn TTTT
CSIE, National Chi Nan University 31
9
7 8
2 3 4 5
HG FA B CD E
T2
9
6 8
1 3 4 5
A B C D E F G H
T1
1 2 3 4 5 6 7 8
9 10 11 12
13 14
15
1 2 4 58
9 11 10
367
12
13 14
15
{1, 2} {3, 4} {5, 6} {7, 8} {9,10} {11, 12} {13, 14}
{1, 2} {3, 6} {4, 5} {7, 8} {9,12} {10, 11} {13, 14}
Distance = 1 - (25920 / 60480) + 2 ≒ 2.571
604801 TP
259202 TP
T1
T2
CSIE, National Chi Nan University 32
0
1
∞
The sameNo different leaves
i
i transposition
Distance
Distance = 1 - (25920 / 60480) + 2 ≒ 2.571
The time complexity is O(n)
2 ,1 , and ,for , /1 mnPPiPP mnmn TTTTDistance =
CSIE, National Chi Nan University 33
OutlineIntroduction
Problem definition
Related worksThe metric and algorithms
Mixture distanceBasic algorithmThe modified algorithm
Mixture - matching distanceMixture - matching distance
Conclusions and Future work
CSIE, National Chi Nan University 34
Conclusions
Metric ConsiderenceTime complexity
Full binary tree
Complete binary tree
Path difference metric Structure N/ANodal distance Structure O(n3) O(n2logn)
Mixture distanceStructure and
time parameterO(n2) O(nlogn)
Matching distance Structure O(n)
Mixture-matching distance
Structure and
time parameterO(n)
CSIE, National Chi Nan University 35
Future work
Improve the time complexity
Extend to k - ary trees
Add mutation point
Thanks for Your Listening.