42
Greedy method for inferring tan dem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002) Reconstructing the dupl ication history of tandemly repeated gen e, Mol. Biol. Evol 2.Tang,M., Waterman M,(2001) Zinc finger gen e clusters and tandem gene duplication, RE COMB reporter: r92922054 李李 b885 06020 李李李 b909

Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

  • View
    216

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

Greedy method for inferring tandem duplication historyLouxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003

reference:1.Elemento,O.,(2002) Reconstructing the duplication history of tandemly repeated gene, Mol. Biol. Evol

2.Tang,M., Waterman M,(2001) Zinc finger gene clusters and tandem gene duplication, RECOMB

reporter: r92922054 李明翰 b88506020 黃寶萱 b90902020 蔡明潔

Page 2: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

Outline

Duplication model

Constructing duplication model from phylogeny Double duplication model Arbitrary duplication model

Discussion

Page 3: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

Duplication

A duplication replaces a stretch of DNA containing several repeats with two identical and adjacent copies of itself.

If the stretch contain k repeats, the duplication is called a k-duplication.

Page 4: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)
Page 5: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)
Page 6: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

DM ( duplication model )

A duplication model M for tandemly repeated sequence is a directed graph.

A duplication model contains nodes, edges and blocks.

Page 7: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

Phylogeny & DM

Page 8: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

Node & Edge

A node in DM represents a repeat.A directed edge (u,v) indicates that v is a c

hild of u. Also means that u is an ancestor of v.Root & Leaf & Internal node.

Page 9: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

Block

A block in DM represents a duplication.Each internal node appears in a unique

block.No node is an ancestor of another in a

block.We draw a block representing a k-

duplication only when the k>2.

Page 10: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)
Page 11: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

Block (Cont.)

lc(v) means the left child of v. rc(v) means the right child of v.If the block corresponds to a k-duplication,

then it contains k nodes v1 , v2 ,…… vk from l

eft to right.Then

lc(v1),lc(v2),…,lc(vk),rc(v1),rc(v2),…,rc(vk)

Page 12: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

Cont.

Hence ,for any i and j, 1 ≤ i < j ≤ k, the edge( vi , rc(vi)) and edge( vj , lc(vj)) cross each other.

The left-to-right order of leaves in the model is identical to the order of the sequences on a chromosome.

Page 13: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

Example

lc(v1),lc(v3),lc(v4),rc(v1),rc(v3),rc(v4).

An ordered phylogenetic tree for sequence {1,2,…,n} is a rooted phylogeny in which its leaves are listed from left to right in the increasing order.

Page 14: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

LEMMA 1:

l*c(u),r*c(u) denote the leftmost and the rightmost leaf in the subtree TM(u) rooted at u respectively.

For each internal node u in TM ,

r*c(u)> r*c(lc(u)) and l*c(u)<l*c(rc(u)).r*c(lc(u)) and l*c(rc(u)) are the biggest and

smallest labels in the subtree TM(u).

Page 15: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

Constructing a duplication model from a phylogeny

Page 16: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

Features:

A duplication model M has a unique associated phylogeny TM.

A phylogeny is not necessarily associated with a duplication model.

Page 17: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

Problem:Reconstruct the Duplication model M in linear time

Input: a phylogeny T

Output: reconstruct the duplication model M

Page 18: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

Problem (Cont.)Note: To represent a duplication model, we

only need to list all non-single duplication blocks on the associated phylogeny

[V1, V3, v 4] [V5 V6] [V7 V8]

Page 19: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

Double duplication models

Given a phylogeny T on sequence family F = {1,2,…,n}. Associate a pair (Lv, Rv) of indices with each node v in T as follows:

1. The i th leaf node: (Lv,Rv) = (i, i)

2. The internal node: (Lv,Rv) = (l*c(v), r *c(v))

Page 20: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

r (1,10)

1(1,1)

6(6,6) 2(2,2)5(5,5)

8(8,8)10(10,10)3(3,3)9(9,9)7(7,7)

4(4,4)

V1(1,6)

V5(2,4) V7(7,9)

V3(2,9)

V6(3,5) V8(8,10)

V4(3,10)

V2(2,10)

Page 21: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

Bottom up fashion for (Lv, Rv)

Lv = min {Llc(v), Lrc(v)}

Rv = max {Rlc(v), Rrc(v)}

Recursively bottom upSince T contains 2n-1 nodes linear time

Page 22: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

Constructing DDM from phylogenyDouble duplication model: A duplication

model with all duplication in it are 1(or 2)-duplcation.

By Lemma1 the leftmost and rightmost leaves in T are 1 and n respectively.

Where does 2 locate?2 must just next to 1 on the DDM

Page 23: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

Let v0 = r, v1, v2, · · · , vp−1, vp = 1

u1 = rc(vi ), u2, · · · , uq−1, uq = 2, where q ≥ p – i

LEMMA 2. M must contain p-i-1 double duplications

[vi+1, u j1 ], [vi+2, u j2 ], · · · , [vp−1, u jp−i−1 ],

i=2

P=5

q= 6

Page 24: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

LEMMA 2. (Cont.)

Since jp-i-1 ≤ q -1 q ≥ p – I

PROOF. If vi+k does not belong to a double duplication block in M, the leaf labeled with 2 cannot be placed before the leftmost leaf in the subtree rooted at rc(vi+k), contradicting the fact that 2 is right next to 1 in M. Hence, vi+k must appear in a double duplication block for each k, 1 ≤ k ≤ p − i − 1. This finishes the proof.

Page 25: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

Note:Ru1 > Ru2 > · · · > Ruq−1 > Ruq and

Rvi+1 > Rvi+2 > · · · > Rvp−1

Rvi+k appears between Ru jk and Ru jk+1 for [Vi

+k, ujk]

We can determine all ujk’s in p – i +q ≤ 2q

Page 26: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

After all the duplication blocks [vi+k , u jk ] are placed on T , the leaf 2 should be right next to the leaf 1

Page 27: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

Derive a rooted binary tree T’’ from the subtree of T(u1) by inserting a new node by

inserting a new node vk in the edge (u jk , u

jk+1) for each 1 ≤ k ≤ p − i − 1

assigning the subtree T(rc(vi+k)) rooted at rc(vi+k) as the right subtree of vk

Note : left child of vk is u jk+1 in T now.

Then, form the new phylogeny T’ from T by replacing subtree T(vi) with T’’

Page 28: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)
Page 29: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)
Page 30: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

Linear time (Analysis)

Since we can charge the number of comparisons taken in different recursive steps to disjoint left paths in the input tree T , the whole algorithm takes at most 2×2n comparisons for determining all the duplication blocks. linear time algorithm.

Each internal node will be compared in q (next to leftmost path) once and then be in p (leftmost path) once. And each internal node will be compared with its (Rv,Lv). Therefore, 2x2n comparisons.

Page 31: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

Arbitrary duplication models

Page 32: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

Now, we generalize the above algorithm into arbitrary duplication models.

Again, we assume the leftmost paths leading to leaf 1 and leaf 2 in T are given in (1) and (2) respectively.

Page 33: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

Observation:

Assume a phylogeny T is associated with a duplication model M. Then, there exist p − i − 1 double duplication blocks [vi+k , ujk ] (1≤k≤ p − i − 1) such that, after these duplications are placed in T , the leaf 2 is right next to the leaf 1. But, these double duplication blocks may not be in M.

Page 34: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

Recall that there are two types of nodes on the leftmost path of T’. Some nodes are original ones in the input tree T ; some are inserted due to duplication blocks we have examined so far.

To extend the existing duplication blocks to larger ones, we associate a flag to each original node on the leftmost path of T’ , which indicates whether the node is in an existing duplication block or not.

Page 35: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

Let x be an original node on the leftmost path P of T’ appearing in a duplication block [x1, x2, · · · , xt , x] of size t + 1 so far, then, there are t inserted nodes x’i right below x on the path P, which correspond to xi for i ≤ t.

To determine whether [x1, x2, · · · , xt , x] can be extended to a large duplication block in the model with which the original tree T is associated, we need to consider x and all the x’i s (1≤i≤ t) simultaneously.

For this purpose, we introduce the concept of hyper-double (duplication) blocks.

Page 36: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

We say that x and y form a hyper-double block [x, y] in T’ if the following three conditions hold:

(i) x is a node in some non-single duplication block that we have obtained so far;

(ii) x and y are not an ancestor of each other;

(iii) the block [x1, x2, · · · , xt , x] can be extended to a block [x1, x2, · · · , xt , x, y] of size t + 2 in the original tree T .

Page 37: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

Hence, when we place a hyper-double block [x, y] in the current tree T’ , the edge (y, l(y)) crosses not only the edge (x, r(x)), but also the edges (x’i , r (x’i )), 1≤ i ≤ t.

So, we have that a phylogeny T is associated with a model if and only if:

(i) there exist p − i − 1 double duplication blocks [vi+k , ujk ] (1≤k≤p − i − 1) in T such that, after these duplication blocks are placed in T, leaf 2 is right next to leaf 1, and

(ii) T’ constructed above is associated to ‘a duplication model’ with introducing hyper-double duplication blocks.

Page 38: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

To make the algorithm run in linear time, we refine the algorithm in two aspects.

First, we assign a pair (R’x , R”x ) of indices to a node x on the leftmost path of T in each recursive step: if x is in a duplication block [x1, x2, · · · , xt , x] in the current stage, we set R’x = Rx1 and R”x = Rx , which are defined in Section 2.2.1. Since R’x < Rxi < R”x for 2≤i≤t, only R’x and R”x will be examined for determining if x is in a hyper-double block in next step.

Page 39: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

Secondly, if the duplication block [x1, x2, · · · , xt , x] is extended into a larger hyper-double block [x1, x2, · · · , xt , x,y] in a step, the binary tree T’ for next step is constructed by inserting the right subtrees of xi ’s and x into the edge between y and its left child lc(y).

To do these insertions, we need to point the left child of x1 to l(y), and then point the left child of y to x.

In this way, we are able to insert all the subtrees in only two pointer operations.

Page 40: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

DS: [v1,v2][v3,v5][v8,v6]

Page 41: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

DS: [v1,v2][v3,v5,v4][v8,v6]

Page 42: Greedy method for inferring tandem duplication history Louxin Zhang, Bin Ma, Lusheng Wang and Ying Xu. BIOINFORMATICS 2003 reference: 1.Elemento,O.,(2002)

DS: [v1,v2][v3,v5,v4,v7][v8,v6]