56
Dynamic-Programming Strateg ies for Analyzing Biomolecu lar Sequences Kun-Mao Chao ( 趙趙趙 ) Department of Computer Scienc e and Information Engineering National Taiwan University, T aiwan E-mail: [email protected] WWW: http://www.csie.ntu.edu.tw/~k mchao

Dynamic-Programming Strategies for Analyzing Biomolecular Sequences

  • Upload
    dinesh

  • View
    30

  • Download
    0

Embed Size (px)

DESCRIPTION

Dynamic-Programming Strategies for Analyzing Biomolecular Sequences. Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National Taiwan University, Taiwan E-mail: [email protected] WWW: http://www.csie.ntu.edu.tw/~kmchao. Dynamic Programming. - PowerPoint PPT Presentation

Citation preview

Page 1: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences

Dynamic-Programming Strategies for Analyzing Biomolecular SequencesKun-Mao Chao (趙坤茂 )

Department of Computer Science and Information EngineeringNational Taiwan University, Taiwan

E-mail: [email protected]: http://www.csie.ntu.edu.tw/~kmchao

Page 2: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences

2

Dynamic Programming• Dynamic programming is a class of

solution methods for solving sequential decision problems with a compositional cost structure.

• Richard Bellman was one of the principal founders of this approach.

Page 3: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences

3

Two key ingredients• Two key ingredients for an optimization

problem to be suitable for a dynamic-programming solution:

Each substructure is optimal.

(Principle of optimality)

1. optimal substructures

2. overlapping subproblems

Subproblems are dependent.(otherwise, a divide-and-conquer approach is the choice.)

Page 4: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences

4

Three basic components• The development of a dynamic-programming algorithm has three basic components:

– The recurrence relation (for defining the value of an optimal solution);– The tabular computation (for computing the value of an optimal solution);– The traceback (for delivering an optimal solution).

Page 5: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences

5

Fibonacci numbers

.for 21

11

00

i>1iFiFiFFF

The Fibonacci numbers are defined by the following recurrence:

Page 6: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences

6

How to compute F10?

F10

F9

F8

F8

F7

F7

F6

……

Page 7: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences

7

Tabular computation• The tabular computation can avoid recompuation.

F0 F1 F2 F3 F4 F5 F6 F7 F8 F9 F10

0 1 1 2 3 5 8 13 21 34 55

Page 8: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences

8

Maximum-sum interval• Given a sequence of real numbers

a1a2…an , find a consecutive subsequence with the maximum sum.9 –3 1 7 –15 2 3 –4 2 –7 6 –2 8 4 -9

For each position, we can compute the maximum-sum interval starting at that position in O(n) time. Therefore, a naive algorithm runs in O(n2) time.

Page 9: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences

9

O-notation: an asymptotic upper bound

• f(n) = O(g(n)) iff there exist two positive constant c and n0 such that 0 f(n) cg(n) for all n n0

cg(n)

f(n)

n0

Page 10: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences

10

How functions grow?

30n 92n log n 26n2 0.68n3 2n

100 0.003 sec. 0.003 sec. 0.0026

sec. 0.68 sec. 4 x 1016

yr.

100,000 3.0 sec. 2.6 min. 3.0 days 22 yr.

For large data sets, algorithms with a complexity greater than O(n log n) are often impractical!

n

function

(Assume one million operations per second.)

Page 11: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences

11

Maximum-sum interval(The recurrence relation)

• Define S(i) to be the maximum sum of the intervals ending at position i.

0

)1(max)(

iSaiS i

ai

If S(i-1) < 0, concatenating ai with its previous interval gives less sum than ai itself.

Page 12: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences

12

Maximum-sum interval(Tabular computation)

9 –3 1 7 –15 2 3 –4 2 –7 6 –2 8 4 -9S(i) 9 6 7 14 –1 2 5 1 3 –4 6 4 12 16 7

The maximum sum

Page 13: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences

13

Maximum-sum interval(Traceback)9 –3 1 7 –15 2 3 –4 2 –7 6 –2 8 4 -9

S(i) 9 6 7 14 –1 2 5 1 3 –4 6 4 12 16 7

The maximum-sum interval: 6 -2 8 4

Page 14: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences

14

Two fundamental problems we recently solved (I) (joint work with Lin and Jiang)• Given a sequence of real numbers

of length n and an upper bound U, find a consecutive subsequence of length at most U with the maximum sum --- an O(n)-time algorithm.U = 3

9 –3 1 7 –15 2 3 –4 2 –7 6 –2 8 4 -9

Page 15: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences

15

Joint work with Huang, Jiang and Lin• Xiaoqiu Huang (黃曉秋 )Iowa State University, USA• Tao Jiang (姜濤 ) University of California Riverside, USA• Yaw-Ling Lin (林耀鈴 )Providence University, Taiwan

Page 16: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences

16

Two fundamental problems we recently solved (II) (joint work with Lin and Jiang)• Given a sequence of real numbers

of length n and a lower bound L, find a consecutive subsequence of length at least L with the maximum average. --- an O(n log L)-time algorithm.

L = 4

3 2 14 6 6 2 10 2 6 6 14 2 1

Page 17: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences

17

Another exampleGiven a sequence as follows:

2, 6.6, 6.6, 3, 7, 6, 7, 2and L = 2, the highest-average

interval is the squared area, which has the average value 20/3.

2, 6.6, 6.6, 3, 7, 6, 7, 2

Page 18: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences

18

C+G rich regions• Our method can be used to locate

a region of length at least L with the highest C+G ratio in O(n log L) time.

ATGACTCGAGCTCGTCA

00101011011011010

Search for an interval of length at least L with the highest average.

Page 19: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences

19

After I opened this problem in Academia Sinica in Aug. 2001 ...• JCSS (2002) – Lin, Jiang, Chao

– A linear-time algorithm for a maximum-sum segment with a lower bound L and an upper bound U.

– An O(n log L) time algorithm for a maximum-average segment with a lower bound L.

– Two open problems raised:• O(n log L) O(n)?• average (with an upper bound U)?

Page 20: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences

20

After I opened this problem in Academia Sinica in Aug. 2001 ...• Comb. Math. (2002) – Yeh et. al

– An expected O(nL)-time algorithm• Comb. Math. (2002) – Lu

– A linear time algorithm for a maximum-average segment with a lower bound L in a binary sequence.

Page 21: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences

21

After I opened this problem in Academia Sinica in Aug. 2001 ...• WABI (2002) & JCSS – Goldwasswer, Kao,

and Lu– O(n) for a lower bound L only– O(n log (U-L+1)) for L and U

• IPL (2003) – Kim– O(n) for a lower bound L only

(Reformulated as a Computational Geometry Problem)

Page 22: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences

22

After I opened this problem in Academia Sinica in Aug. 2001 ...• Bioinformatics (2003) – Lin, Huang, Jiang,

Chao– A tool for finding k-best maximum-average seg

ments• Bioinformatics (2003) – Wang, Xu

– A tool for locating a maximum-sum segment with a lower bound L and an upper bound U

Page 23: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences

23

After I opened this problem in Academia Sinica in Aug. 2001 ...• CIAA(2003) – Lu et al.

– An alternative linear-time algorithm for a maximum-sum segment (online; for finding repeats)

• ESA (2003) – Chung, Lu– O(n log (U-L+1)) O(n) for L and U

• ISAAC (2003) – Lin (R.R.), Kuo, Chao– O(nL) for a maximum-average path in a tree– O(n) for a complete tree

• Some were omitted here...• Some are coming soon...

Page 24: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences

24

Length-unconstrained version

• Maximum-average interval

3 2 14 6 6 2 10 2 6 6 14 2 1

The maximum element is the answer. It can be done in O(n) time.

Page 25: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences

25

Q: computing density in O(1) time?

• prefix-sum(i) = S[1]+S[2]+…+S[i],– all n prefix sums are computable in O(n) time.

• sum(i, j) = prefix-sum(j) – prefix-sum(i-1)• density(i, j) = sum(i, j) / (j-i+1)

prefix-sum(j)

i j

prefix-sum(i-1)

Page 26: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences

26

A naive algorithm• A simple shift algorithm can compute the

highest-average interval of a fixed length in O(n) time

• Try L, L+1, L+2, ..., n. In total, O(n2).

Page 27: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences

27

An O(nL)-time algorithm (Huang, CABIOS’ 94)• Observing that there exists an optimal interv

al of length bounded by 2L, we immediately have an O(nL)-time algorithm.

We can bisect a region of length >= 2L into two segments, where each of them is of length >= L.

Page 28: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences

28

Good partners• Finding good partner g(i) for each i=1, 2, …, n-L+1.

L

g(i)

maximing avg[i, g(i)]

i + L

Page 29: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences

29

Right-Skew Decomposition• Partition S into substrings S1,S2,…,Sk such that

– each Si is a right-skew substring of S• the average of any prefix is always less than or equal to

the average of the remaining suffix.– density(S1) > density(S2) > … > density(Sk)

• [Lin, Jiang, Chao]– Unique– Computable in linear time.

Page 30: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences

30

Right-Skew Decomposition

1 1 0 1 1 0 1 0 1 1 0 0 1

1 2/3 3/5 1/3> >>

Lu’s illustration

Page 31: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences

31

An O(n log L)-time algorithm (Lin, Jiang, Chao, JCSS’02)• Decreasingly right-skew decomposition

(O(n) time)

9 7 8 1 9 8 3 7 2 8

97.5 6

5

8 9 8 75

8

Page 32: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences

32

Right-Skew pointers p[ ]

9 7 8 1 9 8 3 7 2 8

97.5 6

5

8 9 8 75

8

1 2 3 4 5 6 7 8 9 10

p[ ] 1 3 3 6 5 6 10 8 10 10

Page 33: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences

33

Illustration

L

i+L g(i)

L

i+L

Lu’s illustration

Page 34: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences

34

An O(n log L)-time algorithm

• Jumping tables that allows binary search.– O(log L) time for each position to find its good

partner, therefore the total running time is O(n log L).

• We also implemented a program that linearly scans right-skew segments for the good partnership. Our empirical tests showed that it ran in linear time empirically.

Page 35: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences

35

Pointer-jumping tables

9 7 8 1 9 8 3 7 2 8

97.5 6

5

8 9 8 75

8

1 2 3 4 5 6 7 8 9 10

p(0)[ ] 1 3 3 6 5 6 10 8 10 10

p(1)[ ] 3 6 6 10 6 10 10 10 10 10

p(2)[ ] 10 10 10 10 10 10 10 10 10 10

}],1][[min{][][][

)()()1(

)0(

nippipipip

kkk

Page 36: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences

36

Goldwasser, Kao, and Lu’s progress

• An O(n) time algorithm for the problem– Appeared in Proceedings of the Secon

d Workshop on Algorithms in Bioinformatics (WABI), Rome, Itay, Sep. 17-21, 2002.

– JCSS, accepted.

Page 37: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences

37

A new important observation

• i < j < g(j) < g(i) implies– density(i, g(i)) is no more than density(j, g(j))

i g(i)j g(j)

Lu’s illustration

Page 38: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences

38

i g(i)j g(j)

Lu’s illustration

Page 39: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences

39

Searching for all g(i) in linear time

Lu’s illustration

Page 40: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences

40

k non-overlapping maximum-average segments (Lin, Huang, Jiang, Chao, Bioinformatics)• Maintain all candidates in a priority queue.

• A new k-best algorithm + algorithms on trees (Lin and Chao (2003))

Page 41: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences

41

Longest increasing subsequence(LIS)• The longest increasing subsequence is to find a

longest increasing subsequence of a given sequence of distinct integers a1a2…an .e.g. 9 2 5 3 7 11 8 10 13 6

2 3 7

5 7 10 13

9 7 11

3 5 11 13

are increasing subsequences.

are not increasing subsequences.

We want to find a longest one.

Page 42: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences

42

A naive approach for LIS• Let L[i] be the length of a longest increasing

subsequence ending at position i.L[i] = 1 + max j = 0..i-1{L[j] | aj < ai}(use a dummy a0 = minimum, and L[0]=0)

9 2 5 3 7 11 8 10 13 6L[i] 1 1 2 2 3 4 ?

Page 43: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences

43

A naive approach for LIS

9 2 5 3 7 11 8 10 13 6L[i] 1 1 2 2 3 4 4 5 6 3

L[i] = 1 + max j = 0..i-1 {L[j] | aj < ai}

The maximum length

The subsequence 2, 3, 7, 8, 10, 13 is a longest increasing subsequence.

This method runs in O(n2) time.

Page 44: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences

44

Binary search• Given an ordered sequence x1x2 ... xn, where

x1<x2< ... <xn, and a number y, a binary search finds the largest xi such that xi< y in O(log n) time.

n ...

n/2n/4

Page 45: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences

45

Binary search• How many steps would a binary search

reduce the problem size to 1?n n/2 n/4 n/8 n/16 ... 1

How many steps? O(log n) steps.

nsn s

2log12/

Page 46: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences

46

An O(n log n) method for LIS• Define BestEnd[k] to be the smallest number of an

increasing subsequence of length k.9 2 5 3 7 11 8 10 13 69 2 2

523

237

23711

2378

2378

10

2378

1013

BestEnd[1]

BestEnd[2]

BestEnd[3]

BestEnd[4]

BestEnd[5]

BestEnd[6]

Page 47: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences

47

An O(n log n) method for LIS• Define BestEnd[k] to be the smallest number of an

increasing subsequence of length k.9 2 5 3 7 11 8 10 13 69 2 2

523

237

23711

2378

2378

10

2378

1013

23681013

BestEnd[1]

BestEnd[2]

BestEnd[3]

BestEnd[4]

BestEnd[5]

BestEnd[6]

For each position, we perform a binary search to update BestEnd. Therefore, the running time is O(n log n).

Page 48: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences

48

Longest Common Subsequence (LCS)

• A subsequence of a sequence S is obtained by deleting zero or more symbols from S. For example, the following are all subsequences of “president”: pred, sdn, predent.

• The longest common subsequence problem is to find a maximum-length common subsequence between two sequences.

Page 49: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences

49

LCSFor instance, Sequence 1: president Sequence 2: providence Its LCS is priden.

president

providence

Page 50: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences

50

LCSAnother example: Sequence 1: algorithm Sequence 2: alignmentOne of its LCS is algm.

a l g o r i t h m

a l i g n m e n t

Page 51: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences

51

How to compute LCS?• Let A=a1a2…am and B=b1b2…bn .• len(i, j): the length of an LCS between

a1a2…ai and b1b2…bj

• With proper initializations, len(i, j)can be computed as follows.

,. and 0, if)),1(),1,(max(

and 0, if1)1,1(,0or 0 if0

),(

ji

ji

bajijilenjilenbajijilen

jijilen

Page 52: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences

52

p r o c e d u r e L C S - L e n g t h ( A , B ) 1 . f o r i ← 0 t o m d o l e n ( i , 0 ) = 0 2 . f o r j ← 1 t o n d o l e n ( 0 , j ) = 0 3 . f o r i ← 1 t o m d o 4 . f o r j ← 1 t o n d o

5 . i f ji ba t h e n

" "),(

1)1,1(),(jiprev

jilenjilen

6 . e l s e i f )1,(),1( jilenjilen

7 . t h e n

" "),(

),1(),(jiprev

jilenjilen

8 . e l s e

" "),(

)1,(),(jiprev

jilenjilen

9 . r e t u r n l e n a n d p r e v

Page 53: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences

53

i j 0 1 p

2 r

3 o

4 v

5 i

6 d

7 e

8 n

9 c

10 e

0 0 0 0 0 0 0 0 0 0 0 0

1 p 2

0 1 1 1 1 1 1 1 1 1 1

2 r 0 1 2 2 2 2 2 2 2 2 2

3 e 0 1 2 2 2 2 2 3 3 3 3

4 s 0 1 2 2 2 2 2 3 3 3 3

5 i 0 1 2 2 2 3 3 3 3 3 3

6 d 0 1 2 2 2 3 4 4 4 4 4

7 e 0 1 2 2 2 3 4 5 5 5 5

8 n 0 1 2 2 2 3 4 5 6 6 6

9 t 0 1 2 2 2 3 4 5 6 6 6

Page 54: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences

54

p r o c e d u r e O u tp u t - L C S ( A , p r e v , i , j )

1 i f i = 0 o r j = 0 t h e n r e t u r n

2 i f p r e v ( i , j ) = ” “ t h e n

iajiprevALCSOutput

print )1,1,,(

3 e l s e i f p r e v ( i , j ) = ” “ t h e n O u tp u t - L C S ( A , p r e v , i - 1 , j ) 4 e l s e O u tp u t - L C S ( A , p r e v , i , j - 1 )

Page 55: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences

55

i j 0 1 p

2 r

3 o

4 v

5 i

6 d

7 e

8 n

9 c

10 e

0 0 0 0 0 0 0 0 0 0 0 0

1 p 2

0 1 1 1 1 1 1 1 1 1 1

2 r 0 1 2 2 2 2 2 2 2 2 2

3 e 0 1 2 2 2 2 2 3 3 3 3

4 s 0 1 2 2 2 2 2 3 3 3 3

5 i 0 1 2 2 2 3 3 3 3 3 3

6 d 0 1 2 2 2 3 4 4 4 4 4

7 e 0 1 2 2 2 3 4 5 5 5 5

8 n 0 1 2 2 2 3 4 5 6 6 6

9 t 0 1 2 2 2 3 4 5 6 6 6

Output: priden

Page 56: Dynamic-Programming Strategies for Analyzing Biomolecular Sequences

56

Now try this example on your own

Sequence 1: algorithmSequence 2: alignmentTheir LCS?Recall that ...

a l g o r i t h m

a l i g n m e n t