Dynamic-Programming Strategies for Analyzing Biomolecular Sequences

Embed Size (px)

DESCRIPTION

Dynamic-Programming Strategies for Analyzing Biomolecular Sequences. Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National Taiwan University, Taiwan E-mail: kmchao@csie.ntu.edu.tw WWW: http://www.csie.ntu.edu.tw/~kmchao. Dynamic Programming. - PowerPoint PPT Presentation

Text of Dynamic-Programming Strategies for Analyzing Biomolecular Sequences

  • Dynamic-Programming Strategies for Analyzing Biomolecular SequencesKun-Mao Chao ()Department of Computer Science and Information EngineeringNational Taiwan University, Taiwan

    E-mail: kmchao@csie.ntu.edu.twWWW: http://www.csie.ntu.edu.tw/~kmchao

  • Dynamic ProgrammingDynamic programming is a class of solution methods for solving sequential decision problems with a compositional cost structure. Richard Bellman was one of the principal founders of this approach.

  • Two key ingredientsTwo key ingredients for an optimization problem to be suitable for a dynamic-programming solution:Each substructure is optimal.(Principle of optimality)1. optimal substructures2. overlapping subproblemsSubproblems are dependent.(otherwise, a divide-and-conquer approach is the choice.)

  • Three basic componentsThe development of a dynamic-programming algorithm has three basic components:The recurrence relation (for defining the value of an optimal solution);The tabular computation (for computing the value of an optimal solution);The traceback (for delivering an optimal solution).

  • Fibonacci numbers The Fibonacci numbers are defined by the following recurrence:

  • How to compute F10

  • Tabular computationThe tabular computation can avoid recompuation.

    F0F1F2F3F4F5F6F7F8F9F10011235813213455

  • Maximum-sum intervalGiven a sequence of real numbers a1a2an , find a consecutive subsequence with the maximum sum.

    9 3 1 7 15 2 3 4 2 7 6 2 8 4 -9

    For each position, we can compute the maximum-sum interval starting at that position in O(n) time. Therefore, a naive algorithm runs in O(n2) time.

  • O-notation: an asymptotic upper boundf(n) = O(g(n)) iff there exist two positive constant c and n0 such that 0 f(n) cg(n) for all n n0cg(n)f(n)n0

  • How functions grow?For large data sets, algorithms with a complexity greater than O(n log n) are often impractical!nfunction(Assume one million operations per second.)

    30n92n log n26n20.68n32n1000.003 sec.0.003 sec.0.0026 sec.0.68 sec.4 x 1016 yr.100,0003.0 sec.2.6 min.3.0 days22 yr.

  • Maximum-sum interval(The recurrence relation)Define S(i) to be the maximum sum of the intervals ending at position i.

    If S(i-1) < 0, concatenating ai with its previous interval gives less sum than ai itself.

  • Maximum-sum interval(Tabular computation)9 3 1 7 15 2 3 4 2 7 6 2 8 4 -9S(i) 9 6 7 14 1 2 5 1 3 4 6 4 12 16 7The maximum sum

  • Maximum-sum interval(Traceback)9 3 1 7 15 2 3 4 2 7 6 2 8 4 -9S(i) 9 6 7 14 1 2 5 1 3 4 6 4 12 16 7The maximum-sum interval: 6 -2 8 4

  • Two fundamental problems we recently solved (I) (joint work with Lin and Jiang)Given a sequence of real numbers of length n and an upper bound U, find a consecutive subsequence of length at most U with the maximum sum --- an O(n)-time algorithm.U = 39 3 1 7 15 2 3 4 2 7 6 2 8 4 -9

  • Joint work with Huang, Jiang and LinXiaoqiu Huang () Iowa State University, USA Tao Jiang () University of California Riverside, USA

    Yaw-Ling Lin () Providence University, Taiwan

  • Two fundamental problems we recently solved (II) (joint work with Lin and Jiang)Given a sequence of real numbers of length n and a lower bound L, find a consecutive subsequence of length at least L with the maximum average. --- an O(n log L)-time algorithm.L = 43 2 14 6 6 2 10 2 6 6 14 2 1

  • Another exampleGiven a sequence as follows: 2, 6.6, 6.6, 3, 7, 6, 7, 2and L = 2, the highest-average interval is the squared area, which has the average value 20/3.2, 6.6, 6.6, 3, 7, 6, 7, 2

  • C+G rich regionsOur method can be used to locate a region of length at least L with the highest C+G ratio in O(n log L) time.ATGACTCGAGCTCGTCA00101011011011010Search for an interval of length at least L with the highest average.

  • After I opened this problem in Academia Sinica in Aug. 2001 ...JCSS (2002) Lin, Jiang, ChaoA linear-time algorithm for a maximum-sum segment with a lower bound L and an upper bound U.An O(n log L) time algorithm for a maximum-average segment with a lower bound L.Two open problems raised:O(n log L) O(n)?average (with an upper bound U)?

  • After I opened this problem in Academia Sinica in Aug. 2001 ...Comb. Math. (2002) Yeh et. alAn expected O(nL)-time algorithmComb. Math. (2002) LuA linear time algorithm for a maximum-average segment with a lower bound L in a binary sequence.

  • After I opened this problem in Academia Sinica in Aug. 2001 ...WABI (2002) & JCSS Goldwasswer, Kao, and LuO(n) for a lower bound L onlyO(n log (U-L+1)) for L and UIPL (2003) KimO(n) for a lower bound L only (Reformulated as a Computational Geometry Problem)

  • After I opened this problem in Academia Sinica in Aug. 2001 ...Bioinformatics (2003) Lin, Huang, Jiang, ChaoA tool for finding k-best maximum-average segmentsBioinformatics (2003) Wang, XuA tool for locating a maximum-sum segment with a lower bound L and an upper bound U

  • After I opened this problem in Academia Sinica in Aug. 2001 ...CIAA(2003) Lu et al.An alternative linear-time algorithm for a maximum-sum segment (online; for finding repeats)ESA (2003) Chung, LuO(n log (U-L+1)) O(n) for L and UISAAC (2003) Lin (R.R.), Kuo, ChaoO(nL) for a maximum-average path in a treeO(n) for a complete treeSome were omitted here...Some are coming soon...

  • Length-unconstrained versionMaximum-average interval3 2 14 6 6 2 10 2 6 6 14 2 1The maximum element is the answer. It can be done in O(n) time.

  • Q: computing density in O(1) time?prefix-sum(i) = S[1]+S[2]++S[i],all n prefix sums are computable in O(n) time.sum(i, j) = prefix-sum(j) prefix-sum(i-1)density(i, j) = sum(i, j) / (j-i+1)prefix-sum(j)ijprefix-sum(i-1)

  • A naive algorithmA simple shift algorithm can compute the highest-average interval of a fixed length in O(n) time

    Try L, L+1, L+2, ..., n. In total, O(n2).

  • An O(nL)-time algorithm (Huang, CABIOS 94)Observing that there exists an optimal interval of length bounded by 2L, we immediately have an O(nL)-time algorithm.We can bisect a region of length >= 2L into two segments, where each of them is of length >= L.

  • Good partnersFinding good partner g(i) for each i=1, 2, , n-L+1.Lg(i)maximing avg[i, g(i)]i + L

  • Right-Skew DecompositionPartition S into substrings S1,S2,,Sk such thateach Si is a right-skew substring of Sthe average of any prefix is always less than or equal to the average of the remaining suffix.density(S1) > density(S2) > > density(Sk)[Lin, Jiang, Chao]UniqueComputable in linear time.

  • Right-Skew Decomposition12/33/51/3>>>Lus illustration

  • An O(n log L)-time algorithm (Lin, Jiang, Chao, JCSS02)Decreasingly right-skew decomposition (O(n) time) 9 7 8 1 9 8 3 7 2 897.565898758

  • Right-Skew pointers p[ ] 9 7 8 1 9 8 3 7 2 897.565898758 1 2 3 4 5 6 7 8 9 10p[ ] 1 3 3 6 5 6 10 8 10 10

  • IllustrationLi+Lg(i)Li+LLus illustration

  • An O(n log L)-time algorithmJumping tables that allows binary search.O(log L) time for each position to find its good partner, therefore the total running time is O(n log L). We also implemented a program that linearly scans right-skew segments for the good partnership. Our empirical tests showed that it ran in linear time empirically.

  • Pointer-jumping tables 9 7 8 1 9 8 3 7 2 897.565898758 1 2 3 4 5 6 7 8 9 10p(0)[ ] 1 3 3 6 5 6 10 8 10 10p(1)[ ] 3 6 6 10 6 10 10 10 10 10 p(2)[ ] 10 10 10 10 10 10 10 10 10 10

  • Goldwasser, Kao, and Lus progressAn O(n) time algorithm for the problemAppeared in Proceedings of the Second Workshop on Algorithms in Bioinformatics (WABI), Rome, Itay, Sep. 17-21, 2002.JCSS, accepted.

  • A new important observationi < j < g(j) < g(i) impliesdensity(i, g(i)) is no more than density(j, g(j))Lus illustration

  • ig(i)jg(j)Lus illustration

  • Searching for all g(i) in linear timeLus illustration

  • k non-overlapping maximum-average segments (Lin, Huang, Jiang, Chao, Bioinformatics)Maintain all candidates in a priority queue. A new k-best algorithm + algorithms on trees (Lin and Chao (2003))

  • Longest increasing subsequence(LIS)The longest increasing subsequence is to find a longest increasing subsequence of a given sequence of distinct integers a1a2an .

    e.g. 9 2 5 3 7 11 8 10 13 63 77 10 137 113 5 11 13are increasing subsequences.are not increasing subsequences.We want to find a longest one.

  • A naive approach for LISLet L[i] be the length of a longest increasing subsequence ending at position i.L[i] = 1 + max j = 0..i-1{L[j] | aj < ai} (use a dummy a0 = minimum, and L[0]=0)9 2 5 3 7 11 8 10 13 6L[i] 1 1 2 2 3 4 ?

  • A naive approach for LISL[i] = 1 + max j = 0..i-1 {L[j] | aj < ai}9 2 5 3 7 11 8 10 13 6L[i] 1 1 2 2 3 4 4 5 6 3The maximum lengthThe subsequence 2, 3, 7, 8, 10, 13 is a longe