Click here to load reader

A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices Maxime Crochemore Gad M. Landau Michal Ziv-Ukelson

Embed Size (px)

Citation preview

  • A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices Maxime CrochemoreGad M. LandauMichal Ziv-Ukelson

  • PresentationR89922024 B86202049 R90725054 R90922001 R90922091

  • OutlineIntroduction and preliminaries.LZ-78.The basic concept.Global alignment.Local alignment.Proof for LZ-76.Proof for SMAWK algorithm.

  • LZ-78aacgacgaaacgacgaaacgacgaaacgacga

  • Sample of LZ-78

    12345ctacgaga

    1234aacgacga

  • Basic Concept

    01234aacgacga1c2t3a4cg5aga

  • Basic Concept I/O Propagation Across GDIST matrix

    012345I0=00-1-2-3I1=0-1-1-2-1-3I2=0-2001-1-3I3=0-2-20-2-2I4=0-20-1-1I5=0-2-10

  • Basic Concept Monge PropertyDIST matrixAggarwal and Park and Schmidt observed that DIST matrices are Monge arrays.

    012345I0=00-1-2-3I1=0-1-1-2-1-3I2=0-2001-1-3I3=0-2-20-2-2I4=0-20-1-1I5=0-2-10

  • Basic Concept Tatally MonotoneAn important property of Monge arrays is that of being totally monotone.Both DIST and OUT matrices are totally monotone by the concave condition.Aggarwal et al gave a recursive algorithm, nicknamed SMAWK, which can compute on O(n) time all row and column maxima of a n x n totally monotone matrix, by querying only O(n) elements of the array.

  • DIST matrix

    I0 = 10-1-2-3I1 = 2-1-1-2-1-2I2 = 3-2001-1-3I3 = 2-2-20-2-2I4 = 1-20-1-1I5 = 3-2-10

  • OUT matrix

    10-1-2--1101-1-133420-1200200-13-13-1100-14-14-14123

  • OUT matrix concave monotonicity:

    No new column maximum :(n + i + 1) * k-

    10-1-2--1101-1-133420-1200200-13-13-1100-14-14-14123

  • The New block

    01234aa cga c ga1c2t3a4cg5aga

  • Corresponding matrices

  • Maintaining Direct Access to DIST Columns : SMAWK OUT matrix DISTinplutConstant timeOUT matrix Space

    :columndata structure

  • Data Strucure

    DIST(5,4)

    -3-1100-2

    -2-20-2-2

    -1-10-2

    0-1-2

    -2-1-2-1-1

    -3-2-10

    01234aa cga c ga1c2t3a4cg5aga

  • Construction

    DIST(5,4)

    -3-1100-2

    -2-20-2-2

    -1-10-2

    0-1-2

    -2-1-2-1-1

    -3-2-10

    01234aa cga c ga1c2t3a4cg5aga

  • Time and Space Complexitynew column DIST vector ( DIST matrixcolumn)

    SMAWKDIST(input) output maxima

    O ( t )

  • Total complexityh n log ( n ) n

    O ( h n2 log(n) )

    01234aa cga c ga1c2t3a4cg5aga

  • Sub-Quadratic Local AlignmentEric, Yu En LuInformation Management Dept.National Taiwan University

  • Sub-Quadratic Global AlignmentExploits Redundancy among sequences resulted by Lempel-Ziv Compression (self-repeating) to obtain the sub-quadratic part

  • Sub-Quadratic Local AlignmentRequires additional knowledge of where a locally optimal string starts and endsHowever, this algorithm is performed on a per-block basis, we have to compute additional information specific to a blockAnd then use it as the cue to the final score

  • Additional InformationIS[i]CE[k]F=max {MAX ti=0 {I[i]+E[i]}, C}

  • Algorithm BodyGiven: DISTGEncodingCompute values of ECompute values of SCompute values of CPropagationCompute values of O (modified from the O in global alignment)Computing FSeek Highest ScoreFind the highest score F

  • Back-Tracking the Exact PathGlobal AlignmentLocal AlignmentGiven the block with max F valueWe seek its path through looking its max{lp, tp, dia} block recursively until the score 0

  • Time/Space AnalysisEncodingE: max{E[i]lp, E[i]tp,DIST[I, lc]} O(t)S: (all other can be copies, except..) Slr,lc = max{Slr-1,lc+W,Slr,lc-1+W,Slr-1,lc-1+W} O(t)C: max{Clp, Ctp, S[lc]} O(1)PropagationO[i] = max{O[i], S[i]} O(t)F=max {MAX ti=0 {I[i]+E[i]}, C} O(t)Find F O(hn2/log2 n)Total Complexity O(hn2/log n)

  • Further ImprovementsEfficient alignment storage algorithmConditioned in discrete weightsGives a minimal encoding to DIST (O(t) O(1) ) for GThus we obtain O(hn2/(log n)2) storage complexity in Global-Alignment problemWhile time complexity is O(hn2/log n)

  • Now, we are going to have presentations onSMAWK & LZ-76Thank you!

  • The Maximum Numbers of Distinct WordsSpeaker : Emory ChangDate : 2002/1/31

  • What is a Distinct Word?EX: (LZ78)A = {0,1},a = |A| = 2S = 0101000,n = |S| = 7050 1 0 1 0 0 0

    we have four distinct words,and five steps to generate the sequence.

  • NotationA : the set of alphabets : the number of alphabetsS : a sequence belong to An : the length of SC(S) : production complexity of SN : the maximum possible number of distinct words.n

  • The upper boundAny sequential encoding procedure employs a parsing rule which a long string of data is broken down into words that are individually mapped into distinct words.For every :

  • Special case(1/2)Let N denote the maximum possible number of distinct words. Clearly C(S) < N+1 (a possible exception of the last one)Consider the special case :The sequence is formed by all distinct words of length of one, two, , kex: S = 0 1 0 0 0 1 1 0 1 1 ex: S = 0 1 0 0 0 1 1 0 1 1 ex: S = 0 1 0 0 0 1 1 0 1 1 ex: S = 0 1 0 0 0 1 1 0 1 1 ex: S = 0 1 0 0 0 1 1 0 1 1 0124300116501n = 12 + 2212

  • Special case(2/2)The length of symbols at level iThe number of nodes at level i

  • General CaseThe length of level k+1=k+1Level k+1

  • Proof:SinceWe haveThereforefromfrom

  • SMAWKA Linear Time Algorithm for the Maximum Problem on Wide Totally Monotone Matrices

  • Definition Let A be an nm matrix with real entries. Aj denote the jth column of A and Ai denote the ith row of A.A[i1,,ik;j1,,jk] denote the submatrix of A.Let j(i) be the smallest column index j such that A(i,j) equals to the maximum value in Ai.

  • i=2j(i)=3

    Sheet1

    111-12-13-14

    0130-13-14

    -1030-1-14

    -214211

    -12002

    0003

    Sheet2

    Sheet3

  • DefinitionA nm matrix A is monotone if for 1i1i2n, j(i1) j(i2).A is totally monotone if every submatrix of A is monotone.

  • Another DefinitionIn the previous paper, the definition of totally monotone is:A matrix M[0m,0n] is totally monotone is either condition 1 or 2 below holds for all a,b=0n; c,d=0m:1. Convex condition: M[a,c] M[b,c] M[a,d] M[b,d] for all a
  • Comparison Now we want to compare these two definitions.The definition in SMAWKs paper is called Ds, The definition in this paper is called Dc (we need a transpose to match the row and column of these to definition).

  • Comparison(cont.)To proof Dc Ds.Dc holds on matrix A[0n,0m]Let A[ii,jj] be a submatrix of A, ii1i2i, j1= j(i1), j2= j(i2) j1 j2.i1i2j1a,b,c d e,f,g hSo j1 j2

    Sheet1

    abcd

    efgh

    Sheet2

    Sheet3

  • Comparison(cont.)To proof Ds DcThe matrix satisfies Ds but not Dc.

    Dc is stronger.

  • Lemma 1We define an entry A[i,j] is dead if j j(i).Lemma 1:Let A be a totally monotone nm matrix and let 1 j1 j2 m. if A(r, j1)A(r,j2) then entries in {A(i, j2):1 i r} are dead. if A(r, j1)A(r,j2) then entries in {A(i, j1):r i n} are dead.

    j1j2rii

    Sheet1

    d

    c

    ab

    e

    f

    Sheet2

    Sheet3

  • REDUCE(A)REDUCE(A)C=A; k=1While C has more than n columns do case C(k,k) C(k,k+1) and k < n : k = k+1 C(k,k) C(k,k+1) and k = n : Delete column Ck+1 C(k,k) < C(k,k+1) : Delete column Ck; if k>1 then k = k-1

  • REDUCE(A)
  • REDUCE(A)

  • REDUCE(A)
  • Time ComplexityCase 2 + Case 3 = m nCase 1 at most n + (m n) 1Totally 2m n 1O(m)

  • MAXCOMPUTE(A)MAXCOMPUTE(A) B = REDUCE(A)If n=1 then output the maximum and returnC=B[2,4,,2n/2; 1,2,n]MAXCOMPUTE(C)From the known positions of maxima in the even rows of B, find the maxima in its odd rows.

  • IDEA

  • Time ComplexityT(n,m) = c1m + c2n + T(n/2, n) = c1m + (c1+c2)n + c2n/2 + T(n/4, n/2)T(n,m) = 2 (c1+c2)n + c1m = O(m)

    *