A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices Maxime Crochemore Gad M. Landau Michal Ziv-Ukelson

A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices Maxime CrochemoreGad M. LandauMichal Ziv-Ukelson

PresentationR89922024 B86202049 R90725054 R90922001 R90922091

OutlineIntroduction and preliminaries.LZ-78.The basic concept.Global alignment.Local alignment.Proof for LZ-76.Proof for SMAWK algorithm.

LZ-78aacgacgaaacgacgaaacgacgaaacgacga

Sample of LZ-78

12345ctacgaga

1234aacgacga

Basic Concept

01234aacgacga1c2t3a4cg5aga

Basic Concept I/O Propagation Across GDIST matrix

012345I0=00-1-2-3I1=0-1-1-2-1-3I2=0-2001-1-3I3=0-2-20-2-2I4=0-20-1-1I5=0-2-10

Basic Concept Monge PropertyDIST matrixAggarwal and Park and Schmidt observed that DIST matrices are Monge arrays.

012345I0=00-1-2-3I1=0-1-1-2-1-3I2=0-2001-1-3I3=0-2-20-2-2I4=0-20-1-1I5=0-2-10

Basic Concept Tatally MonotoneAn important property of Monge arrays is that of being totally monotone.Both DIST and OUT matrices are totally monotone by the concave condition.Aggarwal et al gave a recursive algorithm, nicknamed SMAWK, which can compute on O(n) time all row and column maxima of a n x n totally monotone matrix, by querying only O(n) elements of the array.

DIST matrix

I0 = 10-1-2-3I1 = 2-1-1-2-1-2I2 = 3-2001-1-3I3 = 2-2-20-2-2I4 = 1-20-1-1I5 = 3-2-10

OUT matrix

10-1-2--1101-1-133420-1200200-13-13-1100-14-14-14123

OUT matrix concave monotonicity:

No new column maximum :(n + i + 1) * k-

10-1-2--1101-1-133420-1200200-13-13-1100-14-14-14123

The New block

01234aa cga c ga1c2t3a4cg5aga

Corresponding matrices

Maintaining Direct Access to DIST Columns : SMAWK OUT matrix DISTinplutConstant timeOUT matrix Space

:columndata structure

Data Strucure

DIST(5,4)

-3-1100-2

-2-20-2-2

-1-10-2

0-1-2

-2-1-2-1-1

-3-2-10


Construction

DIST(5,4)

-3-1100-2

-2-20-2-2

-1-10-2

0-1-2

-2-1-2-1-1

-3-2-10


Time and Space Complexitynew column DIST vector ( DIST matrixcolumn)

SMAWKDIST(input) output maxima

O ( t )

Total complexityh n log ( n ) n

O ( h n2 log(n) )


Sub-Quadratic Local AlignmentEric, Yu En LuInformation Management Dept.National Taiwan University

Sub-Quadratic Global AlignmentExploits Redundancy among sequences resulted by Lempel-Ziv Compression (self-repeating) to obtain the sub-quadratic part

Sub-Quadratic Local AlignmentRequires additional knowledge of where a locally optimal string starts and endsHowever, this algorithm is performed on a per-block basis, we have to compute additional information specific to a blockAnd then use it as the cue to the final score

Additional InformationIS[i]CE[k]F=max {MAX ti=0 {I[i]+E[i]}, C}

Algorithm BodyGiven: DISTGEncodingCompute values of ECompute values of SCompute values of CPropagationCompute values of O (modified from the O in global alignment)Computing FSeek Highest ScoreFind the highest score F

Back-Tracking the Exact PathGlobal AlignmentLocal AlignmentGiven the block with max F valueWe seek its path through looking its max{lp, tp, dia} block recursively until the score 0

Time/Space AnalysisEncodingE: max{E[i]lp, E[i]tp,DIST[I, lc]} O(t)S: (all other can be copies, except..) Slr,lc = max{Slr-1,lc+W,Slr,lc-1+W,Slr-1,lc-1+W} O(t)C: max{Clp, Ctp, S[lc]} O(1)PropagationO[i] = max{O[i], S[i]} O(t)F=max {MAX ti=0 {I[i]+E[i]}, C} O(t)Find F O(hn2/log2 n)Total Complexity O(hn2/log n)

Further ImprovementsEfficient alignment storage algorithmConditioned in discrete weightsGives a minimal encoding to DIST (O(t) O(1) ) for GThus we obtain O(hn2/(log n)2) storage complexity in Global-Alignment problemWhile time complexity is O(hn2/log n)

Now, we are going to have presentations onSMAWK & LZ-76Thank you!

The Maximum Numbers of Distinct WordsSpeaker : Emory ChangDate : 2002/1/31

What is a Distinct Word?EX: (LZ78)A = {0,1},a = |A| = 2S = 0101000,n = |S| = 7050 1 0 1 0 0 0

we have four distinct words,and five steps to generate the sequence.

NotationA : the set of alphabets : the number of alphabetsS : a sequence belong to An : the length of SC(S) : production complexity of SN : the maximum possible number of distinct words.n

The upper boundAny sequential encoding procedure employs a parsing rule which a long string of data is broken down into words that are individually mapped into distinct words.For every :

Special case(1/2)Let N denote the maximum possible number of distinct words. Clearly C(S) < N+1 (a possible exception of the last one)Consider the special case :The sequence is formed by all distinct words of length of one, two, , kex: S = 0 1 0 0 0 1 1 0 1 1 ex: S = 0 1 0 0 0 1 1 0 1 1 ex: S = 0 1 0 0 0 1 1 0 1 1 ex: S = 0 1 0 0 0 1 1 0 1 1 ex: S = 0 1 0 0 0 1 1 0 1 1 0124300116501n = 12 + 2212

Special case(2/2)The length of symbols at level iThe number of nodes at level i

General CaseThe length of level k+1=k+1Level k+1

Proof:SinceWe haveThereforefromfrom

SMAWKA Linear Time Algorithm for the Maximum Problem on Wide Totally Monotone Matrices

Definition Let A be an nm matrix with real entries. Aj denote the jth column of A and Ai denote the ith row of A.A[i1,,ik;j1,,jk] denote the submatrix of A.Let j(i) be the smallest column index j such that A(i,j) equals to the maximum value in Ai.

i=2j(i)=3

Sheet1

111-12-13-14

0130-13-14

-1030-1-14

-214211

-12002

0003

Sheet2

Sheet3

DefinitionA nm matrix A is monotone if for 1i1i2n, j(i1) j(i2).A is totally monotone if every submatrix of A is monotone.

Another DefinitionIn the previous paper, the definition of totally monotone is:A matrix M[0m,0n] is totally monotone is either condition 1 or 2 below holds for all a,b=0n; c,d=0m:1. Convex condition: M[a,c] M[b,c] M[a,d] M[b,d] for all a

Comparison Now we want to compare these two definitions.The definition in SMAWKs paper is called Ds, The definition in this paper is called Dc (we need a transpose to match the row and column of these to definition).

Comparison(cont.)To proof Dc Ds.Dc holds on matrix A[0n,0m]Let A[ii,jj] be a submatrix of A, ii1i2i, j1= j(i1), j2= j(i2) j1 j2.i1i2j1a,b,c d e,f,g hSo j1 j2

Sheet1

abcd

efgh

Sheet2

Sheet3

Comparison(cont.)To proof Ds DcThe matrix satisfies Ds but not Dc.

Dc is stronger.

Lemma 1We define an entry A[i,j] is dead if j j(i).Lemma 1:Let A be a totally monotone nm matrix and let 1 j1 j2 m. if A(r, j1)A(r,j2) then entries in {A(i, j2):1 i r} are dead. if A(r, j1)A(r,j2) then entries in {A(i, j1):r i n} are dead.

j1j2rii

Sheet1

d

c

ab

e

f

Sheet2

Sheet3

REDUCE(A)REDUCE(A)C=A; k=1While C has more than n columns do case C(k,k) C(k,k+1) and k < n : k = k+1 C(k,k) C(k,k+1) and k = n : Delete column Ck+1 C(k,k) < C(k,k+1) : Delete column Ck; if k>1 then k = k-1

REDUCE(A)

Time ComplexityCase 2 + Case 3 = m nCase 1 at most n + (m n) 1Totally 2m n 1O(m)

MAXCOMPUTE(A)MAXCOMPUTE(A) B = REDUCE(A)If n=1 then output the maximum and returnC=B[2,4,,2n/2; 1,2,n]MAXCOMPUTE(C)From the known positions of maxima in the even rows of B, find the maxima in its odd rows.

Time ComplexityT(n,m) = c1m + c2n + T(n/2, n) = c1m + (c1+c2)n + c2n/2 + T(n/4, n/2)T(n,m) = 2 (c1+c2)n + c1m = O(m)

*

Documents

A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices Maxime Crochemore Gad M. Landau Michal Ziv-Ukelson