Click here to load reader
Upload
albert-wilkinson
View
219
Download
0
Embed Size (px)
Citation preview
A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices Maxime CrochemoreGad M. LandauMichal Ziv-Ukelson
PresentationR89922024 B86202049 R90725054 R90922001 R90922091
OutlineIntroduction and preliminaries.LZ-78.The basic concept.Global alignment.Local alignment.Proof for LZ-76.Proof for SMAWK algorithm.
LZ-78aacgacgaaacgacgaaacgacgaaacgacga
Sample of LZ-78
12345ctacgaga
1234aacgacga
Basic Concept
01234aacgacga1c2t3a4cg5aga
Basic Concept I/O Propagation Across GDIST matrix
012345I0=00-1-2-3I1=0-1-1-2-1-3I2=0-2001-1-3I3=0-2-20-2-2I4=0-20-1-1I5=0-2-10
Basic Concept Monge PropertyDIST matrixAggarwal and Park and Schmidt observed that DIST matrices are Monge arrays.
012345I0=00-1-2-3I1=0-1-1-2-1-3I2=0-2001-1-3I3=0-2-20-2-2I4=0-20-1-1I5=0-2-10
Basic Concept Tatally MonotoneAn important property of Monge arrays is that of being totally monotone.Both DIST and OUT matrices are totally monotone by the concave condition.Aggarwal et al gave a recursive algorithm, nicknamed SMAWK, which can compute on O(n) time all row and column maxima of a n x n totally monotone matrix, by querying only O(n) elements of the array.
DIST matrix
I0 = 10-1-2-3I1 = 2-1-1-2-1-2I2 = 3-2001-1-3I3 = 2-2-20-2-2I4 = 1-20-1-1I5 = 3-2-10
OUT matrix
10-1-2--1101-1-133420-1200200-13-13-1100-14-14-14123
OUT matrix concave monotonicity:
No new column maximum :(n + i + 1) * k-
10-1-2--1101-1-133420-1200200-13-13-1100-14-14-14123
The New block
01234aa cga c ga1c2t3a4cg5aga
Corresponding matrices
Maintaining Direct Access to DIST Columns : SMAWK OUT matrix DISTinplutConstant timeOUT matrix Space
:columndata structure
Data Strucure
DIST(5,4)
-3-1100-2
-2-20-2-2
-1-10-2
0-1-2
-2-1-2-1-1
-3-2-10
01234aa cga c ga1c2t3a4cg5aga
Construction
DIST(5,4)
-3-1100-2
-2-20-2-2
-1-10-2
0-1-2
-2-1-2-1-1
-3-2-10
01234aa cga c ga1c2t3a4cg5aga
Time and Space Complexitynew column DIST vector ( DIST matrixcolumn)
SMAWKDIST(input) output maxima
O ( t )
Total complexityh n log ( n ) n
O ( h n2 log(n) )
01234aa cga c ga1c2t3a4cg5aga
Sub-Quadratic Local AlignmentEric, Yu En LuInformation Management Dept.National Taiwan University
Sub-Quadratic Global AlignmentExploits Redundancy among sequences resulted by Lempel-Ziv Compression (self-repeating) to obtain the sub-quadratic part
Sub-Quadratic Local AlignmentRequires additional knowledge of where a locally optimal string starts and endsHowever, this algorithm is performed on a per-block basis, we have to compute additional information specific to a blockAnd then use it as the cue to the final score
Additional InformationIS[i]CE[k]F=max {MAX ti=0 {I[i]+E[i]}, C}
Algorithm BodyGiven: DISTGEncodingCompute values of ECompute values of SCompute values of CPropagationCompute values of O (modified from the O in global alignment)Computing FSeek Highest ScoreFind the highest score F
Back-Tracking the Exact PathGlobal AlignmentLocal AlignmentGiven the block with max F valueWe seek its path through looking its max{lp, tp, dia} block recursively until the score 0
Time/Space AnalysisEncodingE: max{E[i]lp, E[i]tp,DIST[I, lc]} O(t)S: (all other can be copies, except..) Slr,lc = max{Slr-1,lc+W,Slr,lc-1+W,Slr-1,lc-1+W} O(t)C: max{Clp, Ctp, S[lc]} O(1)PropagationO[i] = max{O[i], S[i]} O(t)F=max {MAX ti=0 {I[i]+E[i]}, C} O(t)Find F O(hn2/log2 n)Total Complexity O(hn2/log n)
Further ImprovementsEfficient alignment storage algorithmConditioned in discrete weightsGives a minimal encoding to DIST (O(t) O(1) ) for GThus we obtain O(hn2/(log n)2) storage complexity in Global-Alignment problemWhile time complexity is O(hn2/log n)
Now, we are going to have presentations onSMAWK & LZ-76Thank you!
The Maximum Numbers of Distinct WordsSpeaker : Emory ChangDate : 2002/1/31
What is a Distinct Word?EX: (LZ78)A = {0,1},a = |A| = 2S = 0101000,n = |S| = 7050 1 0 1 0 0 0
we have four distinct words,and five steps to generate the sequence.
NotationA : the set of alphabets : the number of alphabetsS : a sequence belong to An : the length of SC(S) : production complexity of SN : the maximum possible number of distinct words.n
The upper boundAny sequential encoding procedure employs a parsing rule which a long string of data is broken down into words that are individually mapped into distinct words.For every :
Special case(1/2)Let N denote the maximum possible number of distinct words. Clearly C(S) < N+1 (a possible exception of the last one)Consider the special case :The sequence is formed by all distinct words of length of one, two, , kex: S = 0 1 0 0 0 1 1 0 1 1 ex: S = 0 1 0 0 0 1 1 0 1 1 ex: S = 0 1 0 0 0 1 1 0 1 1 ex: S = 0 1 0 0 0 1 1 0 1 1 ex: S = 0 1 0 0 0 1 1 0 1 1 0124300116501n = 12 + 2212
Special case(2/2)The length of symbols at level iThe number of nodes at level i
General CaseThe length of level k+1=k+1Level k+1
Proof:SinceWe haveThereforefromfrom
SMAWKA Linear Time Algorithm for the Maximum Problem on Wide Totally Monotone Matrices
Definition Let A be an nm matrix with real entries. Aj denote the jth column of A and Ai denote the ith row of A.A[i1,,ik;j1,,jk] denote the submatrix of A.Let j(i) be the smallest column index j such that A(i,j) equals to the maximum value in Ai.
i=2j(i)=3
Sheet1
111-12-13-14
0130-13-14
-1030-1-14
-214211
-12002
0003
Sheet2
Sheet3
DefinitionA nm matrix A is monotone if for 1i1i2n, j(i1) j(i2).A is totally monotone if every submatrix of A is monotone.
Comparison Now we want to compare these two definitions.The definition in SMAWKs paper is called Ds, The definition in this paper is called Dc (we need a transpose to match the row and column of these to definition).
Comparison(cont.)To proof Dc Ds.Dc holds on matrix A[0n,0m]Let A[ii,jj] be a submatrix of A, ii1i2i, j1= j(i1), j2= j(i2) j1 j2.i1i2j1a,b,c d e,f,g hSo j1 j2
Sheet1
abcd
efgh
Sheet2
Sheet3
Comparison(cont.)To proof Ds DcThe matrix satisfies Ds but not Dc.
Dc is stronger.
Lemma 1We define an entry A[i,j] is dead if j j(i).Lemma 1:Let A be a totally monotone nm matrix and let 1 j1 j2 m. if A(r, j1)A(r,j2) then entries in {A(i, j2):1 i r} are dead. if A(r, j1)A(r,j2) then entries in {A(i, j1):r i n} are dead.
j1j2rii
Sheet1
d
c
ab
e
f
Sheet2
Sheet3
REDUCE(A)REDUCE(A)C=A; k=1While C has more than n columns do case C(k,k) C(k,k+1) and k < n : k = k+1 C(k,k) C(k,k+1) and k = n : Delete column Ck+1 C(k,k) < C(k,k+1) : Delete column Ck; if k>1 then k = k-1
REDUCE(A)
Time ComplexityCase 2 + Case 3 = m nCase 1 at most n + (m n) 1Totally 2m n 1O(m)
MAXCOMPUTE(A)MAXCOMPUTE(A) B = REDUCE(A)If n=1 then output the maximum and returnC=B[2,4,,2n/2; 1,2,n]MAXCOMPUTE(C)From the known positions of maxima in the even rows of B, find the maxima in its odd rows.
IDEA
Time ComplexityT(n,m) = c1m + c2n + T(n/2, n) = c1m + (c1+c2)n + c2n/2 + T(n/4, n/2)T(n,m) = 2 (c1+c2)n + c1m = O(m)
*