84
Fast Algorithm for String Fast Algorithm for String Matching with Matching with k k Mismatches Mismatches by Amihood Amir, Moshe Lewenstein, and E by Amihood Amir, Moshe Lewenstein, and E ly Porat, ly Porat, Journal of Algorithms Journal of Algorithms , to appear, 20 , to appear, 20 03/2004 03/2004 Speaker: R92921097 李李 R92921084 李李 R92921083 李李

Fast Algorithm for String Matching with k Mismatches by Amihood Amir, Moshe Lewenstein, and Ely Porat, Journal of Algorithms, to appear, 2003/2004 Speaker:

Embed Size (px)

Citation preview

Fast Algorithm for String Fast Algorithm for String Matching with Matching with kk Mismatches Mismatches

by Amihood Amir, Moshe Lewenstein, and Ely Porat, by Amihood Amir, Moshe Lewenstein, and Ely Porat,

Journal of AlgorithmsJournal of Algorithms, to appear, 2003/2004, to appear, 2003/2004

Speaker:R92921097 李宜益R92921084 何明彥R92921083 余宗恩Advisor: 呂學一 老師

Speaker: R92921097 Speaker: R92921097 李宜益李宜益

General CaseGeneral Case

OutlineOutlineIntroductionIntroductionProblem Definition and Problem Definition and PreliminariesPreliminariesLarge and Small AlphabetsLarge and Small AlphabetsGeneral AlphabetsGeneral Alphabets

IntroductionIntroduction

Two types of matching problemsTwo types of matching problems Generalized matching problemGeneralized matching problem Approximate matching problemApproximate matching problem

Previous researchPrevious research Landau and Vishkin : O( )Landau and Vishkin : O( ) Abrahamson : O( )Abrahamson : O( )

nklogn m m

IntroductionIntroduction

Complexity : Complexity : O( )O( ) Contribution :Contribution :

The fastest known algorithm for string The fastest known algorithm for string matching with k mismatches.matching with k mismatches.

Identifying and exploiting a new Identifying and exploiting a new technique that has been implicitly used technique that has been implicitly used in some recent papers – counting.in some recent papers – counting.

logn k k

Problem Definition and Problem Definition and PreliminariesPreliminaries

Let a, b Let a, b . Define. Define

Let be two strings over alphLet be two strings over alphabet abet . Then the hamming distance betwee. Then the hamming distance between X and Y (n X and Y (ham(X, Y)ham(X, Y)) is defined as) is defined as

1 , ;( , )

0 , .def if a b

neq a bif a b

1

0

( , ) ( , )n

defi i

i

ham X Y neq x y

0 1 -1 0 1 -1X = ... and Y = y ...n nx x x y y

Problem Definition and Problem Definition and PreliminariesPreliminaries

The The The String matching with k mismatchesThe String matching with k mismatches Pr Problem is defined as follows:oblem is defined as follows:InputInput : Text : Text TT = = tt00…t…tn-1n-1, pattern , pattern PP = = pp00…p…pm-1m-1,,

where where ti, pj,, i = = 0,…n-1; ; j = =0,…m-1, , and a natural number and a natural number k . .

OutputOutput : All pairs < : All pairs <i, ham(P,Ti, ham(P,T((ii)))>, where )>, where ii is is a text location for which a text location for which hamham((P,TP,T((ii))) )

kk, where , where TT((ii)) = = ttiitti+1i+1…t…ti+m-1i+m-1

Lager and Small AlphabetsLager and Small Alphabets

Large alphabets Large alphabets Number of different alphabets in the patterNumber of different alphabets in the patter

n exceeds n exceeds 2k2k Small alphabetsSmall alphabets

Number of different alphabets in the patterNumber of different alphabets in the pattern less than n less than 2 k

Large Alphabets(1)Large Alphabets(1)

Two stagesTwo stages Marking stageMarking stage

Identifying the potential starts of the pattern.Identifying the potential starts of the pattern. Verification stageVerification stage

Verifying which of the potential candidates is indVerifying which of the potential candidates is indeed a pattern occurrence.eed a pattern occurrence.

Large Alphabets(2)Large Alphabets(2)

The Marking StageThe Marking Stage Let {Let {aa11,…,a,…,a2k2k} be } be 2k2k different alphabet symb different alphabet symb

ols appearing in the text and let ols appearing in the text and let iijj be the sm be the smallest index in the pattern where allest index in the pattern where aajj appears, appears, jj = 1,..., = 1,...,2k2k

a1 a2 a3 aj …… a2k

a1 a1 a2 a3 aj

11 23 35 ij

Text

Pattern

Large Alphabets(3)Large Alphabets(3)

M.1. for every symbol M.1. for every symbol ttii; if ; if ttii = = aajj then mar then mark text location k text location ii – – jj

M.2. discard every text location that is mM.2. discard every text location that is marked less than arked less than kk marks marks

Time: O(n)Time: O(n)

ExampleExample

Text Text : aabcabc: aabcabc PatternPattern : abc : abc K K : 1: 1

a a b c a b c1 3 0 0 3 0 0# of

marks

Lemma 1Lemma 1 After the marking stage, there are at most undiscAfter the marking stage, there are at most undisc

arded locationsarded locationsproof:proof:

Total marks = n and every undiscarded locations has at least marks

there are at most locations undiscarded.

k

n

k

n

k

Verification StageVerification Stage

Using Using suffix treesuffix tree and and Lowest Common ALowest Common Ancestorncestor to check whether a location exis to check whether a location exists a matching that is less than k mismatcts a matching that is less than k mismatches.hes.

takes O(k) for each candidate takes O(k) for each candidate Total time : O( )Total time : O( )

nk

k

Small AlphabetsSmall Alphabets

Using convolutions, as introduced by FisUsing convolutions, as introduced by Fischer and Patersoncher and Paterson

DefineDefine

String S = String S = ss00…s…sn-1n-1, then, then

SSRR is the reverse of the string is the reverse of the string ssn-1n-1…s…s00

1 ( )

0

if xX x

if x

1 ( )

0

if xX x

if x

ExampleExample

TextText((TT)) PatternPattern((PP)) KK

00110110011011 11011011101101 1110110 1110110 001001 010010 100100

X ( )b

T

2X ( )

aT

X ( )c

T X ( )R

a P X ( )R

b P X ( )R

c P

aabcabcabc

ExampleExample

X ( ) X ( )Raa

T P 000011011X ( ) X ( )R

bbT P 011011010

X ( ) X ( )Rcc

T P 111011000

000011011

011011010

+111011000

122033021

Time complexityTime complexity

Each multiplication takes O (Each multiplication takes O (nlogmnlogm) using ) using FFTFFT

We do multiplicationsWe do multiplications Can be solved in O (Can be solved in O (n logmn logm))

k

k

General AlphabetsGeneral Alphabets

Cases which the size of the pattern alphaCases which the size of the pattern alphabet is between 2 and 2bet is between 2 and 2kk

DefinitionDefinition A symbol that appears in the pattern at least A symbol that appears in the pattern at least

2 times is called2 times is called frequent. frequent. A symbol that is A symbol that is not frequent is callednot frequent is called rare rare..

k

k

Many Frequent SymbolsMany Frequent Symbols

More than frequent symbols More than frequent symbols Lemma 2Lemma 2

Let be frequent symbols. Then there Let be frequent symbols. Then there exist in the text at most locations where exist in the text at most locations where there is a pattern occurrence with no more than there is a pattern occurrence with no more than k errors.k errors.

proof:proof:Choose 2 occurrences of every frequent symbol in Choose 2 occurrences of every frequent symbol in pattern and call them relevant occurrences pattern and call them relevant occurrences

The total number of marks is at most The total number of marks is at most There are at most There are at most

2n

k

k

1,..., ka a

k

2n k2 2n k n

k k

Finding the Potential Finding the Potential LocationsLocations

Example : k = 2Example : k = 2

Frequent symbols : Frequent symbols :

a a b c a b c1 2 3 4 5 6 7

a a b c a b d8 9 1011121314

a bfrequent symbols :

Finding the Potential Finding the Potential LocationsLocations

: don’t care: don’t care Using the “Using the “less than matching with “dless than matching with “d

on’t care” problemon’t care” problem” proposed by Am” proposed by Amir et. This can be done in O( )ir et. This can be done in O( )

a a b c a b c1 2 3 4 5 6 7

a a b c a b d8 9 1011121314

logn k m

The Verification StageThe Verification Stage

By lemma 2, we have at most candidatesBy lemma 2, we have at most candidates Using suffix tree and Lowest Common AncUsing suffix tree and Lowest Common Anc

estor to check whether a location exists a estor to check whether a location exists a matching that is less than k mismatches.matching that is less than k mismatches.

takes O(takes O(kk) for each candidate ) for each candidate Total time : O(Total time : O(nn + )=O( ) + )=O( )k

2n

k

logn k m logn k m

Few Frequent SymbolsFew Frequent Symbols

Using the convolutions as described in “SmalUsing the convolutions as described in “Small Alphabets” to deal with the frequent symboll Alphabets” to deal with the frequent symbols s takes O( ) takes O( )

Then replace all frequent symbols in p by “doThen replace all frequent symbols in p by “don’t cares” n’t cares” Case 1 : the remaining symbols and all their occurreCase 1 : the remaining symbols and all their occurre

nces together less than nces together less than 2k2k Case 2 : the remaining symbols and all their occurrCase 2 : the remaining symbols and all their occurr

ences together at least ences together at least 2k2k

logn m k

Case 1Case 1

Using the algorithm “Using the algorithm “Pattern Matching Pattern Matching with Swapswith Swaps” of Amir et. This can be don” of Amir et. This can be done in O( )e in O( )

Total time complexity : O( )Total time complexity : O( )logn k m

logn k m

Case 2Case 2

Choose any 2k symbolsChoose any 2k symbols # of chosen symbols does not exceed# of chosen symbols does not exceed Using the previous method “finding the Using the previous method “finding the

potential positions” potential positions” We have at most O( ) potential positionsWe have at most O( ) potential positions

and verifying each location is O(and verifying each location is O(kk) ) Total time complexity: O( )Total time complexity: O( )

2 k

n

k

logn k m

Speaker : R92921084 Speaker : R92921084 何明彥何明彥

Introduction to Introduction to BreakBreak

AssumptionAssumption PeriodicityPeriodicity BreakBreak Counting ArgumentCounting Argument P has 2k disjoint k-breaksP has 2k disjoint k-breaks P has 2k disjoint l-breaksP has 2k disjoint l-breaks Local matchesLocal matches

OUTLINEOUTLINE

AssumptionAssumption PeriodicityPeriodicity BreakBreak Counting ArgumentCounting Argument P has 2k disjoint k-breaksP has 2k disjoint k-breaks P has 2k disjoint l-breaksP has 2k disjoint l-breaks Local matchesLocal matches

OUTLINEOUTLINE

(Partition)(Partition)

Assumption(1/2)Assumption(1/2)

Text T: |T|=nText T: |T|=n Pattern P: |P|=mPattern P: |P|=m

=>=>n=2mn=2m

T:T:

P:P:

Assumption(2/2)Assumption(2/2)

Therefore, spilt text into substring of Therefore, spilt text into substring of length 2m.length 2m.

Every pattern occurrence appears Every pattern occurrence appears in in some substringsome substring..

for 2m length substrings of for 2m length substrings of the text yields an algorithm of the text yields an algorithm of

n

m

( ( , ))O f m k

( . ( , ))n

O f m km

AssumptionAssumption PeriodicityPeriodicity BreakBreak Counting ArgumentCounting Argument P has 2k disjoint k-breaksP has 2k disjoint k-breaks P has 2k disjoint l-breaksP has 2k disjoint l-breaks Local matchesLocal matches

OUTLINEOUTLINE

Periodicity(1/2)Periodicity(1/2)

Def:Def: A string S[1..n] is periodic if suchA string S[1..n] is periodic if such that S[j]=S[i+j-1].that S[j]=S[i+j-1]. S is periodic if : j 2 , is a prefix S is periodic if : j 2 , is a prefix

of ; otherwise is aperiodic.of ; otherwise is aperiodic. ex: ex: ABCABCAB ABCABCAB

ABCDABCABCDABC

2

ni

1j n i jS u

u

periodic

aperiodic

If P is periodic with a short period, it If P is periodic with a short period, it is quite simple to come up with a is quite simple to come up with a quick algorithm for quick algorithm for string matching string matching with k mismatch.with k mismatch.

T:T:

P:P:

Periodicity(2/2)Periodicity(2/2)

AssumptionAssumption PeriodicityPeriodicity BreakBreak Counting ArgumentCounting Argument P has 2k disjoint k-breaksP has 2k disjoint k-breaks P has 2k disjoint l-breaksP has 2k disjoint l-breaks Local matchesLocal matches

OUTLINEOUTLINE

Def:Def: A A breakbreak of a string S is an aperiodic s of a string S is an aperiodic s

ubstring of S.ubstring of S. An An l-breakl-break is a break of length is a break of length ll..

l-break

Break(1/4)Break(1/4)

period l-break period l-break

A large number of A large number of breaksbreaks are useful are useful for fast algorithm for string matching for fast algorithm for string matching with k mismatches.with k mismatches.

aperiod

Lemma 3:Lemma 3:

Let P be a pattern with 2k disjoint Let P be a pattern with 2k disjoint

l-breakl-break and let T be a text. In each and let T be a text. In each match of P in T match of P in T at least k of theat least k of the l- l-break break match exactly match exactly..

Break(2/4)Break(2/4)

Pf/Pf/

There are at most k mismatches in a There are at most k mismatches in a match and P has 2k disjoint match and P has 2k disjoint l-breaksl-breaks..

Since at most k do not match Since at most k do not match exactly, exactly, at least kat least k must match must match exactly. exactly.

Break(3/4)Break(3/4)

Lemma 4:Lemma 4:P is an m length pattern with < 2kP is an m length pattern with < 2k l-breaksl-breaks..

the length of T is 2m.the length of T is 2m.Then all matches of P in T are in a substrinThen all matches of P in T are in a substrin

ggof T which has at most O(k)of T which has at most O(k) l-breaks l-breaks..

Break(4/4)Break(4/4)

proved in section 6 from Cole , Hariharan "Approximate string matching:

a simple faster algorithm "

AssumptionAssumption PeriodicityPeriodicity BreakBreak Counting ArgumentCounting Argument P has 2k disjoint k-breaksP has 2k disjoint k-breaks P has 2k disjoint l-breaksP has 2k disjoint l-breaks Local matchesLocal matches

OUTLINEOUTLINE

Theorem 1:Theorem 1:

P is a pattern with 2k disjointP is a pattern with 2k disjoint k-k-breaksbreaks..

In everyIn every k k contiguous locations in contiguous locations in T ,at most 4 matches of the pattern.T ,at most 4 matches of the pattern.

Counting Arguments(1/3)Counting Arguments(1/3)

ABCDABCDABC

ABCDABC

kbreak

ABCDABCABCDABC

k

ABCDABC

ABCDABC

ABCDABC

2

l

Counting Arguments(2/3)Counting Arguments(2/3)

pf/pf/

TT

PP

ForFor kk contiguous locations in T, the contiguous locations in T, the overall numbers of exact matches of overall numbers of exact matches of thethe k-breaks k-breaks is is at mostat most 4k4k. .

This means that at most This means that at most 4 locations4 locations have khave k k-breaksk-breaks with an exact match, with an exact match, in their respective locations.in their respective locations.

Counting Arguments(3/3)Counting Arguments(3/3)

( 2*2 )k

AssumptionAssumption PeriodicityPeriodicity BreakBreak Counting ArgumentCounting Argument P has 2k disjoint k-breaksP has 2k disjoint k-breaks P has 2k disjoint l-breaksP has 2k disjoint l-breaks Local matchesLocal matches

OUTLINEOUTLINE

Corollary 1:Corollary 1: If P has 2k disjointIf P has 2k disjoint k-breaksk-breaks then there then there

are at most matches of P in T.are at most matches of P in T. These matches can be found in These matches can be found in O(n+m)O(n+m) time.time.

P has 2k disjointP has 2k disjoint k-k-breaks breaks (1/4)(1/4)

4n

k

pf/pf/From From Theorem 1Theorem 1 there are at most matc there are at most matc

hes of P in T. Therefore, if we knew theshes of P in T. Therefore, if we knew these locations in advance, e locations in advance, verificationverification woulwould take d take O(k)O(k) per location. per location.

4n

k next we describe a method of finding the next we describe a method of finding the candidate location in time candidate location in time O(n)O(n)

( )n

Ok

P has 2k disjointP has 2k disjoint k-k-breaks breaks (2/4)(2/4)

1.1. Find all exact matches of all breaks in Find all exact matches of all breaks in the text.the text.

2.2. For every such match, mark all text For every such match, mark all text locations for pattern occurrence locations for pattern occurrence appropriate for this break. appropriate for this break.

3.3. Discard every text location that is Discard every text location that is marked less than k marks.marked less than k marks.

1)1) There are O(n) exact matches of breaks and There are O(n) exact matches of breaks and they can be found in linear time.they can be found in linear time.

2)2) There is a total of O(n) marks.There is a total of O(n) marks.

P has 2k disjointP has 2k disjoint k-k-breaks breaks (3/4)(3/4)

2)2) There are l distinct breaks, appearing aThere are l distinct breaks, appearing a11…a…all time respectively. The total # of appearance time respectively. The total # of appearance of each distinct k-break does not exceed of each distinct k-break does not exceed

The total # of marks is The total # of marks is

1)1) Each distinct Each distinct k-break k-break can appear at most can appear at most times in the text and since there aretimes in the text and since there are 2k2k k-breaks k-breaks .#of all.#of all k-breaks k-breaks in the text does n in the text does n

ot exceed ot exceed 4n4n. The total length of all . The total length of all k-breakk-breaks s ≤m. ≤m.

All exact matches of all All exact matches of all k-breaks k-breaks in the text cin the text can be found inan be found in O(n+m)O(n+m)

2n

k

2n

k

1

24

l

i

i

na n

k

P has 2k disjointP has 2k disjoint k-k-breaks breaks (4/4)(4/4)

AssumptionAssumption PeriodicityPeriodicity BreakBreak Counting ArgumentCounting Argument P has 2k disjoint k-breaksP has 2k disjoint k-breaks P has 2k disjoint l-breaksP has 2k disjoint l-breaks Local matchesLocal matches

OUTLINEOUTLINE

P has 2k disjointP has 2k disjoint l-l-breaksbreaks(1/7)(1/7)

The pattern does not always contain 2k k-breaks. Nevertheless, they may be an l such that there are 2k l-breaks.

By Corollary 1, finding them may take costly time. ( )[..... 2 *(2 )]

nk nO k

l l

To circumvent this problem, rather than searching for all matches, se need a way to seek for local match.

P has 2k disjointP has 2k disjoint l-l-breaksbreaks(2/7)(2/7)

Lemma 5:Lemma 5: let P be a pattern with 2k disjointlet P be a pattern with 2k disjoint l-breaks l-breaks a a

nd nd let T be a text of size n.let T be a text of size n. We can preprocess T in We can preprocess T in O(n)O(n) time such that, time such that,

given given ll contiguous text locations, we can iden contiguous text locations, we can identify the, at most 4, locations where P matchetify the, at most 4, locations where P matches in time O(klogk)s in time O(klogk)

pf/pf/ S={BS={B11,…,B,…,B2k2k} :set of 2k } :set of 2k l-breaksl-breaks of P of P S’={BS’={B11’,…,B’,…,Bff’}: ,be the maximal subse’}: ,be the maximal subse

t of distinct l-breaks of S.t of distinct l-breaks of S. S’ can be found in time by consS’ can be found in time by cons

tructing a tructing a trietrie of the strings in S. of the strings in S.

P has 2k disjointP has 2k disjoint l-l-breaksbreaks(3/7)(3/7)

2f k

2

1

( ) ( )k

i

O Bi O m

Since each break in S’ is distinct, the Since each break in S’ is distinct, the overall number of exact matches of overall number of exact matches of l-l-breaksbreaks of S’ in T is bounded by n. of S’ in T is bounded by n. These exact matches can be found inThese exact matches can be found in

P has 2k disjointP has 2k disjoint l-l-breaksbreaks(4/7)(4/7)

1

( ' ) ( )f

i

i

O n B O n m

A[1..n]:length =n, corresponding to the n locA[1..n]:length =n, corresponding to the n location of the text.ation of the text.

A[i] is the index of the certain A[i] is the index of the certain l-breakl-break of S’ of S’ matches at location i of T.matches at location i of T.

Partition:Partition:

k

pieces of size k

n

k

pieces of size k

n

k

3 3 51 ... 1 ...

2 2 2 2

k k k kA A A A

1 ... 1 ... 2A A k A k A k

P has 2k disjointP has 2k disjoint l-l-breaksbreaks(5/7)(5/7)

leaves corresponding to the locations containing j in this piece of size k

for each piece of for each piece of size k and each bsize k and each break B’reak B’jj in S’ c in S’ create a balanced reate a balanced binary search trebinary search tree.e.

#of tree is #of tree is 2 2

. .2 4n n

f k nk k

P has 2k disjointP has 2k disjoint l-l-breaksbreaks(6/7)(6/7)

The size of each tree is O(1)+O(The size of each tree is O(1)+O(# of leaves# of leaves)) The leaves of all trees correspond to all The leaves of all trees correspond to all

exact matches of the exact matches of the l-breaksl-breaks of S’ in T. of S’ in T.

P has 2k disjointP has 2k disjoint l-l-breaksbreaks(7/7)(7/7)

Since there are at most n such exact matches,Since there are at most n such exact matches,the overall size of the trees is the overall size of the trees is O(n).O(n). The trees can be constructed in The trees can be constructed in O(n) O(n) time.time.

AssumptionAssumption PeriodicityPeriodicity BreakBreak Counting ArgumentCounting Argument P has 2k disjoint k-breaksP has 2k disjoint k-breaks P has 2k disjoint l-breaksP has 2k disjoint l-breaks Local matchesLocal matches

OUTLINEOUTLINE

Lemma 3Lemma 3 shows that a matches of the p shows that a matches of the pattern dictates that k out of the patternattern dictates that k out of the pattern’s ’s l-breaksl-breaks B B11,…,B,…,Bkk match exactly in T a match exactly in T at appropriate shift.t appropriate shift.

Since 2 exact matches of BSince 2 exact matches of B ii must appear must appear in l contiguous location, we can find thein l contiguous location, we can find them by using exactly one of them by using exactly one of the BST BST..

Local matches O(klogk) (1/3)Local matches O(klogk) (1/3)

The BST is a balanced Binary Search Tree; The BST is a balanced Binary Search Tree; therefore finding the 2 exact matches take therefore finding the 2 exact matches take O(logk)O(logk) time. time.

Finding exact matches of all BFinding exact matches of all B i i and markinand marking potential matches take g potential matches take O(klogk)O(klogk) with at with at most 4k marksmost 4k marks

Local matches O(klogk) (2/3)Local matches O(klogk) (2/3)

Since only 4 ,out of Since only 4 ,out of ll, locations for poten, locations for potential matches can have k marks, the pattetial matches can have k marks, the pattern can match at most 4 locations.rn can match at most 4 locations.

These 4 potential locations can be verifiThese 4 potential locations can be verified for a match in ed for a match in O(k)O(k) time. time.

Local matches O(klogk) (3/3)Local matches O(klogk) (3/3)

Speaker : R92921083 Speaker : R92921083 余宗恩余宗恩

Improved AlgorithmImproved Algorithm

Discuss the Frequency of SymbolsDiscuss the Frequency of Symbols

Divided String matching problem intoDivided String matching problem into

Marking & Verification StageMarking & Verification Stage

Faster Algorithm for Small KFaster Algorithm for Small K Concept about "Break" HandlingConcept about "Break" Handling The AlgorithmThe Algorithm

O( )logn k k

O( )

3lognk km

Schema of the paperSchema of the paper

Improve to

The AlgorithmThe Algorithm

Goal:Goal: An efficient algorithm for general caseAn efficient algorithm for general case Time Complexity: O( )Time Complexity: O( ) Additional Techniques Required:Additional Techniques Required:

l-boundryl-boundry Dominated patternDominated pattern overlappingoverlapping

3 lognk k

m

Chance to ImproveChance to Improve

Can we find an optimal length of break?Can we find an optimal length of break? l-boundryl-boundry

Can we make use of the repetition?Can we make use of the repetition? Dominating patternDominating pattern

l-boundryl-boundry

Appropriate length of l, such thatAppropriate length of l, such that # of l-1 break >=2k# of l-1 break >=2k # of l break <=2k# of l break <=2k

Ex. aadccbcbcacdEx. aadccbcbcacd # of 2-break=3# of 2-break=3

# of 3-break=2# of 3-break=2

aaaadcdccbcbcbcbcacdcacd

aaaadccdccbcbcbcbcacdacd

Property of l-boundryProperty of l-boundry

By theorem 1: P has 2k l-breaksBy theorem 1: P has 2k l-breaks=>at most n/k non-discard location=>at most n/k non-discard location

l-1 breaks >=2kl-1 breaks >=2k => at most O(n/k) non-discard location=> at most O(n/k) non-discard location

Property of l-boundryProperty of l-boundry

By Lemma 5: By Lemma 5: 若若 P P 有 有 <= 2k <= 2k 個 個 l-breakl-break

->-> 所有所有 match match 會在 會在 T T 的一段 的一段 substsubstringring 上,且該 上,且該 substring substring 有 有 O(k) O(k) 個 個 l-brl-breakeak

P P 有 有 <= 2k <= 2k 個 個 l-breakl-break => P => P 與 與 T T 上的 上的 l-break l-break 都是 都是 O(k) O(k) 個個

How to find l-boundry?How to find l-boundry?

1<=l<=k1<=l<=k use binary searchuse binary search

O(logk)O(logk) Complexity:Complexity:

O(mlogk)O(mlogk)

Improve by PeroidicityImprove by Peroidicity

Idea: If there are many occurences of repetition, Idea: If there are many occurences of repetition, we can make use of the former result of matchinwe can make use of the former result of matching.g.

T:T:abcabcababcabcabdada P:abP:ab

# of mismatches in location 1, 2, 3 are equal to 4, # of mismatches in location 1, 2, 3 are equal to 4, 5, 6 respectively.5, 6 respectively.

abcabcababcabcabdadaabab

DefinitionDefinition

w: a string with length at most l/2w: a string with length at most l/2 w*: infinite string www...w*: infinite string www... ww2l2l*: 2 l length prefix of w**: 2 l length prefix of w* a string s of length l has period wa string s of length l has period w

=> s is a substring of w=> s is a substring of w2l2l**

Definition (con'd)Definition (con'd)

l-segment: l-segment:

Divided Pattern and text into strings of Divided Pattern and text into strings of length llength l

Bad l-segment: Bad l-segment: a period stretch which doesn't have period a period stretch which doesn't have period

ww has period w but intersect with breakshas period w but intersect with breaks

Definition (con'd)Definition (con'd)

Pattern P is a Pattern P is a dominating patterndominating pattern

=> P has at most 4k l-segments => P has at most 4k l-segments which doesn't have period wwhich doesn't have period w

Text Location (i) is overlapping:Text Location (i) is overlapping:

P

i

AlgorithmAlgorithm

Case 1: P is a dominating patternCase 1: P is a dominating pattern

Case 2: P is not a dominating patternCase 2: P is not a dominating pattern

P is a dominating patternP is a dominating pattern

Case 1: Text location (i) is Case 1: Text location (i) is overlappingoverlapping

P

i

Text location (i) is not Text location (i) is not overlappingoverlapping

P

ii w

Text location (i) is not Text location (i) is not overlappingoverlapping

A bad l-segment in P won't match a A bad l-segment in P won't match a bad l-segment in Tbad l-segment in T

A bad l-segment in T won't match a A bad l-segment in T won't match a bad l-segment in Pbad l-segment in P

# of mismatch in location i # of mismatch in location i

= # of mismatch in location = # of mismatch in location

(assume location is not (assume location is not overlapping)overlapping)

i w

i w

AlgorithmAlgorithm

A B A B A B A B C D A AA B A B

A B A B C D A B

2 0 0 2 0 2 0 2 # # # #

A B A B

2

AlgorithmAlgorithm1. find all matches of P in T at overlapping locations1. find all matches of P in T at overlapping locations

2. for each bad l-segment B,2. for each bad l-segment B,

do P.M with mismatches, with pattern B and Text wdo P.M with mismatches, with pattern B and Text w2l2l**

3. do P.M with mismatches, with pattern w and Text w3. do P.M with mismatches, with pattern w and Text w2l2l**

4. compute the # of mismatches of P at the first w locations of T4. compute the # of mismatches of P at the first w locations of T

using step 2 and 3using step 2 and 3

5. i <- 5. i <-

6. while end of text not reached6. while end of text not reached

6a. if i is not an overlapping location6a. if i is not an overlapping location

6aa. # of mismatch at location i <- # of mismatch at location 6aa. # of mismatch at location i <- # of mismatch at location

6ab. i <- i+16ab. i <- i+1

6b. else, if j is the next non-overlapping location6b. else, if j is the next non-overlapping location

6ba. for each bad l-segment participating in an overlap in the 6ba. for each bad l-segment participating in an overlap in the overlapping locations overlapping locations

i to j, update the # of mismatches it accrues in the next i to j, update the # of mismatches it accrues in the next locationslocations

6bb. i <- j6bb. i <- j

i w

1w

w

O( )3 logk kO( )2l k

O( )2kO( )lk

O( n )

Complexity analysisComplexity analysis

Lemma 6: P has at most 8k bad l-Lemma 6: P has at most 8k bad l-segmentsegment 1. l-segments which doesn't have period 1. l-segments which doesn't have period

ww <= 4k<= 4k 2. l-segments which has period w 2. l-segments which has period w

but intersect with breaksbut intersect with breaks

<=4k<=4k Total Total <= 8k<= 8k

Complexity AnalysisComplexity Analysis

Total O(n+mlogk+ )Total O(n+mlogk+ ) O(n)O(n): Segmentation: Segmentation O(mlogk)O(mlogk): finding l-boundry: finding l-boundry O( )O( ): find # of mismatches: find # of mismatches3lognk k

m

3lognk km

P is a non-dominating P is a non-dominating patternpattern

Algorithm 1:Algorithm 1: adjust the algorithm used aboveadjust the algorithm used above O( )O( )

Algorithm 2:Algorithm 2: Find candidates more effectivelyFind candidates more effectively O( )O( )

4lognk km

3lognk km

Algorithm 1Algorithm 1

find a substring S of P, S has at most find a substring S of P, S has at most 2k bad l-segments2k bad l-segments

do P.M with k mismatches with do P.M with k mismatches with pattern S and text Tpattern S and text T

verify locations where has less than k verify locations where has less than k mismatchesmismatches

Complexity: O( )Complexity: O( )4lognk km

Algorithm 2Algorithm 2

Idea: if location i is a matchIdea: if location i is a match S S 上有最多 上有最多 k k 個 個 l-segment match T l-segment match T 上沒有 上沒有

period w period w 的 的 l-segmentl-segment T T 上最多 上最多 k k 個 個 l-segment match S l-segment match S 上沒有 上沒有 pp

eriod w eriod w 的 的 l-segmentl-segment 可用 可用 Convolution Convolution 找到符合上述兩條件的 找到符合上述兩條件的 locloc

ationation

O(nlogm)=O(nlogk)O(nlogm)=O(nlogk)0 10 0 0

Complexity analysisComplexity analysis

Total: O( )Total: O( )

Convolution: O( )Convolution: O( )

Verification: O( )Verification: O( )

3loglog nk kn k m

logn k

3lognk km