Computing Reversed Lempel-Ziv Factorization Online

Preview:

DESCRIPTION

Computing Reversed Lempel-Ziv Factorization Online. Shiho Sugimoto , Tomohiro I, Shunsuke Inenaga , Hideo Bannai , Masayuki Takeda Kyushu University, Japan. Outline. Reversed LZ factorization without self-references (RLZ) Online RLZ algorithm by Kolpakov and Kucherov - PowerPoint PPT Presentation

Citation preview

HABATAKITAI LaboratoryEverything is String.

Computing ReversedLempel-Ziv Factorization Online

Shiho Sugimoto, Tomohiro I, Shunsuke Inenaga,Hideo Bannai, Masayuki Takeda

Kyushu University, Japan

HABATAKITAI LaboratoryEverything is String.

• Reversed LZ factorization without self-references (RLZ)

• Online RLZ algorithm by Kolpakov and Kucherov

• New online RLZ algorithm using O(n log σ) bits of space

• Reversed LZ factorization with self-references (RLZS)

• New online RLZS algorithm using O(n log n) bits of space

• New online RLZS algorithm using O(n log σ) bits of space

Outline

n : the length of input stringσ : the alphabet size

HABATAKITAI LaboratoryEverything is String.

• LZ factorization was proposed in 1977[Ziv & Lempel, 1977].– data compression etc.

• Reversed LZ factorization (RLZ in short) was proposed in 2009 [Kolpakov & Kucherov, 2009].– finding gapped palindromes etc.

Background

HABATAKITAI LaboratoryEverything is String.

LZ factorization without self-references[Ziv & Lempel, 1977]

LZ factorization without self-references of string w of length n is a factorization s1,s2,...,sm such that• w = s1 s2…sm

• si is the longest non-empty prefix ofw[|s1…si−1|+1..n] that is also a substring ofw[1.. | s1…si−1|] if such exists

• si = w[|s1…si−1|+1] otherwise

HABATAKITAI LaboratoryEverything is String.

LZ factorization without self-references[Ziv & Lempel, 1977]

Ex ) w = a b b a a a a b b b a cs1 s2

LZ factorization without self-references of string w of length n is a factorization s1,s2,...,sm such that• w = s1 s2…sm

• si is the longest non-empty prefix ofw[|s1…si−1|+1..n] that is also a substring ofw[1.. | s1…si−1|] if such exists

• si = w[|s1…si−1|+1] otherwise

HABATAKITAI LaboratoryEverything is String.

Ex ) w = a b b a a a a b b b a c

LZ factorization without self-references

s1 s2 s3

[Ziv & Lempel, 1977]

LZ factorization without self-references of string w of length n is a factorization s1,s2,...,sm such that• w = s1 s2…sm

• si is the longest non-empty prefix ofw[|s1…si−1|+1..n] that is also a substring ofw[1.. | s1…si−1|] if such exists

• si = w[|s1…si−1|+1] otherwise

HABATAKITAI LaboratoryEverything is String.

Ex ) w = a b b a a a a b b b a c

LZ factorization without self-references

s1 s2 s3

[Ziv & Lempel, 1977]

LZ factorization without self-references of string w of length n is a factorization s1,s2,...,sm such that• w = s1 s2…sm

• si is the longest non-empty prefix ofw[|s1…si−1|+1..n] that is also a substring ofw[1.. | s1…si−1|] if such exists

• si = w[|s1…si−1|+1] otherwises4

HABATAKITAI LaboratoryEverything is String.

Ex ) w = a b b a a a a b b b a c

LZ factorization without self-references

s1 s2 s3

[Ziv & Lempel, 1977]

LZ factorization without self-references of string w of length n is a factorization s1,s2,...,sm such that• w = s1 s2…sm

• si is the longest non-empty prefix ofw[|s1…si−1|+1..n] that is also a substring ofw[1.. | s1…si−1|] if such exists

• si = w[|s1…si−1|+1] otherwises4 s5

HABATAKITAI LaboratoryEverything is String.

Ex ) w = a b b a a a a b b b a c

LZ factorization without self-references

s1 s2 s3

[Ziv & Lempel, 1977]

LZ factorization without self-references of string w of length n is a factorization s1,s2,...,sm such that• w = s1 s2…sm

• si is the longest non-empty prefix ofw[|s1…si−1|+1..n] that is also a substring ofw[1.. | s1…si−1|] if such exists

• si = w[|s1…si−1|+1] otherwises4 s5 s6

HABATAKITAI LaboratoryEverything is String.

Ex ) w = a b b a a a a b b b a c

LZ factorization without self-references

s1 s2 s3

[Ziv & Lempel, 1977]

LZ factorization without self-references of string w of length n is a factorization s1,s2,...,sm such that• w = s1 s2…sm

• si is the longest non-empty prefix ofw[|s1…si−1|+1..n] that is also a substring ofw[1.. | s1…si−1|] if such exists

• si = w[|s1…si−1|+1] otherwises4 s5 s6 s7

HABATAKITAI LaboratoryEverything is String.

Ex ) w = a b b a a a a b b b a c

LZ factorization without self-references

s1 s2 s3

[Ziv & Lempel, 1977]

LZ factorization without self-references of string w of length n is a factorization s1,s2,...,sm such that• w = s1 s2…sm

• si is the longest non-empty prefix ofw[|s1…si−1|+1..n] that is also a substring ofw[1.. | s1…si−1|] if such exists

• si = w[|s1…si−1|+1] otherwises4 s5 s6 s7 s8 s9

HABATAKITAI LaboratoryEverything is String.

RLZ without self-references of string w of length n is a factorization f1,f2,...,fm such that• w = f1 f2…fm

• fi is the longest non-empty prefix of w[|f1...fi−1|+1..n] that is also a substring of w[1.. | f1...fi−1|]R if such exists

• fi = w[|f1...fi−1|+1] otherwise

Reversed LZ factorizationwithout self-references (RLZ)

[Kolpakov & Kucherov, 2009]

reversed

HABATAKITAI LaboratoryEverything is String.

Ex ) w = a b b a a a a b b b a cf1 f2

RLZ without self-references of string w of length n is a factorization f1,f2,...,fm such that• w = f1 f2…fm

• fi is the longest non-empty prefix of w[|f1...fi−1|+1..n] that is also a substring of w[1.. | f1...fi−1|]R if such exists

• fi = w[|f1...fi−1|+1] otherwise

Reversed LZ factorizationwithout self-references (RLZ)

[Kolpakov & Kucherov, 2009]

reversed

HABATAKITAI LaboratoryEverything is String.

Ex ) w = a b b a a a a b b b a cf3f1 f2

RLZ without self-references of string w of length n is a factorization f1,f2,...,fm such that• w = f1 f2…fm

• fi is the longest non-empty prefix of w[|f1...fi−1|+1..n] that is also a substring of w[1.. | f1...fi−1|]R if such exists

• fi = w[|f1...fi−1|+1] otherwise

Reversed LZ factorizationwithout self-references (RLZ)

[Kolpakov & Kucherov, 2009]

reversed

HABATAKITAI LaboratoryEverything is String.

f4f3Ex ) w = a b b a a a a b b b a c

f1 f2

RLZ without self-references of string w of length n is a factorization f1,f2,...,fm such that• w = f1 f2…fm

• fi is the longest non-empty prefix of w[|f1...fi−1|+1..n] that is also a substring of w[1.. | f1...fi−1|]R if such exists

• fi = w[|f1...fi−1|+1] otherwise

Reversed LZ factorizationwithout self-references (RLZ)

[Kolpakov & Kucherov, 2009]

reversed

HABATAKITAI LaboratoryEverything is String.

Ex ) w = a b b a a a a b b b a cf5f4f3f1 f2

RLZ without self-references of string w of length n is a factorization f1,f2,...,fm such that• w = f1 f2…fm

• fi is the longest non-empty prefix of w[|f1...fi−1|+1..n] that is also a substring of w[1.. | f1...fi−1|]R if such exists

• fi = w[|f1...fi−1|+1] otherwise

Reversed LZ factorizationwithout self-references (RLZ)

[Kolpakov & Kucherov, 2009]

reversed

HABATAKITAI LaboratoryEverything is String.

Ex ) w = a b b a a a a b b b a cf5f4f3f1 f2

RLZ without self-references of string w of length n is a factorization f1,f2,...,fm such that• w = f1 f2…fm

• fi is the longest non-empty prefix of w[|f1...fi−1|+1..n] that is also a substring of w[1.. | f1...fi−1|]R if such exists

• fi = w[|f1...fi−1|+1] otherwise

Reversed LZ factorizationwithout self-references (RLZ)

[Kolpakov & Kucherov, 2009]

reversedf6 f7

HABATAKITAI LaboratoryEverything is String.

• Computes RLZ in an online manner• Works in O(n log n) bits of space and O(n log σ)

time (on a word RAM model).– Constructs suffix tree for reversed prefixes online.– Computes RLZ factors from suffix tree.– Blumer’s version of Weiner’s algorithm achieves

above complexity [Blumer et al, 1985] [Weiner, 1973].

KK algorithm[Kolpakov & Kucherov, 2009]

HABATAKITAI LaboratoryEverything is String.

Ex ) w = a b b a a a a b b b a c

KK algorithm[Kolpakov & Kucherov, 2009]

Stree(ε)

f1

HABATAKITAI LaboratoryEverything is String.

Ex ) w = a b b a a a a b b b a c

KK algorithm[Kolpakov & Kucherov, 2009]

Stree(aR)

a

f1 f2

HABATAKITAI LaboratoryEverything is String.

Ex ) w = a b b a a a a b b b a c

KK algorithm[Kolpakov & Kucherov, 2009]

a ab

Stree((ab)R)

f1 f2

HABATAKITAI LaboratoryEverything is String.

Ex ) w = a b b a a a a b b b a c

KK algorithm[Kolpakov & Kucherov, 2009]

Stree((ab)R)

a ab

f1 f2 f3

HABATAKITAI LaboratoryEverything is String.

Ex ) w = a b b a a a a b b b a c

KK algorithm[Kolpakov & Kucherov, 2009]

Stree((abba)R)

a

a

bbba ba

f1 f2 f3 f4

HABATAKITAI LaboratoryEverything is String.

Ex ) w = a b b a a a a b b b a c

KK algorithm[Kolpakov & Kucherov, 2009]

a

a

b

bba ba

abab

Stree((aabba)R)

f1 f2 f3 f4 f5

HABATAKITAI LaboratoryEverything is String.

Ex ) w = a b b a a a a b b b a c

KK algorithm[Kolpakov & Kucherov, 2009]

Stree((aabba)R)

a

a

b

bba ba

abab

This suffix tree requires O(n log n) bits of space

We propose a new online RLZ algorithm which uses only O(n log σ) bits of space. (σ≦n is the alphabet size)

f1 f2 f3 f4 f5

HABATAKITAI LaboratoryEverything is String.

For O(n log σ) bits of space

• We utilize the idea of Starikovskaya’s algorithm.– It computes LZ factorization online in O(n log σ) bits of

space and O(n log2n) time [Starikovskaya, 2012].• We divide input string into blocks of length

r = O(logσn). – Each block is replaced by a meta-character.

HABATAKITAI LaboratoryEverything is String.

For O(n log σ) bits of space

Ex ) w = a b b a a a a b b b a c ………r = 3

B A B C ………

• We utilize the idea of Starikovskaya’s algorithm.– It computes LZ factorization online in O(n log σ) bits of

space and O(n log2n) time [Starikovskaya, 2012].• We divide input string into blocks of length

r = O(logσn). – Each block is replaced by a meta-character.

HABATAKITAI LaboratoryEverything is String.

• We utilize the idea of Starikovskaya’s algorithm.– It computes LZ factorization online in O(n log σ) bits of

space and O(n log2n) time [Starikovskaya, 2012].• We divide input string into blocks of length

r = O(logσn). – Each block is replaced by a meta-character.

For O(n log σ) bits of space

Ex ) w = a b b a a a a b b b a c ………r = 3

B A B C ………

HABATAKITAI LaboratoryEverything is String.

• For fi of length shorter than r, we use suffix trie of reversed subwords of length 2r.– can find fi in o(n) bits of space and O(|fi| log σ) time.

• For fi of length at least r, we use suffix tree of reversed blocks (meta-characters).– can find fi in O(n log σ) bits of space and O(|fi| log2n)

time.

Our online RLZ algorithm

HABATAKITAI LaboratoryEverything is String.

• For fi of length shorter than r, we use suffix trie of reversed subwords of length 2r.– can find fi in o(n) bits of space and O(|fi| log σ) time.

• For fi of length at least r, we use suffix tree of reversed blocks (meta-characters).– can find fi in O(n log σ) bits of space and O(|fi| log2n)

time.

Our online RLZ algorithm

We can compute RLZ without self-references online in O(n log σ) bits of space and O(n log2n) time.

Theorem

HABATAKITAI LaboratoryEverything is String.

Outline

• Reversed LZ factorization without self-references (RLZ)

• Online RLZ algorithm by Kolpakov and Kucherov

• New online RLZ algorithm using O(n log σ) bits of space

• Reversed LZ factorization with self-references (RLZS)

• New online RLZS algorithm using O(n log n) bits of space

• New online RLZS algorithm using O(n log σ) bits of space

n : the length of input stringσ : the alphabet size

HABATAKITAI LaboratoryEverything is String.

LZ factorization with self-references

LZ factorization with self-references of string w of length n is a factorization t1,t2,...,tm such that• w = t1 t2…tm

• ti is the longest non-empty prefix of w[|t1…ti−1|+1..n] that is also a substring of w[1.. | t1…ti|-1] if such exists

• ti = w[|t1…ti−1|+1] otherwise.self-reference

[Ziv & Lempel, 1977]

HABATAKITAI LaboratoryEverything is String.

LZ factorization with self-references

LZ factorization with self-references of string w of length n is a factorization t1,t2,...,tm such that• w = t1 t2…tm

• ti is the longest non-empty prefix of w[|t1…ti−1|+1..n] that is also a substring of w[1.. | t1…ti|-1] if such exists

• ti = w[|t1…ti−1|+1] otherwise.self-reference

[Ziv & Lempel, 1977]

Ex ) w = a b b a a a a b b b a ct1 t2 t3

HABATAKITAI LaboratoryEverything is String.

LZ factorization with self-references

LZ factorization with self-references of string w of length n is a factorization t1,t2,...,tm such that• w = t1 t2…tm

• ti is the longest non-empty prefix of w[|t1…ti−1|+1..n] that is also a substring of w[1.. | t1…ti|-1] if such exists

• ti = w[|t1…ti−1|+1] otherwise.self-reference

[Ziv & Lempel, 1977]

Ex ) w = a b b a a a a b b b a ct1 t2 t3 t4

HABATAKITAI LaboratoryEverything is String.

LZ factorization with self-references

LZ factorization with self-references of string w of length n is a factorization t1,t2,...,tm such that• w = t1 t2…tm

• ti is the longest non-empty prefix of w[|t1…ti−1|+1..n] that is also a substring of w[1.. | t1…ti|-1] if such exists

• ti = w[|t1…ti−1|+1] otherwise.self-reference

[Ziv & Lempel, 1977]

Ex ) w = a b b a a a a b b b a ct1 t2 t3 t4 t5 t6 t7 t8

HABATAKITAI LaboratoryEverything is String.

Reversed LZ factorizationwith self-references

RLZ with self-references (RLZS) of string w of length n is a factorization g1,g2,...,gm such that• w = g1 g2…gm

• gi is the longest non-empty prefix of w[|g1...gi−1|+1..n] that is also a substring of w[1.. | g1…gi|-1]R if such exists

• gi = w[|g1…gi−1|+1] otherwise.self-reference

HABATAKITAI LaboratoryEverything is String.

Ex ) w = a b b a a a a b b b a c

Reversed LZ factorizationwith self-references

g1 g2

RLZ with self-references (RLZS) of string w of length n is a factorization g1,g2,...,gm such that• w = g1 g2…gm

• gi is the longest non-empty prefix of w[|g1...gi−1|+1..n] that is also a substring of w[1.. | g1…gi|-1]R if such exists

• gi = w[|g1…gi−1|+1] otherwise.self-reference

HABATAKITAI LaboratoryEverything is String.

Ex ) w = a b b a a a a b b b a c

Reversed LZ factorizationwith self-references

g1 g3

RLZ with self-references (RLZS) of string w of length n is a factorization g1,g2,...,gm such that• w = g1 g2…gm

• gi is the longest non-empty prefix of w[|g1...gi−1|+1..n] that is also a substring of w[1.. | g1…gi|-1]R if such exists

• gi = w[|g1…gi−1|+1] otherwise.self-reference

g2

HABATAKITAI LaboratoryEverything is String.

Ex ) w = a b b a a a a b b b a c

Reversed LZ factorizationwith self-references

g1 g3

RLZ with self-references (RLZS) of string w of length n is a factorization g1,g2,...,gm such that• w = g1 g2…gm

• gi is the longest non-empty prefix of w[|g1...gi−1|+1..n] that is also a substring of w[1.. | g1…gi|-1]R if such exists

• gi = w[|g1…gi−1|+1] otherwise.self-reference

g2 g4 g5

HABATAKITAI LaboratoryEverything is String.

online computation of RLZSEx ) w = a b b a a a a b b b a cw[1..1] = a

HABATAKITAI LaboratoryEverything is String.

online computation of RLZS

w[1..2] = a bw[1..1] = aEx ) w = a b b a a a a b b b a c

HABATAKITAI LaboratoryEverything is String.

online computation of RLZS

w[1..2] = a bw[1..1] = aEx ) w = a b b a a a a b b b a c

w[1..3] = a b b

HABATAKITAI LaboratoryEverything is String.

online computation of RLZS

w[1..2] = a bw[1..1] = aEx ) w = a b b a a a a b b b a c

w[1..3] = a b bw[1..4] = a b b a

HABATAKITAI LaboratoryEverything is String.

online computation of RLZS

w[1..2] = a bw[1..1] = aEx ) w = a b b a a a a b b b a c

w[1..3] = a b bw[1..4] = a b b a

HABATAKITAI LaboratoryEverything is String.

online computation of RLZS

w[1..2] = a bw[1..1] = aEx ) w = a b b a a a a b b b a c

w[1..3] = a b bw[1..4] = a b b aw[1..5] = a b b a aw[1..6] = a b b a a aw[1..7] = a b b a a a aw[1..8] = a b b a a a a bw[1..9] = a b b a a a a b b

w[1..10] = a b b a a a a b b bw[1..11] = a b b a a a a b b b aw[1..12] = a b b a a a a b b b a c

HABATAKITAI LaboratoryEverything is String.

Every self-referencing factor is a suffix of a palindrome.

Ex ) w = a b b a a a a b b b a c

palindrome

Reversed LZ factorizationwith self-references

g1 g3g2 g4 g5

HABATAKITAI LaboratoryEverything is String.

Every self-referencing factor is a suffix of a palindrome.

Ex ) w = a b b a a a a b b b a c

palindrome

Reversed LZ factorizationwith self-references

g1 g3g2 g4 g5

HABATAKITAI LaboratoryEverything is String.

We can compute each RLZS factor gi by• using KK algorithm, and

– In a total of O(n log n) bits of space and O(n log σ) time.• computing the longest palindrome which ends at

each position, online– In a total of O(n log n) bits of space and O(n) time, by

modifying Manachar’s algorithm [Manacher, 1975].

online RLZS in O(nlogn) bits of space

We can compute RLZS online in O(n log n) bits of space and O(n log σ) time.

Theorem

HABATAKITAI LaboratoryEverything is String.

Outline

• Reversed LZ factorization without self-references (RLZ)

• Online RLZ algorithm by Kolpakov and Kucherov

• New online RLZ algorithm using O(n log σ) bits of space

• Reversed LZ factorization with self-references (RLZS)

• New online RLZS algorithm using O(n log n) bits of space

• New online RLZS algorithm using O(n log σ) bits of space

n : the length of input stringσ : the alphabet size

HABATAKITAI LaboratoryEverything is String.

Suffix palindromes

• All suffix palindromes of a string of length n can be presented by O(log n) arithmetic progressions [Apostolico,1995].

HABATAKITAI LaboratoryEverything is String.

Suffix palindromes

• All suffix palindromes of a string of length n can be presented by O(log n) arithmetic progressions [Apostolico,1995].

Ex) w = a b a b a c a b a b a d a b a b a c a b a b a

HABATAKITAI LaboratoryEverything is String.

Suffix palindromes

• All suffix palindromes of a string of length n can be presented by O(log n) arithmetic progressions [Apostolico,1995].

Ex) w = a b a b a c a b a b a d a b a b a c a b a b a

HABATAKITAI LaboratoryEverything is String.

online computation of suffix palindromes

wa = a b a b a b a b a

wc = a b a b a b a b c

Ex) w = a b a b a b a b

• What happens to the suffix palindromes when a new character is appended?

HABATAKITAI LaboratoryEverything is String.

online computation of suffix palindromes

• What happens to the suffix palindromes when a new character is appended?

xw a

HABATAKITAI LaboratoryEverything is String.

a

online computation of suffix palindromes

• What happens to the suffix palindromes when a new character is appended?

w a

xw a

if x = a

HABATAKITAI LaboratoryEverything is String.

x

online computation of suffix palindromes

• What happens to the suffix palindromes when a new character is appended?

w b b b b

w bb b b b

if x = b

HABATAKITAI LaboratoryEverything is String.

We can compute each RLZS factor gi by• using our RLZS algorithm, and

– In a total of O(n log σ) bits of space and O(n log2n) time.• computing the longest palindrome which ends at

each position, online– In a total of O(log2n) bits of space and O(n log n) time.

Computing RLZS in O(n log σ) bits of space

We can compute RLZS online in O(n log σ) bits of space and O(n log2n) time.

Theorem

HABATAKITAI LaboratoryEverything is String.

• RLZS was too difficult for us to factorize

The problems of RLZS

There is a mistake in proceedings of the PSC.

Proof

HABATAKITAI LaboratoryEverything is String.

• RLZS was too difficult for us to factorize

The problems of RLZS

There is a mistake in proceedings of the PSC.

Proof

p114

a b b a a a a b b b a c

a b b a a a a b b b a c

HABATAKITAI LaboratoryEverything is String.

• RLZS was too difficult for us to factorize

The problems of RLZS

There is a mistake in proceedings of the PSC.

Proof

HABATAKITAI LaboratoryEverything is String.

• RLZS was too difficult for us to factorize

• No idea for using RLZS

The problems of RLZS

There is a mistake in proceedings of the PSC.

Proof

HABATAKITAI LaboratoryEverything is String.

RLZ online algorithms

Conclusion

O(n log n) bits O(n log σ) bits

without O(n log σ) time O(n log2n) time

with O(n log σ) time O(n log2n) time

self-references

space

n : the length of input stringσ : the alphabet size

[Kolpakov & Kucherov, 2009]

HABATAKITAI LaboratoryEverything is String.

RLZ online algorithms

Conclusion

O(n log n) bits O(n log σ) bits

without O(n log σ) time O(n log2n) time

with O(n log σ) time O(n log2n) time

self-references

space

This workn : the length of input stringσ : the alphabet size

[Kolpakov & Kucherov, 2009]

Recommended