Speeding up pattern matching by text compression

Speeding up pattern matching Speeding up pattern matching by text compressionby text compression

Department of Informatics,　 Kyushu University, JapanDepartment of AI, Kyushu Institute of Technology, Japan

Yusuke Shibata, Takuya Kida, Shuichi Fukamachi, Masayuki Takeda, Ayumi Shinohara, Takeshi Shinohara, Setsuo Arikawa

Contents

Pattern matching on compressed text.

A unifying framework for compressed

pattern matching (Collage System)

Byte pair encoding (BPE).

Pattern matching algorithm on BPE compressed text.

Experimental result.

Conclusion.

Pattern matchingmatching is one of the most fundamental operations in string processing.Recently, a new trend for accelerating pattern matchingmatching hasemerged: Speeding up pattern matchingmatching by text compression.From the traditional criteria for data compression, i.e., compression ratio and compression/decompression time,adaptive dictionary methods such as the Lempel-Ziv family are often preferred. However, such methods cannot speed upthe pattern matchingmatching since an extra work is needed to keep track of compression mechanism.

Pattern Matching Problem

matchingmatchingPatternPattern

TextText

Knuth-Morris-Pratt (1974)

Boyer-Moore (1977)

Aho-Corasick (1975)

Shift-Or (1992)

Pattern Matching on Compressed Text

Expand

on Memory

File transfer

on Secondary disk storage

original textoriginal text

File transfer

on Memoryon Secondary disk storage

compressed textcompressed text

SearchSearch

It requires extra time and space.

Pattern Matching on Compressed Text

File transfer

on Memoryon Secondary disk storage

compressed textcompressed text

Search directlySearch directly

To perform a faster search in compressed texts in comparisonwith a regular decompression followed by an ordinary search.

GOAL 1GOAL 1

To perform a faster search in compressed texts in comparison with an ordinary search in the original texts.

GOAL 2GOAL 2

Speeding up pattern matching by text compression

Previous Results(1)

1988 Eliam-Tsoreff and Vishkin run-length

1992 Amir, Landau, and Vishkin two-dimensional run-length

1995 Farach and Thorup LZ77

1996 Amir, Benson and Farach LZW

1997 Karpinski, Rytter, and Shinohara straight-line programs

1996 Gasieniec, et al. LZ77

1997 Miyazaki, Shinohara, and Takeda straight-line programs

1992 Amir and Benson two-dimensional run-length

Amir, Benson, and Farach1994 two-dimensional run-length

1997 Takeda finite state encoding

1998 Shibata byte pair encoding

1994 Manber original compression scheme

1998 Fukamachi, Shinohara, and Takeda Huffman encoding

1998 Kida, et al. LZW

year researcher compression

1999 Shibata, Takeda, Shinohara, andArikawa

Antidictionary based

1999 Kida, Takeda, Shinohara, andArikawa

2000 Shibata, et al. Byte pair encoding

1999 Navarro and Raffinot LZ family

Today’s talkToday’s talk

Previous Results(2)

1998 de Moura, Navarro, Ziviani, andBaeza-Yates

Word based encoding

Unifying frameworkUnifying

frameworkKida, et al.1999 Dictionary based methods

(Collage system)

A Unifying Framework for Compressed Pattern Matching

Previous:Compression A PM Algorithm A

Compression B PM Algorithm B

Compression C PM Algorithm C

Collage system

Kida et al.[1999]:

Pattern matching algorithm on the unifying framework

Compression A

Compression B

Compression C

Collage SystemCollage System

Definition and Several Examples

Originaltext

Dictionary Based Compression

compressedtext

Dictionarystructure

encoding

factorize into a series of phrases

How to choose the phrases.How to design the data structure of the dictionary.How to encode phrases.

Collage System

Collage system is a pair 〈 D, S 〉

S : A sequence of variables defined in D (Compressed text)

S = Xi1 , Xi2 , ・・・ , Xil ( Xi ∈D )

D : A sequence of assignments (Dictionary structure)

X1 := expr1 ; ・・・X2 := expr2 ; Xn := exprn ;

||D|| = n : number of assignments in D

|S| = l : number of variables in S

where exprk are ...

X1 = expr1 ; ・・・X2 = expr2 ; Xn = exprn ;

D : A sequence of assignments (Dictionary structure)

a a ∈Σ {ε∪ }, (primitive assignment)

Xi ・ X ｊ (concatenation)for i, j < k,

( Xi ) j for i < k and integer j ( j times repetition)

[ j ]Xi(prefix truncation)for i < k and integer j

Xi [ j ] (suffix truncation)for i < k and integer j

Collage System

Example of Collage System

X1 = a ;X2 = b ;

S : X3 , X6 , X4 , X7

abbabbababba

X7 = X6・ X4 ;

X6 = [ 3 ]X5 ;

X5 = ( X3 )3 ;

X4 = X2・ X1 ;

X3 = X1・ X2 ;

babbabababababbaab

a b )3 )[ 3 ] (( b a

prefixtruncation

3 timesrepetition

height(X7) = 4

height(D) = 4

??????

Pattern Matching Algorithmon a Collage System

Compressed pattern matching on a collage system

mm : pattern lengthrr : number of pattern occurrences

||||DD|||| : number of assignments in D||SS|| : number of variables in S

Theorem[Kida et al. 1999]Problem of compressed pattern matching

can be solved inOO( (||( (||DD||+|||+|SS|)|) ・・ heightheight((DD) + ) + mm22 + + r r ) ) timetime

using OO( ||( ||DD|| + || + mm22 ) ) spacespace.If D contains no truncation, it can be solved in

OO( ||( ||DD|| + ||| + |SS| + | + mm22 + + r r )) time time.

Theorem[Kida et al. 1999]Problem of compressed pattern matching

can be solved inOO( (||( (||DD||+|||+|SS|)|) ・・ heightheight((DD) + ) + mm22 + + r r ) ) timetime

using OO( ||( ||DD|| + || + mm22 ) ) spacespace.If D contains no truncation, it can be solved in

state: 0

: goto function: failure function

Pattern π= a b a b b

Basic Idea

original text: abababba

1 2b a

1 2 3 4 3 4 5 1

S ： Xi1 Xi2 Xi3 Xi4

abababba

The set Output( j, u) ={1≦i≦|u| | P = a suffix of P[1: j]・ u[1: i]}

The function Jump( j, u) =δKMP( j, u)

•This set contains the pattern occurrences.

•The domain is Q×D• It simulates the sequence of state transitions for u.

Jump and Output

Reply inO(1) timeReply inO(1) time

Reply inO( l ) timeReply in

O( l ) time

Realization of Jump and Output

for Jump( q, Xk) , if Xk is ...

Xi ・ X ｊ

O(1) time

If the factor concatenation problem for length m string can be solved in O(1) time, it can be solved in O(1) time.

Xi ・ X ｊ

O(1) time

for Output( q, Xk), if Xk is ...

It can be enumerate in O( l ) time

from Output of Xi and X ｊ .

Size of the set Output

Factor Concatenation Problem

example: P = COPACABANA

OPA , CABAN OPACABAN‘Yes’! P[2:9]concatenate

Instance: Two factors x and y of a string Peach represented as a node of suffix trie of P.Question: Is the string xy a factor of P ?If ‘yes’ then return its node number.

Solution to the problem

• Using a suffix trie, it can be solved in O(m) time after preprocessing of O(m2) time and space.

• Using a two-dimensional lookup table, it can be solved in O(1), but we need O(m4) time and space preprocessing.

It can be solved in O(1) time after O(m2) space and time preprocessing.

Outline of Our Algorithm

Input. pattern P and collage system 〈 D, S 〉 ( S := Xi1 , Xi2 , ・・・ , Xin )Output. All occurrences of the patterns.

/* preprocessing of D and P */ preprocess(D); preprocess(P);

l:=0; q:=0;for j:=1 to n do begin for each dOutput(q, Xij) do report ‘pattern occurs at position l+d ’;

q:= Jump(q, Xij); /* state transition */

l:= l + |Xij |; /* calculation of the offset */end

Compressed pattern matching on a collage system

O( ||D|| + |S| + m2 + r ) time

LZ78, LZW, BPEBPE, Run-length, etc...

no truncation

LZ77, LZSS, etc...LZ77, LZSS, etc...

truncation

O( (||D|| + |S| )・ height(D) + m2 + r ) time

not suitable for speeding up

pattern matching

Byte Pair EncodingByte Pair Encoding

original encoding algorithmand modified algorithm

ABCDEFGHI

Code Pair

Pair Table

Byte Pair Encoding

Text:　 T = ABABCDEBDEFABDEABC

GGCHBHFGHGC

GIHBHFGHI

GGCDEBDEFGDEGC

AB→GAB→G

DE→HDE→H

GC→IGC→I

AABBCCDDEEFF

Used Character

ABABABAB ABAB ABAB

DEDE DEDE DEDE

GCGC GCGC

Byte Pair Encoding “collage system”

Text:　 T = ABABABABCDEBDEFABABDEABABC

GGCGCHBHFGHGCGC

GIHBHFGHI

GGCDEBDEDEFGDEDEGCAB→GAB→G

DE→HDE→H

GC→IGC→I

X1 = A;X2 = B ;

X7 = X1・ X2 ;

X6 = F ;X5 = E ;

X4 = D ;

X3 = C ;

X8 = X4・ X5 ;

X9 = X7・ X3 ;S : X7 , X9 , X8 , X2 , X8 , X6 , X7 , X8 , X9

Speeding up of compression

Time complexity of BPE 　 O(uN)

u : The number of character codes，N : Text length

using doubly-linked list

O(u + N) time

Speed-up of compressionoriginal text:

we apply the BPE algorithm to the first block.

X1 = A

X2 = C

X3 = X2・ X1

X255 = X247・ X8

X256 = X125・ X48

Pattern Matching Machine for multiple replacement

[Arikawa et al. 1984]

BPE compressed text:

BPECompress Gzip

originalmodifiedBrown corpus ( 6.8Mb)Medline (60.3Mb)Genbank (17.1Mb)

51.056.230.8 32.5

59.059.0

26.842.343.7 39.0

33.323.1

Brown corpus Medline Genbank

196.91699.9440.6 16.5

60.78.0

19.373.312.7 37.7

242.2100.9

Comparison of Compression Ratio and time

compression Ratio(%)

compression time(sec)

BPE are worse than those of “Compress” and “Gzip”

It is drastically acceleratedby our modification

Compressed pattern matching on BPE compressed text

Problem of compressed pattern matchingon BPE compressed text can be solved in

||D|| 256≦

-The dictionary D is encoded separately from the sequence S.

-The size of D is small enough.

-The variables of S are encoded using a fixed length code.

Experimental result

5 10 15 20 25 30

pattern length

5 10 15 20 25 30

pattern length

KMPKMPKMPKMP

AgrepAgrep

AgrepAgrepour algorithmour algorithm

our algorithmour algorithm

Medline dataMedline data(compression ratio is 59%)

Genbank dataGenbank data(compression ratio is 32%)

Ultra ...

a clinically-oriented subset of

Medlin

a data set from GenBank

Concluding RemarksConcluding Remarks

Conclusion and Future Works

Conclusion

We introduced compressed pattern matching from practical viewpoints.

We observed that our algorithm is reduced at the same rate as the compression ratio compared with uncompressed case.

We also observed that it is occasionally faster than

Agrep ．

Future Works

• Can we reduce the complexity of the preprocessing? O(m2) O(m)

• To develop a sublinear algorithm on BPE compressed texts.

• To develop an approximate pattern matching algorithm on a collage system.

• To develop a new compression which is suitable for compressed pattern matching.

More recent work

A Boyer-Moore type algorithm for A Boyer-Moore type algorithm for

compressed pattern matching [CPM2000]compressed pattern matching [CPM2000]

A Boyer-Moore type algorithm for A Boyer-Moore type algorithm for

compressed pattern matching [CPM2000]compressed pattern matching [CPM2000]

We proposed a Boyer-Moore (BM) type algorithmfor pattern matching in BPE compressed texts.

Does text compression speed up such a sublinear time algorith

More recent work

5 10 15 20 25 30

pattern length

5 10 15 20 25 30

pattern length

KMPKMP

AgrepAgrep

most recent workmost recent work

KMPKMP

AgrepAgrep

most recent workmost recent work

Medline dataMedline data(compression ratio is 59%)

Genbank dataGenbank data(compression ratio is 32%)

Speeding up pattern matching by text compression

Documents

Matching Impedance

Impedance Matching

Matching theory

Technique Perfect Application A Speeding 1Algorithmskyodo/kokyuroku/contents/pdf/...Speeding up Technique for Enumeration Algorithms and its Application for Perfect Matching Takeaki

Speeding Up Hardware Verification by Automated Data Path Scaling · 2020. 1. 10. · Speeding Up Hardware Verification by Automated Data Path Scaling Dissertation ... which I also

Speeding Up String Pattern Matching by Text Compression ...ayumi/papers/IPSJ40.pdf · Vol.42 No.3 IPSJJournal Mar. 2001 IPSJ40thAnniversaryAwardPaper Speeding Up String Pattern Matching

circaid compression anklet elastic compression system... circaid® compression anklet elastic compression system Instructions for use. Gebrauchsanweisung. Инструкция по

Ontology matching

Speeding Master

Cost of a Speeding Ticket in Wisconsin

Speeding up Distributed Request-Response Workﬂowsconferences.sigcomm.org/sigcomm/2013/papers/sigcomm/p219.pdf · 2013. 7. 19. · Speeding up Distributed Request-Response Workﬂows

Triaxial Compression Test Compression Test.pdf · Lab 15: Triaxial Compression Test Asst.Prof.Chusak Kererat มาตรฐานการทดสอบทางวิศวกรรมปฐพ

Matching China

Matching balcani

Speeding up Java Persistence

Matching Curve

speeding-up model-based fault injection of deep - RiuNet - UPV

물리치료 Physical Therapy - kbccc.org ankle inversion sprains, ... compression garments, compression pump, compression stockings, compression ... shoulder pendulum …

GPUML: Graphical processors for speeding up kernel …huqi/SDM2010_slides.pdf · GPUML: Graphical processors for speeding up ... Use training data to learn “model” parameters

ELEC692 VLSI Signal Processing Architecture Lecture 7 VLSI Architecture for Block Matching Algorithm for Video compression * Part of the notes is taken