38
北北北北北 Hokkaido University 1 Lecture on Information knowledge network 2011/11/29 Lecture on Information Knowledge Network "Information retrieval and pattern matching" Laboratory of Information Knowledge Network, Division of Computer Science, Graduate School of Information Science and Technology, Hokkaido University Takuya KIDA

Lecture on Information Knowledge Network "Information retrieval and pattern matching"

Embed Size (px)

DESCRIPTION

Lecture on Information Knowledge Network "Information retrieval and pattern matching". Laboratory of Information Knowledge Network, Division of Computer Science, Graduate School of Information Science and Technology, Hokkaido University Takuya KIDA. - PowerPoint PPT Presentation

Citation preview

Page 1: Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University

1

Lecture on Information knowledge network2011/11/29

Lecture on Information Knowledge Network

"Information retrieval and pattern matching"

Laboratory of Information Knowledge Network,Division of Computer Science,

Graduate School of Information Science and Technology,Hokkaido University

Takuya KIDA

Page 2: Lecture on Information Knowledge Network "Information retrieval and pattern matching"

The 8thMisc. topics of pattern

matchingMethod for multi-bytecode texts

Toward an intelligent pattern matching:1. Pattern matching for XML data

2. Pattern matching on texts with arc annotation3. Pattern matching with taxonomy data

Appendix: Randomized algorithm

Page 3: Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University3

Lecture on Information knowledge network2011/11/29

Method for multi-bytecode texts (Japanese texts)

Synchronization problem of codewords:– False detection will occur when we do pattern matching on a Japanese

text by the unit of ASCII (unit of byte).– It is necessary to determine the boundaries of characters as well as

Huffman codes.T F T 液 晶 の 時 代Text T =

54 46 54 B1D5BEBDA4CEBBFEC2E5A sequence of bytes →

修 了Pattern P =Japanese EUC encoded text

AC machine for a pattern P=“BD A4 CE BB” (修了)

BD A4 CE BB10

修了3 42

∑ - {BD}

Page 4: Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University4

Lecture on Information knowledge network2011/11/29

Review: Solution by automaton with synchronization

0 1

0 1

0 1

0 1

A B

C

D

E

Huffman tree Pattern P = DECHuffman encodedPattern E(P) = 011001

Text T = ABECA ・・・Huffman encoded text E(T) = 0000000110010000 ・・・

0 1 1 10 0∑

Ordinal KMP automaton

0 1

0 1

0 1

0 1 0 1 1 10 0

KMP automaton with sync.0 1

0 1

0 1

0 1 0 1 1 10 0

KMP automaton with sync.

M. Miyaaki, S. Fukamachi, M. Takeda: Speeding up the pattern matching machine for compressed texts (in Japanese), Trans. IPSJ, Vol. 39, No. 9, pp.2638-2648, 1998.

Page 5: Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University5

Lecture on Information knowledge network2011/11/29

PM on multi-bytecode texts by an automaton with synchronizationM. Takeda, et al.: Processing Text Files as Is: Pattern Matching over Compressed Texts, Multi-Byte Character Texts,and Semi-Structured Texts, Proc. of SPIRE2002, LNCS2476, pp.170-186, 2002.

T F T 液 晶 の 時 代Text T =

54 46 54 B1D5BEBDA4CEBBFEC2E5A sequence of bytes →

修 了Pattern P =Japanese EUC encoded

text

An AC machine with synchronization, which correctly detects ( EUC encoded ) pattern P=“ 修了”

BD A4 CE BB1

修了3 42z

g

[8E, A0-FF] ∖ [BD]

[A0-FF]

[8F][A0-FF]

[00-8D, 90-9F]

0

part for synchronization

Code automaton accepting any EUC code

0

z

g

[00-8D,90-9F]

[8E, A0-FF]

[8F]

[A0-FF]

[A0-FF]

{full-width char.}

{half-widthchar.}

Page 6: Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University6

Lecture on Information knowledge network2011/11/29

Idea of bit-parallel technique

abababba

ababb

01000

10100

10100

10100

&

00000

10000

01000

10100

01010

10100

01010

00001

10000

12345

Ri = (Ri-1<<1 | 1) & M(T[i])Ri = (Ri-1<<1 | 1) & M(T[i])

Mask table M

ab

10100

01011

ababb

Text T:

Pattern P:

This can be calculated in O(1) time

※Keeping only the right transferred bits by taking AND op. with the maskbits M.

Page 7: Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University7

Lecture on Information knowledge network2011/11/29

Bit-parallel method for multi-bytecode texts

Basic idea:– We construct the pattern matching machine (code automaton)

that can determine the boundaries of codewords and recognize each multi-byte character in the input pattern.

– The code automaton runs while reading the text by each byte, and it output the mask bit sequence corresponding to each character in the input pattern.

– We simulate an arbitrary bit parallel algorithm by using the output M(T[i]) of the code automaton instead of reading T[i].

A code automaton that can determine the boundariesof EUC codes and recognize “ 修” and “ 了” .

BD A4

CE BB

1 2

3 4

0

M[ 修 ]=01

M[ 了 ]=10

z

g

[8E, A0-FF]/[BD,CE]

[A0-FF]

[8F][A0-FF]

[00-8D,90-9F]

arbitrary bit parallel algorithm

arbitrary bit parallel algorithm

Ri = (Ri-1<<1 | 1) & M(T[i])

Heikki Hyyrö, Jun Takaba, Ayumi Shinohara, and Masayuki Takeda: On Bit-Parallel Processing of Multi-byte Strings,Proc. of Asia Information Retrieval Symposium, pp.190-196, 2004.

Page 8: Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University8

Lecture on Information knowledge network2011/11/29

Toward an intelligent pattern matching

Until now …–Text = just a sequence of characters

(We’ve ignored the background knowledge about the text and meaning of sentences.)

–Fast! Fast! Fast!

From now on …–Text = a sequence of sentences that have meanings and/or

structures–We need an intelligent pattern matching ( of course, at high

speed! ) Pattern matching in consideration of the structure of the text

– Pattern matching for XML texts– Pattern matching for texts with arc-annotation– etc…

Pattern matching in consideration of the meaning of the text ( cooperating with ontology data )

– Pattern matching in consideration of the taxonomic information– Thesaurus, Inductive rules, etc…

Page 9: Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University9

Lecture on Information knowledge network2011/11/29

Pattern matching for XML texts: previous ones

XMLdocument

XMLdocument

XMLdocument

XMLdocument

memoryA

pplication program

DOMAPI

XML parser

……

Tanaka

person/name/last

Makiko

person/name/first

“”person/name

“”person

RDBperson

name

first last

Makiko Tanaka

SQL

Page 10: Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University10

Lecture on Information knowledge network2011/11/29

Pattern matching for XML texts: our approach

XMLdocument

XMLdocument

XMLdocument

XMLdocument

memoryA

pplication program

Pattern

match

ing

algo

rithm

<person> <name> <first> Makiko </first> <last> Tanaka </last> </name></person>

<person> <name> <first> Makiko </first> <last> Tanaka </last> </name></person>

M. Takeda, et al.: Processing Text Files as Is: Pattern Matching over Compressed Texts, Multi-Byte Character Texts,and Semi-Structured Texts, Proc. of SPIRE2002, LNCS2476, pp.170-186, 2002.

Page 11: Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University11

Lecture on Information knowledge network2011/11/29

Advantage of pattern matching approach

It can batch the processing for a huge XML document and a large amount of documents

It can treat many queries at once.

Treestructure

Fast processing

In a little memory space

Various applications

XMLdocument

XMLdocument

Page 12: Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University12

Lecture on Information knowledge network2011/11/29

Problem in a simple pattern matching algorithm

It may match to part of tag names.

<body> <h1>That TVCM</h1> <p> <mother> “mother” </mother> If we remove m, it becomes <other> “other” </other></p></body>

<body> <h1>That TVCM</h1> <p> <mother> “mother” </mother> If we remove m, it becomes <other> “other” </other></p></body>

Wrong detection

Is it inside or outside

of tags?

Pattern Π = {other, <mother>}

Page 13: Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University13

Lecture on Information knowledge network2011/11/29

A solution

6 7 8 9 10 11 12 13

0 1 2 3 4 5

14

ro t h e

ro t h em<

>

other

<mother>

other

An ordinal AC machine

An AC machine in consideration of XML tags

ro t h e

ro t h em<

>136 7 8 9 10 11 12

0 1 2 3 4 5other

<mother>Other than

‘<‘

><

Other than‘<‘

15

14ro t h e

ro t h em<

>136 7 8 9 10 11 12

0 1 2 3 4 5other

<mother>Other than

‘<‘

><

Other than‘<‘

15

14

Page 14: Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University14

Lecture on Information knowledge network2011/11/29

ro t h e

ro t h em<

>136 7 8 9 10 11 12

0 1 2 3 4 5other

<mother>> 以外の文字

><

< 以外の文字

15

14ro t h e

ro t h em<

>136 7 8 9 10 11 12

0 1 2 3 4 5other

<mother>Other than

‘< ‘

><

Other than ‘< ‘

15

1416 >

]

Other than ‘< ‘

Handling of attributes

<mother> <mother nature=“tender”><mother nature=“hard”>

・・・

The same tag<mother>

The same tag<mother>

Page 15: Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University15

Lecture on Information knowledge network2011/11/29

Pattern matching in consideration of XML path

ro t h e

ro t h em<

>136 7 8 9 10 11 12

0 1 2 3 4 5other

<mother>Other than

‘<‘

><

Other than‘<‘

15

14

0

1

3

<person>

<name>

2

<last>

<person>Other than<person>

={<person>,</person>,<name>,</name>,<last>,</last>,…}={Tanaka}

(<person>,0)

(<name>,1)

(<xml>,0)

stack

(<last>,2)

I want to look for the parsons whose family name is “Tanaka”( In Xpath expression, the element //person/name/last/ is equal to

“Tanaka” )

Page 16: Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University16

Lecture on Information knowledge network2011/11/29

Processible subset of XPath

Limitation of pattern matching approach– We cannot specify the predecessor nodes– The complex filter specifications remarkably decrease the processing

speed

LocationPath ::= '/' RelativeLocationPathRelativeLocationPath ::= Step

| RelativeLocationPath '/' StepStep ::= AxisSpecifier NodeTestAxisSpecifier ::= AxisName '::'AxisName ::= 'attribute'

| 'child' | 'descendant' | 'descendant-or-self' | 'following' | 'following-sibling' | 'self' | 'namespace'

NodeTest ::= QName | NodeType '(' ')'

NodeType ::= 'node' | 'text' | 'comment' | 'processing-instruction'

//cars/car/@*

/descendant::cars/child::car/attribute::node()

Page 17: Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University17

Lecture on Information knowledge network2011/11/29

Speed comparison with Sgrep

Comparison with Sgrep(J. Jaakkola and P. Kilpeläinen)

Text : 110MB (English text)CPU : Celelon 366MHzMemory : 128MBOS : Kondara/MNU Linux 2.1 RC2

Pattern //text/"summers" //test//"summers"

/site/regions/africa/item/location/"United_States"

Sgrep 38.44 37.02 51.85

Takeda et al. [2002] 12.40 12.30 12.23CPU time (sec.)

Page 18: Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University18

Lecture on Information knowledge network2011/11/29

Pattern matching for texts with arc-annotation

Definition : The arc annotation A that accompanies sequence S is the set of union of integers {1, 2, …, |S|}

Each element (iL, iR) ∈A is called an arc.–S[iL] and S[iR] are called a right endpoint and a left endpoint,

respectively.–For an arbitrary arc, we assume that it holds that iL < iR. –Moreover, any two arcs doesn't share the same integer.–That is, any two arcs doesn't share the same endpoint.

An example of the text with arc-annotation:

A G T C A C G C C C G T1 2 3 4 5 6 7 8 9 10 11 12

Page 19: Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University19

Lecture on Information knowledge network2011/11/29

Example of text with arc annotation

An example of the tRNA(tRNAPhe) two-dimensional structure

・・・ACACCUAGCΨTGUGU ・・・

The string having nested arcs

Page 20: Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University20

Lecture on Information knowledge network2011/11/29

Arc-preserving subsequence(APS) problem

The APS problem is to answer if the following conditions are satisfied, when text S1 = S1[1 : n] and pattern S2 = S2[1 : m] are given with arch annotations A1 and A2, respectively.

– S2 is a subsequence of S1– There are arcs in the pattern if there are arcs in the sequence,

and vice versa.

A G T C A C G C C C G TS1:=

A G T C A C G C C C G TS1:=

A T G C TS2:=

A T G C TS2:=

Text:

Pattern:

Text:

Pattern:

○ base match

×arc match

Page 21: Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University21

Lecture on Information knowledge network2011/11/29

APS(TYPE1, TYPE2)

The difficulty of the APS problem changes for its arc annotation structure

APS(TYPE1, TYPE2) –TYPE1 : arc structure of the

text–TYPE2 : arc structure of the

pattern

Example : APS(nested, chain)

–Arc structure of the text is “nested”–Arc structure of the pattern is

“chain”

Chain

Nested

LimitationDifficulty

High

Low

loose

strict

Crossing

Plain

Page 22: Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University22

Lecture on Information knowledge network2011/11/29

Result of Kida[2005]

The previous work of APS problem:– J. Gramm, J. Guo, and R. Niedermeier.

“Pattern matching for arc-annotated sequences.”In Proc. 22nd FSTTCS, volume 2556 of LNCS, pages 182–193. Springer, 2002.

The result of Kida[2005]: proposed an improved algorithm based on the GGN algorithm–However, the worst case complexity is as the same as GGN

corrected an error of Gramm-Guo-Niedermeier (GGN) algorithm–The original GGN algorithm include an error

have implemented and experimented–The proposed algorithm runs 2 ~ 5 times faster than GGN

APS(nested, nested) is solved in O(nm)

Kida: Faster Pattern Matching Algorithm for Arc-Annotated Sequences, Proc. of Federation over the Web,LNAI (to appear)

Page 23: Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University23

Lecture on Information knowledge network2011/11/29

Change to the text length n

|A1|=20% of n, m=20, |A2|=4

Page 24: Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University24

Lecture on Information knowledge network2011/11/29

Change to the pattern length m

|A2|=20% of m, n=1000, |A1|=100

Page 25: Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University25

Lecture on Information knowledge network2011/11/29

Take a breath

King Penguin flying in water ( 2005.8.12 in Asahiyama Zoo )

Summary to here– Method for multi-bytecode texts (Japanese texts)

Embedding the code automaton into AC machine for synchronization Combining the code automaton that outputs mask bit sequences with

bit-parallel methods– Pattern matching in consideration of the structure of the text

Pattern matching for XML texts Pattern matching for arc-annotated texts

~ Trivia ~How to compute min(x,y) without conditional branching when two integers x and y are represented as m-bits sequences

S ← ((x | 10m) - y) & 10m,S ← S - (S m),≫min(x,y) ← (~S & x) | (S & y)

( However, we need m+1-bits for each )

Page 26: Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University26

Lecture on Information knowledge network2011/11/29

Example of pattern matching in consideration of taxonomic information (PMTX)

cell

insolublefraction

membranefraction

vesicularfraction

microsome

cellsurface

cellenvelope

cellwall

molecularfunction

Gene Ontology

catalyticactivity

lyaseactivity

hyaluronate

Text T:

Pattern P: (cell) (receptor) (for) (catalytic activity)

Pub:1: Cell. 1990 Jun 29;61(7):1303-13.Title:CD44 is the principal cell surface receptor for hyaluronate.Authours:Aruffo A, Stamenkovic I, Melnick M, Underhill CB, Seed B.

Page 27: Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University27

Lecture on Information knowledge network2011/11/29

O(m+mh/w) time for preprocessing O(m|∑|/w) space O(mn/w) time for scanning the text

O(m+h) time for preprocessing O(|∑|) space O(n) time for scanning the text

–m: the length of pattern P∈∑*

–n: the length of text T∈∑*

–h: the size of taxonomic information H– |∑|: the size of set ∑ of concepts–w: the length of word (say, 32 or 64)

Result of Kida&Arimura[2004]

It works well when m < w

T. Kida and H. Arimura: Pattern Matching with Taxonomic Information, Proc. of Asia Information Retrieval Symposium(AIRS2004), pp. 265-268, Oct. 2004.

Page 28: Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University28

Lecture on Information knowledge network2011/11/29

Taxonomic information and sorted alphabet

An example of DAG H representing (∑,)G

E

C D F

A B

We assume that a pattern and a text are given as a sequence of concepts: P ∑* and T ∑*∈ ∈

Sorted alphabet (∑ , )– ∑ : a finite alphabet ( a set of concepts )– : a partial order relation

※ This is also called as Hasse diagram.

Pattern P:= A B E F

A B C B D F C BText T:=

Concept E corresponds with the character class [A,B,C,D,E].

Page 29: Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University29

Lecture on Information knowledge network2011/11/29

Examples of sorted alphabet

A B C D Z

0 1 2 9 a z

[0-9] [a-z]

?

(1) flat alphabet

(2) class of characters

(3) letter-sets alphabet

(abc)

(ab)

(a)

(ac)(bc)

(c)(b)

φ

Page 30: Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University30

Lecture on Information knowledge network2011/11/29

We can utilize the Shift-And method!

ababbbba

ab[ab]bb

01010

10101

00100

01110

&

00000

10000

01000

10100

01010

00100

00010

00001

10000

12345

Mask table M

ab

10100

01111

ab

[ab]bb

Text T:

Pattern P:

The difference is just here!

Ri = (Ri-1<<1 | 1) & M(T[i])Ri = (Ri-1<<1 | 1) & M(T[i])

This is the same

Page 31: Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University31

Lecture on Information knowledge network2011/11/29

Toward taxonomic information

GE

C D F

A B

Mask table M’

ABCDEFG

100110

ABCDEF

010111

001010

000110

000010

000001

000000

O(mh) ?

Taxonomic information H:

Pattern P:= A B E F

A B C B D F C BText T:=

Page 32: Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University32

Lecture on Information knowledge network2011/11/29

Computation of M’(a)

Lemma 1Let (∑ , ) be a sorted alphabet. Given pattern P∈∑*, for any a∈∑, it holds that

M’(a) = ∪x∈Upb(a) M(x) .

Lemma 2Let (∑ , ) be a sorted alphabet. Given pattern P∈∑*, for any a∈∑, it holds that

M’(a) = M(a) ∪ ∪x∈Par(a) M’(x) .

Page 33: Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University33

Lecture on Information knowledge network2011/11/29

Pseudo code for computing M’(a)

Preprocess_M’ (P=p1…pm) /* Assume H is a global variable */1 initalize M(a) as follows:2 M(a)={1 i m | P[i]=a}≦ ≦ ;3 for each a ∑ ∈ do4 CalculateM’(a) ;5 end of for

Function CalculateM’(a)6 if M’(a) has been computed then return M’(a)7 else do8 M’(a) = M(a);9 for each x Par(a) ∈ do10 M’(a)=M’(a) (CalculateM’(x));∪11 end of for12 return M’(a);

O(m)

O(h) TotalO(m+mh/w)

O(m/w)

Page 34: Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University34

Lecture on Information knowledge network2011/11/29

Occurrences Taxonomicinformation DB

Text DB

TranslatorPatternmatchingmachine

Pattern

Overview of retrieval system with PMTX algorithm

We have to parse the text into a sequence of concepts

Replace Automaton ( Arikawa and Shiraishi[1984] )

O(h+n)Translator

Or using a morphological parser for natural language texts like ChaSen

Page 35: Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University35

Lecture on Information knowledge network2011/11/29

The 7th summary

Method for multi-bytecode texts (Japanese texts)–Embedding the code automaton into AC machine for

synchronization–Combining the code automaton that outputs mask bit

sequences with bit-parallel methods Toward an intelligent pattern matching:

– Pattern matching in consideration of the structure of the text Pattern matching for XML texts Pattern matching for arc-annotated texts

–Pattern matching in consideration of the meanings of the text (in cooperation with ontology data)

Pattern matching in consideration of taxonomic information

Prof. Arimura will take charge of this class from the next–Efficient data structure for information retrieval–Data mining form the web, etc.

Page 36: Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University36

Lecture on Information knowledge network2011/11/29

Karp-Rabin algorithm

It is a randomized algorithm using hashing technique– Matching a string by regarding it as an integer!

The worst case takes O(mn) time, but it becomes O(n+m) time in the average

Extra space we need is only O(1)

KARP R.M., RABIN M.O., Efficient randomized pattern-matching algorithms. IBM J. Res. Dev. 31(2):249-260, 1987.

2 3 5 9 0 2 3 1 4 1 5 2 6 7 3 9 9 2 1Text :

Pattern : 3 1 4 1 5 mod 137

7

mod 13

8 9 3 11 0 1 8 4 5 10 11 7 9 11

∑ = { 0,1,2,…,9 }

Correct! Wrong!

・・・ ・・・ ・・・

3 1 4 1 5 2

7 8

The highest figure in the previous step

The lowest figure that is newly input

14152 ≡ (31415 – 3×10000)×10 + 2 (mod 13) ≡ (7 – 3×3)×10 + 2 (mod13) ≡ 8 (mod 13)

Page 37: Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University37

Lecture on Information knowledge network2011/11/29

Pseudo code

Karp-Rabin (P, T, d, q)1 m ← length[P].2 n ← length[T].3 h ← dm–1 mod q.4 p ← 0.5 t0 ← 0.6 for i ← 1 to m do7 p ← (d ・ p + P[i]) mod q;8 t0 ← (d ・ t0 + T[i]) mod q.9 for s ← 0 to n – m do10 if p = ts then11 if P[1…m] = T[s+1…s+m] then12 report an occurrence at s;13 else if s < n – m then14 ts+1 ← (d ・ (ts – T[s+1] ・ h)+T[s+m+1]) mod q.

Check if the candidate is the occurrence

Page 38: Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University38

Lecture on Information knowledge network2011/11/29

Randomized approximate pattern matching using FFT

Fast Fourier Transform (FFT) can be computedat high speed on hardware

They do (approximate) pattern matching by replacing strings into a sequence of numeric and then computing the score vectors at high speed by FFT

K. Baba, A. Shinohara, M. Takeda, S. Inenaga, and S. Arikawa. A Note on Randomized Algorithm for String Matchingwith Mismatches. Nordic Journal of Computing, 10(1):2-12, 2003.

m

j

jjii ptc1

1 ),(

ba

baba

0

1),(

K. Baba ( Kyushu Univ. )

a b b a ca b b a c

a b b a ca b b a c

a b b a c

ci =

T[i] =a c b a b b a c c bi :1 2 3 4 5 6 7 8 9 10

P =a b b a c

3 1 1 5 2 0

Scorevecto

r