Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University

1

Lecture on Information knowledge network2011/11/29

Lecture on Information Knowledge Network

"Information retrieval and pattern matching"

Laboratory of Information Knowledge Network,Division of Computer Science,

Graduate School of Information Science and Technology,Hokkaido University

Takuya KIDA

The 8thMisc. topics of pattern

matchingMethod for multi-bytecode texts

Toward an intelligent pattern matching:1. Pattern matching for XML data

2. Pattern matching on texts with arc annotation3. Pattern matching with taxonomy data

Appendix: Randomized algorithm

北海道大学 Hokkaido University3


Method for multi-bytecode texts (Japanese texts)

Synchronization problem of codewords:– False detection will occur when we do pattern matching on a Japanese

text by the unit of ASCII (unit of byte).– It is necessary to determine the boundaries of characters as well as

Huffman codes.T F T 液晶の時代Text T =

54 46 54 B1D5BEBDA4CEBBFEC2E5A sequence of bytes →

修了Pattern P =Japanese EUC encoded text

AC machine for a pattern P=“BD A4 CE BB” （修了）

BD A4 CE BB10

修了3 42

∑ － {BD}



Review: Solution by automaton with synchronization

0 1

0 1

0 1

0 1

A B

C

D

E

Huffman tree Pattern P = DECHuffman encodedPattern E(P) = 011001

Text T = ABECA ・・・Huffman encoded text E(T) = 0000000110010000 ・・・

0 1 1 10 0∑

Ordinal KMP automaton

0 1

0 1

0 1

0 1 0 1 1 10 0

KMP automaton with sync.0 1

0 1

0 1

0 1 0 1 1 10 0

KMP automaton with sync.

M. Miyaaki, S. Fukamachi, M. Takeda: Speeding up the pattern matching machine for compressed texts (in Japanese), Trans. IPSJ, Vol. 39, No. 9, pp.2638-2648, 1998.



PM on multi-bytecode texts by an automaton with synchronizationM. Takeda, et al.: Processing Text Files as Is: Pattern Matching over Compressed Texts, Multi-Byte Character Texts,and Semi-Structured Texts, Proc. of SPIRE2002, LNCS2476, pp.170-186, 2002.

T F T 液晶の時代Text T =

54 46 54 B1D5BEBDA4CEBBFEC2E5A sequence of bytes →

修了Pattern P =Japanese EUC encoded

text

An AC machine with synchronization, which correctly detects （ EUC encoded ） pattern P=“ 修了”

BD A4 CE BB1

修了3 42z

g

[8E, A0-FF]　∖ [BD]

[A0-FF]

[8F][A0-FF]

[00-8D, 90-9F]

0

part for synchronization

Code automaton accepting any EUC code

0

z

g

[00-8D,90-9F]

[8E, A0-FF]

[8F]

[A0-FF]

[A0-FF]

{full-width char.}

{half-widthchar.}



Idea of bit-parallel technique

abababba

ababb

01000

10100

10100

10100

&

00000

10000

01000

10100

01010

10100

01010

00001

10000

12345

Ri = (Ri-1<<1 | 1) & M(T[i])Ri = (Ri-1<<1 | 1) & M(T[i])

Mask table M

ab

10100

01011

ababb

Text T:

Pattern P:

This can be calculated in O(1) time

※Keeping only the right transferred bits by taking AND op. with the maskbits M.



Bit-parallel method for multi-bytecode texts

Basic idea:– We construct the pattern matching machine (code automaton)

that can determine the boundaries of codewords and recognize each multi-byte character in the input pattern.

– The code automaton runs while reading the text by each byte, and it output the mask bit sequence corresponding to each character in the input pattern.

– We simulate an arbitrary bit parallel algorithm by using the output M(T[i]) of the code automaton instead of reading T[i].

A code automaton that can determine the boundariesof EUC codes and recognize “ 修” and “ 了” .

BD A4

CE BB

1 2

3 4

0

M[ 修 ]=01

M[ 了 ]=10

z

g

[8E, A0-FF]/[BD,CE]

[A0-FF]

[8F][A0-FF]

[00-8D,90-9F]

arbitrary bit parallel algorithm

arbitrary bit parallel algorithm

Ri = (Ri-1<<1 | 1) & M(T[i])

Heikki Hyyrö, Jun Takaba, Ayumi Shinohara, and Masayuki Takeda: On Bit-Parallel Processing of Multi-byte Strings,Proc. of Asia Information Retrieval Symposium, pp.190-196, 2004.



Toward an intelligent pattern matching

Until now …–Text = just a sequence of characters

(We’ve ignored the background knowledge about the text and meaning of sentences.)

–Fast! Fast! Fast!

From now on …–Text = a sequence of sentences that have meanings and/or

structures–We need an intelligent pattern matching （ of course, at high

speed! ） Pattern matching in consideration of the structure of the text

– Pattern matching for XML texts– Pattern matching for texts with arc-annotation– etc…

Pattern matching in consideration of the meaning of the text （ cooperating with ontology data ）

– Pattern matching in consideration of the taxonomic information– Thesaurus, Inductive rules, etc…



Pattern matching for XML texts: previous ones

XMLdocument

XMLdocument

XMLdocument

XMLdocument

memoryA

pplication program

DOMAPI

ＸＭＬ parser

……

Tanaka

person/name/last

Makiko

person/name/first

“”person/name

“”person

RDBperson

name

first last

Makiko Tanaka

SQL



Pattern matching for XML texts: our approach

XMLdocument

XMLdocument

XMLdocument

XMLdocument

memoryA

pplication program

Pattern

match

ing

algo

rithm

<person> <name> <first> Makiko </first> <last> Tanaka </last> </name></person>

<person> <name> <first> Makiko </first> <last> Tanaka </last> </name></person>

M. Takeda, et al.: Processing Text Files as Is: Pattern Matching over Compressed Texts, Multi-Byte Character Texts,and Semi-Structured Texts, Proc. of SPIRE2002, LNCS2476, pp.170-186, 2002.



Advantage of pattern matching approach

It can batch the processing for a huge XML document and a large amount of documents

It can treat many queries at once.

Treestructure

Fast processing

In a little memory space

Various applications

XMLdocument

XMLdocument



Problem in a simple pattern matching algorithm

It may match to part of tag names.

<body> <h1>That TVCM</h1> <p> <mother> “mother” </mother> If we remove m, it becomes <other> “other” </other></p></body>

<body> <h1>That TVCM</h1> <p> <mother> “mother” </mother> If we remove m, it becomes <other> “other” </other></p></body>

Wrong detection

Is it inside or outside

of tags?

Pattern Π = {other, <mother>}



A solution

6 7 8 9 10 11 12 13

0 1 2 3 4 5

14

ro t h e

ro t h em<

>

other

<mother>

∑

other

An ordinal AC machine

An AC machine in consideration of XML tags

ro t h e

ro t h em<

>136 7 8 9 10 11 12

0 1 2 3 4 5other

<mother>Other than

‘<‘

><

Other than‘<‘

15

14ro t h e

ro t h em<

>136 7 8 9 10 11 12

0 1 2 3 4 5other

<mother>Other than

‘<‘

><

Other than‘<‘

15

14



ro t h e

ro t h em<

>136 7 8 9 10 11 12

0 1 2 3 4 5other

<mother>> 以外の文字

><

< 以外の文字

15

14ro t h e

ro t h em<

>136 7 8 9 10 11 12

0 1 2 3 4 5other

<mother>Other than

‘< ‘

><

Other than ‘< ‘

15

1416 >

]

Other than ‘< ‘

Handling of attributes

<mother> <mother nature=“tender”><mother nature=“hard”>

・・・

The same tag<mother>

The same tag<mother>



Pattern matching in consideration of XML path

ro t h e

ro t h em<

>136 7 8 9 10 11 12

0 1 2 3 4 5other

<mother>Other than

‘<‘

><

Other than‘<‘

15

14

0

1

3

<person>

<name>

2

<last>

<person>Other than<person>

={<person>,</person>,<name>,</name>,<last>,</last>,…}={Tanaka}

(<person>,0)

(<name>,1)

(<xml>,0)

stack

(<last>,2)

I want to look for the parsons whose family name is “Tanaka”（ In Xpath expression, the element //person/name/last/ is equal to

“Tanaka” ）



Processible subset of XPath

Limitation of pattern matching approach– We cannot specify the predecessor nodes– The complex filter specifications remarkably decrease the processing

speed

LocationPath ::= '/' RelativeLocationPathRelativeLocationPath ::= Step

| RelativeLocationPath '/' StepStep ::= AxisSpecifier NodeTestAxisSpecifier ::= AxisName '::'AxisName ::= 'attribute'

| 'child' | 'descendant' | 'descendant-or-self' | 'following' | 'following-sibling' | 'self' | 'namespace'

NodeTest ::= QName | NodeType '(' ')'

NodeType ::= 'node' | 'text' | 'comment' | 'processing-instruction'

//cars/car/@＊

/descendant::cars/child::car/attribute::node()



Speed comparison with Sgrep

Comparison with Sgrep(J. Jaakkola and P. Kilpeläinen)

Text : 110MB (English text)CPU : Celelon 366MHzMemory : 128MBOS : Kondara/MNU Linux 2.1 RC2

Pattern //text/"summers" //test//"summers"

/site/regions/africa/item/location/"United_States"

Sgrep 38.44 37.02 51.85

Takeda et al. [2002] 12.40 12.30 12.23CPU time (sec.)



Pattern matching for texts with arc-annotation

Definition ： The arc annotation A that accompanies sequence S is the set of union of integers {1, 2, …, |S|}

Each element (iL, iR) ∈A is called an arc.–S[iL] and S[iR] are called a right endpoint and a left endpoint,

respectively.–For an arbitrary arc, we assume that it holds that iL < iR. –Moreover, any two arcs doesn't share the same integer.–That is, any two arcs doesn't share the same endpoint.

An example of the text with arc-annotation:

A G T C A C G C C C G T1 2 3 4 5 6 7 8 9 10 11 12



Example of text with arc annotation

An example of the tRNA(tRNAPhe) two-dimensional structure

・・・ACACCUAGCΨTGUGU ・・・

The string having nested arcs



Arc-preserving subsequence(APS) problem

The APS problem is to answer if the following conditions are satisfied, when text S1 = S1[1 : n] and pattern S2 = S2[1 : m] are given with arch annotations A1 and A2, respectively.

– S2 is a subsequence of S1– There are arcs in the pattern if there are arcs in the sequence,

and vice versa.

A G T C A C G C C C G TS1:=

A G T C A C G C C C G TS1:=

A T G C TS2:=

A T G C TS2:=

Text:

Pattern:

Text:

Pattern:

○ base match

×arc match



APS(TYPE1, TYPE2)

The difficulty of the APS problem changes for its arc annotation structure

APS(TYPE1, TYPE2) –TYPE1 ： arc structure of the

text–TYPE2 ： arc structure of the

pattern

Example ： APS(nested, chain)

–Arc structure of the text is “nested”–Arc structure of the pattern is

“chain”

Chain

Nested

LimitationDifficulty

High

Low

loose

strict

Crossing

Plain



Result of Kida[2005]

The previous work of APS problem:– J. Gramm, J. Guo, and R. Niedermeier.

“Pattern matching for arc-annotated sequences.”In Proc. 22nd FSTTCS, volume 2556 of LNCS, pages 182–193. Springer, 2002.

The result of Kida[2005]: proposed an improved algorithm based on the GGN algorithm–However, the worst case complexity is as the same as GGN

corrected an error of Gramm-Guo-Niedermeier (GGN) algorithm–The original GGN algorithm include an error

have implemented and experimented–The proposed algorithm runs 2 ～ 5 times faster than GGN

APS(nested, nested) is solved in O(nm)

Kida: Faster Pattern Matching Algorithm for Arc-Annotated Sequences, Proc. of Federation over the Web,LNAI (to appear)



Change to the text length n

|A1|=20% of n, m=20, |A2|=4



Change to the pattern length m

|A2|=20% of m, n=1000, |A1|=100



Take a breath

King Penguin flying in water （ 2005.8.12 in Asahiyama Zoo ）

Summary to here– Method for multi-bytecode texts (Japanese texts)

Embedding the code automaton into AC machine for synchronization Combining the code automaton that outputs mask bit sequences with

bit-parallel methods– Pattern matching in consideration of the structure of the text

Pattern matching for XML texts Pattern matching for arc-annotated texts

～ Trivia ～How to compute min(x,y) without conditional branching when two integers x and y are represented as m-bits sequences

S ← ((x | 10m) － y) & 10m,S ← S － (S m),≫min(x,y) ← (~S & x) | (S & y)

（ However, we need m+1-bits for each ）



Example of pattern matching in consideration of taxonomic information (PMTX)

cell

insolublefraction

membranefraction

vesicularfraction

microsome

cellsurface

cellenvelope

cellwall

molecularfunction

Gene Ontology

catalyticactivity

lyaseactivity

hyaluronate

Text T:

Pattern P: (cell) (receptor) (for) (catalytic activity)

Pub:1: Cell. 1990 Jun 29;61(7):1303-13.Title:CD44 is the principal cell surface receptor for hyaluronate.Authours:Aruffo A, Stamenkovic I, Melnick M, Underhill CB, Seed B.



O(m+mh/w) time for preprocessing O(m|∑|/w) space O(mn/w) time for scanning the text

O(m+h) time for preprocessing O(|∑|) space O(n) time for scanning the text

–m: the length of pattern P∈∑*

–n: the length of text T∈∑*

–h: the size of taxonomic information H– |∑|: the size of set ∑ of concepts–w: the length of word (say, 32 or 64)

Result of Kida&Arimura[2004]

It works well when m < w

T. Kida and H. Arimura: Pattern Matching with Taxonomic Information, Proc. of Asia Information Retrieval Symposium(AIRS2004), pp. 265-268, Oct. 2004.



Taxonomic information and sorted alphabet

An example of DAG H representing (∑,)G

E

C D F

A B

We assume that a pattern and a text are given as a sequence of concepts: P ∑* and T ∑*∈ ∈

Sorted alphabet （∑ , ）– ∑ ： a finite alphabet （ a set of concepts ）– ： a partial order relation

※ This is also called as Hasse diagram.

Pattern P:= A B E F

A B C B D F C BText T:=

Concept E corresponds with the character class [A,B,C,D,E].



Examples of sorted alphabet

A B C D Z

0 1 2 9 a z

[0-9] [a-z]

?

(1) flat alphabet

(2) class of characters

(3) letter-sets alphabet

(abc)

(ab)

(a)

(ac)(bc)

(c)(b)

φ



We can utilize the Shift-And method!

ababbbba

ab[ab]bb

01010

10101

00100

01110

&

00000

10000

01000

10100

01010

00100

00010

00001

10000

12345

Mask table M

ab

10100

01111

ab

[ab]bb

Text T:

Pattern P:

The difference is just here!

Ri = (Ri-1<<1 | 1) & M(T[i])Ri = (Ri-1<<1 | 1) & M(T[i])

This is the same



Toward taxonomic information

GE

C D F

A B

Mask table M’

ABCDEFG

100110

ABCDEF

010111

001010

000110

000010

000001

000000

O(mh) ?

Taxonomic information H:

Pattern P:= A B E F

A B C B D F C BText T:=



Computation of M’(a)

Lemma 1Let （∑ , ） be a sorted alphabet. Given pattern P∈∑*, for any a∈∑, it holds that

M’(a) = ∪x∈Upb(a) M(x) .

Lemma 2Let （∑ , ） be a sorted alphabet. Given pattern P∈∑*, for any a∈∑, it holds that

M’(a) = M(a) ∪ ∪x∈Par(a) M’(x) .



Pseudo code for computing M’(a)

Preprocess_M’ (P=p1…pm) /* Assume H is a global variable */1 initalize M(a) as follows:2 M(a)={1 i m | P[i]=a}≦ ≦ ；3 for each a ∑ ∈ do4 CalculateM’(a) ；5 end of for

Function CalculateM’(a)6 if M’(a) has been computed then return M’(a)7 else do8 M’(a) = M(a);9 for each x Par(a) ∈ do10 M’(a)=M’(a) (CalculateM’(x));∪11 end of for12 return M’(a);

O(m)

O(h) TotalO(m+mh/w)

O(m/w)



Occurrences Taxonomicinformation ＤＢ

Text ＤＢ

TranslatorPatternmatchingmachine

Pattern

Overview of retrieval system with PMTX algorithm

We have to parse the text into a sequence of concepts

Replace Automaton （ Arikawa and Shiraishi[1984] ）

O(h+n)Translator

Or using a morphological parser for natural language texts like ChaSen



The 7th summary

Method for multi-bytecode texts (Japanese texts)–Embedding the code automaton into AC machine for

synchronization–Combining the code automaton that outputs mask bit

sequences with bit-parallel methods Toward an intelligent pattern matching:

– Pattern matching in consideration of the structure of the text Pattern matching for XML texts Pattern matching for arc-annotated texts

–Pattern matching in consideration of the meanings of the text (in cooperation with ontology data)

Pattern matching in consideration of taxonomic information

Prof. Arimura will take charge of this class from the next–Efficient data structure for information retrieval–Data mining form the web, etc.



Karp-Rabin algorithm

It is a randomized algorithm using hashing technique– Matching a string by regarding it as an integer!

The worst case takes O(mn) time, but it becomes O(n+m) time in the average

Extra space we need is only O(1)

KARP R.M., RABIN M.O., Efficient randomized pattern-matching algorithms. IBM J. Res. Dev. 31(2):249-260, 1987.

2 3 5 9 0 2 3 1 4 1 5 2 6 7 3 9 9 2 1Text ：

Pattern ： 3 1 4 1 5 mod 137

7

mod 13

8 9 3 11 0 1 8 4 5 10 11 7 9 11

∑ = { 0,1,2,…,9 }

Correct! Wrong!

・・・・・・・・・

3 1 4 1 5 2

7 8

The highest figure in the previous step

The lowest figure that is newly input

14152 ≡ (31415 – 3×10000)×10 + 2 (mod 13) ≡ (7 – 3×3)×10 + 2 (mod13) ≡ 8 (mod 13)



Pseudo code

Karp-Rabin (P, T, d, q)1 m ← length[P].2 n ← length[T].3 h ← dm–1 mod q.4 p ← 0.5 t0 ← 0.6 for i ← 1 to m do7 p ← (d ・ p + P[i]) mod q;8 t0 ← (d ・ t0 + T[i]) mod q.9 for s ← 0 to n – m do10 if p = ts then11 if P[1…m] = T[s+1…s+m] then12 report an occurrence at s;13 else if s < n – m then14 ts+1 ← (d ・ (ts – T[s+1] ・ h)+T[s+m+1]) mod q.

Check if the candidate is the occurrence



Randomized approximate pattern matching using FFT

Fast Fourier Transform (FFT) can be computedat high speed on hardware

They do (approximate) pattern matching by replacing strings into a sequence of numeric and then computing the score vectors at high speed by FFT

K. Baba, A. Shinohara, M. Takeda, S. Inenaga, and S. Arikawa. A Note on Randomized Algorithm for String Matchingwith Mismatches. Nordic Journal of Computing, 10(1):2-12, 2003.

m

j

jjii ptc1

1 ),(

ba

baba

0

1),(

K. Baba （ Kyushu Univ. ）

a b b a ca b b a c

a b b a ca b b a c

a b b a c

ci =

T[i] =a c b a b b a c c bi :1 2 3 4 5 6 7 8 9 10

P =a b b a c

3 1 1 5 2 0

Scorevecto

r

Documents

Lecture on Information Knowledge Network "Information retrieval and pattern matching"