Upload
achilles-tocci
View
22
Download
0
Embed Size (px)
DESCRIPTION
Lecture on Information Knowledge Network "Information retrieval and pattern matching". Laboratory of Information Knowledge Network, Division of Computer Science, Graduate School of Information Science and Technology, Hokkaido University Takuya KIDA. - PowerPoint PPT Presentation
Citation preview
北海道大学 Hokkaido University
1
Lecture on Information knowledge network2011/11/29
Lecture on Information Knowledge Network
"Information retrieval and pattern matching"
Laboratory of Information Knowledge Network,Division of Computer Science,
Graduate School of Information Science and Technology,Hokkaido University
Takuya KIDA
The 8thMisc. topics of pattern
matchingMethod for multi-bytecode texts
Toward an intelligent pattern matching:1. Pattern matching for XML data
2. Pattern matching on texts with arc annotation3. Pattern matching with taxonomy data
Appendix: Randomized algorithm
北海道大学 Hokkaido University3
Lecture on Information knowledge network2011/11/29
Method for multi-bytecode texts (Japanese texts)
Synchronization problem of codewords:– False detection will occur when we do pattern matching on a Japanese
text by the unit of ASCII (unit of byte).– It is necessary to determine the boundaries of characters as well as
Huffman codes.T F T 液 晶 の 時 代Text T =
54 46 54 B1D5BEBDA4CEBBFEC2E5A sequence of bytes →
修 了Pattern P =Japanese EUC encoded text
AC machine for a pattern P=“BD A4 CE BB” (修了)
BD A4 CE BB10
修了3 42
∑ - {BD}
北海道大学 Hokkaido University4
Lecture on Information knowledge network2011/11/29
Review: Solution by automaton with synchronization
0 1
0 1
0 1
0 1
A B
C
D
E
Huffman tree Pattern P = DECHuffman encodedPattern E(P) = 011001
Text T = ABECA ・・・Huffman encoded text E(T) = 0000000110010000 ・・・
0 1 1 10 0∑
Ordinal KMP automaton
0 1
0 1
0 1
0 1 0 1 1 10 0
KMP automaton with sync.0 1
0 1
0 1
0 1 0 1 1 10 0
KMP automaton with sync.
M. Miyaaki, S. Fukamachi, M. Takeda: Speeding up the pattern matching machine for compressed texts (in Japanese), Trans. IPSJ, Vol. 39, No. 9, pp.2638-2648, 1998.
北海道大学 Hokkaido University5
Lecture on Information knowledge network2011/11/29
PM on multi-bytecode texts by an automaton with synchronizationM. Takeda, et al.: Processing Text Files as Is: Pattern Matching over Compressed Texts, Multi-Byte Character Texts,and Semi-Structured Texts, Proc. of SPIRE2002, LNCS2476, pp.170-186, 2002.
T F T 液 晶 の 時 代Text T =
54 46 54 B1D5BEBDA4CEBBFEC2E5A sequence of bytes →
修 了Pattern P =Japanese EUC encoded
text
An AC machine with synchronization, which correctly detects ( EUC encoded ) pattern P=“ 修了”
BD A4 CE BB1
修了3 42z
g
[8E, A0-FF] ∖ [BD]
[A0-FF]
[8F][A0-FF]
[00-8D, 90-9F]
0
part for synchronization
Code automaton accepting any EUC code
0
z
g
[00-8D,90-9F]
[8E, A0-FF]
[8F]
[A0-FF]
[A0-FF]
{full-width char.}
{half-widthchar.}
北海道大学 Hokkaido University6
Lecture on Information knowledge network2011/11/29
Idea of bit-parallel technique
abababba
ababb
01000
10100
10100
10100
&
00000
10000
01000
10100
01010
10100
01010
00001
10000
12345
Ri = (Ri-1<<1 | 1) & M(T[i])Ri = (Ri-1<<1 | 1) & M(T[i])
Mask table M
ab
10100
01011
ababb
Text T:
Pattern P:
This can be calculated in O(1) time
※Keeping only the right transferred bits by taking AND op. with the maskbits M.
北海道大学 Hokkaido University7
Lecture on Information knowledge network2011/11/29
Bit-parallel method for multi-bytecode texts
Basic idea:– We construct the pattern matching machine (code automaton)
that can determine the boundaries of codewords and recognize each multi-byte character in the input pattern.
– The code automaton runs while reading the text by each byte, and it output the mask bit sequence corresponding to each character in the input pattern.
– We simulate an arbitrary bit parallel algorithm by using the output M(T[i]) of the code automaton instead of reading T[i].
A code automaton that can determine the boundariesof EUC codes and recognize “ 修” and “ 了” .
BD A4
CE BB
1 2
3 4
0
M[ 修 ]=01
M[ 了 ]=10
z
g
[8E, A0-FF]/[BD,CE]
[A0-FF]
[8F][A0-FF]
[00-8D,90-9F]
arbitrary bit parallel algorithm
arbitrary bit parallel algorithm
Ri = (Ri-1<<1 | 1) & M(T[i])
Heikki Hyyrö, Jun Takaba, Ayumi Shinohara, and Masayuki Takeda: On Bit-Parallel Processing of Multi-byte Strings,Proc. of Asia Information Retrieval Symposium, pp.190-196, 2004.
北海道大学 Hokkaido University8
Lecture on Information knowledge network2011/11/29
Toward an intelligent pattern matching
Until now …–Text = just a sequence of characters
(We’ve ignored the background knowledge about the text and meaning of sentences.)
–Fast! Fast! Fast!
From now on …–Text = a sequence of sentences that have meanings and/or
structures–We need an intelligent pattern matching ( of course, at high
speed! ) Pattern matching in consideration of the structure of the text
– Pattern matching for XML texts– Pattern matching for texts with arc-annotation– etc…
Pattern matching in consideration of the meaning of the text ( cooperating with ontology data )
– Pattern matching in consideration of the taxonomic information– Thesaurus, Inductive rules, etc…
北海道大学 Hokkaido University9
Lecture on Information knowledge network2011/11/29
Pattern matching for XML texts: previous ones
XMLdocument
XMLdocument
XMLdocument
XMLdocument
memoryA
pplication program
DOMAPI
XML parser
……
Tanaka
person/name/last
Makiko
person/name/first
“”person/name
“”person
RDBperson
name
first last
Makiko Tanaka
SQL
北海道大学 Hokkaido University10
Lecture on Information knowledge network2011/11/29
Pattern matching for XML texts: our approach
XMLdocument
XMLdocument
XMLdocument
XMLdocument
memoryA
pplication program
Pattern
match
ing
algo
rithm
<person> <name> <first> Makiko </first> <last> Tanaka </last> </name></person>
<person> <name> <first> Makiko </first> <last> Tanaka </last> </name></person>
M. Takeda, et al.: Processing Text Files as Is: Pattern Matching over Compressed Texts, Multi-Byte Character Texts,and Semi-Structured Texts, Proc. of SPIRE2002, LNCS2476, pp.170-186, 2002.
北海道大学 Hokkaido University11
Lecture on Information knowledge network2011/11/29
Advantage of pattern matching approach
It can batch the processing for a huge XML document and a large amount of documents
It can treat many queries at once.
Treestructure
Fast processing
In a little memory space
Various applications
XMLdocument
XMLdocument
北海道大学 Hokkaido University12
Lecture on Information knowledge network2011/11/29
Problem in a simple pattern matching algorithm
It may match to part of tag names.
<body> <h1>That TVCM</h1> <p> <mother> “mother” </mother> If we remove m, it becomes <other> “other” </other></p></body>
<body> <h1>That TVCM</h1> <p> <mother> “mother” </mother> If we remove m, it becomes <other> “other” </other></p></body>
Wrong detection
Is it inside or outside
of tags?
Pattern Π = {other, <mother>}
北海道大学 Hokkaido University13
Lecture on Information knowledge network2011/11/29
A solution
6 7 8 9 10 11 12 13
0 1 2 3 4 5
14
ro t h e
ro t h em<
>
other
<mother>
∑
other
An ordinal AC machine
An AC machine in consideration of XML tags
ro t h e
ro t h em<
>136 7 8 9 10 11 12
0 1 2 3 4 5other
<mother>Other than
‘<‘
><
Other than‘<‘
15
14ro t h e
ro t h em<
>136 7 8 9 10 11 12
0 1 2 3 4 5other
<mother>Other than
‘<‘
><
Other than‘<‘
15
14
北海道大学 Hokkaido University14
Lecture on Information knowledge network2011/11/29
ro t h e
ro t h em<
>136 7 8 9 10 11 12
0 1 2 3 4 5other
<mother>> 以外の文字
><
< 以外の文字
15
14ro t h e
ro t h em<
>136 7 8 9 10 11 12
0 1 2 3 4 5other
<mother>Other than
‘< ‘
><
Other than ‘< ‘
15
1416 >
]
Other than ‘< ‘
Handling of attributes
<mother> <mother nature=“tender”><mother nature=“hard”>
・・・
The same tag<mother>
The same tag<mother>
北海道大学 Hokkaido University15
Lecture on Information knowledge network2011/11/29
Pattern matching in consideration of XML path
ro t h e
ro t h em<
>136 7 8 9 10 11 12
0 1 2 3 4 5other
<mother>Other than
‘<‘
><
Other than‘<‘
15
14
0
1
3
<person>
<name>
2
<last>
<person>Other than<person>
={<person>,</person>,<name>,</name>,<last>,</last>,…}={Tanaka}
(<person>,0)
(<name>,1)
(<xml>,0)
stack
(<last>,2)
I want to look for the parsons whose family name is “Tanaka”( In Xpath expression, the element //person/name/last/ is equal to
“Tanaka” )
北海道大学 Hokkaido University16
Lecture on Information knowledge network2011/11/29
Processible subset of XPath
Limitation of pattern matching approach– We cannot specify the predecessor nodes– The complex filter specifications remarkably decrease the processing
speed
LocationPath ::= '/' RelativeLocationPathRelativeLocationPath ::= Step
| RelativeLocationPath '/' StepStep ::= AxisSpecifier NodeTestAxisSpecifier ::= AxisName '::'AxisName ::= 'attribute'
| 'child' | 'descendant' | 'descendant-or-self' | 'following' | 'following-sibling' | 'self' | 'namespace'
NodeTest ::= QName | NodeType '(' ')'
NodeType ::= 'node' | 'text' | 'comment' | 'processing-instruction'
//cars/car/@*
/descendant::cars/child::car/attribute::node()
北海道大学 Hokkaido University17
Lecture on Information knowledge network2011/11/29
Speed comparison with Sgrep
Comparison with Sgrep(J. Jaakkola and P. Kilpeläinen)
Text : 110MB (English text)CPU : Celelon 366MHzMemory : 128MBOS : Kondara/MNU Linux 2.1 RC2
Pattern //text/"summers" //test//"summers"
/site/regions/africa/item/location/"United_States"
Sgrep 38.44 37.02 51.85
Takeda et al. [2002] 12.40 12.30 12.23CPU time (sec.)
北海道大学 Hokkaido University18
Lecture on Information knowledge network2011/11/29
Pattern matching for texts with arc-annotation
Definition : The arc annotation A that accompanies sequence S is the set of union of integers {1, 2, …, |S|}
Each element (iL, iR) ∈A is called an arc.–S[iL] and S[iR] are called a right endpoint and a left endpoint,
respectively.–For an arbitrary arc, we assume that it holds that iL < iR. –Moreover, any two arcs doesn't share the same integer.–That is, any two arcs doesn't share the same endpoint.
An example of the text with arc-annotation:
A G T C A C G C C C G T1 2 3 4 5 6 7 8 9 10 11 12
北海道大学 Hokkaido University19
Lecture on Information knowledge network2011/11/29
Example of text with arc annotation
An example of the tRNA(tRNAPhe) two-dimensional structure
・・・ACACCUAGCΨTGUGU ・・・
The string having nested arcs
北海道大学 Hokkaido University20
Lecture on Information knowledge network2011/11/29
Arc-preserving subsequence(APS) problem
The APS problem is to answer if the following conditions are satisfied, when text S1 = S1[1 : n] and pattern S2 = S2[1 : m] are given with arch annotations A1 and A2, respectively.
– S2 is a subsequence of S1– There are arcs in the pattern if there are arcs in the sequence,
and vice versa.
A G T C A C G C C C G TS1:=
A G T C A C G C C C G TS1:=
A T G C TS2:=
A T G C TS2:=
Text:
Pattern:
Text:
Pattern:
○ base match
×arc match
北海道大学 Hokkaido University21
Lecture on Information knowledge network2011/11/29
APS(TYPE1, TYPE2)
The difficulty of the APS problem changes for its arc annotation structure
APS(TYPE1, TYPE2) –TYPE1 : arc structure of the
text–TYPE2 : arc structure of the
pattern
Example : APS(nested, chain)
–Arc structure of the text is “nested”–Arc structure of the pattern is
“chain”
Chain
Nested
LimitationDifficulty
High
Low
loose
strict
Crossing
Plain
北海道大学 Hokkaido University22
Lecture on Information knowledge network2011/11/29
Result of Kida[2005]
The previous work of APS problem:– J. Gramm, J. Guo, and R. Niedermeier.
“Pattern matching for arc-annotated sequences.”In Proc. 22nd FSTTCS, volume 2556 of LNCS, pages 182–193. Springer, 2002.
The result of Kida[2005]: proposed an improved algorithm based on the GGN algorithm–However, the worst case complexity is as the same as GGN
corrected an error of Gramm-Guo-Niedermeier (GGN) algorithm–The original GGN algorithm include an error
have implemented and experimented–The proposed algorithm runs 2 ~ 5 times faster than GGN
APS(nested, nested) is solved in O(nm)
Kida: Faster Pattern Matching Algorithm for Arc-Annotated Sequences, Proc. of Federation over the Web,LNAI (to appear)
北海道大学 Hokkaido University23
Lecture on Information knowledge network2011/11/29
Change to the text length n
|A1|=20% of n, m=20, |A2|=4
北海道大学 Hokkaido University24
Lecture on Information knowledge network2011/11/29
Change to the pattern length m
|A2|=20% of m, n=1000, |A1|=100
北海道大学 Hokkaido University25
Lecture on Information knowledge network2011/11/29
Take a breath
King Penguin flying in water ( 2005.8.12 in Asahiyama Zoo )
Summary to here– Method for multi-bytecode texts (Japanese texts)
Embedding the code automaton into AC machine for synchronization Combining the code automaton that outputs mask bit sequences with
bit-parallel methods– Pattern matching in consideration of the structure of the text
Pattern matching for XML texts Pattern matching for arc-annotated texts
~ Trivia ~How to compute min(x,y) without conditional branching when two integers x and y are represented as m-bits sequences
S ← ((x | 10m) - y) & 10m,S ← S - (S m),≫min(x,y) ← (~S & x) | (S & y)
( However, we need m+1-bits for each )
北海道大学 Hokkaido University26
Lecture on Information knowledge network2011/11/29
Example of pattern matching in consideration of taxonomic information (PMTX)
cell
insolublefraction
membranefraction
vesicularfraction
microsome
cellsurface
cellenvelope
cellwall
molecularfunction
Gene Ontology
catalyticactivity
lyaseactivity
hyaluronate
Text T:
Pattern P: (cell) (receptor) (for) (catalytic activity)
Pub:1: Cell. 1990 Jun 29;61(7):1303-13.Title:CD44 is the principal cell surface receptor for hyaluronate.Authours:Aruffo A, Stamenkovic I, Melnick M, Underhill CB, Seed B.
北海道大学 Hokkaido University27
Lecture on Information knowledge network2011/11/29
O(m+mh/w) time for preprocessing O(m|∑|/w) space O(mn/w) time for scanning the text
O(m+h) time for preprocessing O(|∑|) space O(n) time for scanning the text
–m: the length of pattern P∈∑*
–n: the length of text T∈∑*
–h: the size of taxonomic information H– |∑|: the size of set ∑ of concepts–w: the length of word (say, 32 or 64)
Result of Kida&Arimura[2004]
It works well when m < w
T. Kida and H. Arimura: Pattern Matching with Taxonomic Information, Proc. of Asia Information Retrieval Symposium(AIRS2004), pp. 265-268, Oct. 2004.
北海道大学 Hokkaido University28
Lecture on Information knowledge network2011/11/29
Taxonomic information and sorted alphabet
An example of DAG H representing (∑,)G
E
C D F
A B
We assume that a pattern and a text are given as a sequence of concepts: P ∑* and T ∑*∈ ∈
Sorted alphabet (∑ , )– ∑ : a finite alphabet ( a set of concepts )– : a partial order relation
※ This is also called as Hasse diagram.
Pattern P:= A B E F
A B C B D F C BText T:=
Concept E corresponds with the character class [A,B,C,D,E].
北海道大学 Hokkaido University29
Lecture on Information knowledge network2011/11/29
Examples of sorted alphabet
A B C D Z
0 1 2 9 a z
[0-9] [a-z]
?
(1) flat alphabet
(2) class of characters
(3) letter-sets alphabet
(abc)
(ab)
(a)
(ac)(bc)
(c)(b)
φ
北海道大学 Hokkaido University30
Lecture on Information knowledge network2011/11/29
We can utilize the Shift-And method!
ababbbba
ab[ab]bb
01010
10101
00100
01110
&
00000
10000
01000
10100
01010
00100
00010
00001
10000
12345
Mask table M
ab
10100
01111
ab
[ab]bb
Text T:
Pattern P:
The difference is just here!
Ri = (Ri-1<<1 | 1) & M(T[i])Ri = (Ri-1<<1 | 1) & M(T[i])
This is the same
北海道大学 Hokkaido University31
Lecture on Information knowledge network2011/11/29
Toward taxonomic information
GE
C D F
A B
Mask table M’
ABCDEFG
100110
ABCDEF
010111
001010
000110
000010
000001
000000
O(mh) ?
Taxonomic information H:
Pattern P:= A B E F
A B C B D F C BText T:=
北海道大学 Hokkaido University32
Lecture on Information knowledge network2011/11/29
Computation of M’(a)
Lemma 1Let (∑ , ) be a sorted alphabet. Given pattern P∈∑*, for any a∈∑, it holds that
M’(a) = ∪x∈Upb(a) M(x) .
Lemma 2Let (∑ , ) be a sorted alphabet. Given pattern P∈∑*, for any a∈∑, it holds that
M’(a) = M(a) ∪ ∪x∈Par(a) M’(x) .
北海道大学 Hokkaido University33
Lecture on Information knowledge network2011/11/29
Pseudo code for computing M’(a)
Preprocess_M’ (P=p1…pm) /* Assume H is a global variable */1 initalize M(a) as follows:2 M(a)={1 i m | P[i]=a}≦ ≦ ;3 for each a ∑ ∈ do4 CalculateM’(a) ;5 end of for
Function CalculateM’(a)6 if M’(a) has been computed then return M’(a)7 else do8 M’(a) = M(a);9 for each x Par(a) ∈ do10 M’(a)=M’(a) (CalculateM’(x));∪11 end of for12 return M’(a);
O(m)
O(h) TotalO(m+mh/w)
O(m/w)
北海道大学 Hokkaido University34
Lecture on Information knowledge network2011/11/29
Occurrences Taxonomicinformation DB
Text DB
TranslatorPatternmatchingmachine
Pattern
Overview of retrieval system with PMTX algorithm
We have to parse the text into a sequence of concepts
Replace Automaton ( Arikawa and Shiraishi[1984] )
O(h+n)Translator
Or using a morphological parser for natural language texts like ChaSen
北海道大学 Hokkaido University35
Lecture on Information knowledge network2011/11/29
The 7th summary
Method for multi-bytecode texts (Japanese texts)–Embedding the code automaton into AC machine for
synchronization–Combining the code automaton that outputs mask bit
sequences with bit-parallel methods Toward an intelligent pattern matching:
– Pattern matching in consideration of the structure of the text Pattern matching for XML texts Pattern matching for arc-annotated texts
–Pattern matching in consideration of the meanings of the text (in cooperation with ontology data)
Pattern matching in consideration of taxonomic information
Prof. Arimura will take charge of this class from the next–Efficient data structure for information retrieval–Data mining form the web, etc.
北海道大学 Hokkaido University36
Lecture on Information knowledge network2011/11/29
Karp-Rabin algorithm
It is a randomized algorithm using hashing technique– Matching a string by regarding it as an integer!
The worst case takes O(mn) time, but it becomes O(n+m) time in the average
Extra space we need is only O(1)
KARP R.M., RABIN M.O., Efficient randomized pattern-matching algorithms. IBM J. Res. Dev. 31(2):249-260, 1987.
2 3 5 9 0 2 3 1 4 1 5 2 6 7 3 9 9 2 1Text :
Pattern : 3 1 4 1 5 mod 137
7
mod 13
8 9 3 11 0 1 8 4 5 10 11 7 9 11
∑ = { 0,1,2,…,9 }
Correct! Wrong!
・・・ ・・・ ・・・
3 1 4 1 5 2
7 8
The highest figure in the previous step
The lowest figure that is newly input
14152 ≡ (31415 – 3×10000)×10 + 2 (mod 13) ≡ (7 – 3×3)×10 + 2 (mod13) ≡ 8 (mod 13)
北海道大学 Hokkaido University37
Lecture on Information knowledge network2011/11/29
Pseudo code
Karp-Rabin (P, T, d, q)1 m ← length[P].2 n ← length[T].3 h ← dm–1 mod q.4 p ← 0.5 t0 ← 0.6 for i ← 1 to m do7 p ← (d ・ p + P[i]) mod q;8 t0 ← (d ・ t0 + T[i]) mod q.9 for s ← 0 to n – m do10 if p = ts then11 if P[1…m] = T[s+1…s+m] then12 report an occurrence at s;13 else if s < n – m then14 ts+1 ← (d ・ (ts – T[s+1] ・ h)+T[s+m+1]) mod q.
Check if the candidate is the occurrence
北海道大学 Hokkaido University38
Lecture on Information knowledge network2011/11/29
Randomized approximate pattern matching using FFT
Fast Fourier Transform (FFT) can be computedat high speed on hardware
They do (approximate) pattern matching by replacing strings into a sequence of numeric and then computing the score vectors at high speed by FFT
K. Baba, A. Shinohara, M. Takeda, S. Inenaga, and S. Arikawa. A Note on Randomized Algorithm for String Matchingwith Mismatches. Nordic Journal of Computing, 10(1):2-12, 2003.
m
j
jjii ptc1
1 ),(
ba
baba
0
1),(
K. Baba ( Kyushu Univ. )
a b b a ca b b a c
a b b a ca b b a c
a b b a c
ci =
T[i] =a c b a b b a c c bi :1 2 3 4 5 6 7 8 9 10
P =a b b a c
3 1 1 5 2 0
Scorevecto
r