全文検索のためのデータ構造と構成の効率について

全文検索のためのデータ構造と

構成の効率について

定兼邦彦東京大学理学系研究科

情報科学専攻

http://naomi.is.s.u-tokyo.ac.jp/~sada/papers/fulltext.ppt













2

内容

• 全文検索のためのデータ構造の比較– 検索時間– ディスク容量– 更新時間

• 検索精度

3

背景

• 電子化された文書の普及– WWW, メール– 新聞 , 辞書 , 書籍– ゲノムデータベース

• 大量のテキストから高速に検索したい– もれがないようにしたい– 必要なもののみ欲しい

4

全文検索のアルゴリズム

• sequential search

• signature file [Moders 49]– 各文書がどのキーワードを含むか

• inverted file [Bleir 67]– 各キーワードがどこにあるか

• digital tree (trie)– 任意のキーワード

5

Inverted file のデータ構造

• sorted array– キーワードの出現位置のリスト

• prefix B-tree– 更新が簡単

• trie– prefix をコンパクトに表現

キーワードごとに出現文書，位置を記憶

6

Word indexes vs. Full-text indexes

• 決まったキーワードのみ– サイズが小さい– 構成が早い

• データ構造– sorted array

– prefix B-tree

– trie

• 任意のキーワード– サイズが大きい– 構成が遅い

• データ構造– suffix array

– String B-tree

– suffix tree

word indexes full-text indexes

7

Full-text index のデータ構造

• suffix tree [Weiner 73]• suffix array [Manber, Myers 93]• String B-tree[Ferragina, Grossi 95]

8

Suffix tree・文字列の全ての suffix( 接尾辞 ) を表す

compacted trie

・メモリ上では線形時間で構成可能× サイズが大きい× unbalanced

abab$

ab

b

ab$

$ $

ab$

9

Suffix array

・文字列の全ての suffix のポインタを辞書順にソートした配列

・省スペース (5N)

× 更新が遅い

abab$1

bab$2

ab$3

b$4

10

String B-tree

・suffix のポインタを B-tree で表したもの・検索時の disk アクセスが少ない (blind tree)

・最悪時の性能が良い・挿入、削除が容易・サイズ : 13N

× １から作るのは遅い

abab$

11

I/O complexity

• 検索の I/O complexity

• 更新の I/O complexity

• 構成の I/O complexity

12

検索の I/O complexity

・Suffix tree– N に依存しない

○ String B-tree

・Suffix array

N

B

occpO Blog

occpO B log

N

B

p

B

occO 2log

p : キーワード長 occ : 答えの数 N : 文字列長

13

更新の I/O complexity

• Suffix tree– N に依存しない

• String B-tree

• Suffix array– 追加する量が多けれ

ば String B-tree と差はない

)(log pNpO B

BpO log

NNpO 2log

p : キーワード長 N : 文字列長 B : ディスクページサイズ

14

構成の I/O complexity

◎suffix tree (optimal)

○ suffix array

・String B-tree

M

NN

B

NO BM /2 loglog

M

N

B

NO BM /log

N : 文字列長 M : メモリサイズ B : ディスクページサイズ

)log( NNO B

15

構成アルゴリズム

• Suffix tree の構成– メモリ上– ディスク上

• Suffix array の構成– メモリ上– ディスク上

16

Suffix tree の構成

• メモリ上 ( 線形時間 )* Weiner 73* McCreight 76* Ukkonen 95* Farach 97

• divide and conquer, batch 処理

• ディスク上* Farach, Ferragina, Muthukrishnan 98

17

Disk 上での suffix tree 構成

• アルゴリズムを sorting と scan で表現• 数の sorting と同じ I/O complexity

(optimal)

)log( / M

N

B

NO BM I/O

18

Sorting I/O complexity

• 次の問題は sorting と同じ I/O complexity を持つ– tree のノードの lca を K 個 (K 個の range minima)

– tree T の Euler Tour ET(T) とノードの深さ– 文字列中の任意の位置の K 文字– tree の各ノードの子孫で mark されているもの– uncompacted trie の merge

– suffix tree の全ての suffix link

– suffix tree の構成

19

Block vs. Random I/O

•2-way merge

•M/B-way merge )log( / M

N

B

NO BM random I/O

block I/O)log( 2 M

N

B

NO

補題 : random I/O が )log( / M

N

B

No BM 回の sorting

)log( 2 M

N

B

N 回の I/O を必要とする。アルゴリズムは

20

Disk 上の suffix tree の問題点

tree の枝が文字列へのポインタで表されている

tree をたどる際に random access が生じる

古典的なアルゴリズムは適さない　 (divide and conquer を用いる )

21

Algorithm outline

• Odd tree を作る

• Even tree を作る

• merge する

b

$a

b$

Even tree

ab

ab$

$

Odd tree

$

$

ab

b$

ab$

$b

$ a

22

Building the odd tree

• 連続する 2 文字を 1 つの文字とみなし長さ N/2 の文字列を作る

• 新しい文字列の suffix tree を再帰的に作る• 文字を元に戻す

abab$ AA$

$$ A

A$

$

abab$

$

23

Building the even tree• 偶数番目の suffix を辞書順に radix sort する

– ( 先頭の文字 , 奇数番目の suffix の辞書順 )

• 隣り合う suffix 間の lcp を求める• compacted trie を作る

ab

ab$

$

$1

2 3

2: (b,ab$) = (b,2)

4: (b,$) = (b,1)

b

$a

b$

Even tree

abab$2

4

24

Merging the odd and even trees

• anchor pair を見つける• side tree pair に分割する• pull node を見つける• merge node を見つける• Te と To を merge する

25

Suffix array のメモリ上での構成

• quick sort× 文字列の比較なので非常に遅い

• ternary partitioning[Bentley, Sedgewick 97]○ 無駄な文字列比較が少ない× 極端に遅くなることがある

• doubling algorithm– Manber, Myers 93– Sadakane, Imai 98

• 多くの場合最速

26

Doubling algorithm

• Karp, Miller, Rosenberg 72

• ディスク上の文字列ソート [Arge et al. 97]

• 長さ 1, 2, 4, … の部分文字列を数値に変換– log n 回の比較で文字列を区別できる

27

Suffix sorting by doubling (1/5)

• 各 suffix を先頭の 1 文字でグループに分ける– グループに番号をつける

• 各グループの中を suffix の 2 文字目で分ける– 番号を更新 ( 番号が異なる　先頭の 2 文字が異なる )

• 各グループの中を suffix の 3,4 文字目で分ける– グループの番号でソート

• 全ての suffix の順序が決まるまで繰り返す– 順序の決まっているグループは skip する

28


＄ｂｅｏ

ｒｂｅ＄

ｅｏｒ

ｎｅ＄

ｎｏｔ

ｔｏｂｅ

ｏｏｒｎ

ｏｏｔｔ

ｏｏｂｅ

＄ｒｎｏ

ｔｔｏｂ

ｅｔｔｏ

ｂｔｏｂ

ｅ

ｔｏｂｅｏｒｎｏｔｔｏｂｅ $

13 2 11 3 12 6 1 4 7 10 5 0 8 9I[i]

0 1 1 3 3 5 6 6 6 6 10 11 11 11V[I[i]]

3 3 6 0 1 10 11 1 6 11 6V[I[i]+1]

先頭の２文字でソート

29


＄ｂｅｏ

ｒｂｅ＄

ｅｏｒ

ｎｅ＄

ｎｏｔ

ｔｏｂｅ

ｏｏｒｎ

ｏｏｔｔ

ｏｏｂｅ

＄ｒｎｏ

ｔｔｏｂ

ｅｔｔｏ

ｂｔｏｂ

ｅ

0 5 101 1 3 4 6 6 8 9 11 11 13


13 2 11 12 3 6 1 10 4 7 5 0 9 8I[i]

V[I[i]]

V[I[i]+2] 8 0 4 3 1 1

先頭の４文字でソート

30


＄ｂｅｏ

ｒｂｅ＄

ｅｏｒ

ｎｅ＄

ｎｏｔ

ｔｏｂｅ

ｏｏｒｎ

ｏｏｔｔ

ｏｏｂｅ

＄ｒｎｏ

ｔｔｏｂｅｏ

ｒｔｔｏ

ｂｔｏｂｅ

＄


0 5 101 2 3 4 6 7 8 9 11 11 13

13 11 2 12 3 6 10 1 4 7 5 0 9 8I[i]

V[I[i]]

V[I[i]+4] 8 0

先頭の８文字でソート

31



＄ｂｅｏ

ｒｂｅ＄

ｅｏｒ

ｎｅ＄

ｎｏｔ

ｔｏｂｅ

ｏｏｒｎ

ｏｏｔｔ

ｏｏｂｅ

＄ｒｎｏ

ｔｔｏｂｅｏ

ｒｔｔｏ

ｂｔｏｂｅ

＄

0 5 101 2 3 4 6 7 8 9 11 12 13

13 11 2 12 3 6 10 1 4 7 5 9 0 8I[i]

V[I[i]]

ソート終了

32

Suffix array のディスク上での構成

• Gonnet, Baeza-Yates, Snider 92– disk は sequential access のみ

• Crauser, Ferragina 98– doubling algorithm + discarding

2

2

MB

NI/O

)(loglog /2 M

NN

B

NBM I/O

33

Doubling algorithm + discarding

• doubling algorithm をディスク上で行う• 回の反復• M/B-way マージソートを用いる

メモリ内と異なる点• すでにソートされている部分はスキッ

プ

N2log

Word indexes vs. Full-text indexes

35

網羅性

• 単語の先頭のみ– データ量は約 1/7

(日本語 , 英語とも )

• 検索もれの可能性– 形態素解析が必要– DNA 配列には使えない

• 全ての部分文字列• 長いものを見つけるのが得

意

• 検索結果にごみが入る– 京都のつもりが東京都– ルパンのつもりがダブルパンチ– はらだのつもりがはらだたしい– AND 検索で回避 ?

word indexes full-text indexes

36

Full-text index の利点・欠点

・検索結果は文字列へのポインタ△ポインタから文書番号への変換が必要○超高速 grep として利用できる

× サイズが大きい・Full-text index から word index は構成可能

– テキストを走査する– 必要の無い index に印をつける– index を走査し、印のついているものを削除

37

課題• 検索結果のごみをなくす

– AND 検索 ?

• シソーラスの利用– OR 検索

• 構造化された文書からの検索– 見出しのみから検索など

• データの収集速度– 元の文書を圧縮して送る– word indexだけ送る

38

圧縮と検索の統合

• Block sorting圧縮法 [Burrows, Wheeler 94]– suffix array に従い文字列を並べ替えてから圧縮

テキストを転送する際は Block sorting で圧縮しておけば良い[Sadakane, Imai 98a]

伸張時に文字列と suffix array が復元される

39

謝辞

貴重なコメントをくださった NTT の原田昌紀氏、中村隆幸氏に感謝いたします。

40

参考文献 (1/3)

[1] L.Arge, P.Ferragina, R.Grossi, and J.S. Vitter. On sorting strings in external memory. In ACM Symposium on Theory of Computing, pp. 540--548, 1997.

[2] J.L. Bentley and R.Sedgewick. Fast algorithms for sorting and searching strings. In Proceedings of the 8th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 360--369, 1997.

[3] M. Burrows and D. J. Wheeler. A Block-sorting Lossless Data Compression Algorithms. Technical Report 124, Digital SRC Research Report, 1994.

[4] A.Crauser and P.Ferragina. External memory construction of full-text indexes. In DIMACS Workshop on External Memory Algorithms and/or Visualization, 1998.

[5] M.Farach. Optimal Suffix Tree Construction with Large Alphabets. In 38th Symp. on Foundations of Computer Science, pp. 137--143, 1997.

URL

URL

URL

URL

URL

http://www.cs.princeton.edu/~rs/strings/

http://www.di.unipi.it/~ferragin

http://www.cs.rutgers.edu/~farach

http://www.cs.duke.edu/~jsv

http://gatekeeper.dec.com/pub/DEC/SRC/research-reports/abstracts/src-rr-124.html

41

参考文献 (2/3)

[6] P.Ferragina and R.Grossi. The String B-Tree: a new data structure for string search in external memory and its applications. Journal of the ACM, 1998.

[7] G.H. Gonnet, R.Baeza-Yates, and T.Snider. New Indices for Text: PAT trees and PAT arrays. In W.Frakes and R.Baeza-Yates, editors, Information Retrieval: Algorithms and Data Structures, chapter5, pp. 66--82. Prentice-Hall, 1992.

[8] R.M. Karp, R.E. Miller, and A.L. Rosenberg. Rapid identification of repeated patterns in strings, arrays and trees. In 4th ACM Symposium on Theory of Computing, pp. 125--136, 1972.

[9] U.Manber and G.Myers. Suffix arrays: A New Method for On-Line String Searches. SIAM Journal on Computing, Vol.22, No.5, pp. 935--948, October 1993.

URL

URL

http://www.dcc.uchile.cl/~rbaeza

http://www.di.unipi.it/~ferragin

42

参考文献 (3/3)

[10] E.M. McCreight. A space-economical suffix tree construction algorithm. Journal of the ACM, Vol.23, No.12, pp. 262--272, 1976.

[11] K.Sadakane and H.Imai. A Cooperative Distributed Text Database Management Method Unifying Search and Compression Based on the Burrows-Wheeler Transformation. In Proceedings of NewDB’98, 1998.

[12] K.Sadakane and H.Imai. Constructing Suffix Arrays of Large Texts. In Proceedings of DEWS'98, 1998.

[13] E.Ukkonen. On-line construction of suffix trees. Algorithmica, Vol.14, No.3, pp. 249--260, September 1995.

[14] P.Weiner. Linear Pattern Matching Algorihms. In Proceedings of the 14th IEEE Symposium on Switching and Automata Theory, pp. 1--11, 1973.

URL

URL

http://naomi.is.s.u-tokyo.ac.jp/~sada

http://naomi.is.s.u-tokyo.ac.jp/~sada

Documents

全文検索のためのデータ構造と 構成の効率について

全文検索のためのデータ構造と構成の効率について