39
Introduction to Information Retrieval Lecture 1 : Full-text Indexing 楊楊楊楊楊 楊楊楊楊楊楊楊 [email protected] 楊楊楊楊楊楊楊 Introduction to Information Retrieval 楊楊楊楊 楊 Ch 1 & 2 1

Lecture 1 : Full-text Indexing

  • Upload
    ranger

  • View
    79

  • Download
    0

Embed Size (px)

DESCRIPTION

Lecture 1 : Full-text Indexing. 楊立偉教授 台灣科大資管系 [email protected] 本投影片修改自 Introduction to Information Retrieval 一書之 投影片 Ch 1 & 2. Definition of information retrieval. Information retrieval (IR) is finding material (usually documents) of - PowerPoint PPT Presentation

Citation preview

Page 1: Lecture 1 : Full-text Indexing

Introduction to Information Retrieval

Lecture 1 : Full-text Indexing楊立偉教授

台灣科大資管系[email protected]

本投影片修改自 Introduction to Information Retrieval 一書之投影片 Ch 1 & 2

1

Page 2: Lecture 1 : Full-text Indexing

Introduction to Information Retrieval

2

Definition of information retrieval

Information retrieval (IR) is finding material (usually documents) ofan unstructured nature (usually text) that satisfies an informationneed from within large collections (usually stored on computers).

2

Page 3: Lecture 1 : Full-text Indexing

Introduction to Information Retrieval

3

Boolean retrieval

The Boolean model is arguably the simplest model to base an information retrieval system on.

Queries are Boolean expressions, e.g., CAESAR AND BRUTUS

The search engine returns all documents that satisfy theBoolean expression.

3

Page 4: Lecture 1 : Full-text Indexing

Introduction to Information Retrieval

Inverted Index

4

Page 5: Lecture 1 : Full-text Indexing

Introduction to Information Retrieval

5

Unstructured data in 1650

Which plays of Shakespeare contain the words BRUTUS AND CAESAR, but not CALPURNIA ?

One could grep all of Shakespeare’s plays for BRUTUS and CAESAR, then strip out lines containing CALPURNIA

Why is grep not the solution? Slow (for large collections) grep is line-oriented, IR is document-oriented “NOT CALPURNIA” is non-trivial Other operations (e.g., find the word ROMANS near

COUNTRYMAN ) not feasible5

Page 6: Lecture 1 : Full-text Indexing

Introduction to Information Retrieval

6

Term-document incidence matrix

Entry is 1 if term occurs. Example: CALPURNIA occurs in Julius Caesar. Entry is 0 if term doesn’t occur. Example: CALPURNIAdoesn’t occur in The tempest.

6

Anthony and Cleopatra

Julius Caesar

The Tempest

Hamlet Othello Macbeth . . .

ANTHONYBRUTUS CAESARCALPURNIACLEOPATRAMERCYWORSER. . .

1110111

1111000

0000011

0110011

0010011

1010010

Page 7: Lecture 1 : Full-text Indexing

Introduction to Information Retrieval

7

Incidence vectors

So we have a 0/1 vector for each term. To answer the query BRUTUS AND CAESAR AND NOT CALPURNIA:

Take the vectors for BRUTUS, CAESAR AND NOT CALPURNIA Complement the vector of CALPURNIA Do a (bitwise) and on the three vectors 110100 AND 110111 AND 101111 = 100100

7

Page 8: Lecture 1 : Full-text Indexing

Introduction to Information Retrieval

8

0/1 vector for BRUTUS

8

Anthony and Cleopatra

Julius Caesar

The Tempest

Hamlet Othello Macbeth . . .

ANTHONYBRUTUS CAESARCALPURNIACLEOPATRAMERCYWORSER. . .

1110111

1111000

0000011

0110011

0010011

1010010

result: 1 0 0 1 0 0

答案是 Antony and Cleopatra 與 Hamlet

Page 9: Lecture 1 : Full-text Indexing

Introduction to Information Retrieval

9

Bigger collections

Consider N = 106 documents, each with about 1000 tokens ⇒ total of 109 tokens (10 億 )

average 6 bytes per token, including spaces / punctuation ⇒ size of document collection is about 6 ・ 109 = 6 GB

Assume there are M = 500,000 distinct terms in the collection

9

Page 10: Lecture 1 : Full-text Indexing

Introduction to Information Retrieval

10

Can’t build the incidence matrix

M = 500,000 × 106 = half a trillion 0s and 1s. (5000 億 ) But the matrix has no more than 109 1s.

Matrix is extremely sparse. (only 10/5000 has values) What is a better representations?

We only record the 1s.

10

Page 11: Lecture 1 : Full-text Indexing

Introduction to Information Retrieval

11

Inverted Index

For each term t, we store a list of all documents that contain t.

11

dictionary (sorted) postings

Page 12: Lecture 1 : Full-text Indexing

Introduction to Information Retrieval

• Linked lists generally preferred to arrays– 優點 Dynamic space allocation– 優點 Insertion of terms into documents easy– 缺點 Space overhead of pointers

12

Inverted Index

Page 13: Lecture 1 : Full-text Indexing

Introduction to Information Retrieval

TokenizerToken stream. Friends Romans Countrymen

Linguistic modules

Modified tokens.正規化…等處理friend roman countryman

Indexer

Inverted index.

friendromancountryman

2 42

13 161

之後會討論

Documents tobe indexed. Friends, Romans, countrymen.

Inverted Index construction

Page 14: Lecture 1 : Full-text Indexing

Introduction to Information Retrieval

14

(1) Tokenizing and preprocessing

14

Page 15: Lecture 1 : Full-text Indexing

Introduction to Information Retrieval

15

(2) Generate posting

15

Page 16: Lecture 1 : Full-text Indexing

Introduction to Information Retrieval

16

(3) Sort postings

16

Sorting by docID is important

Page 17: Lecture 1 : Full-text Indexing

Introduction to Information Retrieval

17

(4) Create postings lists

17

Group by term and docID

Get the document frequency

Page 18: Lecture 1 : Full-text Indexing

Introduction to Information Retrieval

18

(5) Split the result into dictionary and postings

18

dictionary postings

Page 19: Lecture 1 : Full-text Indexing

Introduction to Information Retrieval

Processing Boolean queries

19

Page 20: Lecture 1 : Full-text Indexing

Introduction to Information Retrieval

20

Simple conjunctive query (two terms)

Consider the query: BRUTUS AND CALPURNIA

To find all matching documents using inverted index:❶ Locate BRUTUS in the dictionary❷ Retrieve its postings list from the postings file❸ Locate CALPURNIA in the dictionary❹ Retrieve its postings list from the postings file❺ Intersect the two postings lists❻ Return intersection to user

20

Page 21: Lecture 1 : Full-text Indexing

Introduction to Information Retrieval

21

Intersecting two posting lists

This is linear in the length of the postings lists.→ If the list lengths are x and y, the merge takes O(x+y)

Note: This only works if postings lists are sorted.

21

Page 22: Lecture 1 : Full-text Indexing

Introduction to Information Retrieval

22

Intersecting two posting lists

22

Page 23: Lecture 1 : Full-text Indexing

Introduction to Information Retrieval

23

Query processing: Exercise

Compute hit list for ((paris AND NOT france) OR lear)

23

Page 24: Lecture 1 : Full-text Indexing

Introduction to Information Retrieval

24

Boolean queries The Boolean retrieval model can answer any query that is a

Boolean expression. Boolean queries are queries that use AND, OR and NOT to join

query terms. 可包含括號 Views each document as a set of terms. 依集合理論 Is precise: Document matches condition or not.

Primary commercial retrieval tool for 3 decades Professional searchers (e.g., lawyers) like Boolean queries.

You know exactly what you are getting.

24

Page 25: Lecture 1 : Full-text Indexing

Introduction to Information Retrieval

25

Westlaw: Example queries

Largest commercial (paying subscribers) legal search service since 1975; Tens of terabytes of data; 700,000 users

Information need: legal theories involved in preventing the disclosure of trade secrets by employees Query: “trade secret” /s disclos! /s prevent /s employe!

Information need: Requirements for disabled people to be able to access a workplace Query: disab! /p access! /s work-place (employment /3 place)

25

Page 26: Lecture 1 : Full-text Indexing

Introduction to Information Retrieval

26

Westlaw: Comments Proximity operators: /3 = within 3 words, /s = within a

sentence, /p = within a paragraph Why professional searchers often like Boolean search:

precision, transparency, control Long, precise queries: incrementally developed

(not like web search)

26

Page 27: Lecture 1 : Full-text Indexing

Introduction to Information Retrieval

Query optimization

27

Page 28: Lecture 1 : Full-text Indexing

Introduction to Information Retrieval

28

ProblemGiven an Boolean formula?(Brutus OR Caesar) AND NOT (Antony OR Cleopatra)

• How to get the answer ?– process from the left, or from the right– "NOT"

• How to do it faster?

Page 29: Lecture 1 : Full-text Indexing

Introduction to Information Retrieval

29

Query optimization

Consider a query that is an and of n terms, n > 2 For each of the terms, get its postings list,

then and them together Example query: BRUTUS AND CALPURNIA AND CAESAR

What is the best order for processing this query?

29

Page 30: Lecture 1 : Full-text Indexing

Introduction to Information Retrieval

30

Query optimization

Example query: BRUTUS AND CALPURNIA AND CAESAR

Simple and effective optimization: Process in order of increasing frequency

Start with the shortest postings list, then keep cutting further In this example, first CAESAR, then CALPURNIA, then BRUTUS

30

Page 31: Lecture 1 : Full-text Indexing

Introduction to Information Retrieval

31

Optimized intersection algorithm

31

每次做最短的,其餘以遞迴處理

Page 32: Lecture 1 : Full-text Indexing

Introduction to Information Retrieval

32

More general optimization

Example query: (tangerine OR trees) AND (marmalade OR skies)

Get frequencies for all terms Estimate the size of the answer

or : get the sum of its frequenciesand : get the min. value of its frequencies

Process in increasing order of or sizes

32

Page 33: Lecture 1 : Full-text Indexing

Introduction to Information Retrieval

33

Exercise

• Recommend a query processing order for

(tangerine OR trees) AND(marmalade OR skies) AND(kaleidoscope OR eyes)

Term Freq eyes 213312 kaleidoscope 87009 marmalade 107913 skies 271658 tangerine 46653 trees 316812

Page 34: Lecture 1 : Full-text Indexing

Introduction to Information Retrieval

Excerise

34

Page 35: Lecture 1 : Full-text Indexing

Introduction to Information Retrieval

Inverted indexing (1)

• 傳回查詢結果為 Doc1

Doc1:台灣大哥台Doc2:台灣大學

1 2 3 4 5 pos

1: 1

1: 2

1: 3

1: 4

1: 5 2: 1

2: 2

2: 3

2: 4

3

2

2

1

1

Dictionaryin Hashtableor Tree/Trie

被索引文件逐一讀入

Inverted List (posting)

查詢 台大 and 大哥 (容錯距離=1)

1: 1

1: 2

1: 3

1: 4

1: 5 2: 1

2: 2

2: 3

2: 4

3

2

2

1

1

1. 判斷應由(大)(哥)先合併,再做(台)(大)合併2. 各自合併之結果,再合併

Page 36: Lecture 1 : Full-text Indexing

Introduction to Information Retrieval

Inverted indexing (2)• 當文件讀入完畢,或是記憶體不夠時,必需寫入磁碟(序列化)

Doc1:台灣大哥台Doc2:台灣大學

1 2 3 4 5 pos被索引文件逐一讀入

1: 1

1: 2

1: 3

1: 4

1: 5 2:1

2: 2

2: 3

2: 4

3

2

2

1

1

Dictionaryin Hashtableor Tree/Trie

Inverted List (posting)

台 1: 1 1: 5 2: 13 灣 1: 2 2: 22 大 1: 3 2: 32 哥 1: 41 學 2: 41

2 + 7 x 4 = 30 bytes 22 bytes 22 bytes 14 bytes 14 bytes

Page 37: Lecture 1 : Full-text Indexing

Introduction to Information Retrieval

Inverted indexing (3) Design issue• 為了很快跳至某個 term list ,於檔案起始處增加 Dictionary index

• 實作設計– 自行寫入 Dictionary 與 term list ,並於新增修改時做維護 ( 需做 defragment)

– 採用 Persistent Dictionary 套件

台 1: 1 1: 5 2: 13 灣 1: 2 2: 22 大 1: 3 2: 32 哥 1: 41 學 2: 41

2 + 7 x 4 = 30 bytes 22 bytes 22 bytes 14 bytes 14 bytes

台 0 灣 30 大 52 哥 74 學 88

Page 38: Lecture 1 : Full-text Indexing

Introduction to Information Retrieval

Inverted Indexing (4) Compression• 能否節省索引空間?

– 略去不重要的 term (ex. stop words)– Compression

• 用時間換取空間,若 CPU 資源過多,且較磁碟存取快,則壓縮有利提升整體系統速度

• 目前磁碟傳輸都是透過 system bus 完成,因此 CPU 是可以在磁碟傳輸時另外做些事的

– Compression 1: dictionary 中較常用的 term 用較少的 bit 來表示• main idea: frequency v.s. variable-length

– Compression 2: term list 內的位置改存 offset

Page 39: Lecture 1 : Full-text Indexing

Introduction to Information Retrieval

Conclusion• Why is Search engine much faster than Relational

database ?– Relational database : row-oriented and B-tree index– Search engine : column-oriented (i.e. term) and Inverted

index

• Inverted index is much faster than B-tree index– any limitation ?

39