44
Hsin-Hsi Chen 1-1 Chapter 1 Introduction Hsin-Hsi Chen 陳陳陳陳 () 陳陳陳陳陳陳陳陳陳陳陳

Hsin-Hsi Chen1-1 Chapter 1 Introduction Hsin-Hsi Chen (陳信希) 國立台灣大學資訊程學系

Embed Size (px)

Citation preview

Page 1: Hsin-Hsi Chen1-1 Chapter 1 Introduction Hsin-Hsi Chen (陳信希) 國立台灣大學資訊程學系

Hsin-Hsi Chen 1-1

Chapter 1 Introduction

Hsin-Hsi Chen (陳信希)國立台灣大學資訊程學系

Page 2: Hsin-Hsi Chen1-1 Chapter 1 Introduction Hsin-Hsi Chen (陳信希) 國立台灣大學資訊程學系

Hsin-Hsi Chen 1-2

Motivation

• Information retrieval– Representation, Storage, Organization, Access– To retrieve information which might be useful or releva

nt to the user

• Information need (vs query)– Find all the pages containing information on college ten

nis teams which (1) are maintained by an university in the USA and (2) participate in the NCAA tennis tournament. To be relevant, the page must include information on the national ranking of the team in the last three years and the email or phone number of the team coach.

Page 3: Hsin-Hsi Chen1-1 Chapter 1 Introduction Hsin-Hsi Chen (陳信希) 國立台灣大學資訊程學系

Hsin-Hsi Chen 1-3

Information versus Data Retrieval

• Data retrieval– Determine which documents of a collection

contain the keywords in the user query– Retrieve all objects which satisfy clearly defined

conditions in regular expression or relational algebra expression

– Data has a well defined structure and semantics– Solution to the user of a database system

• Information retrieval

Page 4: Hsin-Hsi Chen1-1 Chapter 1 Introduction Hsin-Hsi Chen (陳信希) 國立台灣大學資訊程學系

Hsin-Hsi Chen 1-4

Database Management

• A specified set of attributes is used to characterize each item.EMPLOYEE(NAME, SSN, BDATE, ADDR, SEX, SALARY, DNO)

• Exact match between the attributes used inquery formulations and those attached to the record.

SELECT BDATE, ADDRFROM EMPLOYEEWHERE NAME = ‘John Smith’

Page 5: Hsin-Hsi Chen1-1 Chapter 1 Introduction Hsin-Hsi Chen (陳信希) 國立台灣大學資訊程學系

Hsin-Hsi Chen 1-5

Basic Concepts• Content identifiers (keywords, index terms,

descriptors) characterize the stored texts.• degrees of coincidence between the sets of

identifiers attached to queries and documents

content analysisquery formulation

User task Logical viewof the documents

Page 6: Hsin-Hsi Chen1-1 Chapter 1 Introduction Hsin-Hsi Chen (陳信希) 國立台灣大學資訊程學系

Hsin-Hsi Chen 1-6

The User Task

• Convey the semantics of information need

• Retrieval and browsing

Retrieval

Browsing

Database

Page 7: Hsin-Hsi Chen1-1 Chapter 1 Introduction Hsin-Hsi Chen (陳信希) 國立台灣大學資訊程學系

Hsin-Hsi Chen 1-7

Logical View of Documents

• Full text representation

• A set of index terms– Elimination of stop-words– The use of stemming– The identification of noun groups– …

Page 8: Hsin-Hsi Chen1-1 Chapter 1 Introduction Hsin-Hsi Chen (陳信希) 國立台灣大學資訊程學系

Hsin-Hsi Chen 1-8

From full text to a set of index terms

document

structurerecognition

text+structure

accents,spacing,

etc.stopwords

noungroups

stemming

automaticor manualindexing

structure

text

full textindexterms

Page 9: Hsin-Hsi Chen1-1 Chapter 1 Introduction Hsin-Hsi Chen (陳信希) 國立台灣大學資訊程學系

Hsin-Hsi Chen 1-9

Indexing

• indexing: assign identifiers to text items.• assign: manual vs. automatic indexing• identifiers:

– objective vs. nonobjective text identifiers cataloging rules define, e.g., author names, publisher names, dates of publications, …

– controlled vs. uncontrolled vocabulariesinstruction manuals, terminological schedules, …

– single-term vs. term phrase

Page 10: Hsin-Hsi Chen1-1 Chapter 1 Introduction Hsin-Hsi Chen (陳信希) 國立台灣大學資訊程學系

Hsin-Hsi Chen 1-10

The retrieval process

User Interface

Text Operations

QueryOperations

Searching

Ranking

Indexing

Index

DB ManagerModule

TextDatabase

Text

Text

logical viewlogical view

user need

userfeedback query

retrieved documents

rankeddocuments

Page 11: Hsin-Hsi Chen1-1 Chapter 1 Introduction Hsin-Hsi Chen (陳信希) 國立台灣大學資訊程學系

Hsin-Hsi Chen 1-11

Information Retrieval• generic information retrieval system

select and return to the user desired documents from a large set of documents in accordance with criteria specified by the user

• functions– document search

the selection of documents from an existing collection of documents

– document routingthe dissemination of incoming documents to appropriate users on the basis of user interest profiles

Page 12: Hsin-Hsi Chen1-1 Chapter 1 Introduction Hsin-Hsi Chen (陳信希) 國立台灣大學資訊程學系

Hsin-Hsi Chen 1-12

Detection Need• Definition

a set of criteria specified by the user which describes the kind of information desired.– queries in document search task– profiles in routing task

• forms– keywords– keywords with Boolean operators– free text– example documents– ...

Page 13: Hsin-Hsi Chen1-1 Chapter 1 Introduction Hsin-Hsi Chen (陳信希) 國立台灣大學資訊程學系

Hsin-Hsi Chen 1-13

Example

<head> Tipster Topic Description<num> Number: 033<dom> Domain: Science and Technology<title> Topic: Companies Capable of Producing Document

Management<des> Description:Document must identify a company who has the capability toproduce document management system by obtaining a turnkey-system or by obtaining and integrating the basic components.<narr> Narrative:To be relevant, the document must identify a turnkey documentmanagement system or components which could be integratedto form a document management system and the name of eitherthe company developing the system or the company using thesystem. These components are: a computer, image scanner oroptical character recognition system, and an information retrievalor text management system.

Page 14: Hsin-Hsi Chen1-1 Chapter 1 Introduction Hsin-Hsi Chen (陳信希) 國立台灣大學資訊程學系

Hsin-Hsi Chen 1-14

Example (Continued)

<con> Concepts:1. document management, document processing, office automationelectronic imaging2. image scanner, optical character recognition (OCR)3. text management, text retrieval, text database4. optical disk<fac> Factors:<def> DefinitionsDocument Management-The creation, storage and retrieval of documents containing, text, images, and graphics.Image Scanner-A device that converts a printed image into a videoimage, without recognizing the actual content of the text or pictures.Optical Disk-A disk that is written and read by light, and are sometimes associated with the storage of digital images because oftheir high storage capacity.

Page 15: Hsin-Hsi Chen1-1 Chapter 1 Introduction Hsin-Hsi Chen (陳信希) 國立台灣大學資訊程學系

Hsin-Hsi Chen 1-15

search vs. routing

• The search process matches a single Detection Need against the stored corpus to return a subset of documents.

• Routing matches a single document against a group of Profiles to determine which users are interested in the document.

• Profiles stand long-term expressions of user needs.• Search queries are ad hoc in nature.• A generic detection architecture can be used for bo

th the search and routing.

Page 16: Hsin-Hsi Chen1-1 Chapter 1 Introduction Hsin-Hsi Chen (陳信希) 國立台灣大學資訊程學系

Hsin-Hsi Chen 1-16

Search• retrieval of desired documents from an existing corpus

• Retrospective search is frequently interactive.

• Methods

– indexing the corpus by keyword, stem and/or phrase

– apply statistical and/or learning techniques to better understand the content of the corpus

– analyze free text Detection Needs to compare with the indexed corpus or a single document

– ...

Page 17: Hsin-Hsi Chen1-1 Chapter 1 Introduction Hsin-Hsi Chen (陳信希) 國立台灣大學資訊程學系

Hsin-Hsi Chen 1-17

Document Detection: Search

Page 18: Hsin-Hsi Chen1-1 Chapter 1 Introduction Hsin-Hsi Chen (陳信希) 國立台灣大學資訊程學系

Hsin-Hsi Chen 1-18

Document Detection: Search(Continued)

• Document Corpus– the content of the corpus may have significant t

he performance in some applications

• Preprocessing of Document Corpus– stemming– a list of stop words– phrases, multi-term items– ...

Page 19: Hsin-Hsi Chen1-1 Chapter 1 Introduction Hsin-Hsi Chen (陳信希) 國立台灣大學資訊程學系

Hsin-Hsi Chen 1-19

Document Detection: Search(Continued)

• Building Index from Stems– key place for optimizing run-time performance

– cost to build the index for a large corpus

• Document Index– a list of terms, stems, phrases, etc.

– frequency of terms in the document and corpus

– frequency of the co-occurrence of terms within the corpus

– index may be as large as the original document corpus

Page 20: Hsin-Hsi Chen1-1 Chapter 1 Introduction Hsin-Hsi Chen (陳信希) 國立台灣大學資訊程學系

Hsin-Hsi Chen 1-20

Document Detection: Search(Continued)

• Detection Need– the user’s criteria for a relevant document

• Convert Detection Need to System Specific Query– first transformed into a detection query, and then a retri

eval query.

– detection query: specific to the retrieval engine, but independent of the corpus

– retrieval query: specific to the retrieval engine, and to the corpus

Page 21: Hsin-Hsi Chen1-1 Chapter 1 Introduction Hsin-Hsi Chen (陳信希) 國立台灣大學資訊程學系

Hsin-Hsi Chen 1-21

Document Detection: Search(Continued)

• Compare Query with Index

• Resultant Rank Ordered List of Documents– Return the top ‘N’ documents – Rank the list of relevant documents from the m

ost relevant to the query to the least relevant

Page 22: Hsin-Hsi Chen1-1 Chapter 1 Introduction Hsin-Hsi Chen (陳信希) 國立台灣大學資訊程學系

Hsin-Hsi Chen 1-22

Routing

Page 23: Hsin-Hsi Chen1-1 Chapter 1 Introduction Hsin-Hsi Chen (陳信希) 國立台灣大學資訊程學系

Hsin-Hsi Chen 1-23

Routing (Continued)

• Profile of Multiple Detection Needs– A Profile is a group of individual Detection Ne

eds that describes a user’s areas of interest.– All Profiles will be compared to each incoming

document (via the Profile index).– If a document matches a Profile the user is notif

ied about the existence of a relevant document.

Page 24: Hsin-Hsi Chen1-1 Chapter 1 Introduction Hsin-Hsi Chen (陳信希) 國立台灣大學資訊程學系

Hsin-Hsi Chen 1-24

Routing (Continued)

• Convert Detection Need to System Specific Query

• Building Index from Queries– similar to build the corpus index for searching– the quantify of source data (Profiles) is usually

much less than a document corpus– Profiles may have more specific, structured dat

a in the form of SGML tagged fields

Page 25: Hsin-Hsi Chen1-1 Chapter 1 Introduction Hsin-Hsi Chen (陳信希) 國立台灣大學資訊程學系

Hsin-Hsi Chen 1-25

Routing (Continued)

• Routing Profile Index– The index will be system specific and will make use of

all the preprocessing techniques employed by a particular detection system.

• Document to be routed– A stream of incoming documents is handled one at a ti

me to determine where each should be directed.

– Routing implementation may handle multiple document streams and multiple Profiles.

Page 26: Hsin-Hsi Chen1-1 Chapter 1 Introduction Hsin-Hsi Chen (陳信希) 國立台灣大學資訊程學系

Hsin-Hsi Chen 1-26

Routing (Continued)

• Preprocessing of Document– A document is preprocessed in the same manner that a

query would be set-up in a search

– The document and query roles are reversed compared with the search process

• Compare Document with Index– Identify which Profiles are relevant to the document

– Given a document, which of the indexed profiles match it?

Page 27: Hsin-Hsi Chen1-1 Chapter 1 Introduction Hsin-Hsi Chen (陳信希) 國立台灣大學資訊程學系

Hsin-Hsi Chen 1-27

Routing (Continued)

• Resultant List of Profiles– The list of Profiles identify which user should r

eceive the document

Page 28: Hsin-Hsi Chen1-1 Chapter 1 Introduction Hsin-Hsi Chen (陳信希) 國立台灣大學資訊程學系

Hsin-Hsi Chen 1-28

Summary

• Generate a representation of the meaning or content of each object based on its description.

• Generate a representation of the meaning of the information need.

• Compare these two representations to select those objects that are most likely to match the information need.

Page 29: Hsin-Hsi Chen1-1 Chapter 1 Introduction Hsin-Hsi Chen (陳信希) 國立台灣大學資訊程學系

Hsin-Hsi Chen 1-29

Documents Queries

DocumentRepresentation

QueryRepresentation

Comparison

Basic Architecture of an Information Retrieval System

Page 30: Hsin-Hsi Chen1-1 Chapter 1 Introduction Hsin-Hsi Chen (陳信希) 國立台灣大學資訊程學系

Hsin-Hsi Chen 1-30

Research Issues

• Given a set of description for objects in the collection and a description of an information need, we must consider

• Issue 1– What makes a good document representation?– What are retrievable units and how are they org

anized?– How can a representation be generated from a d

escription of the document?

Page 31: Hsin-Hsi Chen1-1 Chapter 1 Introduction Hsin-Hsi Chen (陳信希) 國立台灣大學資訊程學系

Hsin-Hsi Chen 1-31

Research Issues (Continued)

• Issue 2How can we represent the information need and how can we acquire this representation either from a description of the information need or through interaction with the user?

• Issue 3How can we compare representations to judge likelihood that a document matches an information need?

Page 32: Hsin-Hsi Chen1-1 Chapter 1 Introduction Hsin-Hsi Chen (陳信希) 國立台灣大學資訊程學系

Hsin-Hsi Chen 1-32

Research Issues (Continued)

• Issue 4How can we evaluate the effectiveness of the retrieval process?

Page 33: Hsin-Hsi Chen1-1 Chapter 1 Introduction Hsin-Hsi Chen (陳信希) 國立台灣大學資訊程學系

Hsin-Hsi Chen 1-33

Information Extraction

• Generic Information Extraction SystemAn information extraction system is a cascade of transducers or modules that at each step add structure and often lose information, hopefully irrelevant, by applying rules that are acquired manually and/or automatically.

Page 34: Hsin-Hsi Chen1-1 Chapter 1 Introduction Hsin-Hsi Chen (陳信希) 國立台灣大學資訊程學系

Hsin-Hsi Chen 1-34

Information Extraction (Continued)

• What are the transducers or modules?

• What are their input and output?

• What structure is added?

• What information is lost?

• What is the form of the rules?

• How are the rules applied?

• How are the rules acquired?

Page 35: Hsin-Hsi Chen1-1 Chapter 1 Introduction Hsin-Hsi Chen (陳信希) 國立台灣大學資訊程學系

Hsin-Hsi Chen 1-35

Example: Parser• transducer: parser• input: the sequence of words or lexical items• output: a parse tree• information added: predicate-argument and modifi

cation relations• information lost: no• rule form: unification grammars• application method: chart parser• acquisition method: manually

Page 36: Hsin-Hsi Chen1-1 Chapter 1 Introduction Hsin-Hsi Chen (陳信希) 國立台灣大學資訊程學系

Hsin-Hsi Chen 1-36

Modules

• Text Zonerturn a text into a set of text segments

• Preprocessorturn a text or text segment into a sequence of sentences, each of which is a sequence of lexical items, where a lexical item is a word together with its lexical attributes

• Filterturn a set of sentences into a smaller set of sentences by filtering out the irrelevant ones

• Preparsertake a sequence of lexical items and try to identify various reliably determinable, small-scale structures

Page 37: Hsin-Hsi Chen1-1 Chapter 1 Introduction Hsin-Hsi Chen (陳信希) 國立台灣大學資訊程學系

Hsin-Hsi Chen 1-37

Modules (Continued)

• Parserinput a sequence of lexical items and perhaps small-scale structures (phrases) and output a set of parse tree fragments, possibly complete

• Fragment Combinerturn a set of parse tree or logical form fragments into a parse tree or logical form for the whole sentence

• Semantic Interpretergenerate a semantic structure or logical form from a parse tree or from parse tree fragments

Page 38: Hsin-Hsi Chen1-1 Chapter 1 Introduction Hsin-Hsi Chen (陳信希) 國立台灣大學資訊程學系

Hsin-Hsi Chen 1-38

Modules (Continued)

• Lexical Disambiguationturn a semantic structure with general or ambiguous predicates into a semantic structure with specific, unambiguous predicates

• Coreference Resolution, or Discourse Processingturn a tree-like structure into a network-like structure by identifying different descriptions of the same entity in different parts of the text

• Template Generatorderive the templates from the semantic structures

Page 39: Hsin-Hsi Chen1-1 Chapter 1 Introduction Hsin-Hsi Chen (陳信希) 國立台灣大學資訊程學系

Hsin-Hsi Chen 1-39

Topics

• Introduction to Information Retrieval and Extraction

• Modeling• Retrieval Evaluation• Query Languages• Query Operations• Text and Multimedia Languages and Properties• Text Operations• Indexing and Searching

Page 40: Hsin-Hsi Chen1-1 Chapter 1 Introduction Hsin-Hsi Chen (陳信希) 國立台灣大學資訊程學系

Hsin-Hsi Chen 1-40

Topics (Continued)

• User Interfaces and Visualization

• Multimedia IR: Models and Languages

• Multimedia IR: Indexing and Searching

• Searching the Web

• Digital Libraries

• Information Extraction (Jerry R. Hobbs)

Page 41: Hsin-Hsi Chen1-1 Chapter 1 Introduction Hsin-Hsi Chen (陳信希) 國立台灣大學資訊程學系

Hsin-Hsi Chen 1-41

TextIR

Retrieval Modelsand Evaluation

ImprovementsOn Retrieval

EfficientProcessing

Interfaces &Visualization

MultimediaModeling

& Searching

Human-ComputerInteraction for IR

MultimediaIR

Applicationsfor IR

Bibliographic

Systems

TheWeb

DigitalLibraries

Page 42: Hsin-Hsi Chen1-1 Chapter 1 Introduction Hsin-Hsi Chen (陳信希) 國立台灣大學資訊程學系

Hsin-Hsi Chen 1-42

Information Sources• Books

– Ricardo Baeza-Yates and Berthier Riberiro-Neto (1999) Modern Information Retrieval, Addison-Wesley.

– Salton, G. (1989) Automatic Text Processing. The Transformation, Analysis and Retrieval of Information by Computer. Reading, MA: Addison-Wesley.

– Frakes, W.B. and Baeza-Yates, R. (Eds.) (1992) Information Retrieval: Data Structures and Algorithms. Englewood Cliffs, NJ: Prentice Hall.

– Cheong, F. (1996) Internet Agents: Spiders, Wanderers, Brokers, and Bots. Indianapolis, IN: New Riders, 1996.

– Karen Sparck Jones and Peter Willett (1997) Readings in Information Retrieval, CA: Morgan Kaufmann Publishers.

Page 43: Hsin-Hsi Chen1-1 Chapter 1 Introduction Hsin-Hsi Chen (陳信希) 國立台灣大學資訊程學系

Hsin-Hsi Chen 1-43

Information Sources

• Conference Proceedings– ACM SIGIR Annual International Conference

on Research and Development in Information Retrieval (1978-)

– ACM International Conference on Digital Libraries

– ACM Conference on Information Knowledge Management

– Text Retrieval Conference

Page 44: Hsin-Hsi Chen1-1 Chapter 1 Introduction Hsin-Hsi Chen (陳信希) 國立台灣大學資訊程學系

Hsin-Hsi Chen 1-44

Information Sources(Continued)

• Journals– ACM Transactions on Information Systems– Information Processing and Management (formerly

Information Storage and Retrieval)– Journal of the American Society for Information

Science (formerly American Documentation)– Journal of Documentation– Information Systems– Information Retrieval– Knowledge and Information Systems