34
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 1 龙龙龙龙龙龙 : 龙龙龙龙 Next-Generation Search Engines ChengXiang Zhai ( 翟翟翟 ) Department of Computer Science Graduate School of Library & Information Science Institute for Genomic Biology, Statistics University of Illinois, Urbana-Champaign http://www-faculty.cs.uiuc.edu/~czhai, [email protected]

龙星计划课程 : 信息检索 Next-Generation Search Engines

  • Upload
    sage

  • View
    107

  • Download
    16

Embed Size (px)

DESCRIPTION

龙星计划课程 : 信息检索 Next-Generation Search Engines. ChengXiang Zhai (翟成祥) Department of Computer Science Graduate School of Library & Information Science Institute for Genomic Biology, Statistics University of Illinois, Urbana-Champaign http://www-faculty.cs.uiuc.edu/~czhai, [email protected]. - PowerPoint PPT Presentation

Citation preview

Page 1: 龙星计划课程 : 信息检索 Next-Generation Search Engines

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 1

龙星计划课程 :信息检索   Next-Generation Search Engines

ChengXiang Zhai (翟成祥 ) Department of Computer Science

Graduate School of Library & Information Science

Institute for Genomic Biology, StatisticsUniversity of Illinois, Urbana-Champaign

http://www-faculty.cs.uiuc.edu/~czhai, [email protected]

Page 2: 龙星计划课程 : 信息检索 Next-Generation Search Engines

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 2

Outline• Overview of web search

• Next generation search engines

Page 3: 龙星计划课程 : 信息检索 Next-Generation Search Engines

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 3

Characteristics of Web Information• “Infinite” size (Surface vs. deep Web)

– Surface = static HTML pages – Deep = dynamically generated HTML pages (DB)

• Semi-structured – Structured = HTML tags, hyperlinks, etc– Unstructured = Text

• Different format (pdf, word, ps, …)

• Multi-media (Textual, audio, images, …)

• High variances in quality (Many junks)

• “Universal” coverage (can be about any content)

Page 4: 龙星计划课程 : 信息检索 Next-Generation Search Engines

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 4

General Challenges in Web Information Management

• Handling the size of the Web– How to ensure completeness of coverage?– Efficiency issues

• Dealing with or tolerating errors and low quality information

• Addressing the dynamics of the Web – Some pages may disappear permanently– New pages are constantly created

Page 5: 龙星计划课程 : 信息检索 Next-Generation Search Engines

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 5

“Free text” vs. “Structured text”• So far, we’ve assumed “free text”

– Document = word sequence– Query = word sequence– Collection = a set of documents– Minimal structure …

• But, we may have structures on text (e.g., title, hyperlinks)– Can we exploit the structures in retrieval?

Page 6: 龙星计划课程 : 信息检索 Next-Generation Search Engines

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 6

Examples of Document Structures• Intra-doc structures (=relations of components)

– Natural components: title, author, abstract, sections, references, …

– Annotations: named entities, subtopics, markups, …

• Inter-doc structures (=relations between documents)– Topic hierarchy– Hyperlinks/citations (hypertext)

Page 7: 龙星计划课程 : 信息检索 Next-Generation Search Engines

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 7

Structured Text Collection

...

Subtopic 1 Subtopic k

A general topic

General question: How do we search such a collection?

Page 8: 龙星计划课程 : 信息检索 Next-Generation Search Engines

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 8

Exploiting Intra-document Structures[Ogilvie & Callan 2003]

Title

Abstract

Body-Part1

Body-Part2

D

D1

D2

D3

Dk

Intuitively, we want to combine all the parts, but give more weights to some parts

Think about query-likelihood model…

1

11

( | , ) ( | , )

( | , ) ( | , )

n

ii

n k

j i jji

p Q D R p w D R

s D D R p w D R

“part selection” prob. Serves as weight for Dj

Can be trained using EM

Select Dj and generate a query word using Dj

Anchor text can be treated as a “part” of a document

Page 9: 龙星计划课程 : 信息检索 Next-Generation Search Engines

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 9

Exploiting Inter-document Structures

• Document collection has links (e.g., Web, citations of literature)

• Query: text query

• Results: ranked list of documents

• Challenge: how to exploit links to improve ranking?

Page 10: 龙星计划课程 : 信息检索 Next-Generation Search Engines

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 10

Exploiting Inter-Document Links

Description(“anchor text”)

Hub Authority

“Extra text”/summary for a doc

Links indicate the utility of a doc

What does a link tell us?

Page 11: 龙星计划课程 : 信息检索 Next-Generation Search Engines

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 11

PageRank: Capturing Page “Popularity” [Page & Brin 98]

• Intuitions– Links are like citations in literature– A page that is cited often can be expected to be more useful in

general

• PageRank is essentially “citation counting”, but improves over simple counting– Consider “indirect citations” (being cited by a highly cited paper

counts a lot…)– Smoothing of citations (every page is assumed to have a non-

zero citation count)

• PageRank can also be interpreted as random surfing (thus capturing popularity)

Page 12: 龙星计划课程 : 信息检索 Next-Generation Search Engines

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 12

The PageRank Algorithm (Page et al. 98)

1 1

1

0 0 1/ 2 1/ 21 0 0 00 1 0 01/ 2 1/ 2 0 0

1( ) (1 ) ( ) ( )

1[ (1 ) ] ( )

( (1 ) )

N N

i ki k kk k

N

ki kk

T

M

p d m p d p dN

m p dN

p I M p

d1

d2

d4

“Transition matrix”d3

Iterate until converge

N= # pages

Stationary (“stable”) distribution, so we

ignore time

Random surfing model: At any page,

With prob. , randomly jumping to a pageWith prob. (1-), randomly picking a link to follow.

Iij = 1/N

Initial value p(d)=1/N

Page 13: 龙星计划课程 : 信息检索 Next-Generation Search Engines

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 13

PageRank in Practice• Interpretation of the damping factor (0.15):

– Probability of a random jump – Smoothing the transition matrix (avoid zero’s)

• Normalization doesn’t affect ranking, leading to some variants

• The zero-outlink problem: p(di)’s don’t sum to 1– One possible solution = page-specific damping factor (=1.0

for a page with no outlink)

1 1

1( ) (1 ) ( ) ( ) ( (1 ) )N N

Ti ki k k

k k

p d m p d p d p I M pN

1 1

1 1

1

1'( ) ( ), , ( ) (1 ) ( ) ( )

1'( ) (1 ) '( ) '( )

'( ) (1 ) '( )

N N

i i i ki k kk k

N N

i ki k kk k

NC N

i ki kk

Let p d cp d c constant cp d m cp d cp dN

p d m p d p dN

p d m p d

Page 14: 龙星计划课程 : 信息检索 Next-Generation Search Engines

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 14

HITS: Capturing Authorities & Hubs [Kleinberg 98]

• Intuitions– Pages that are widely cited are good authorities– Pages that cite many other pages are good hubs

• The key idea of HITS– Good authorities are cited by good hubs– Good hubs point to good authorities– Iterative reinforcement…

Page 15: 龙星计划课程 : 信息检索 Next-Generation Search Engines

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 15

The HITS Algorithm [Kleinberg 98]

d1

d2

d4( )

( )

0 0 1 11 0 0 00 1 0 01 1 0 0

( ) ( )

( ) ( )

;

;

j i

j i

i jd OUT d

i jd IN d

T

T T

A

h d a d

a d h d

h Aa a A h

h AA h a A Aa

“Adjacency matrix”

d3 Initial values: a(di)=h(di)=1

Iterate

Normalize: 2 2( ) ( ) 1i i

i i

a d h d

Page 16: 龙星计划课程 : 信息检索 Next-Generation Search Engines

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 16

Basic Search Engine Technologies

Cachedpages

Crawler

Web

--------…--------

--------…--------

…Indexer

(Inverted) Index

Retriever

Browser

QueryHost Info.

Results

User

Efficiency!!!Coverage

Freshness

Precision

Error/spam handling

Page 17: 龙星计划课程 : 信息检索 Next-Generation Search Engines

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 17

Component I: Crawler/Spider/Robot• Building a “toy crawler” is easy

– Start with a set of “seed pages” in a priority queue– Fetch pages from the web– Parse fetched pages for hyperlinks; add them to the queue– Follow the hyperlinks in the queue

• A real crawler is much more complicated…– Robustness (server failure, trap, etc.)– Crawling courtesy (server load balance, robot exclusion, etc.)– Handling file types (images, PDF files, etc.)– URL extensions (cgi script, internal references, etc.)– Recognize redundant pages (identical and duplicates)– Discover “hidden” URLs (e.g., truncated)

• Crawling strategy is a main research topic (i.e., which page to visit next?)

Page 18: 龙星计划课程 : 信息检索 Next-Generation Search Engines

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 18

Major Crawling Strategies• Breadth-First (most common(?); balance server load)

• Parallel crawling

• Focused crawling – Targeting at a subset of pages (e.g., all pages about

“automobiles” )– Typically given a query

• Incremental/repeated crawling – Can learn from the past experience– Probabilistic models are possible

The Major challenge remains to maintain “freshness” and good coverage with minimum resource overhead

Page 19: 龙星计划课程 : 信息检索 Next-Generation Search Engines

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 19

Component II: Indexer• Standard IR techniques are the basis

– Basic indexing decisions (stop words, stemming, numbers, special symbols)

– Indexing efficiency (space and time)– Updating

• Additional challenges– Recognize spams/junks– Exploit multiple features (PageRank, font information, structures,

etc)– How to support “fast summary generation”?

• Google’s contributions: – Google file system: distributed file system– Big Table: column-based database – MapReduce: Software framework for parallel computation– Hadoop: Open source implementation of MapReduce (mainly by

Yahoo!)

Page 20: 龙星计划课程 : 信息检索 Next-Generation Search Engines

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 20

Google’s Basic Solutions

URL Queue/List

Cached source pages(compressed)

Inverted index

Hypertextstructure

Use many features,e.g. font,layout,…

Page 21: 龙星计划课程 : 信息检索 Next-Generation Search Engines

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 21

Component III: Retriever• Standard IR models are applicable but insufficient

– Different information need (home page finding vs. topic-driven)– Documents have additional information (hyperlinks, markups, URL)– Information is often redundant and the quality varies a lot– Server-side feedback is often not feasible

• Major extensions– Exploiting links (anchor text, link-based scoring)– Exploiting layout/markups (font, title field, etc.)– Spelling correction– Spam filtering– Redundancy elimination

• In general, rely on machine learning to combine all kinds of features

Page 22: 龙星计划课程 : 信息检索 Next-Generation Search Engines

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 22

Effective Web Retrieval Heuristics• High accuracy in home page finding can be achieved by

– Matching query with the title– Matching query with the anchor text– Plus URL-based or link-based scoring (e.g. PageRank)

• Imposing a conjunctive (“and”) interpretation of the query is often appropriate – Queries are generally very short (all words are necessary)– The size of the Web makes it likely that at least a page would

match all the query words

• Combine multiple features using machine learning

Page 23: 龙星计划课程 : 信息检索 Next-Generation Search Engines

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 23

Home/Entry Page Finding Evaluation Results(TREC 2001)

MRR %top10 %fail0.774 88.3 4.80.772 87.6 4.8

Unigram Query Likelihood+ Link/URL prior i.e., p(Q|D) p(D)

[Kraaij et al. SIGIR 2002]

Exploiting anchor text, structure or links

0.382 62.1 11.7

0.340 60.7 15.9

Query example: Haas Business School

Page 24: 龙星计划课程 : 信息检索 Next-Generation Search Engines

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 24

Named Page Finding Evaluation Results (TREC 2002)

Dirichlet Prior+ Title, Anchor Text

(Lemur)

[Ogilvie & Callan SIGIR 2003]

Okapi/BM25 + Anchor Text

Best content-only

Query example: America’s century farms

Page 25: 龙星计划课程 : 信息检索 Next-Generation Search Engines

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 25

Learning Retrieval Functions• Basic idea:

– Given a query-doc pair (Q,D), define various kinds of features Fi(Q,D)

– Examples of feature: the number of overlapping terms, p(Q|D), PageRank of D, p(Q|Di), where Di may be anchor text or big font text

– Hypothesize p(R=1|Q,D)=s(F1(Q,D),…,Fn(Q,D), ) where is parameters

– Learn by fitting function s with training data (i.e., (d,q)’s where d is known to be relevant or non-relevant to q)

• Methods: – Early work: logistic regression [Cooper 92, Gey 94]– Recent work: Ranking SVM [Joachims 02], RankNet [Burges et al. 05]

Page 26: 龙星计划课程 : 信息检索 Next-Generation Search Engines

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 26

Learning to Rank

• Advantages– May combine multiple features (helps improve

accuracy and combat web spams)– May re-use all the past relevance judgments (self-

improving)

• Problems– No much guidance on feature generation (rely on

traditional retrieval models)

• All current Web search engines use some kind of learning algorithms to combine many features

Page 27: 龙星计划课程 : 信息检索 Next-Generation Search Engines

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 27

Next-Generation Search Engines

Page 28: 龙星计划课程 : 信息检索 Next-Generation Search Engines

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 28

Limitations of the Current Search Engines

• Limited query language– Syntactic querying (sense disambiguation?)– Can’t express multiple search criteria (readability?)

• Limited understanding of document contents– Bag of words & keyword matching (sense disambiguation?)

• Heuristic query-document matching: mostly TF-IDF weighting– No guarantee for optimality – Machine learning can combine many features, but content

matching remains the most important component in scoring

Page 29: 龙星计划课程 : 信息检索 Next-Generation Search Engines

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 29

Limitations of the Current Search Engines (cont.)

• Lack of user/context modeling– Using the same query, different users would get the same results

(poor user modeling)– The same user may use the same query to find different

information at different times

• Inadequate support for multiple-mode information access– Passive search support: A user must take initiative (no

recommendation)– Static navigation support: No dynamically generated links – Consider more integration of search, recommendation, and

navigation

• Lack of interaction

• Lack of task support

Page 30: 龙星计划课程 : 信息检索 Next-Generation Search Engines

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 30

Towards Next-Generation Search Engines• Better support for query formulation

– Allow querying from any task context – Query by examples– Automatic query generation (recommendation)

• Better search accuracy– More accurate information need understanding (more

personalization and context modeling)– More accurate document content understanding (more powerful

content analysis, sense disambiguation, sentiment analysis, …)

• More complex retrieval criteria– Consider multiple utility aspects of information items (e.g.,

readability, quality, communication cost)– Consider collective value of information items (context-sensitive

ranking)

Page 31: 龙星计划课程 : 信息检索 Next-Generation Search Engines

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 31

Towards Next-Generation Search Engines (cont.)

• Better result presentation– Better organization of search results to facilitate

navigation – Better summarization

• More effective and robust retrieval models– Automatic parameter tuning

• More interactive search

• More task support

Page 32: 龙星计划课程 : 信息检索 Next-Generation Search Engines

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 32

Looking Ahead….• More user modeling

– Personalized search– Community search engine (collaborative information access)

• More content analysis and domain modeling– Vertical search engines – More in-depth (domain-specific) natural language understanding– Text mining

• More accurate retrieval models (life-time learning)

• Going beyond search– Towards full-fledge information access: integration of search,

recommendation, and navigation– Towards task support: putting information access in the context

of a task

Page 33: 龙星计划课程 : 信息检索 Next-Generation Search Engines

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 33

Summary

• Web provides many challenges and opportunities for text information management

• Search engine technology crawling + retrieval models + machine learning + software engineering

• Current generation of search engines are limited in user modeling, content understanding, retrieval model…

• Next generation of search engines likely moves toward personalization, domain-specific vertical search engines, collaborative search, task support, …

Page 34: 龙星计划课程 : 信息检索 Next-Generation Search Engines

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 34

What You Should Know• Special characteristics of Web information (as

compared with ordinary text collection)

• Two kinds of structures of a text collection (intra-doc and inter-doc)

• Basic ideas of PageRank and HITS

• How a web search engine works

• Limitations of current search engines