Upload
phunghanh
View
218
Download
5
Embed Size (px)
Citation preview
Query Understanding in Web Search- by Large Scale Log Data Mining and
Statistical Learning
Hang Li
Microsoft Research Asia
COLING 2010 NLPIX Workshop
August 28, 2010
Joint Work with Colleagues, Interns, Collaborators
1
Web Search is Part of Our Life
2
Search System = `Black Boxes’
3
Natural Language ProcessingInformation Retrieval
Data Mining
Large Scale Distributed Computing
Advanced Web Search Technologies Are Used…
Statistical Learning
4
Web Search Relies on NLP and IR
• Query Understanding– Classification, structure prediction, topic modeling, similarity
learning
• Document Understanding– Classification, structure prediction, topic modeling, learning on
graph
• Query Document Matching– Language model, similarity learning
• Ranking– Learning to rank
• User Understanding– Classification, topic modeling
5
Query Understanding
• Input: query
• Output: query representation
– Refined query (e.g., spelling error correction)
– Similar queries
– Categories
– Topics
– Key phrases
– Named entities
6
This Talk = Query Understanding
7
Talk Outline: Three Projects
• LOGAL: Search and Browse Log Mining Platform
• Semantic Matching: Improving Tail Query Relevance
• Context aware Search: Better Search Using Context Information
2010/8/30 8
PROJECT: LOGAL (LOG OBJECT GALLERY)
9
Joint work with Daxin Jiang, Xiaohui Sun
LOGAL
Search and Browse Log Mining Platform
10
Data Structure of Search/Browse Logs
• Various types of data
• Complex relationship among data objects
– Hierarchical relationship
– Sequential relationship
Users
Queries
Srch/Ads clicks
Follow-up clicks
Sessions
•
•
•
•
•
Search result pages
•
•
•
•
•
Rich Log Mining Applications
Log Mining Applications
Search Applications
Ads Applications
Query Understanding
Document Understanding
User Understanding
Query-Doc Matching
Query Suggestion
Query Expansion
Query Substitution
Query Classification
Document Annotation
Document Classification
Document Summarization
Personalized Search
Search UI Design
User Satisfaction Prediction
Document & Ad (re-)Ranking
Search Results Clustering
Search Results Diversification
Web Site Recommendation
Keyword Generation
Behavior Targeting
Ad Click-Through Prediction
Contextual Advertising
12
The Problem
Search Log
Apps
Log Data
Query Understanding
A huge gap between the data and the applications
Toolbar Data Ads. Log Web Site Log
Document Understanding
User Understanding
Query-Doc Matching
13
The Problem
Search Log
Apps
Log Data
Query Understanding
Toolbar Data Ads. Log Web Site Log
Document Understanding
User Understanding
Query-Doc Matching
Each researcher or developer
1. Has to access the raw log data directly
2. Has to build the application from scratch
Very difficult to build large-scale log mining applications
14
Log Data Mining Platform
Log Objects Gallery (LOGAL)
ToolbarData
Web SiteLog
SearchLog
Ads.Log
Raw Logs
Query Understanding
Document Understanding
Query DocMatching
User Understanding
App Level
Middle Level
Raw Data Level
Data Platform
15
Query Histogram
Example applications:• Query auto completion• Query suggestion• Query analysis: temporal changes of query frequency
Query Countfacebook 3,157 Kgoogle 1,796 Kyoutube 1,162 K
myspace 702 Kfacebook com 665 Kyahoo 658 Kyahoo mail 486 Kyahoo com 486 Kebay 486 Kfacebook login 445 K
Click-through Bipartite
• Example applications
– Document (re-)ranking
– Search results clustering
– Web page summarization
– Query suggestion
click-through bipartite
Click PatternQuery
Doc 1Doc 2…
………………Doc N
Doc 1 Doc 2
…
………………
Doc N
Doc 1 Doc 2
… …
……………Doc N
Pattern 1 (count)
Pattern n (count)
Pattern 2 (count)
…
• Example applications– Estimate relevance of document to query
– Predict users’ satisfaction
– Query classification (informational vs navigational)
Session Pattern
• Example applications– Doc (re-)ranking
– Query suggestion
– Site recommendation
– User satisfaction prediction
Srch click: search clickAds click: advertisement click
User activities in a session
Query Click:Srch clickAds click…
Browse
PROJECT: SEMANTIC MATCHING
20
Joint work with Gu Xu, Jun Xu, Jingfang Xu
Semantic Matching
Improving Tail Query Relevance
21
Different Queries Can Represent Same Intent “Distance between Sun and Earth” - Luke DeLorme
• distance from earth to the sun• distance from sun to earth• distance from sun to the earth• distance from the earth to the sun• distance from the sun to earth• distance from the sun to the earth• distance of earth from sun• distance of earth from the sun• distance of earth to sun• distance of earth to the sun• distance of sun from earth• distance of sun from the earth• distance of sun to earth• distance of the earth from the sun• distance of the earth to the sun• distance of the sun from earth• distance of the sun from the earth• distance of the sun to earth• distance of the sun to the earth• distance sun• distance sun and earth• distance sun earth• distance sun from earth• distance sun to earth• distance to sun from earth• distance to the sun from earth• earth and sun distance
• "how far" earth sun
• "how far" sun
• "how far" sun earth
• average distance earth sun
• average distance from earth to sun
• average distance from the earth to the sun
• distance between earth & sun
• distance between earth and sun
• distance between earth and the sun
• distance between earth sun
• distance between sun and earth
• distance between the earth and sun
• distance between the earth and the sun
• distance between the sun and earth
• distance between the sun and the earth
• distance earth and sun
• distance earth from sun
• distance earth is from the sun
• distance earth sun
• distance earth to sun
• distance earth to the sun
• distance from earth to sun
• distance from earth to the sun
• distance from sun to earth
• distance from sun to the earth
• distance from the earth to the sun
• distance from the sun to earth
• how far away is the sun from earth• how far away is the sun from the
earth• how far earth from sun• how far earth is from the sun• how far earth sun• how far from earth is the sun• how far from earth to sun• how far from the earth to the sun• how far from the sun is earth• how far from the sun is the earth• how far is earth away from the sun• how far is earth from sun• how far is earth from the sun• how far is earth to the sun• how far is it from earth to the sun• how far is it from the earth to the sun• how far is sun from earth• how far is the earth away from the
sun• how far is the earth from sun• how far is the earth from the sun• how far is the earth to the sun• how far is the sun• how far is the sun away from earth• how far is the sun away from the
earth
Microsoft Confidential
Different Levels of Semantic Matching
Structure
Term
Word Sense
Topic
Leve
l of
Sem
anti
cs
Match exactly same terms
NY New York
disk disc
Match terms with same meanings
NY New York
motherboard mainboard
utube youtube
Match topics of query and documents
Microsoft Office … working for Microsoft … my office is in …
Topic: PC Software Topic: Personal Homepage
Match intent with answers (structures of query and document)Microsoft Office home find homepage of Microsoft Office
21 movie find movie named 21
buy laptop less than 1000 find online dealers to buylaptop with less than 1000 dollars
23
Semantic Matching Is Useful for
• General Search Relevance
• Vertical Search
• Entity Search
• Task Completion
24
SYSTEM VIEW OF SEMANTIC MATCHING
25
Overall System
Query Representation
Query Index Document Index
Search Log Data Web Data
Offline Query Processing Offline Document Processing
Ranked Documents
Document Representations
Query
Query Knowledge
Online Query Processing Semantic Matching
Microsoft
Online
Offline
26
Online Query Processing
Named Entity Recognition in Query
Query Topic Identification
Similar Query Finding
Query RefinementSense
Topic
Structure
michael jordan berkele
michael jordan berkeley
michael jordan berkeley
michael I. jordan berkeley
michael jordan berkeley: academic
michael I. jordan berkeley: academic
[michael jordan: PersonName] [berkeley: Location]: academic
[michael I. jordan: PersonName] [berkeley: Location]: academic
27
Offline Document Processing
Named Entity Recognition in Doc.
Document Topic Identification
Key Concept Identification
TokenizationSense
Topic
Structure
Michael Jordan is Professor in the Department of Electrical Engineering
[Michael Jordan] is [Professor] in the [Department] of [Electrical Engineering]
[Michael Jordan/M. Jordan] is [Professor]in the [Department/Dept.] of [Electrical Engineering/EE]
[Michael Jordan/M. Jordan] is [Professor]in the [Department/Dept.] of [Electrical Engineering/EE]: academic
[Michael Jordan/M. Jordan: PersonName]is [Professor] in the [Department/Dept.] of [Electrical Engineering/EE]: academic
28
Online Semantic Matching
QueryRepresentation
DocumentRepresentation
[michael jordan: PersonName] [berkeley: Location]: academic
[michael I. jordan: PersonName] [berkeley: Location]: academic
[Michael Jordan/M. Jordan: PersonName]is [Professor] in the [Department/Dept.] of [Electrical Engineering/EE]: academic
SemanticMatching
Matching can be conducted at different levels
Ranking Results
29
QUERY REFINEMENT USING CRF MODEL
30
Correcting Errors in Query
search “windows onecare”
window onecar
Query Refiner
SearchSystem
windows onecare
31
Structured Prediction Problem
windows onecare
window onecarObserved “noisy” word sequence
“Ideal” word sequence
original query word sequence
“ideal” query word sequence
32
Conditional Random Fields for Query Refinement
Introducing Refinement Operations
oi-1 oi oi+1
yi-1 yi yi+1
xi-1 xi xi+1
OperationsSpelling: insertion, deletion, substitution, transposition, …
Word Stemming: +s/-s, +es/-es, +ed/-ed, +ing/-ing, … 33
Query Refinement Using Conditional Random Fields
34
NAMED ENTITY MINING FROM QUERY LOG USING TOPIC MODEL
35
Named Entity Recognition in Query
harry potter harry potter film harry potter author
harry potter – Movie (0.5)harry potter – Book (0.4)harry potter – Game (0.1)
harry potter filmharry potter – Movie (0.95)
harry potter authorharry potter – Book (0.95)
36
Our Approach
• Using Query Log Data (or Click-through Data)
• Using Topic Model
• Weakly Supervised Latent Dirichlet Allocation
• vs Pasca’s work (named entity mining from log data, deterministic approach)
37
Seed and Query Log
final fantasyMovie Game
gone with the windMovie Book
harry potterMovie Book Game
final fantasy 300final fantasy movie 120final fantasy wallpaper 50gone with the wind movie 120gone with the wind review 10gone with the wind photos 10harry potter 1000harry potter book 650 gone with the wind book 80gone with the wind summary 20 harry potter cheats 300harry potter pics 200harry potter summary 100final fantasy xbox 10final fantasy soundtrack 10gone with the wind 250harry potter movie 800……
38
Pseudo Documents of Named Entities
\# 1000\# movie 800\# book 650 \# cheats 300\# pics 200\# summary 100
\# 250\# movie 120\# book 80\# summary 20\# review 10\# photos 10
\# 300\# movie 120\# wallpaper 50\# xbox 10\# soundtrack 10
final fantasy Movie, Game
gone with the wind
harry potter Movie, Book, Game
Movie, Book
39
Latent Dirichlet Allocation Model
z: Movie, Book, Gamew: \#, \# movie, \# book, ….
: distribution of classes for named entity: distribution of contexts for class
40
Weakly Supervised Latent Dirichlet Allocation
),()|(log yCDp
0
0.2
0.4
0.6
0.8
harry potter
final fantacy
gone with wind
Movie
Book
Game
Music
constraints41
PROJECT: CONTEXT AWARE SEARCH
42
Joint work with Daxin Jiang, Jian Pei, and others
Context aware Search:
Better Search Using Context Information
43
• Job related
Search in Office
44
Search at Home
• Household related
• Hobby and leisure
45
• Location
• Time
• Activity
Search in Mobile Context
46
• Community
Search in Social Context
47
Conventional Web Search
48
• Suppose that user raises query “jaguar”
• If we know the user raises query “”BMW” before “”jaguar”
• Then we know that the user is likely to look for the car
Search Intent and Context
49
• User usually conducts multiple related searches in a session
Context of Search
User query query
click click click Current search
Context
50
• Example of search sessions
Context Information is Useful
SID Search sessions
S1 Ford Toyota GMC Allstate
www.autohome.com
S2 Ford cars Toyota cars GMC cars Allstate
www.autohome.com
S3 Ford cars Toyota cars Allstate
www.allstate.com
S4 GMC GMC dealers
www.gmc.com
51
• 50% of users clicked car review site www.autohome.com after searching several car names.
Context Information is Useful
52
• How to model context?
• How to learn context model from data?
• How to apply context model in search?
Challenges in Context aware Search
53
• Three Models
– Sequential Model (Cao et al. KDD 2008)
– Hidden Markov Model (Cao et al. WWW 2009)
– Conditional Random Fields Model (Cao et al. SIGIR 2009)
• Large Scale Data Mining to Construct Models
• Using Learning Models to Make Prediction (Context aware Search)
Our Approach
54
Modeling Context by Sequential Model
q1
u1
c1
qi
ui
ci
qt
ut
ct… …
55
Modeling Context by CRF
q1
u1
c1
qi
ui
ci
qt
ut
ct… …
56
Modeling Context by HMM
q1
u1
c1
qi
ui
ci
qt
ut
ct… …
57
TRAINING HIDDEN MARKOV MODEL
58
• Challenge 1:– EM algorithm needs determined number of hidden states.
– However, in our problem, hidden states correspond to search intents, for which the number is unknown.
• Our Solution:– Conduct clustering on click-bipartite graph and view
clusters as hidden states.
Training Very Large HMM
59
• Challenge 2: – Search log data contains hundreds of millions of sessions.
– It is impractical to train HMM from such huge training data on single machine.
• Our Solution:– Deploy learning task on distributed system under map-
reduce model
Training Very Large HMM
60
• Challenge 3: – Each machine needs to hold the values of all parameters.
– Since search log data contains millions of unique queries and URLs, the space of parameters is extremely large.
• Our Solution:– Employ special initialization strategy based on the clusters
mined from click-through bipartite
Training Very Large HMM
61
SUMMARY
62
• Web search relies on NLP and IR
• Query understanding = identify user search intent
• Query understanding needs– Large scale mining platform
– Advanced NLP and IR technologies
– Advanced statistical learning technologies
• Our Projects– LOGAL: search and browse log mining platform
– Semantic Matching: improving tail query relevance
– Context aware Search: better search using context information
• NLP Challenges and Technologies in Information Explosion Era
Summary
63
THANKS!
64