Upload
shana-logan
View
63
Download
3
Embed Size (px)
DESCRIPTION
龙星计划课程 : 信息检索 Natural Language Processing for IR. ChengXiang Zhai (翟成祥) Department of Computer Science Graduate School of Library & Information Science Institute for Genomic Biology, Statistics University of Illinois, Urbana-Champaign - PowerPoint PPT Presentation
Citation preview
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 1
龙星计划课程 :信息检索 Natural Language Processing for IR
ChengXiang Zhai (翟成祥 ) Department of Computer Science
Graduate School of Library & Information Science
Institute for Genomic Biology, Statistics
University of Illinois, Urbana-Champaign
http://www-faculty.cs.uiuc.edu/~czhai, [email protected]
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 2
Elements of Text Info Management Technologies
Search
Text
Filtering
Categorization
Summarization
Clustering
Natural Language Content Analysis
Extraction
Mining
VisualizationRetrievalApplications
MiningApplications
InformationAccess
KnowledgeAcquisition
InformationOrganization
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 3
Overview of NLP for IR
• Most work has focused on improving the indexing unit
– Syntactic phrases
– Word sense disambiguation
– Anaphora resolution
• Dependency parsing
• Sentiment retrieval
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 4
Improving the Indexing Unit
The following slides are taken/adapted from Jimmy Lin’s presentationhttp://umiacs.umd.edu/~resnik/ling647_sp2006/slides/jimmy_ir_lecture.ppt
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 5
The Central Problem in IRInformation Seeker Authors
Concepts Concepts
Query Terms Document Terms
Do these represent the same concepts?
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 6
Why is IR hard?
• IR is hard because natural language is so rich (among other reasons)
• What are the issues?
– Tokenization
– Morphological Variation
– Synonymy
– Polysemy
– Paraphrase
– Ambiguity
– Anaphora
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 7
Possible Solutions
• Vary the unit of indexing
– Strings and segments
– Tokens and words
– Phrases and entities
– Senses and concepts
• Manipulate queries and results
– Term expansion
– Post-processing of results
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 8
Tokenization• What’s a word?
– First try: words are separated by spaces
– What about clitics?
• What about languages without spaces?
I’m not saying that I don’t want John’s input on this.
The cat on the mat. the, cat, on, the, mat
天主教教宗若望保祿二世因感冒再度住進醫院。
天主教 教宗 若望保祿二世 因 感冒 再度 住進 醫院。
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 9
Word-Level Issues• Morphological variation = different forms of the same concept
– Inflectional morphology: same part of speech
– Derivational morphology: different parts of speech
• Synonymy
= different words, same meaning
• Polysemy
= same word, different meanings (e.g., root)
{dog, canine, doggy, puppy, etc.} concept of dog
break, broke, broken; sing, sang, sung; etc.
destroy, destruction; invent, invention, reinvention; etc.
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 10
Paraphrase
Who killed Abraham Lincoln?
(1) John Wilkes Booth killed Abraham Lincoln.(2) John Wilkes Booth altered history with a bullet. He will forever be
known as the man who ended Abraham Lincoln’s life.
When did Wilt Chamberlain score 100 points?
(1) Wilt Chamberlain scored 100 points on March 2, 1962 against the New York Knicks.
(2) On December 8, 1961, Wilt Chamberlain scored 78 points in a triple overtime game. It was a new NBA record, but Warriors coach Frank McGuire didn’t expect it to last long, saying, “He’ll get 100 points someday.” McGuire’s prediction came true just a few months later in a game against the New York Knicks on March 2.
• Language provides different ways of saying the same thing
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 11
Ambiguity
• What exactly do you mean?
• Why don’t we have problems (most of the time)?
I saw the man on the hill with the telescope?Who has the telescope?
Time flies like an arrow.Say what?
Visiting relatives can be annoying.Who’s visiting?
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 12
Ambiguity in Action
• Different documents with the same keywords may have different meanings…
What do frogs eat?
(1) Adult frogs eat mainly insects and other small animals, including earthworms, minnows, and spiders.
(2) Alligators eat many kinds of small animals that live in or near the water, including fish, snakes, frogs, turtles, small mammals, and birds.
(3) Some bats catch fish with their claws, and a few species eat lizards, rodents, small birds, tree frogs, and other bats.
keywords: frogs, eat
What is the largest volcano in the Solar System?
(1) Mars boasts many extreme geographic features; for example, Olympus Mons, is the largest volcano in the solar system.
(2) The Galileo probe's mission to Jupiter, the largest planet in the Solar system, included amazing photographs of the volcanoes on Io, one of its four most famous moons.
(3) Even the largest volcanoes found on Earth are puny in comparison to others found around our own cosmic backyard, the Solar System.
keywords: largest, volcano, solar, system
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 13
Anaphora
Who killed Abraham Lincoln?
(1) John Wilkes Booth killed Abraham Lincoln.(2) John Wilkes Booth altered history with a bullet. He will forever be
known as the man who ended Abraham Lincoln’s life.
When did Wilt Chamberlain score 100 points?
(1) Wilt Chamberlain scored 100 points on March 2, 1962 against the New York Knicks.
(2) On December 8, 1961, Wilt Chamberlain scored 78 points in a triple overtime game. It was a new NBA record, but Warriors coach Frank McGuire didn’t expect it to last long, saying, “He’ll get 100 points someday.” McGuire’s prediction came true just a few months later in a game against the New York Knicks on March 2.
• Language provides different ways of referring to the same entity
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 14
More Anaphora• Terminology
– Anaphor = an expression that refers to another
– Anaphora = the phenomenon
• Other different types of referring expressions:
• Anaphora resolution can be hard!
Fujitsu and NEC said they were still investigating, and that knowledge of more such bids could emerge... Other major Japanese computer companies contacted yesterday said they have never made such bids.
The city council denied the demonstrators a permit because…they feared violence.they advocated violence.
The hotel recently went through a $200 million restoration… original artworks include an impressive collection of Greek statues in the lobby.
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 15
What can we do?
• Here are the some of the problems:
– Tokenization
– Morphological variation, synonymy, polysemy
– Paraphrase, ambiguity
– Anaphora
• General approaches:
– Vary the unit of indexing
– Manipulate queries and results
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 16
What do we index?
• In information retrieval, we are after the concepts represented in the documents
• … but we can only index strings
• So what’s the best unit of indexing?
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 17
The Tokenization Problem
• In many languages, words are not separated by spaces…
• Tokenization = separating a string into “words”
• Simple greedy approach:
– Start with a list of every possible term (e.g., from a dictionary)
– Look for the longest word in the unsegmented string
– Take longest matching term as the next word and repeat
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 18
Probabilistic Segmentation
• For an input word: c1 c2 c3 … cn
• Try all possible partitions:
• Choose the highest probability partition
– E.g., compute P(c1 c2 c3) using a language model
• Challenges: search, probability estimation
c1 c2 c3 c4 … cn
c1 c2 c3 c4 … cn
c1 c2 c3 c4 … cn
…
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 19
Indexing N-Grams
• Consider a Chinese document: c1 c2 c3 … cn
• Don’t segment (you could be wrong!)
• Instead, treat every character bigram as a term
• Break up queries the same way
• Works at least as well as trying to segment correctly!
c1 c2 c3 c4 c5 … cn
c1 c2 c2 c3 c3 c4 c4 c5 … cn-1 cn
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 20
Morphological Variation
• Handling morphology: related concepts have different forms
– Inflectional morphology: same part of speech
– Derivational morphology: different parts of speech
• Different morphological processes:– Prefixing
– Suffixing
– Infixing
– Reduplication
dogs = dog + PLURAL
broke = break + PAST
destruction = destroy + ion
researcher = research + er
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 21
Stemming
• Dealing with morphological variation: index stems instead of words
– Stem: a word equivalence class that preserves the central concept
• How much to stem?
– organization organize organ?
– resubmission resubmit/submission submit?
– reconstructionism?
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 22
Does Stemming Work?
• Generally, yes! (in English)
– Helps more for longer queries
– Lots of work done in this areaDonna Harman (1991) How Effective is Suffixing? Journal of the American Society for Information Science, 42(1):7-15.
Robert Krovetz. (1993) Viewing Morphology as an Inference Process. Proceedings of SIGIR 1993.
David A. Hull. (1996) Stemming Algorithms: A Case Study for Detailed Evaluation. Journal of the American Society for Information Science, 47(1):70-84.
And others…
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 23
Stemming in Other Languages
• Arabic makes frequent use of infixes
• What’s the most effective stemming strategy in Arabic? Open research question…
maktab (office), kitaab (book), kutub (books), kataba (he wrote), naktubu (we write), etc.
the root ktb
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 24
Words = wrong indexing unit!
• Synonymy = different words, same meaning
• Polysemy
= same word, different meanings
• It’d be nice if we could index concepts!
– Word sense: a coherent cluster in semantic space
– Indexing word senses achieves the effect of conceptual indexing
{dog, canine, doggy, puppy, etc.} concept of dog
Bank: financial institution or side of a river?Crane: bird or construction equipment?
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 25
Indexing Word Senses
• How does indexing word senses solve the synonym/polysemy problem?
• Okay, so where do we get the word senses?
– WordNet
– Automatically find “clusters” of words that describe the same concepts
– Other methods also have been tried…
{dog, canine, doggy, puppy, etc.} concept 112986
I deposited my check in the bank. bank concept 76529I saw the sailboat from the bank. bank concept 53107
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 26
Word Sense Disambiguation
• Given a word in context, automatically determine the sense (concept)
– This is the Word Sense Disambiguation (WSD) problem
• Context is the key:
– For each ambiguous word, note the surrounding words
– Learn a classifier from a collection of examples
– Use the classifier to determine the senses of words in the documents
bank {river, sailboat, water, etc.} side of a riverbank {check, money, account, etc.} financial institution
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 27
Does it work?• Nope!
• Examples of limited success….
Ellen M. Voorhees. (1993) Using WordNet to Disambiguate Word Senses for Text Retrieval. Proceedings of SIGIR 1993.
Mark Sanderson. (1994) Word-Sense Disambiguation and Information Retrieval. Proceedings of SIGIR 1994
And others…
Hinrich Schütze and Jan O. Pedersen. (1995) Information Retrieval Based on Word Senses. Proceedings of the 4th Annual Symposium on Document Analysis and Information Retrieval.
Rada Mihalcea and Dan Moldovan. (2000) Semantic Indexing Using WordNet Senses. Proceedings of ACL 2000 Workshop on Recent Advances in NLP and IR.
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 28
Why Disambiguation Hurts
• Bag-of-words techniques already disambiguate
– Context for each term is established in the query
• WSD is hard!
– Many words are highly polysemous, e.g., interest
– Granularity of senses is often domain/application specific
• WSD tries to improve precision
– But incorrect sense assignments would hurt recall
– Slight gains in precision do not offset large drops in recall
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 29
An Alternate Approach
• Indexing word senses “freezes” concepts at index time
• What if we expanded query terms at query time instead?
• Two approaches
– Manual thesaurus, e.g., WordNet, UMLS, etc.
– Automatically-derived thesaurus, e.g., co-occurrence statistics
dog AND cat ( dog OR canine ) AND ( cat OR feline )
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 30
Does it work?
• Yes… if done “carefully”
• User should be involved in the process
– Otherwise, poor choice of terms can hurt performance
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 31
Handling Anaphora
• Anaphora resolution: finding what the anaphor refers to (i.e., the antecedent)
• Most common example: pronominal anaphora resolution
– Simplest method works pretty well: find previous noun phrase matching in gender and number
John Wilkes Booth altered history with a bullet. He will forever be known as the man who ended Abraham Lincoln’s life.
He = John Wilkes Booth
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 32
Expanding Anaphors
• When indexing, replace anaphors with their antecedents
• Does it work?
– Somewhat
– … but can be computationally expensive
– … helps more if you want to retrieve sub-document segments
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 33
Beyond Word-Level Indexing
• Words are the wrong unit to index…
• Many multi-word combinations identify entities
– Persons: George W. Bush, Dr. Jones
– Organizations: Red Cross, United Way
– Corporations: Hewlett Packard, Kraft Foods
– Locations: Easter Island, New York City
• Entities often have finer-grained structuresProfessor Stephen W. Hawking
title first name middle initial last name
Cambridge, Massachusetts
city state
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 34
Indexing Named Entities
• Why would we want to index named entities?
• Index named entities as special tokens
• And treat special tokens like query terms
• Works pretty well for question answering, but not evaluated for regular topic retrieval (?)
In reality, at the time of Edison’s 1879 patent, the light bulb
had been in existence for some five decades ….
PERSON DATE
Who patented the light bulb?
When was the light bulb patented?
patent light bulb PERSON
patent light bulb DATE
John Prager, Eric Brown, and Anni Coden. (2000) Question-Answering by Predictive Annotation. Proceedings of SIGIR 2000.
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 35
Indexing Phrases• Motivation: “bank terminology” vs “terminology bank”
• Two types of phrases
– Those that make sense, e.g., “school bus”, “hot dog”
– Those that don’t, e.g., bigrams in Chinese
• Treat multi-word tokens as index terms
• Three sources of evidence:
– Dictionary lookup
– Linguistic analysis
– Statistical analysis (e.g., co-occurrence)
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 36
Known Phrases
• Compile a term list that includes phrases
– Technical terminology can be very helpful
• Index any phrase that occurs in the list
• Most effective in a limited domain
– Otherwise hard to capture most useful phrases
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 37
Syntactic Phrases• Parsing = automatically assign structure to a sentence
• “Walk” the tree and extract phrases
– Index all noun phrases
– Index subjects and verbs
– Index verbs and objects
– etc.
Sentence
Noun Phrase
The quick brown fox jumped over the lazy black dog
Noun phrase
Det Adj Adj Noun Verb Adj NounAdjDet
Prepositional Phrase
Prep
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 38
Syntactic Variations
• What does linguistic analysis buy?
– Coordinations
– Substitutions
– Permutations
lung and breast cancer lung cancer, breast cancer
inflammatory sinonasal disease inflammatory disease, sinonasal disease
addition of calcium calcium addition
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 39
Statistical Analysis
• Automatically discover phrases based on co-occurrence probabilities
• If terms are not independent, they may form a phrase
• Use this method to automatically learn a phrase dictionary
P(“kick the bucket”) = P(“kick”) P(“the”) P(“bucket”) ?
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 40
Does Phrasal Indexing Work?
• Yes…
• But the gains are so small they’re not worth the cost
• Primary drawback: too slow!
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 41
Sample Phrase-Indexing Results[Zhai 97]
Using single type of phrases to supplement single words helps
But using all phrases to supplement single words doesn’t help as much
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 42
What about ambiguity?
• Different documents with the same keywords may have different meanings…
What do frogs eat?
(1) Adult frogs eat mainly insects and other small animals, including earthworms, minnows, and spiders.
(2) Alligators eat many kinds of small animals that live in or near the water, including fish, snakes, frogs, turtles, small mammals, and birds.
(3) Some bats catch fish with their claws, and a few species eat lizards, rodents, small birds, tree frogs, and other bats.
keywords: frogs, eat
What is the largest volcano in the Solar System?
(1) Mars boasts many extreme geographic features; for example, Olympus Mons, is the largest volcano in the solar system.
(2) The Galileo probe's mission to Jupiter, the largest planet in the Solar system, included amazing photographs of the volcanoes on Io, one of its four most famous moons.
(3) Even the largest volcanoes found on Earth are puny in comparison to others found around our own cosmic backyard, the Solar System.
keywords: largest, volcano, solar, system
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 43
Indexing Relations
• Instead of terms, index syntactic relations between entities in the text
Adult frogs eat mainly insects and other small animals, including earthworms, minnows, and spiders.
< frogs subject-of eat >< insects object-of eat >< animals object-of eat >< adult modifies frogs >< small modifies animals >
Alligators eat many kinds of small animals that live in or near the water, including fish, snakes, frogs, turtles, small mammals, and birds.
< alligators subject-of eat >< kinds object-of animals >< small modifies animals >
From the relations, it is clear who’s eating whom!
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 44
Are syntactic relations enough?
• Consider this example:
• Syntax sometimes isn’t enough… we need semantics (or meaning)!
• Semantics, for example, allows us to relate the following two fragments:
John broke the window.The window broke.
< John subject-of break >< window subject-of break>
“John” and “window” are both subjects…But John is the person doing the breaking (or “agent”),and the window is the thing being broken (or “theme”)
The barbarians destroyed the city…The destruction of the city by the barbarians…
event: destroyagent: barbarianstheme: city
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 45
Semantic Roles
• Semantic roles are invariant with respect to syntactic expression
• The idea:
– Identify semantic roles
– Index “frame structures” with filled slots
– Retrieve answers based on semantic-level matching
Mary loaded the truck with hay. Hay was loaded onto the truck by Mary.
event: loadagent: Marymaterial: haydestination: truck
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 46
Does it work?
• No, not really…
• Why not?
– Syntactic and semantic analysis is difficult: errors offset whatever gain is gotten
– As with WSD, these techniques are precision-enhancers… recall usually takes a dive
– It’s slow!
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 47
Alternative Approach• Sophisticated linguistic analysis is slow!
– Unnecessary processing can be avoided by query time analysis
• Two-stage retrieval
– Use standard document retrieval techniques to fetch a candidate set of documents
– Use passage retrieval techniques to choose a few promising passages (e.g., paragraphs)
– Apply sophisticated linguistic techniques to pinpoint the answer
• Passage retrieval
– Find “good” passages within documents
– Key Idea: locate areas where lots of query terms appear close together
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 48
Summary
• IR is hard because language is rich and complex (among other reasons)
• Two general approaches to the problem
– Attempt to find the best unit of indexing
– Try to fix things at query time
• It is hard to predict a priori what techniques work
– Questions must be answered experimentally
• Words are really the wrong thing to index
– But there isn’t really a better alternative…
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 49
What You Should Know
• Lots of effort have been made to improve indexing units
• Most results, however, are non-promising
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 50
Dependence Language Model for Information Retrieval
Jianfeng Gao, Jian-Yun Nie, Guangyuan Wu, Guihong Cao,
Dependence Language Model for Information Retrieval, SIGIR 2004
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 51
Reference
• Structure and performance of a dependency language model. Ciprian, David Engle and et al. Eurospeech 1997.
• Parsing English with a Link Grammar. Daniel D. K. Sleator and Davy Temperley. Technical Report CMU-CS-91-196 1991.
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 52
Why we use independence assumption?
• The independence assumption is one of the assumptions widely adopted in probabilistic retrieval theory.
• Why?
– Make retrieval models easier.
– Make retrieval operation tractable.
• The shortage of independence assumption
– Independence assumption does not hold in textual data.
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 53
Latest ideas of dependence assumption
• Bigram
– Some language modeling approach try to incorporate word frequency by using bigram.
– Shortage:
• Some of word dependencies not only exist between adjacent words but also exist at more distant.
• Some of adjacent words are not exactly connected.
– Bigam language model showed only marginally better effectiveness than the unigram model.
• Bi-term
– Bi-term language model is similar to the bigram model except the constraint of order in terms is relaxed.
– “information retrieval” and “retrieval of information” will be assigned the same probability of generating the query.
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 54
Structure and performance of a dependency language model
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 55
Introduction
• This paper present a maximal entropy language model that incorporates both syntax and semantics via a dependency grammar.
• Dependency grammar: express the relations between words by a directed graph which can incorporate the predictive power of words that lie outside of bigram or trigram range.
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 56
Introduction
• Why we use Ngram
– Assume
if we want to record
we need to store independent parameters
• The drawback of Ngram
– Ngram blindly discards relevant words that lie N or more positions in the past.
nwwwwS ...,,, 210)...|()...|()()( 10010 nn wwwPwwPwPSP
)...|( 10 nn wwwP
)1(1 VV i
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 57
Structure of the model
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 58
Structure of the model
• Develop an expression for the joint probability , K is the linkages in the sentence.
• Then we get
• Assume that the sum is dominated by a single term, then
),( KSP
K
KSPSP ),()(
),(maxarg
),(),(*
*
KSPKwhere
KSPKSP
K
K
),()( *KSPSP
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 59
A dependency language model of IR
• A query we want to rank
– Previous work:
• Assume independence between query terms :
– New work:
• Assume that term dependencies in a query form a linkage
)...( 1 mqqQ )|( DQP
mi i DqPDQP...1
)|()|(
L L
DLQPDLPDLQPDQP ),|()|()|,()|(
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 60
A dependency language model of IR
• Assume that the sum over all the possible Ls is dominated by a single term
• Assume that each term is dependent on exactly one related query term generated previous.
L L
DLQPDLPDLQPDQP ),|()|()|,()|(
L
DLQP )|,(
*L
)|(maxarg
),|()|()|(
DLPLthatsuch
DLQPDLPDQP
L
hq iq jq
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 61
A dependency language model of IR
Lji ji
jijh
Lji i
ijh
Ljiijh
DLqPDLqP
DLqPDLqqPDqP
DLqP
DLqqPDqP
DLqqPDqPDLQP
),(
),(
),(
),|(),|(
),|(),|,()|(
),|(
),|,()|(
),,|()|(),|(
)|(maxarg
),|()|()|(
DLPLthatsuch
DLQPDLPDQP
L
hq iq jq
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 62
A dependency language model of IR• Assume
– The generation of a single term is independent of L
• By this assumption, we would have arrived at the same result by starting from any term. L can be represented as an undirected graph.
)|(),|( DqPDLqP jj
mi Lji ji
jii
hj Lji ji
jijh
DqPDqP
DLqqPDqP
DqPDqP
DLqqPDqPDqPDLQP
...1 ),(
),(
)|()|(
),|,()|(
)|()|(
),|,()|()|(),|(
Lji ji
jijh DLqPDLqP
DLqPDLqqPDqP
),( ),|(),|(
),|(),|,()|(
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 63
A dependency language model of IR
)|(maxarg
),|()|()|(
DLPLthatsuch
DLQPDLPDQP
L
)|()|(
),|,(log),|,(
),|,()|(log)|(log)|(log...1 ),(
DqPDqP
DLqqPDLqqMI
DLqqMIDqPDLPDQP
ji
jiji
mi Ljijii
取 log
mi Lji ji
jii DqPDqP
DLqqPDqPDLQP
...1 ),( )|()|(
),|,()|(),|(
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 64
Parameter Estimation
• Estimating
– Assume that the linkages are independent.
– Then count the relative frequency of link l between and given that they appear in the same sentence.
)|( DLP
Ll
DlPDLP )|()|(
),(
),,(),|(
ji
jiji qqC
RqqCqqRF
mi Lji
jii DLqqMIDqPDLPDQP...1 ),(
),|,()|(log)|(log)|(log
iq jq
Have a link in a sentence
in training dataA score
)|(),|(
),|(
),(
QlPqqRF
qqRF
ljiji
ji
The link frequency of
query i and query j
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 65
Parameter Estimation
),|()|( ji qqRFQlP
Lji
jiLL
qqRFQLPL),(
),|(maxarg)|(maxarg
Ll Lji
ji qqRFDlPDQLPDLP),(
),|()|(),|()|(
)|(),|(
),|(
),(
QlPqqRF
qqRF
ljiji
ji
assumption
),|(),|()1(),|( jiCjiDji qqRFqqRFqqRF
Assumption: )|()|( DLPQLP
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 66
Parameter Estimation
• Estimating
– The document language model is smoothed with a Dirichlet prior
)|( DqP i
ii qiC
iC
qiC
iCiD
iii
qC
qC
qC
qCqC
CqPDqPDqP
)(
)(
)(
)()()1(
)|()|()1()|('
Dirichilet distribution
Constant discount
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 67
Parameter Estimation
• Estimating ),|,( DLqqMI ji
),(*,),*,(
),,(log
)),(*,)(),*,((
),,(log
),|(),|(
),|,(log),|,(
RqCRqC
NRqqC
NRqCNRqC
NRqqC
DLqPDLqP
DLqqPDLqqMI
jDiD
jiD
jDiD
jiD
ji
jiji
)(*,*, RCN D
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 68
Experimental Setting
• Stemmed and stop words were removed.
• Queries are TREC topics 202 to 250 on TREC disk 2 and 3.
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 69
The flow of the experimental
Find the linkage of query
query
Find the max L by maxlP(l|Q)
Get
documentTraining data For weight computation
Count the frequency
),|( and
),|(
jiC
jiD
qqRF
qqRF
),|( ji qqRF
Get P(L|D)
Count the frequency
)( and )( iDiC qCqC Get )|( DqP i
Count the frequency
)(*,*, and
),,( and
),*,( and ),*,(
RC
RqqC
RqCRqC
D
jiD
iDiD Get ),|,( DLqqMI ji
combine Ranking
document
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 70
Result-BM & UG
• BM: binary independent retrieval
• UG: unigram language model approach
• UG achieves the performance similar to, or worse than, that of BM.
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 71
Result- DM
• DM: dependency model
• The improve of DM over UG is statistically significant.
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 72
Result- BG
• BG: bigram language model
• BG is slightly worse than DM in five out of six TREC collections but substantially outperforms UG in all collection.
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 73
Result- BT1 & BT2
• BT: bi-term language model)),|(),|((
2
1),|( 1111 DqqPDqqPDqqP iiBGiiBGiiBT
)}(),(min{2
),(),(),|(
1
1112
iDiD
iiDiiDiiBT qCqC
qqCqqCDqqP
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 74
Conclusion
• This paper introduce the linkage of a query as a hidden variable.
• Generate each term in turn depending on other related terms according to the linkage.
– This approach cover several language model approaches as special cases.
• The experimental of this paper outperforms substantially over unigram, bigram and classical probabilistic retrieval model.
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 75
CIKM 2007
75
Opinion Retrieval from Blogs
Wei Zhang1 Clement Yu1 Weiyi Meng2
[email protected] [email protected] [email protected]
1 Department of Computer Science, University of Illinois at Chicago
2 Department of Computer Science, Binghamton University
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 76
• Overview of the opinion retrieval
• Topic retrieval
• Opinion identification
• Ranking documents by opinion similarity
• Experimental results
CIKM 2007 76
Outline
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 77
Overview of the Opinion Retrieval
• Opinion retrieval• Given a query, find documents that have
subjective opinions about the query
A query “book”Relevant: “This is a very good book.”Irrelevant: “This book has 123 pages.”
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 78
Overview of the Opinion Retrieval
• Introduced at TREC 2006 Blog Track• 14 groups, 57 submitted runs in TREC 2006• 20 groups, 104 runs in TREC 2007 (on going)
• Key problems• Opinion features• Query-related opinions• Rank the retrieved documents
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 79
Document set
Our Algorithm
Retrieved documents
Query
Opinionative documents
Query-relatedopinionative documents
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 80
Topic Retrieval
• Retrieve query-relevant documents
• No opinion involved
• Features• Phrase recognition• Query expansion• Two document-query similarities
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 81
Topic Retrieval – Phrase Recognition
• Semantic relationship among the words
• For phrase similarity calculation purpose
• 4 types• Proper noun: “University of Lisbon”• Dictionary phrase: “computer science”• Simple phrase: “white car”• Complex phrase: “small white car”
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 82
Topic Retrieval – Query Expansion
• Find the synonyms• “wto” “world trade organization”• Same importance
• Add additional terms• “wto” negotiate, agreements, Tariffs,
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 83
Topic Retrieval - Similarity
• Sim(Query, Doc) = <Sim_P, Sim_T>
• Phrase similarity• Having or not having a phrase• Sim_P = sum ( idf(P_i) )
• Term similarity• Sum of the Okapi scores of all the query terms
• Document ranking• D1 is ranked higher than D2, if
(Sim_P1>Sim_P2) OR (P1=P2 AND T1>T2)
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 84
Opinion Identification
Feature Selection
SVM classifier
Subjective training data
Objective training data
From topic retrieval To opinion ranking
retrieved documents
opinionative
documents
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 85
Opinion Identification – Training Data
• Subjective training data• Review web sites• Documents having opinionative phrases
• Objective training data• Dictionary entries• Documents not having opinionative phrases
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 86
Opinion Identification – Feature Selection
• The words expressing opinions
• Pearson’s Chi-square test• Test of the independence between subjectivity
label and words via contingency table• Count the number of sentences
• Unigrams and bigrams
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 87
Opinion Identification – Classifier
• A support vector machine (SVM) classifier
Objective sentencesSubjective sentences
Features
Training
Feature vector representation
SVM classifier
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 88
Opinion Identification – Classifier
• Apply the SVM classifier
SVMclassifier
Document
Sentence 1
…
Label 1:objective
…Sentence 2
Sentence n
Label 2:subjective
Label n:objective
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 89
Opinion Similarity - Query-Related Opinions
• Find the query-related opinions
query opinionative sentence
document document
text window
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 90
Opinion Similarity – Similarity 1
• Assumption 1• Higher topic relevanceHigher rank
• OSim_ir = Sim(Query, Doc)
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 91
Opinion Similarity – Similarity 2
• Assumption 2• More query-related opinionsHigher rank
• OSim_stcc: total number of sentences
• OSim_stcs: total score of sentences
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 92
Opinion Similarity – Similarity 3
• A linear combination of 1 and 2
• a * Osim_ir + (1-a) * OSim_stcc• b * Osim_ir + (1-b) * OSim_stcs
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 93
Opinion Similarity – Experimental Results
• TREC 2006 Blog Track data• 50 queries, 3.2 million Blog documens
• UIC at TREC 2006 Blog Track• Title-only queries: scored the first
• 28% - 32% higher than best TREC 2006 scores
• Good things learned• More training data• Combined similarity function
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 94
Conclusions
• Designed and implemented an opinion retrieval system. IR + text classification for opinion retrieval
• The best known retrieval effectiveness on TREC 2006 blog data
• Extend to polarity classification: positive/negative/mixed
• Plan to improve feature selection
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 95
What You Should Know
• Language models offer more opportunities for studying weighting of phase-indexing (including weights on).
• Opinion finding is a hot area now. A major challenge is how to combine sentiment score with a topic score.
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 96
Future Research Directions in NLP for IR
• It’s time to revisit many old issues:
– We have better NLP techniques now
– We have better probabilistic retrieval models now
– Possible topics: phrase indexing, word sense disambiguation, TF-effect of NLP,…
• Language models seem to offer better opportunities for adopting NLP results (e.g, dependency LM)
• Sentiment retrieval
• NLP for helping difficult topics
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 97
Roadmap
• This lecture: NLP for IR (understanding documents)
• Next lecture: Topic models for text mining