1
Fusion Approach to Finding Opinions in Blogosphere
Kiduk Yang, Ning Yu, Alejandro Valerio, Hui Zhang, and Weimao Ke
Web Information Discovery Integrated Tool Laboratory (WIDIT)
Indiana University
Assistant Professor of Information Science
ICWSM 2007
拿 WIDIT in TREC-2006 Blog track 的內容來當 Paper ,圖表都一樣、多了 ReferenceWIDIT 在 Blog track 的 Spam 處理上也是第一名
2/21
WIDIT’s Fusion Approach
adapt a topical retrieval system for opinion retrieval task apply existing system to retrieve blogs about a target (i.e.,
on-topic retrieval) 基本 IR optimize on-topic retrieval to address the challenges of
short queries 作 Rerank identify opinion blogs by leveraging evidences of
subjectiveness/opinion (i.e., opinion identification) 加料 Research Question
what the evidences of opinion are and how they can be leveraged to retrieve opinionated blogs
3/21
Sources of Evidence
Opinion Lexicon a set of terms often used in expressing opinions (e.g., “Skype sucks”, “Skype rocks”, “Skype is cool”).
Opinion Collocations: contextual evidence collocations used to mark adjacent statements as opinions
(e.g., “I believe God exists”, “God is dead to me”) Opinion Morphology
When expressing strong opinions or perspectives, people often use morphed word form for emphasis
(“Skype is soooo buggy”, “Skype is bugfested”).
4/21
5/21
Related Works It is still arguable whether hyperlinks are good indicators
of subjective affiliations 許多研究著墨在 product, customer review 上 Wiebe 等學者關於 subjectivity 的研究 WIDIT 作 IR 的系統
6/21
(1/4) Initial Retrieval – term indexing Hyphenated words were split into parts removing markup tags and stopwords
words in a standard stopword list non-alphabetical words words consisting of more than 25 or less than 3
characters words that contain 3 or more repeated characters
a modified version of the simple plural remover
acronyms and abbreviations were kept
7/21
(2/4) Initial Retrieval – incremental indexing to scale up to large collections
to index the document collection in fixed-size subcollections
to searched in parallel collection term statistics
derived after the creation of the subcollections subcollection retrieval results can simply be
merged without any need for retrieval score normalizations
8/21
(3/4) Initial Retrieval – query indexing identify nouns and noun phrases expand acronyms and abbreviations extract non-relevant portion of topic
descriptions with which to formulate various expanded versions of the query
query expansion submodules
9/21
(4/4) Initial Retrieval - Retrieval Vector Space Model the SMART length-normalized term weights
Term k 和document i 的分數
probabilistic model the Okapi BM25
formula
10/21
On-Topic Retrieval Optimization Rerank based on a set of topic-related reranking
factors Exact Match, exact query string occurrence Proximity Match, padded query string occurrence Noun Phrase Match Non-Rel Match
Steps Compute topic reranking scores for each of top N results Categorize the top N results into reranking groups
designed to preserve initial ranking while appropriate rank-boosting for a given combination of reranking factors
Boost the rank of documents using reranking scores within groups
11/21
Opinion Identification
Opinion Term Module frequency of terms that only occur frequently in opinion blogs
Rare Term Module (e.g., “sooo good”) extract low frequency terms from positive training data removed dictionary terms examined them to construct a RT lexicon and regular
expressions identify creative term patterns used in opinion blogs
IU Module ‘I believe’, ‘my assessment’, ‘good for you’ counts the frequency of “padded” IU collocations within
sentence boundary Adjective-Verb Module 判斷 density of Potential
Subjective Elements (PSE) – (next page)
12/21
Adjective-Verb Module
Selection of Potential Subjective Elements 先找 PSE 集合 Expansion of an initial seed set (WordNet, FrameNet
等 ) Good, Bad, Oppose, Agree
Refine the candidates and eliminate ambiguous elements
Classifying Blogs using AVM 根據 PSE 密度作 Decision >0.5, 100% 有意見 <0.2, 100% 沒意見
13/21
Fusion
the multiple sets of search results after retrieval time on the assumption that documents with higher overlap
are more likely to be relevant scores weighted with the relative contributions of the
fusion components ( 靠 training)
Weighted Sum
Overlap WS
Weighted OWS
14/21
Dynamic Tuning bio-feedback 技術 , 協助人工判斷 local optimum 在哪
15/21
Experiment
2006 TREC blog test collection 50 topics, (title, destrcription, narrative),
12/2005~2006, 100,649 feeds (38G), 2.8m permalinks (75G), 325,000 homepages (20GB)
系統對每 Topic 回答 1000 個結果 +Topic Reranking +Opinion Rerenking
16/21
Results
mean average precision (MAP) the precision at rank where relevant item is retrieved averaged
over topics Mean R-precision (MRP)
the precision at rank same as the total number of relevant items averaged over topics
precision at rank N (P@N)
17/21
Query Length Effect 傳統上 query 越長越好 ( 全用 > 只用 title) 有例外就是有 noise Rerank 可以改善
18/21
Topic Reranking Effect
19/21
20/21
Rerenk 後再 Tune 就不顯著
Short,Topic Short, Opinion Long,Topic Long, Opinion Fusion,Topic Fusion, Opinion
21/21
Fusion Effect - Conclusion
Fusion 約提升 20%
Recommended