Download ppt - 1 Fusion Approach to Finding Opinions in Blogosphere Kiduk Yang, Ning Yu, Alejandro Valerio, Hui Zhang, and Weimao Ke Web Information Discovery Integrated

1

Fusion Approach to Finding Opinions in Blogosphere

Kiduk Yang, Ning Yu, Alejandro Valerio, Hui Zhang, and Weimao Ke

Web Information Discovery Integrated Tool Laboratory (WIDIT)

Indiana University

Assistant Professor of Information Science

ICWSM 2007

拿 WIDIT in TREC-2006 Blog track 的內容來當 Paper ，圖表都一樣、多了 ReferenceWIDIT 在 Blog track 的 Spam 處理上也是第一名

2/21

WIDIT’s Fusion Approach

adapt a topical retrieval system for opinion retrieval task apply existing system to retrieve blogs about a target (i.e.,

on-topic retrieval) 基本 IR optimize on-topic retrieval to address the challenges of

short queries 作 Rerank identify opinion blogs by leveraging evidences of

subjectiveness/opinion (i.e., opinion identification) 加料 Research Question

what the evidences of opinion are and how they can be leveraged to retrieve opinionated blogs

3/21

Sources of Evidence

Opinion Lexicon a set of terms often used in expressing opinions (e.g., “Skype sucks”, “Skype rocks”, “Skype is cool”).

Opinion Collocations: contextual evidence collocations used to mark adjacent statements as opinions

(e.g., “I believe God exists”, “God is dead to me”) Opinion Morphology

When expressing strong opinions or perspectives, people often use morphed word form for emphasis

(“Skype is soooo buggy”, “Skype is bugfested”).

4/21

5/21

Related Works It is still arguable whether hyperlinks are good indicators

of subjective affiliations 許多研究著墨在 product, customer review 上 Wiebe 等學者關於 subjectivity 的研究 WIDIT 作 IR 的系統

6/21

(1/4) Initial Retrieval – term indexing Hyphenated words were split into parts removing markup tags and stopwords

words in a standard stopword list non-alphabetical words words consisting of more than 25 or less than 3

characters words that contain 3 or more repeated characters

a modified version of the simple plural remover

acronyms and abbreviations were kept

7/21

(2/4) Initial Retrieval – incremental indexing to scale up to large collections

to index the document collection in fixed-size subcollections

to searched in parallel collection term statistics

derived after the creation of the subcollections subcollection retrieval results can simply be

merged without any need for retrieval score normalizations

8/21

(3/4) Initial Retrieval – query indexing identify nouns and noun phrases expand acronyms and abbreviations extract non-relevant portion of topic

descriptions with which to formulate various expanded versions of the query

query expansion submodules

9/21

(4/4) Initial Retrieval - Retrieval Vector Space Model the SMART length-normalized term weights

Term k 和document i 的分數

probabilistic model the Okapi BM25

formula

10/21

On-Topic Retrieval Optimization Rerank based on a set of topic-related reranking

factors Exact Match, exact query string occurrence Proximity Match, padded query string occurrence Noun Phrase Match Non-Rel Match

Steps Compute topic reranking scores for each of top N results Categorize the top N results into reranking groups

designed to preserve initial ranking while appropriate rank-boosting for a given combination of reranking factors

Boost the rank of documents using reranking scores within groups

11/21

Opinion Identification

Opinion Term Module frequency of terms that only occur frequently in opinion blogs

Rare Term Module (e.g., “sooo good”) extract low frequency terms from positive training data removed dictionary terms examined them to construct a RT lexicon and regular

expressions identify creative term patterns used in opinion blogs

IU Module ‘I believe’, ‘my assessment’, ‘good for you’ counts the frequency of “padded” IU collocations within

sentence boundary Adjective-Verb Module 判斷 density of Potential

Subjective Elements (PSE) – (next page)

12/21

Adjective-Verb Module

Selection of Potential Subjective Elements 先找 PSE 集合 Expansion of an initial seed set (WordNet, FrameNet

等 ) Good, Bad, Oppose, Agree

Refine the candidates and eliminate ambiguous elements

Classifying Blogs using AVM 根據 PSE 密度作 Decision >0.5, 100% 有意見 <0.2, 100% 沒意見

13/21

Fusion

the multiple sets of search results after retrieval time on the assumption that documents with higher overlap

are more likely to be relevant scores weighted with the relative contributions of the

fusion components ( 靠 training)

Weighted Sum

Overlap WS

Weighted OWS

14/21

Dynamic Tuning bio-feedback 技術 , 協助人工判斷 local optimum 在哪

15/21

Experiment

2006 TREC blog test collection 50 topics, (title, destrcription, narrative),

12/2005~2006, 100,649 feeds (38G), 2.8m permalinks (75G), 325,000 homepages (20GB)

系統對每 Topic 回答 1000 個結果 +Topic Reranking +Opinion Rerenking

16/21

Results

mean average precision (MAP) the precision at rank where relevant item is retrieved averaged

over topics Mean R-precision (MRP)

the precision at rank same as the total number of relevant items averaged over topics

precision at rank N (P@N)

17/21

Query Length Effect 傳統上 query 越長越好 ( 全用 > 只用 title) 有例外就是有 noise Rerank 可以改善

18/21

Topic Reranking Effect

19/21

20/21

Rerenk 後再 Tune 就不顯著

Short,Topic Short, Opinion Long,Topic Long, Opinion Fusion,Topic Fusion, Opinion

21/21

Fusion Effect - Conclusion

Fusion 約提升 20%