View
25
Download
0
Category
Preview:
DESCRIPTION
Web Search Clustering and Labeling with Hidden Topics. Presenter : Chien-Hsing Chen Author: Cam- Tu Nguyen Xuan-Hieu Phan Susumu Horiguchi Thu- Trang Nguyen Quang-Thuy Ha. 2009.TALIP.40 . Outline. Motivation Objective Method Experiments Conclusion - PowerPoint PPT Presentation
Citation preview
Intelligent Database Systems Lab
國立雲林科技大學National Yunlin University of Science and Technology
Web Search Clustering and Labeling withHidden Topics
Presenter : Chien-Hsing ChenAuthor: Cam-Tu Nguyen Xuan-Hieu Phan Susumu Horiguchi Thu-Trang Nguyen Quang-Thuy Ha
1
2009.TALIP.40.
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
2
Outline Motivation Objective Method Experiments Conclusion Comment
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
3
d1: ezPeer+ 音樂下載、音樂試聽、歌詞、 MP3 、音樂網 - 蔡依林 - 歷年專輯 ezPeer+ – 蔡依林 - J1 Live Concert 演唱會影音全紀錄 ,J-game, 看我 72 變 , 城堡 ,J9 Party 派對精選 ,Jolin J- Top 冠軍精選 , 舞孃 , 蔡依林唯舞獨尊演唱會鮮聽版 & 混音專輯 & 花 ... web.ezpeer.com/singer/s120.html - 頁庫存檔 - 類似內容
d2: ezPeer+ 音樂下載、音樂試 花蝴蝶好聽… web.ezpeer.com/singer/s120.html - 頁庫存檔 - 類似內容
The snippets are usually noisier, less topic-focused, and much shorter 花 ??
similarity evaluation between snippets may not be successful
Motivation
d3: {He is an author}d4: {The writer is standing behind you}
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
4
Similarity evaluator is referred to a set of hidden topics
di: {He is an author}dj: {The writer is standing behind you}
(a document may be related to multi-topics)
Objective
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
5
Framework
(label candidate generation)
di > topic10dj > topic10
djdi
musicmovieradio player
musicmovie
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
6
LDA
music movie author kill writer book …
1 1 0 0 1 1 …
1 1 0 0 1 1 …
k topicm documentn word
zm,n
wm,n
k = 10 (show business)K=60
z1
w1
1 1 0 0 1 1 …
1 0 0 0 0 0 …
z2
w2
1 1 0 0 0 0 …
1 0 0 0 0 0 …
z3
w3
politicsentertainment
show business
edu. cul.hel.
the word “music” in the topic 10 can explain the occurrence of the words in the documents m=1,2,3
In training step:the keyword is related to a topic when it often occurs in the documents topic
refer to topic k
refer to vocabulary
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
7
LDA
music movie author move writer book …
1 1 0 0 1 1 …
1 1 0 1 1 1 …
k topicm documentn word zm,n
wm,nk = topic 10K=60
z1
w1
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
8
LDA
music movie author move writer book …
1 1 0 0 1 1 …
1 1 0 1 1 1 …
k topicm documentn word
zm,n
wm,nk = topic 10K=60
z1
w1
1 2 3 4 … 9 10 11 11 … 60
dm0.2 0.1 0.4 0.3 … 0.2 0.9 0.1 0.2 … 0.1
p(.|.)=?
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
9
LDA
music movie author move writer book …
1 1 0 0 1 1 …
1 1 0 1 1 1 …
k topicm documentn word
zm,n
wm,nk = topic 10K=60
z1
w1
1 2 3 4 … 9 10 11 11 … 60
dm0.2 0.1 0.4 0.3 … 0.2 0.9 0.1 0.2 … 0.1
p(.|.)=?
p(.|.)=1/60
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
10
Framework
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
11
the tth term in the vocabulary V
the kth topic
Similarity between di and dj
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
12
Framework
similarity matrix between snippets
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
13
Label Candidate Generation
D Topic k
k=1 k=2 … k=10 … k=60
music 14 18 38 9Label Candidate Generation
music
radio player
mp3
CD
…
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
14
Label assignment for clustering snippets
D Topic k
Label Candidate Generation
music
radio player
mp3
CD
…
dj
di
Label assignment
music
CD
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
15
Framework
(label candidate generation)
di > topic10dj > topic4, topic10
djdi
musicmovieradio layer
musicmovie
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
16
Experiment
Wikipedia datasetVnexpress dataset
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
17
Experimental dataset
Web dataset consists of 2,357 snippets in 9 categories
20 queries to Google and obtaining about 150 distinguished snippets
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
18
F-measure
Experiments
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
19
Experiments
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
20
Experiments
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
21
Experiments
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
22
Experiments
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
23
Experiments
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
24
Experiments
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
25
clustering snippets with hidden topics labeling clusters using hidden topic analysis
Conclusion
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
26
Advantage labeling clusters with the help of hidden topics the size of snippets is small
Two datasets: 2,357 and 150 (in our work: more than 2 million snippets)
Disadvantage less depends on snippets
Application snippets are useful to make sense
My Comment
Recommended