Web Search Clustering and Labeling with Hidden Topics

Preview:

DESCRIPTION

Web Search Clustering and Labeling with Hidden Topics. Presenter : Chien-Hsing Chen Author: Cam- Tu Nguyen Xuan-Hieu Phan Susumu Horiguchi Thu- Trang Nguyen Quang-Thuy Ha. 2009.TALIP.40 . Outline. Motivation Objective Method Experiments Conclusion - PowerPoint PPT Presentation

Citation preview

Intelligent Database Systems Lab

國立雲林科技大學National Yunlin University of Science and Technology

Web Search Clustering and Labeling withHidden Topics

Presenter : Chien-Hsing ChenAuthor: Cam-Tu Nguyen Xuan-Hieu Phan Susumu Horiguchi Thu-Trang Nguyen Quang-Thuy Ha

1

2009.TALIP.40.

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

2

Outline Motivation Objective Method Experiments Conclusion Comment

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

3

d1: ezPeer+ 音樂下載、音樂試聽、歌詞、 MP3 、音樂網 - 蔡依林 - 歷年專輯 ezPeer+ – 蔡依林 - J1 Live Concert 演唱會影音全紀錄 ,J-game, 看我 72 變 , 城堡 ,J9 Party 派對精選 ,Jolin J- Top 冠軍精選 , 舞孃 , 蔡依林唯舞獨尊演唱會鮮聽版 & 混音專輯 & 花 ... web.ezpeer.com/singer/s120.html - 頁庫存檔 - 類似內容

d2: ezPeer+ 音樂下載、音樂試 花蝴蝶好聽… web.ezpeer.com/singer/s120.html - 頁庫存檔 - 類似內容

The snippets are usually noisier, less topic-focused, and much shorter 花 ??

similarity evaluation between snippets may not be successful

Motivation

d3: {He is an author}d4: {The writer is standing behind you}

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

4

Similarity evaluator is referred to a set of hidden topics

di: {He is an author}dj: {The writer is standing behind you}

(a document may be related to multi-topics)

Objective

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

5

Framework

(label candidate generation)

di > topic10dj > topic10

djdi

musicmovieradio player

musicmovie

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

6

LDA

music movie author kill writer book …

1 1 0 0 1 1 …

1 1 0 0 1 1 …

k topicm documentn word

zm,n

wm,n

k = 10 (show business)K=60

z1

w1

1 1 0 0 1 1 …

1 0 0 0 0 0 …

z2

w2

1 1 0 0 0 0 …

1 0 0 0 0 0 …

z3

w3

politicsentertainment

show business

edu. cul.hel.

the word “music” in the topic 10 can explain the occurrence of the words in the documents m=1,2,3

In training step:the keyword is related to a topic when it often occurs in the documents topic

refer to topic k

refer to vocabulary

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

7

LDA

music movie author move writer book …

1 1 0 0 1 1 …

1 1 0 1 1 1 …

k topicm documentn word zm,n

wm,nk = topic 10K=60

z1

w1

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

8

LDA

music movie author move writer book …

1 1 0 0 1 1 …

1 1 0 1 1 1 …

k topicm documentn word

zm,n

wm,nk = topic 10K=60

z1

w1

1 2 3 4 … 9 10 11 11 … 60

dm0.2 0.1 0.4 0.3 … 0.2 0.9 0.1 0.2 … 0.1

p(.|.)=?

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

9

LDA

music movie author move writer book …

1 1 0 0 1 1 …

1 1 0 1 1 1 …

k topicm documentn word

zm,n

wm,nk = topic 10K=60

z1

w1

1 2 3 4 … 9 10 11 11 … 60

dm0.2 0.1 0.4 0.3 … 0.2 0.9 0.1 0.2 … 0.1

p(.|.)=?

p(.|.)=1/60

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

10

Framework

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

11

the tth term in the vocabulary V

the kth topic

Similarity between di and dj

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

12

Framework

similarity matrix between snippets

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

13

Label Candidate Generation

D Topic k

k=1 k=2 … k=10 … k=60

music 14 18 38 9Label Candidate Generation

music

radio player

mp3

CD

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

14

Label assignment for clustering snippets

D Topic k

Label Candidate Generation

music

radio player

mp3

CD

dj

di

Label assignment

music

CD

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

15

Framework

(label candidate generation)

di > topic10dj > topic4, topic10

djdi

musicmovieradio layer

musicmovie

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

16

Experiment

Wikipedia datasetVnexpress dataset

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

17

Experimental dataset

Web dataset consists of 2,357 snippets in 9 categories

20 queries to Google and obtaining about 150 distinguished snippets

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

18

F-measure

Experiments

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

19

Experiments

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

20

Experiments

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

21

Experiments

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

22

Experiments

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

23

Experiments

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

24

Experiments

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

25

clustering snippets with hidden topics labeling clusters using hidden topic analysis

Conclusion

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

26

Advantage labeling clusters with the help of hidden topics the size of snippets is small

Two datasets: 2,357 and 150 (in our work: more than 2 million snippets)

Disadvantage less depends on snippets

Application snippets are useful to make sense

My Comment