26
Intelligent Database Systems Lab 國國國國國國國國 National Yunlin University of Science and Technology Web Search Clustering and Labeling with Hidden Topics Presenter Chien-Hsing Chen Author: Cam-Tu Nguyen Xuan-Hieu Phan Susumu Horiguchi Thu-Trang Nguyen Quang-Thuy 1 2009.TALIP.40.

Web Search Clustering and Labeling with Hidden Topics

  • Upload
    willow

  • View
    25

  • Download
    0

Embed Size (px)

DESCRIPTION

Web Search Clustering and Labeling with Hidden Topics. Presenter : Chien-Hsing Chen Author: Cam- Tu Nguyen Xuan-Hieu Phan Susumu Horiguchi Thu- Trang Nguyen Quang-Thuy Ha. 2009.TALIP.40 . Outline. Motivation Objective Method Experiments Conclusion - PowerPoint PPT Presentation

Citation preview

Page 1: Web  Search Clustering and Labeling  with Hidden Topics

Intelligent Database Systems Lab

國立雲林科技大學National Yunlin University of Science and Technology

Web Search Clustering and Labeling withHidden Topics

Presenter : Chien-Hsing ChenAuthor: Cam-Tu Nguyen Xuan-Hieu Phan Susumu Horiguchi Thu-Trang Nguyen Quang-Thuy Ha

1

2009.TALIP.40.

Page 2: Web  Search Clustering and Labeling  with Hidden Topics

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

2

Outline Motivation Objective Method Experiments Conclusion Comment

Page 3: Web  Search Clustering and Labeling  with Hidden Topics

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

3

d1: ezPeer+ 音樂下載、音樂試聽、歌詞、 MP3 、音樂網 - 蔡依林 - 歷年專輯 ezPeer+ – 蔡依林 - J1 Live Concert 演唱會影音全紀錄 ,J-game, 看我 72 變 , 城堡 ,J9 Party 派對精選 ,Jolin J- Top 冠軍精選 , 舞孃 , 蔡依林唯舞獨尊演唱會鮮聽版 & 混音專輯 & 花 ... web.ezpeer.com/singer/s120.html - 頁庫存檔 - 類似內容

d2: ezPeer+ 音樂下載、音樂試 花蝴蝶好聽… web.ezpeer.com/singer/s120.html - 頁庫存檔 - 類似內容

The snippets are usually noisier, less topic-focused, and much shorter 花 ??

similarity evaluation between snippets may not be successful

Motivation

d3: {He is an author}d4: {The writer is standing behind you}

Page 4: Web  Search Clustering and Labeling  with Hidden Topics

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

4

Similarity evaluator is referred to a set of hidden topics

di: {He is an author}dj: {The writer is standing behind you}

(a document may be related to multi-topics)

Objective

Page 5: Web  Search Clustering and Labeling  with Hidden Topics

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

5

Framework

(label candidate generation)

di > topic10dj > topic10

djdi

musicmovieradio player

musicmovie

Page 6: Web  Search Clustering and Labeling  with Hidden Topics

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

6

LDA

music movie author kill writer book …

1 1 0 0 1 1 …

1 1 0 0 1 1 …

k topicm documentn word

zm,n

wm,n

k = 10 (show business)K=60

z1

w1

1 1 0 0 1 1 …

1 0 0 0 0 0 …

z2

w2

1 1 0 0 0 0 …

1 0 0 0 0 0 …

z3

w3

politicsentertainment

show business

edu. cul.hel.

the word “music” in the topic 10 can explain the occurrence of the words in the documents m=1,2,3

In training step:the keyword is related to a topic when it often occurs in the documents topic

refer to topic k

refer to vocabulary

Page 7: Web  Search Clustering and Labeling  with Hidden Topics

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

7

LDA

music movie author move writer book …

1 1 0 0 1 1 …

1 1 0 1 1 1 …

k topicm documentn word zm,n

wm,nk = topic 10K=60

z1

w1

Page 8: Web  Search Clustering and Labeling  with Hidden Topics

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

8

LDA

music movie author move writer book …

1 1 0 0 1 1 …

1 1 0 1 1 1 …

k topicm documentn word

zm,n

wm,nk = topic 10K=60

z1

w1

1 2 3 4 … 9 10 11 11 … 60

dm0.2 0.1 0.4 0.3 … 0.2 0.9 0.1 0.2 … 0.1

p(.|.)=?

Page 9: Web  Search Clustering and Labeling  with Hidden Topics

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

9

LDA

music movie author move writer book …

1 1 0 0 1 1 …

1 1 0 1 1 1 …

k topicm documentn word

zm,n

wm,nk = topic 10K=60

z1

w1

1 2 3 4 … 9 10 11 11 … 60

dm0.2 0.1 0.4 0.3 … 0.2 0.9 0.1 0.2 … 0.1

p(.|.)=?

p(.|.)=1/60

Page 10: Web  Search Clustering and Labeling  with Hidden Topics

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

10

Framework

Page 11: Web  Search Clustering and Labeling  with Hidden Topics

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

11

the tth term in the vocabulary V

the kth topic

Similarity between di and dj

Page 12: Web  Search Clustering and Labeling  with Hidden Topics

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

12

Framework

similarity matrix between snippets

Page 13: Web  Search Clustering and Labeling  with Hidden Topics

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

13

Label Candidate Generation

D Topic k

k=1 k=2 … k=10 … k=60

music 14 18 38 9Label Candidate Generation

music

radio player

mp3

CD

Page 14: Web  Search Clustering and Labeling  with Hidden Topics

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

14

Label assignment for clustering snippets

D Topic k

Label Candidate Generation

music

radio player

mp3

CD

dj

di

Label assignment

music

CD

Page 15: Web  Search Clustering and Labeling  with Hidden Topics

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

15

Framework

(label candidate generation)

di > topic10dj > topic4, topic10

djdi

musicmovieradio layer

musicmovie

Page 16: Web  Search Clustering and Labeling  with Hidden Topics

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

16

Experiment

Wikipedia datasetVnexpress dataset

Page 17: Web  Search Clustering and Labeling  with Hidden Topics

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

17

Experimental dataset

Web dataset consists of 2,357 snippets in 9 categories

20 queries to Google and obtaining about 150 distinguished snippets

Page 18: Web  Search Clustering and Labeling  with Hidden Topics

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

18

F-measure

Experiments

Page 19: Web  Search Clustering and Labeling  with Hidden Topics

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

19

Experiments

Page 20: Web  Search Clustering and Labeling  with Hidden Topics

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

20

Experiments

Page 21: Web  Search Clustering and Labeling  with Hidden Topics

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

21

Experiments

Page 22: Web  Search Clustering and Labeling  with Hidden Topics

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

22

Experiments

Page 23: Web  Search Clustering and Labeling  with Hidden Topics

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

23

Experiments

Page 24: Web  Search Clustering and Labeling  with Hidden Topics

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

24

Experiments

Page 25: Web  Search Clustering and Labeling  with Hidden Topics

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

25

clustering snippets with hidden topics labeling clusters using hidden topic analysis

Conclusion

Page 26: Web  Search Clustering and Labeling  with Hidden Topics

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

26

Advantage labeling clusters with the help of hidden topics the size of snippets is small

Two datasets: 2,357 and 150 (in our work: more than 2 million snippets)

Disadvantage less depends on snippets

Application snippets are useful to make sense

My Comment