22
Modeling Modeling Topical Trends over Topical Trends over Continuous Time Continuous Time with Priors with Priors Tomonari MASADA 正正正正 Nagasaki University 正正正正 [email protected] Tomonari MASADA ISNN 2010 1

Modeling Topical Trends over Continuous Time with Priors

Embed Size (px)

DESCRIPTION

Modeling Topical Trends over *Continuous Time *with Priors

Citation preview

Page 1: Modeling Topical Trends over Continuous Time with Priors

ModelingModelingTopical Trends overTopical Trends overContinuous TimeContinuous Timewith Priorswith Priors

Tomonari MASADA 正田备也Nagasaki University 长崎大学

[email protected]

Tomonari MASADA ISNN 2010 1

Page 2: Modeling Topical Trends over Continuous Time with Priors

BackgroundBackgroundWe can estimate document similarity not only by text body

but also by metadata, e.g.

◦Timestamps our target!◦Hyperlinks◦Authors, etc…

Tomonari MASADA ISNN 2010 2

Page 3: Modeling Topical Trends over Continuous Time with Priors

Dynamism in Topical TrendsDynamism in Topical Trends

What does “EXPO” mean in

1970?・・・ EXPO held in Osaka, Japan

What does “EXPO” mean in

2010?・・・ EXPO held in Shanghai, China

Tomonari MASADA ISNN 2010 3

Page 4: Modeling Topical Trends over Continuous Time with Priors

Latent Dirichlet AllocationLatent Dirichlet Allocation[Blei et al. 03][Blei et al. 03]

Collapsed Gibbs sampling[Griffiths et al. PNAS04]

◦Re-assign a token of word w

in doc j to topic k with probability:

Tomonari MASADA ISNN 2010 4

Wn

nn

k

kwjk

Page 5: Modeling Topical Trends over Continuous Time with Priors

Tomonari MASADA ISNN 2010 5

Shanghai is the largest city in China, located in her eastern coast at the outlet of the Yangtze River. Originally a fishing and textiles town, Shanghai grew to importance in the 19th century. In 2005 Shanghai became the world's busiest cargo port. The city is an emerging tourist destination renowned for its historical landmarks such as the Bund and Xintiandi, its modern and

63 7 7 41 63 7 63

41 7 7 41 41 7 7

41 7 7 63 22 41 7

22 7 41 41 63

41 7 41 7 7 50 50

7 50 63 41 7 41

41 22 22 7 41 7 7

41 41 41 41

7 7 41 41 7 7 7

63 7 63 7 41 7

Page 6: Modeling Topical Trends over Continuous Time with Priors

Topic Re-assignment ProbabilitiesTopic Re-assignment Probabilities

Tomonari MASADA ISNN 2010 6

Wn

nn

k

kwjk

How many tokens of word ware assigned to topic k

How many tokens in doc jare assigned to topic k

Introducetime-dependency!

Page 7: Modeling Topical Trends over Continuous Time with Priors

Topics over TimeTopics over Time[Wang et al. KDD06][Wang et al. KDD06]

Tomonari MASADA ISNN 2010 7

Wn

nyfn

k

kwjkjk

)(

jjjkjk yyyfyf )()(

11

21

21 21 )1()()(

)()(

kk yyyfkk

kkk

Page 8: Modeling Topical Trends over Continuous Time with Priors

Tomonari MASADA ISNN 2010 9

LDA: no time-dependency

TOT: too heavy time-dependency

Page 9: Modeling Topical Trends over Continuous Time with Priors

LYNDALYNDA [Masada et al. CIKM09][Masada et al. CIKM09]

LLatent datent dYNYNamical amical DDirichlet irichlet AAllocationllocation

Tomonari MASADA ISNN 2010 10

Wn

nyfn

k

kwjkjk

)(

jjjkjk yyyfyf )()(

2

2

2exp)(

k

kkk s

myyf

Page 10: Modeling Topical Trends over Continuous Time with Priors

Bayesian TOTBayesian TOT [this paper][this paper]

Apply prior to Beta distributions◦But… no conjugate priors for Beta!

Tomonari MASADA ISNN 2010 12

Wn

nyfn

k

kwjkjk

)(

11

21

21 21 )1()()(

)()(

kk yyyfkk

kkk

Page 11: Modeling Topical Trends over Continuous Time with Priors

Use Gamma as Conjugate PriorUse Gamma as Conjugate Prior

Choose one among the two by a coin flip

Tomonari MASADA ISNN 2010 13

12

111

2

2)2(

11

111

1

1)1(

22

11

)1()1()()1(

)1()(

)1()1()(

)1()(

kk

kk

yyyyf

yyyyf

kk

kk

kk

kk

baa

ea

bba

1

)(),;(Gamma

Page 12: Modeling Topical Trends over Continuous Time with Priors

x

y

zθα

φ

s

τηγ

β ba

Page 13: Modeling Topical Trends over Continuous Time with Priors

EvaluationEvaluationLink detection task of TDT

http://projects.ldc.upenn.edu/TDT/(a world-wide competition)

◦For every pair of documents,judge whether the two tell the

same story or not.

◦Check correctnesswith the ground-truth set.

Tomonari MASADA ISNN 2010 15

Page 14: Modeling Topical Trends over Continuous Time with Priors

Tomonari MASADA ISNN 2010 16

entire document set

ground-truthdocument set

“Yes!” “No!”

Page 15: Modeling Topical Trends over Continuous Time with Priors

TDT4 DatasetTDT4 DatasetDataset spec

◦96,259 newswire articles

◦123 timestamps (Oct. 1, 2000 ~ Jan 31, 2001)

◦17,638,946 word tokens, 196,131 word types

80 ground-truth document sets◦40 sets for TDT 2002 + 40 sets for TDT 2003

(Each corresponds to different news stories.)

Evaluation measure◦Normalized detection cost

Tomonari MASADA ISNN 2010 17

Page 16: Modeling Topical Trends over Continuous Time with Priors

ej(w): weight of word w in doc j Jw : # of docs including word w J : total # of docs

njw /nj : ML prob. of word w in doc jpj(w) : predictive prob. of word w in doc j given by a topic model

ρ,σ : probability rescaling parameters

Tomonari MASADA ISNN 2010 18

JJ

nnwpnwe

w

jjwj

jwj

)()}({log)(

Weighting WordsWeighting Wordswith Predictive Probabilitieswith Predictive Probabilities

Page 17: Modeling Topical Trends over Continuous Time with Priors

Interpreting Interpreting Weighting SchemeWeighting Scheme

ρ=σ=0 : naïve TFIDF (inefficient, not used)

ρ=0, σ>0 : baseline TFIDFρ>0, σ>0 : LDA, TOT, BTOT

Tomonari MASADA ISNN 2010 19

JJ

nnwpnwe

w

jjwj

jwj

)()}({log)(

Page 18: Modeling Topical Trends over Continuous Time with Priors

Summary of EvaluationSummary of Evaluation

“LDA > TFIDF” for 26 sets

“TOT > LDA” for 27

“BTOT > LDA” for 15

“BTOT > TOT” for 7

Tomonari MASADA ISNN 2010 20

“LDA < TFIDF” for 8 sets

“TOT < LDA” for 2

“BTOT < LDA” for 0

“BTOT < TOT” for 1

Page 19: Modeling Topical Trends over Continuous Time with Priors

ConclusionConclusionOur aim is to seek a better balance between

◦text body similarity and

◦metadata similarity.

But… Still many investigations are needed.

Tomonari MASADA ISNN 2010 21

Page 20: Modeling Topical Trends over Continuous Time with Priors

Tomonari MASADA ISNN 2010 22

http://www.cis.nagasaki-u.ac.jp/~masada/researches.html

DBLP1990 ~ 2009

LDA

BTOT

Page 21: Modeling Topical Trends over Continuous Time with Priors

Tomonari MASADA ISNN 2010 23

http://www.cis.nagasaki-u.ac.jp/~masada/researches.html

Xinhua net May 5, 2009~ Dec. 17, 2009

LDA

BTOT

Page 22: Modeling Topical Trends over Continuous Time with Priors

Tomonari MASADA ISNN 2010 24

http://www.cis.nagasaki-u.ac.jp/~masada/researches.html

Yomiuri Newspaper2002 ~ 2005

LDA

BTOT