Upload
tomonari-masada
View
526
Download
2
Embed Size (px)
DESCRIPTION
Modeling Topical Trends over *Continuous Time *with Priors
Citation preview
ModelingModelingTopical Trends overTopical Trends overContinuous TimeContinuous Timewith Priorswith Priors
Tomonari MASADA 正田备也Nagasaki University 长崎大学
Tomonari MASADA ISNN 2010 1
BackgroundBackgroundWe can estimate document similarity not only by text body
but also by metadata, e.g.
◦Timestamps our target!◦Hyperlinks◦Authors, etc…
Tomonari MASADA ISNN 2010 2
Dynamism in Topical TrendsDynamism in Topical Trends
What does “EXPO” mean in
1970?・・・ EXPO held in Osaka, Japan
What does “EXPO” mean in
2010?・・・ EXPO held in Shanghai, China
Tomonari MASADA ISNN 2010 3
Latent Dirichlet AllocationLatent Dirichlet Allocation[Blei et al. 03][Blei et al. 03]
Collapsed Gibbs sampling[Griffiths et al. PNAS04]
◦Re-assign a token of word w
in doc j to topic k with probability:
Tomonari MASADA ISNN 2010 4
Wn
nn
k
kwjk
Tomonari MASADA ISNN 2010 5
Shanghai is the largest city in China, located in her eastern coast at the outlet of the Yangtze River. Originally a fishing and textiles town, Shanghai grew to importance in the 19th century. In 2005 Shanghai became the world's busiest cargo port. The city is an emerging tourist destination renowned for its historical landmarks such as the Bund and Xintiandi, its modern and
63 7 7 41 63 7 63
41 7 7 41 41 7 7
41 7 7 63 22 41 7
22 7 41 41 63
41 7 41 7 7 50 50
7 50 63 41 7 41
41 22 22 7 41 7 7
41 41 41 41
7 7 41 41 7 7 7
63 7 63 7 41 7
Topic Re-assignment ProbabilitiesTopic Re-assignment Probabilities
Tomonari MASADA ISNN 2010 6
Wn
nn
k
kwjk
How many tokens of word ware assigned to topic k
How many tokens in doc jare assigned to topic k
Introducetime-dependency!
Topics over TimeTopics over Time[Wang et al. KDD06][Wang et al. KDD06]
Tomonari MASADA ISNN 2010 7
Wn
nyfn
k
kwjkjk
)(
jjjkjk yyyfyf )()(
11
21
21 21 )1()()(
)()(
kk yyyfkk
kkk
Tomonari MASADA ISNN 2010 9
LDA: no time-dependency
TOT: too heavy time-dependency
LYNDALYNDA [Masada et al. CIKM09][Masada et al. CIKM09]
LLatent datent dYNYNamical amical DDirichlet irichlet AAllocationllocation
Tomonari MASADA ISNN 2010 10
Wn
nyfn
k
kwjkjk
)(
jjjkjk yyyfyf )()(
2
2
2exp)(
k
kkk s
myyf
Bayesian TOTBayesian TOT [this paper][this paper]
Apply prior to Beta distributions◦But… no conjugate priors for Beta!
Tomonari MASADA ISNN 2010 12
Wn
nyfn
k
kwjkjk
)(
11
21
21 21 )1()()(
)()(
kk yyyfkk
kkk
Use Gamma as Conjugate PriorUse Gamma as Conjugate Prior
Choose one among the two by a coin flip
Tomonari MASADA ISNN 2010 13
12
111
2
2)2(
11
111
1
1)1(
22
11
)1()1()()1(
)1()(
)1()1()(
)1()(
kk
kk
yyyyf
yyyyf
kk
kk
kk
kk
baa
ea
bba
1
)(),;(Gamma
x
y
zθα
φ
s
τηγ
β ba
EvaluationEvaluationLink detection task of TDT
http://projects.ldc.upenn.edu/TDT/(a world-wide competition)
◦For every pair of documents,judge whether the two tell the
same story or not.
◦Check correctnesswith the ground-truth set.
Tomonari MASADA ISNN 2010 15
Tomonari MASADA ISNN 2010 16
entire document set
ground-truthdocument set
“Yes!” “No!”
TDT4 DatasetTDT4 DatasetDataset spec
◦96,259 newswire articles
◦123 timestamps (Oct. 1, 2000 ~ Jan 31, 2001)
◦17,638,946 word tokens, 196,131 word types
80 ground-truth document sets◦40 sets for TDT 2002 + 40 sets for TDT 2003
(Each corresponds to different news stories.)
Evaluation measure◦Normalized detection cost
Tomonari MASADA ISNN 2010 17
ej(w): weight of word w in doc j Jw : # of docs including word w J : total # of docs
njw /nj : ML prob. of word w in doc jpj(w) : predictive prob. of word w in doc j given by a topic model
ρ,σ : probability rescaling parameters
Tomonari MASADA ISNN 2010 18
JJ
nnwpnwe
w
jjwj
jwj
)()}({log)(
Weighting WordsWeighting Wordswith Predictive Probabilitieswith Predictive Probabilities
Interpreting Interpreting Weighting SchemeWeighting Scheme
ρ=σ=0 : naïve TFIDF (inefficient, not used)
ρ=0, σ>0 : baseline TFIDFρ>0, σ>0 : LDA, TOT, BTOT
Tomonari MASADA ISNN 2010 19
JJ
nnwpnwe
w
jjwj
jwj
)()}({log)(
Summary of EvaluationSummary of Evaluation
“LDA > TFIDF” for 26 sets
“TOT > LDA” for 27
“BTOT > LDA” for 15
“BTOT > TOT” for 7
Tomonari MASADA ISNN 2010 20
“LDA < TFIDF” for 8 sets
“TOT < LDA” for 2
“BTOT < LDA” for 0
“BTOT < TOT” for 1
ConclusionConclusionOur aim is to seek a better balance between
◦text body similarity and
◦metadata similarity.
But… Still many investigations are needed.
Tomonari MASADA ISNN 2010 21
Tomonari MASADA ISNN 2010 22
http://www.cis.nagasaki-u.ac.jp/~masada/researches.html
DBLP1990 ~ 2009
LDA
BTOT
Tomonari MASADA ISNN 2010 23
http://www.cis.nagasaki-u.ac.jp/~masada/researches.html
Xinhua net May 5, 2009~ Dec. 17, 2009
LDA
BTOT
Tomonari MASADA ISNN 2010 24
http://www.cis.nagasaki-u.ac.jp/~masada/researches.html
Yomiuri Newspaper2002 ~ 2005
LDA
BTOT