龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 1

龙星计划课程 :信息检索 Topic Models for Text Mining

ChengXiang Zhai (翟成祥 ) Department of Computer Science

Graduate School of Library & Information Science

Institute for Genomic Biology, StatisticsUniversity of Illinois, Urbana-Champaign

http://www-faculty.cs.uiuc.edu/~czhai, [email protected]

http://www.cs.uiuc.edu/


Text Management Applications

Access Mining

Organization

Select information

Create Knowledge

Add Structure/Annotations



What Is Text Mining?“The objective of Text Mining is to exploit information

contained in textual documents in various ways, including …discovery of patterns and trends in data, associations among entities, predictive rules, etc.” (Grobelnik et al., 2001)

“Another way to view text data mining is as a process of exploratory data analysis that leads to heretofore unknown information, or to answers for questions for which the answer is not currently known.” (Hearst, 1999)

(Slide from Rebecca Hwa’s “Intro to Text Mining”)



Two Different Views of Text Mining

• Data Mining View: Explore patterns in textual data– Find latent topics – Find topical trends– Find outliers and other hidden patterns

• Natural Language Processing View: Make inferences based on partial understanding natural language text – Information extraction– Question answering

Shallow mining

Deep mining



Applications of Text Mining

• Direct applications: Go beyond search to find knowledge– Question-driven (Bioinformatics, Business Intelligence, etc):

We have specific questions; how can we exploit data mining to answer the questions?

– Data-driven (WWW, literature, email, customer reviews, etc): We have a lot of data; what can we do with it?

• Indirect applications– Assist information access (e.g., discover latent topics to better

summarize search results)– Assist information organization (e.g., discover hidden

structures)



Text Mining Methods• Data Mining Style: View text as high dimensional data

– Frequent pattern finding– Association analysis– Outlier detection

• Information Retrieval Style: Fine granularity topical analysis– Topic extraction– Exploit term weighting and text similarity measures

• Natural Language Processing Style: Information Extraction– Entity extraction– Relation extraction– Sentiment analysis – Question answering

• Machine Learning Style: Unsupervised or semi-supervised learning– Mixture models– Dimension reduction

Topic of this lecture



Outline• The Basic Topic Models:

– Probabilistic Latent Semantic Analysis (PLSA) [Hofmann 99]

– Latent Dirichlet Allocation (LDA) [Blei et al. 02]

• Extensions– Contextual Probabilistic Latent Semantic Analysis

(CPLSA) [Mei & Zhai 06]

– Other extensions



Basic Topic Model: PLSA



PLSA: Motivation

Government Response New Orleans Oil Price Praying and Blessing Aid and Donation Personal bush 0.071 city 0.063 price 0.077 god 0.141 donate 0.120 i 0.405president 0.061 orleans 0.054 oil 0.064 pray 0.047 relief 0.076 my 0.116federal 0.051 new 0.034 gas 0.045 prayer 0.041 red 0.070 me 0.060government 0.047 louisiana 0.023 increase 0.020 love 0.030 cross 0.065 am 0.029fema 0.047 flood 0.022 product 0.020 life 0.025 help 0.050 think 0.015administrate 0.023 evacuate 0.021 fuel 0.018 bless 0.025 victim 0.036 feel 0.012response 0.020 storm 0.017 company 0.018 lord 0.017 organize 0.022 know 0.011brown 0.019 resident 0.016 energy 0.017 jesus 0.016 effort 0.020 something 0.007blame 0.017 center 0.016 market 0.016 will 0.013 fund 0.019 guess 0.007governor 0.014 rescue 0.012 gasoline 0.012 faith 0.012 volunteer 0.019 myself 0.006

What did people say in their blog articles about “Hurricane Katrina”?

Query = “Hurricane Katrina” Results:



Probabilistic Latent Semantic Analysis/Indexing (PLSA/PLSI) [Hofmann 99]

• Mix k multinomial distributions to generate a document

• Each document has a potentially different set of mixing weights which captures the topic coverage

• When generating words in a document, each word may be generated using a DIFFERENT multinomial distribution (this is in contrast with the document clustering model where, once a multinomial distribution is chosen, all the words in a document would be generated using the same model)

• We may add a background distribution to “attract” background words



PLSA as a Mixture Model

Topic 1

Topic k

Topic 2

…

Document d

Background B

warning 0.3 system 0.2..

aid 0.1donation 0.05support 0.02 ..

statistics 0.2loss 0.1dead 0.05 ..

is 0.05the 0.04a 0.03 ..

k

1

2

BB

W

d,1

d, k

1 - Bd,2

“Generating” word w in doc d in the collection

Parameters: B=noise-level (manually set)’s and ’s are estimated with Maximum Likelihood

])|()1()|([log),()(log

)|()1()|()(

1,

1,

k

jjjdBBB

Vw

k

jjjdBBBd

wpwpdwcdp

wpwpwp

??

??

?

???

??

?



Special Case: Model-based Feedback• Simple case: there is only one topic

)]|()1()|([log),()(log

)|()1()|()(

FBBVw

FBBd

wpwpdwcdp

wpwpwp

P(w| F )

P(w| B)

1-

P(source)

Background words

Topic words

d

F dp )(logmaxarg

Maximum Likelihood:

What about there are k topics?



How to Estimate j: EM Algorithm

the 0.2a 0.1we 0.01to 0.02…

KnownBackground

p(w | B)

…text =? mining =? association =?word =? …

Unknowntopic model

p(w|1)=?

“Text mining”

Observed Doc(s)

MLEstimator

……information =? retrieval =? query =?document =? …

Unknowntopic model

p(w|2)=?

“informationretrieval”

Suppose, we knowthe identity of each word ...



How the Algorithm Works

14

aidprice

oil

πd1,1 ( P(θ1|d1) )

πd1,2 ( P(θ2|d1) )

πd2,1 ( P(θ1|d2) )

πd2,2 ( P(θ2|d2) )aid

priceoil

Topic 1 Topic 2

aidprice

oil

P(w| θ)

Initial value

Initial value

Initial value

Initializing πd, j and P(w| θj) with random values

Iteration 1: E Step: split word counts with different topics (by computing z’ s)

Iteration 1: M Step: re-estimate πd, j and P(w| θj) by adding and normalizing the

splitted word counts

Iteration 2: E Step: split word counts with different topics (by computing z’ s)

Iteration 2: M Step: re-estimate πd, j and P(w| θj) by adding and normalizing the

splitted word counts

Iteration 3, 4, 5, …Until converging

756

875

d1

d2

c(w, d)c(w,d)p(zd,w = B)

c(w,d)(1 - p(zd,w = B))p(zd,w=j)



Vw

m

i Cd wdwd

m

i Cd wdwd

jn

j Vw wdwd

Vw wdwdnjd

k

j jnn

jdBBB

BBwd

k

j jnn

jd

jnn

jdwd

i

i

jzpBzpdwc

jzpBzpdwcwp

jzpBzpdwcjzpBzpdwc

wpwpwpBzp

wp

wpjzp

' 1 ',',

1 ,,)1(

' ,,

,,)1(,

1)()(

,

,

1' ')()(

',

)()(,

,

)())(1)(,'(

)())(1)(,()|(

)'())(1)(,()())(1)(,(

)|()1()|()|()(

)|(

)|()(

Parameter EstimationE-Step: Word w in doc d is generated- from cluster j- from background

Application of Bayes rule

M-Step:Re-estimate - mixing weights- cluster LM

Fractional counts contributing to- using cluster j in generating d- generating w from cluster j

Sum over all docs(in multiple collections)m = 1 if one collection



PLSA with Prior Knowledge

• There are different ways of choosing aspects (topics)– Google = Google News + Google Map + Google scholar, …– Google = Google US + Google France + Google China, …

• Users have some domain knowledge in mind, e.g.,– We expect to see “retrieval models” as a topic in IR.– We want to show the aspects of “history” and “statistics” for

Youtube

• A flexible way to incorporate such knowledge as priors of PLSA model

• In Bayesian, it’s your “belief” on the topic distributions

16



Adding Prior

Topic 1

Topic k

Topic 2

…

Document d

Background B

warning 0.3 system 0.2..

aid 0.1donation 0.05support 0.02 ..

statistics 0.2loss 0.1dead 0.05 ..

is 0.05the 0.04a 0.03 ..

k

1

2

BB

W

d,1

d, k

1 - Bd,2

“Generating” word w in doc d in the collection

Parameters: B=noise-level (manually set)’s and ’s are estimated with Maximum Likelihood

)|()(maxarg*

DatappMost likely



Adding Prior as Pseudo Counts

18

the 0.2a 0.1we 0.01to 0.02…

KnownBackground

p(w | B)

…text =? mining =? association =?word =? …

Unknowntopic model

p(w|1)=?

“Text mining”

…information =? retrieval =? query =?document =? …

…Unknown

topic modelp(w|2)=?

“informationretrieval”

Suppose, we knowthe identity of each word ...

Observed Doc(s)

MAPEstimator

Pseudo Doc

Size = μtext

mining



Maximum A Posterior (MAP) Estimation

+p(w|’j)+

Pseudo counts of w from prior ’

Sum of all pseudo counts

What if =0? What if =+?

Vw

m

i Cd wdwd

m

i Cd wdwd

jn

j Vw wdwd

Vw wdwdnjd

k

j jnn

jdBBB

BBwd

k

j jnn

jd

jnn

jdwd

i

i

jzpBzpdwc

jzpBzpdwcwp

jzpBzpdwcjzpBzpdwc

wpwpwpBzp

wp

wpjzp

' 1 ',',

1 ,,)1(

' ,,

,,)1(,

1)()(

,

,

1' ')()(

',

)()(,

,

)())(1)(,'(

)())(1)(,()|(

)'())(1)(,()())(1)(,(

)|()1()|()|()(

)|(

)|()(



Basic Topic Model: LDA

The following slides about LDA are taken from Michael C. Mozer’s course lecturehttp://www.cs.colorado.edu/~mozer/courses/ProbabilisticModels/


2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008

LDA: Motivation– “Documents have no generative probabilistic semantics”

•i.e., document is just a symbol– Model has many parameters

•linear in number of documents•need heuristic methods to prevent overfitting

– Cannot generalize to new documents



Unigram Model

N

nnwpp

1

)()(w



Mixture of Unigrams

z

N

nn zwpzpp

1

)|()()(w



Topic Model / Probabilistic LSI

z

nn dzpzwpdpwdp )|()|()(),(

•d is a localist representation of (trained) documents

•LDA provides a distributed representation



LDA•Vocabulary of |V| words

•Document is a collection of words from vocabulary.•N words in document•ww = (w1, ..., wN)

•Latent topics•random variable z, with values 1, ..., k

•Like topic model, document is generated by sampling a topic from a mixture and then sampling a word from a mixture.

•But topic model assumes a fixed mixture of topics (multinomial distribution) for each document.•LDA assumes a random mixture of topics (Dirichlet distribution) for each topic.



Generative Model

•“Plates” indicate looping structure•Outer plate replicated for each document•Inner plate replicated for each word•Same conditional distributions apply for each replicate

•Document probability



Fancier Version

111

1

1 1

)()(

)(

kkk

i i

ki ip



Inference

kN

n znnn

nn

N

nn

pp

dzwpzppp

zwpzppp

p

n

1

1

),|(),|,,(

),()()(),(

),()()(),,,(

),,|,(

w

wz

wz wwz



Inference•In general, this formula is intractable:

•Expanded version:

kN

n znnn dzwpzppp

n

1

),()()(),(w

dpN

n

k

i

V

j

wiji

k

ii

i i

i i jni

1 1 11

1 )()()(

),|(w

1 if wn is the j'th vocab word



Variational Approximation •Computing log likelihood and introducing Jensen's inequality: log(E[x]) >= E[log(x)]

•Find variational distribution q such that the above equation is computable.– q parameterized by γ and φn

– Maximize bound with respect to γ and φn to obtain best approximation to p(w | α, β)

– Lead to variational EM algorithm

•Sampling algorithms (e.g., Gibbs sampling) are also common



Data Sets

C. Elegans Community abstracts5,225 abstracts28,414 unique terms

TREC AP corpus (subset)16,333 newswire articles23,075 unique terms

Held-out data – 10% Removed terms

50 stop words, words appearing once



C. Elegans

Note: fold in hack for pLSI to allow it to handle novel documents.Involves refitting p(z|dnew) parameters -> sort of a cheat



AP



Summary: PLSA vs. LDA • LDA adds a Dirichlet distribution on top of PLSA to

regularize the model

• Estimation of LDA is more complicated than PLSA

• LDA is a generative model, while PLSA isn’t

• PLSA is more likely to over-fit the data than LDA

• Which one to use? – If you need generalization capacity, LDA– If you want to mine topics from a collection, PLSA may

be better (we want overfitting!)



Extension of PLSA: Contextual Probabilistic Latent

Semantic Analysis (CPLSA)



A General Introduction to EM

Data: X (observed) + H(hidden) Parameter:

“Incomplete” likelihood: L( )= log p(X| )“Complete” likelihood: Lc( )= log p(X,H| )

EM tries to iteratively maximize the incomplete likelihood:

Starting with an initial guess (0),

1. E-step: compute the expectation of the complete likelihood

2. M-step: compute (n) by maximizing the Q-function

( 1)( 1) ( 1)( ; ) [ ( ) | ] ( | , ) log ( , )n

i

n nc i i

h

Q E L X p H h X P X h

( ) ( 1) ( 1)arg max ( ; ) arg max ( | , ) log ( , )i

n n ni i

h

Q p H h X P X h



Convergence Guarantee

Goal: maximizing “Incomplete” likelihood: L( )= log p(X| ) I.e., choosing (n), so that L((n))-L((n-1))0

Note that, since p(X,H| ) =p(H|X, ) P(X| ) , L() =Lc() -log p(H|X, ) L((n))-L((n-1)) = Lc((n))-Lc( (n-1))+log [p(H|X, (n-1) )/p(H|X, (n))]

Taking expectation w.r.t. p(H|X, (n-1)), L((n))-L((n-1)) = Q((n); (n-1))-Q( (n-1); (n-1)) + D(p(H|X, (n-1))||p(H|X, (n)))

KL-divergence, always non-negativeEM chooses (n) to maximize Q

Therefore, L((n)) L((n-1))!

Doesn’t contain H



Another way of looking at EM

Likelihood p(X| )

current guess

Lower bound(Q function)

next guess

E-step = computing the lower boundM-step = maximizing the lower bound

L((n-1)) + Q(; (n-1)) -Q( (n-1); (n-1) ) + D(p(H|X, (n-1) )||p(H|X, ))

L((n-1)) + Q(; (n-1)) -Q( (n-1); (n-1) )



Why Contextual PLSA?



Motivating Example:Comparing Product Reviews

Common Themes “IBM” specific “APPLE” specific “DELL” specific

Battery Life Long, 4-3 hrs Medium, 3-2 hrs Short, 2-1 hrs

Hard disk Large, 80-100 GB Small, 5-10 GB Medium, 20-50 GB

Speed Slow, 100-200 Mhz Very Fast, 3-4 Ghz Moderate, 1-2 Ghz

IBM LaptopReviews

APPLE LaptopReviews

DELL LaptopReviews

Unsupervised discovery of common topics and their variations



Motivating Example:Comparing News about Similar Topics

Common Themes “Vietnam” specific “Afghan” specific “Iraq” specific

United nations … … …Death of people … … …… … … …

Vietnam War Afghan War Iraq War

Unsupervised discovery of common topics and their variations



Motivating Example:Discovering Topical Trends in Literature

Unsupervised discovery of topics and their temporal variations

Theme Strength

Time1980 1990 1998 2003

TF-IDF Retrieval

IR Applications

Language ModelText Categorization



Motivating Example:Analyzing Spatial Topic Patterns

• How do blog writers in different states respond to topics such as “oil price increase during Hurricane Karina”?

• Unsupervised discovery of topics and their variations in different locations



Motivating Example: Sentiment Summary

Unsupervised/Semi-supervised discovery of topics and different sentiments of the topics



Research Questions• Can we model all these problems generally?

• Can we solve these problems with a unified approach?

• How can we bring human into the loop?



Contextual Text Mining• Given collections of text with contextual information (meta-

data)

• Discover themes/subtopics/topics (interesting word clusters)

• Compute variations of themes over contexts

• Applications:– Summarizing search results– Federation of text information– Opinion analysis – Social network analysis – Business intelligence– ..



Context Features of Text (Meta-data)

Weblog Article

Author

Author’s OccupationLocationTime

communities

source



Context = Partitioning of Text

1999

2005

2006

1998

…… ……

papers written in 1998

WWW SIGIR ACL KDD SIGMOD

papers written by authors in US

Papers about Web



Themes/Topics

• Uses of themes:– Summarize topics/subtopics– Navigate in a document space– Retrieve documents– Segment documents– …

Theme 1

Theme k

Theme 2

…

Background B

government 0.3

response 0.2..donate 0.1relief 0.05help 0.02 ..

city 0.2new 0.1orleans 0.05 ..

Is 0.05the 0.04a 0.03 ..

[ Criticism of government response to the hurricane primarily consisted of criticism of its response to the approach of the storm and its aftermath, specifically in the delayed response ] to the [ flooding of New Orleans. … 80% of the 1.3 million residents of the greater New Orleans metropolitan area evacuated ] …[ Over seventy countries pledged monetary donations or other assistance]. …



View of Themes: Context-Specific Version of Views

Context: After 1998 (Language models)

Context: Before 1998 (Traditional models)

vectorspace

TF-IDFOkapiLSI

vectorRocchio

weightingfeedback

term

retrieval

feedback

languagemodelsmoothing

querygeneration

mixture

estimateEM

pseudo

model

feedbackjudgeexpansionpseudoquery

Theme 2:

FeedbackTheme 1:

Retrieval Model

retrievemodelrelevancedocumentquery



Coverage of Themes: Distribution over Themes

Background

• Theme coverage can depend on context

Oil Price

Government Response

Aid and donation

Criticism of government response to the hurricane primarily consisted of criticism of its response to … The total shut-in oil production from the Gulf of Mexico … approximately 24% of the annual production and the shut-in gas production … Over seventy countries pledged monetary donations or other assistance. …

Background

Oil PriceGovernment Response

Aid and donation

Context: Texas

Context: Louisiana



General Tasks of Contextual Text Mining

• Theme Extraction: Extract the global salient themes

– Common information shared over all contexts

• View Comparison: Compare a theme from different views

– Analyze the content variation of themes over contexts

• Coverage Comparison: Compare the theme coverage of different contexts

– Reveal how closely a theme is associated to a context

• Others: – Causal analysis– Correlation analysis



A General Solution: CPLSA • CPLAS = Contextual Probabilistic Latent Semantic Analysis

• An extension of PLSA model ([Hofmann 99]) by – Introducing context variables– Modeling views of topics– Modeling coverage variations of topics

• Process of contextual text mining– Instantiation of CPLSA (context, views, coverage)– Fit the model to text data (EM algorithm)– Compute probabilistic topic patterns



Documentcontext:

Time = July 2005Location = Texas

Author = xxxOccup. = Sociologist

Age Group = 45+…

“Generation” Process of CPLSAView1 View2 View3

Themes

government

donation

New Orleans

government 0.3 response 0.2..

donate 0.1relief 0.05help 0.02 ..

city 0.2new 0.1orleans 0.05 ..

Texas July 2005

sociologist

Theme coverages:

Texas July 2005 document

……

Choose a view

Choose a Coverage

government

donate

new

Draw a word from i

response

aid help

Orleans

Criticism of government response to the hurricane primarily consisted of criticism of its response to … The total shut-in oil production from the Gulf of Mexico … approximately 24% of the annual production and the shut-in gas production … Over seventy countries pledged monetary donations or other assistance. …

Choose a theme



• To generate a document D with context feature set C: – Choose a view vi according to the view distribution

– Choose a coverage кj according to the coverage distribution

– Choose a theme according to the coverage кj

– Generate a word using – The likelihood of the document collection is:

Probabilistic Model

),|( CDvp i

),|( CDp j

il

D

D),( 111

))|()|(),|(),|(log(),()(logCD Vw

k

lilj

m

jj

n

ii wplpCDpCDvpDwcp

il



Parameter Estimation: EM Algorithm• Interesting patterns:

– Theme content variation for each view:

– Theme strength variation for each context

• Prior from a user can be incorporated using MAP estimation

n

i

m

j

k

l lit

jt

jt

it

ilt

jt

jt

it

ljiwwplpCDpCDvp

wplpCDpCDvpzp

1' 1' 1' '')()(

')(

')(

)()()()(

,,,)|()'|'(),|(),|(

)|()|(),|(),|()1(

n

i Vw

m

j

k

l ljiw

Vw

m

j

k

l ljiwi

t

zpDwc

zpDwcCDvp

1' 1' 1' ',',',

1 1 ,,,)1(

)1(),(

)1(),(),|(

m

j Vw

n

i

k

l ljiw

Vw

n

i

k

l ljiwj

t

zpDwc

zpDwcCDp

1' 1' 1' ',',',

1 1 ,,,)1(

)1(),(

)1(),(),|(

l

l CD Vw

n

i ljiw

CD Vw

n

i ljiwj

t

zpDwc

zpDwclp

1' ),( ' 1' ',,',

),( 1 ,,,)1(

)1(),(

)1(),()|(

D

D

Vw CD

m

j ljiw

CD

m

j ljiwil

t

zpDwc

zpDwcwp

' ),( 1' ,',,'

),( 1 ,,,)1(

)1(),'(

)1(),()|(

D

D

)|( ilwp

)|( jlp



Regularization of the Model• Why?

– Generality high complexity (inefficient, multiple local maxima)– Real applications have domain constraints/knowledge

• Two useful simplifications: – Fixed-Coverage: Only analyze the content variation of themes (e.g.,

author-topic analysis, cross-collection comparative analysis )– Fixed-View: Only analyze the coverage variation of themes (e.g.,

spatiotemporal theme analysis)

• In general– Impose priors on model parameters– Support the whole spectrum from unsupervised to supervised

learning



Interpretation of Topics

Statistical topic models

term 0.1599relevance 0.0752weight 0.0660 feedback 0.0372independence 0.0311model 0.0310frequent 0.0233probabilistic 0.0188document 0.0173…


term 0.1599relevance 0.0752weight 0.0660 feedback 0.0372independence 0.0311model 0.0310frequent 0.0233probabilistic 0.0188document 0.0173…Multinomial topic models

NLP ChunkerNgram stat.

database system, clustering algorithm, r tree, functional dependency, iceberg cube, concurrency control, index structure …

Candidate label pool

Collection (Context)

Ranked Listof Labels

clustering algorithm;distance measure;…

Relevance Score Re-rankingCoverage; Discrimination



Relevance: the Zero-Order Score

• Intuition: prefer phrases covering high probability words

Clustering

dimensional

algorithm

birch

shape

Latent Topic

…

Good Label (l1): “clustering algorithm”

body

Bad Label (l2): “body shape”

…

p(w|)

)()|(

lplp



Relevance: the First-Order Score

• Intuition: prefer phrases with similar context (distribution)

Clustering

dimension

partition

algorithm

hash

Clustering

hash

dimension

algorithm

partition

C: SIGMOD Proceedings

Topic

… …

P(w|) P(w|l1)

D(||l1) < D(||l2)

Good Label (l1):“clustering algorithm”

Clustering

hash

dimension

join

algorithm

…

Bad Label (l2):“hash join”

P(w|l2)

w

ClwPMIwp )|,()|(

Score (l, )



Sample Results• Comparative text mining

• Spatiotemporal pattern mining

• Sentiment summary

• Event impact analysis

• Temporal author-topic analysis



Comparing News Articles Iraq War (30 articles) vs. Afghan War (26 articles)

Cluster 1 Cluster 2 Cluster 3

Common

Theme

united 0.042nations 0.04…

killed 0.035month 0.032deaths 0.023…

…

Iraq

Theme

n 0.03Weapons 0.024Inspections 0.023…

troops 0.016hoon 0.015sanches 0.012…

…

Afghan Theme

Northern 0.04alliance 0.04kabul 0.03taleban 0.025aid 0.02…

taleban 0.026rumsfeld 0.02hotel 0.012front 0.011…

…

The common theme indicates that “United Nations” is involved in both wars

Collection-specific themes indicate different roles of “United Nations” in the two wars



Comparing Laptop Reviews

Top words serve as “labels” for common themes(e.g., [sound, speakers], [battery, hours], [cd,drive])

These word distributions can be used to segment text and add hyperlinks between documents



Spatiotemporal Patterns in Blog Articles

• Query= “Hurricane Katrina”

• Topics in the results:

• Spatiotemporal patterns



Theme Life Cycles for Hurricane Katrina

city 0.0634orleans 0.0541new 0.0342louisiana 0.0235flood 0.0227evacuate 0.0211storm 0.0177…

price 0.0772oil 0.0643gas 0.0454 increase 0.0210product 0.0203fuel 0.0188company 0.0182…

Oil Price

New Orleans



Theme Snapshots for Hurricane Katrina

Week4: The theme is again strong along the east coast and the Gulf of Mexico

Week3: The theme distributes more uniformly over the states

Week2: The discussion moves towards the north and west

Week5: The theme fades out in most states

Week1: The theme is the strongest along the Gulf of Mexico



Theme Life Cycles: KDD

00. 0020. 0040. 0060. 0080. 01

0. 0120. 0140. 0160. 0180. 02

1999 2000 2001 2002 2003 2004Time (year)

Nor

mal

ized

Stre

ngth

of T

hem

e

Biology DataWeb InformationTime SeriesClassificationAssociation RuleClusteringBussiness

Global Themes life cycles of KDD AbstractsGlobal Themes life cycles of KDD Abstracts

gene 0.0173expressions 0.0096probability 0.0081microarray 0.0038…

marketing 0.0087customer 0.0086model 0.0079business 0.0048…

rules 0.0142association 0.0064support 0.0053…



Theme Evolution Graph: KDDT

SVM 0.007criteria 0.007classifica – tion 0.006linear 0.005…

decision 0.006tree 0.006classifier 0.005class 0.005Bayes 0.005…

Classifica - tion 0.015text 0.013unlabeled 0.012document 0.008labeled 0.008learning 0.007…

Informa - tion 0.012web 0.010social 0.008retrieval 0.007distance 0.005networks 0.004…

……

1999

…

web 0.009classifica –tion 0.007features0.006topic 0.005…

mixture 0.005random 0.006cluster 0.006clustering 0.005variables 0.005… topic 0.010

mixture 0.008LDA 0.006 semantic 0.005…

…

2000 2001 2002 2003 2004



Blog Sentiment Summary (query=“Da Vinci Code”)

Neutral Positive Negative

Facet 1:Movie

... Ron Howards selection of Tom Hanks to play Robert Langdon.

Tom Hanks stars in the movie,who can be mad at that?

But the movie might get delayed, and even killed off if he loses.

Directed by: Ron Howard Writing credits: Akiva Goldsman ...

Tom Hanks, who is my favorite movie star act the leading role.

protesting ... will lose your faith by ... watching the movie.

After watching the movie I went online and some research on ...

Anybody is interested in it?

... so sick of people making such a big deal about a FICTION book and movie.

Facet 2:Book

I remembered when i first read the book, I finished the book in two days.

Awesome book. ... so sick of people making such a big deal about a FICTION book and movie.

I’m reading “Da Vinci Code” now.

…

So still a good book to past time.

This controversy book cause lots conflict in west society.



Results: Sentiment Dynamics

Facet: the book “ the da vinci code”. ( Bursts during the movie, Pos > Neg )

Facet: religious beliefs ( Bursts during the movie, Neg > Pos )



Event Impact Analysis: IR Research

vector 0.0514concept 0.0298extend 0.0297 model 0.0291space 0.0236boolean 0.0151function 0.0123feedback 0.0077…

xml 0.0678email 0.0197 model 0.0191collect 0.0187judgment 0.0102rank 0.0097subtopic 0.0079…

probabilist 0.0778model 0.0432logic 0.0404 ir 0.0338boolean 0.0281algebra 0.0200estimate 0.0119weight 0.0111…

model 0.1687language 0.0753estimate 0.0520 parameter 0.0281distribution 0.0268probable 0.0205smooth 0.0198markov 0.0137likelihood 0.0059…

1998

Publication of the paper “A language modeling approach to information retrieval”

Starting of the TREC conferences

year1992


Theme: retrieval models

SIGIR papersSIGIR papers



Temporal-Author-Topic Analysis

pattern 0.1107frequent 0.0406frequent-pattern 0.039 sequential 0.0360pattern-growth 0.0203constraint 0.0184push 0.0138…

project 0.0444itemset 0.0433intertransaction 0.0397 support 0.0264associate 0.0258frequent 0.0181closet 0.0176prefixspan 0.0170…

research 0.0551next 0.0308transaction 0.0308 panel 0.0275technical 0.0275article 0.0258revolution 0.0154innovate 0.0154…

close 0.0805pattern 0.0720sequential 0.0462 min_support 0.0353threshold 0.0207top-k 0.0176fp-tree 0.0102…

index 0.0440graph 0.0343web 0.0307gspan 0.0273substructure 0.0201gindex 0.0164bide 0.0115xml 0.0109…

2000time

Author

Author B

Author AGlobal theme: frequent patterns

Jiawei Han

Rakesh Agrawal



Modeling Topical Communities (Mei et al. 08)

73

Community 1: Information Retrieval

Community 2: Data Mining

Community 3: Machine Learning



Other Extensions (LDA Extensions)

• Many extensions of LDA, mostly done by David Blei, Andrew McCallum and their co-authors

• Some examples:– Hierarchical topic models [Blei et al. 03]

– Modeling annotated data [Blei & Jordan 03]

– Dynamic topic models [Blei & Lafferty 06]

– Pachinko allocation [Li & McCallum 06])

• Also, some specific context extension of PLSA, e.g., author-topic model [Steyvers et al. 04]



Future Research Directions • Topic models for text mining

– Evaluation of topic models – Improve the efficiency of estimation and inferences – Incorporate linguistic knowledge– Applications in new domains and for new tasks

• Text mining in general– Combination of NLP-style and DM-style mining algorithms– Integrated mining of text (unstructured) and unstructured data

(e.g., Text OLAP)– Interactive mining:

• Incorporate user constraints and support iterative mining• Design and implement mining languages



What You Should Know• How PLSA works

• How EM algorithm works in general

• Contextual PLSA can be used to perform many quite different interesting text mining tasks



Roadmap• This lecture: Topic models for text mining

• Next lecture: Next generation search engines