77
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 1 龙龙龙龙龙龙 : 龙龙龙龙 Topic Models for Text Mining ChengXiang Zhai ( 翟翟翟 ) Department of Computer Science Graduate School of Library & Information Science Institute for Genomic Biology, Statistics University of Illinois, Urbana-Champaign http://www-faculty.cs.uiuc.edu/~czhai, [email protected]

龙星计划课程 : 信息检索 Topic Models for Text Mining

  • Upload
    kamin

  • View
    146

  • Download
    4

Embed Size (px)

DESCRIPTION

龙星计划课程 : 信息检索 Topic Models for Text Mining. ChengXiang Zhai (翟成祥) Department of Computer Science Graduate School of Library & Information Science Institute for Genomic Biology, Statistics University of Illinois, Urbana-Champaign http://www-faculty.cs.uiuc.edu/~czhai, [email protected]. - PowerPoint PPT Presentation

Citation preview

Page 1: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 1

龙星计划课程 :信息检索   Topic Models for Text Mining

ChengXiang Zhai (翟成祥 ) Department of Computer Science

Graduate School of Library & Information Science

Institute for Genomic Biology, StatisticsUniversity of Illinois, Urbana-Champaign

http://www-faculty.cs.uiuc.edu/~czhai, [email protected]

Page 2: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 2

Text Management Applications

Access Mining

Organization

Select information

Create Knowledge

Add Structure/Annotations

Page 3: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 3

What Is Text Mining?“The objective of Text Mining is to exploit information

contained in textual documents in various ways, including …discovery of patterns and trends in data, associations among entities, predictive rules, etc.” (Grobelnik et al., 2001)

“Another way to view text data mining is as a process of exploratory data analysis that leads to heretofore unknown information, or to answers for questions for which the answer is not currently known.” (Hearst, 1999)

(Slide from Rebecca Hwa’s “Intro to Text Mining”)

Page 4: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 4

Two Different Views of Text Mining

• Data Mining View: Explore patterns in textual data– Find latent topics – Find topical trends– Find outliers and other hidden patterns

• Natural Language Processing View: Make inferences based on partial understanding natural language text – Information extraction– Question answering

Shallow mining

Deep mining

Page 5: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 5

Applications of Text Mining

• Direct applications: Go beyond search to find knowledge– Question-driven (Bioinformatics, Business Intelligence, etc):

We have specific questions; how can we exploit data mining to answer the questions?

– Data-driven (WWW, literature, email, customer reviews, etc): We have a lot of data; what can we do with it?

• Indirect applications– Assist information access (e.g., discover latent topics to better

summarize search results)– Assist information organization (e.g., discover hidden

structures)

Page 6: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 6

Text Mining Methods• Data Mining Style: View text as high dimensional data

– Frequent pattern finding– Association analysis– Outlier detection

• Information Retrieval Style: Fine granularity topical analysis– Topic extraction– Exploit term weighting and text similarity measures

• Natural Language Processing Style: Information Extraction– Entity extraction– Relation extraction– Sentiment analysis – Question answering

• Machine Learning Style: Unsupervised or semi-supervised learning– Mixture models– Dimension reduction

Topic of this lecture

Page 7: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 7

Outline• The Basic Topic Models:

– Probabilistic Latent Semantic Analysis (PLSA) [Hofmann 99]

– Latent Dirichlet Allocation (LDA) [Blei et al. 02]

• Extensions– Contextual Probabilistic Latent Semantic Analysis

(CPLSA) [Mei & Zhai 06]

– Other extensions

Page 8: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 8

Basic Topic Model: PLSA

Page 9: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 9

PLSA: Motivation

Government Response New Orleans Oil Price Praying and Blessing Aid and Donation Personal bush 0.071 city 0.063 price 0.077 god 0.141 donate 0.120 i 0.405president 0.061 orleans 0.054 oil 0.064 pray 0.047 relief 0.076 my 0.116federal 0.051 new 0.034 gas 0.045 prayer 0.041 red 0.070 me 0.060government 0.047 louisiana 0.023 increase 0.020 love 0.030 cross 0.065 am 0.029fema 0.047 flood 0.022 product 0.020 life 0.025 help 0.050 think 0.015administrate 0.023 evacuate 0.021 fuel 0.018 bless 0.025 victim 0.036 feel 0.012response 0.020 storm 0.017 company 0.018 lord 0.017 organize 0.022 know 0.011brown 0.019 resident 0.016 energy 0.017 jesus 0.016 effort 0.020 something 0.007blame 0.017 center 0.016 market 0.016 will 0.013 fund 0.019 guess 0.007governor 0.014 rescue 0.012 gasoline 0.012 faith 0.012 volunteer 0.019 myself 0.006

What did people say in their blog articles about “Hurricane Katrina”?

Query = “Hurricane Katrina” Results:

Page 10: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 10

Probabilistic Latent Semantic Analysis/Indexing (PLSA/PLSI) [Hofmann 99]

• Mix k multinomial distributions to generate a document

• Each document has a potentially different set of mixing weights which captures the topic coverage

• When generating words in a document, each word may be generated using a DIFFERENT multinomial distribution (this is in contrast with the document clustering model where, once a multinomial distribution is chosen, all the words in a document would be generated using the same model)

• We may add a background distribution to “attract” background words

Page 11: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 11

PLSA as a Mixture Model

Topic 1

Topic k

Topic 2

Document d

Background B

warning 0.3 system 0.2..

aid 0.1donation 0.05support 0.02 ..

statistics 0.2loss 0.1dead 0.05 ..

is 0.05the 0.04a 0.03 ..

k

1

2

BB

W

d,1

d, k

1 - Bd,2

“Generating” word w in doc d in the collection

Parameters: B=noise-level (manually set)’s and ’s are estimated with Maximum Likelihood

])|()1()|([log),()(log

)|()1()|()(

1,

1,

k

jjjdBBB

Vw

k

jjjdBBBd

wpwpdwcdp

wpwpwp

??

??

?

???

??

?

Page 12: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 12

Special Case: Model-based Feedback• Simple case: there is only one topic

)]|()1()|([log),()(log

)|()1()|()(

FBBVw

FBBd

wpwpdwcdp

wpwpwp

P(w| F )

P(w| B)

1-

P(source)

Background words

Topic words

d

F dp )(logmaxarg

Maximum Likelihood:

What about there are k topics?

Page 13: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 13

How to Estimate j: EM Algorithm

the 0.2a 0.1we 0.01to 0.02…

KnownBackground

p(w | B)

…text =? mining =? association =?word =? …

Unknowntopic model

p(w|1)=?

“Text mining”

Observed Doc(s)

MLEstimator

……information =? retrieval =? query =?document =? …

Unknowntopic model

p(w|2)=?

“informationretrieval”

Suppose, we knowthe identity of each word ...

Page 14: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 14

How the Algorithm Works

14

aidprice

oil

πd1,1 ( P(θ1|d1) )

πd1,2 ( P(θ2|d1) )

πd2,1 ( P(θ1|d2) )

πd2,2 ( P(θ2|d2) )aid

priceoil

Topic 1 Topic 2

aidprice

oil

P(w| θ)

Initial value

Initial value

Initial value

Initializing πd, j and P(w| θj) with random values

Iteration 1: E Step: split word counts with different topics (by computing z’ s)

Iteration 1: M Step: re-estimate πd, j and P(w| θj) by adding and normalizing the

splitted word counts

Iteration 2: E Step: split word counts with different topics (by computing z’ s)

Iteration 2: M Step: re-estimate πd, j and P(w| θj) by adding and normalizing the

splitted word counts

Iteration 3, 4, 5, …Until converging

756

875

d1

d2

c(w, d)c(w,d)p(zd,w = B)

c(w,d)(1 - p(zd,w = B))p(zd,w=j)

Page 15: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 15

Vw

m

i Cd wdwd

m

i Cd wdwd

jn

j Vw wdwd

Vw wdwdnjd

k

j jnn

jdBBB

BBwd

k

j jnn

jd

jnn

jdwd

i

i

jzpBzpdwc

jzpBzpdwcwp

jzpBzpdwcjzpBzpdwc

wpwpwpBzp

wp

wpjzp

' 1 ',',

1 ,,)1(

' ,,

,,)1(,

1)()(

,

,

1' ')()(

',

)()(,

,

)())(1)(,'(

)())(1)(,()|(

)'())(1)(,()())(1)(,(

)|()1()|()|()(

)|(

)|()(

Parameter EstimationE-Step: Word w in doc d is generated- from cluster j- from background

Application of Bayes rule

M-Step:Re-estimate - mixing weights- cluster LM

Fractional counts contributing to- using cluster j in generating d- generating w from cluster j

Sum over all docs(in multiple collections)m = 1 if one collection

Page 16: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 16

PLSA with Prior Knowledge

• There are different ways of choosing aspects (topics)– Google = Google News + Google Map + Google scholar, …– Google = Google US + Google France + Google China, …

• Users have some domain knowledge in mind, e.g.,– We expect to see “retrieval models” as a topic in IR.– We want to show the aspects of “history” and “statistics” for

Youtube

• A flexible way to incorporate such knowledge as priors of PLSA model

• In Bayesian, it’s your “belief” on the topic distributions

16

Page 17: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 1717

Adding Prior

Topic 1

Topic k

Topic 2

Document d

Background B

warning 0.3 system 0.2..

aid 0.1donation 0.05support 0.02 ..

statistics 0.2loss 0.1dead 0.05 ..

is 0.05the 0.04a 0.03 ..

k

1

2

BB

W

d,1

d, k

1 - Bd,2

“Generating” word w in doc d in the collection

Parameters: B=noise-level (manually set)’s and ’s are estimated with Maximum Likelihood

)|()(maxarg*

DatappMost likely

Page 18: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 18

Adding Prior as Pseudo Counts

18

the 0.2a 0.1we 0.01to 0.02…

KnownBackground

p(w | B)

…text =? mining =? association =?word =? …

Unknowntopic model

p(w|1)=?

“Text mining”

…information =? retrieval =? query =?document =? …

…Unknown

topic modelp(w|2)=?

“informationretrieval”

Suppose, we knowthe identity of each word ...

Observed Doc(s)

MAPEstimator

Pseudo Doc

Size = μtext

mining

Page 19: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 1919

Maximum A Posterior (MAP) Estimation

+p(w|’j)+

Pseudo counts of w from prior ’

Sum of all pseudo counts

What if =0? What if =+?

Vw

m

i Cd wdwd

m

i Cd wdwd

jn

j Vw wdwd

Vw wdwdnjd

k

j jnn

jdBBB

BBwd

k

j jnn

jd

jnn

jdwd

i

i

jzpBzpdwc

jzpBzpdwcwp

jzpBzpdwcjzpBzpdwc

wpwpwpBzp

wp

wpjzp

' 1 ',',

1 ,,)1(

' ,,

,,)1(,

1)()(

,

,

1' ')()(

',

)()(,

,

)())(1)(,'(

)())(1)(,()|(

)'())(1)(,()())(1)(,(

)|()1()|()|()(

)|(

)|()(

Page 20: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 20

Basic Topic Model: LDA

The following slides about LDA are taken from Michael C. Mozer’s course lecturehttp://www.cs.colorado.edu/~mozer/courses/ProbabilisticModels/

Page 21: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008

LDA: Motivation– “Documents have no generative probabilistic semantics”

•i.e., document is just a symbol– Model has many parameters

•linear in number of documents•need heuristic methods to prevent overfitting

– Cannot generalize to new documents

Page 22: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008

Unigram Model

N

nnwpp

1

)()(w

Page 23: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008

Mixture of Unigrams

z

N

nn zwpzpp

1

)|()()(w

Page 24: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008

Topic Model / Probabilistic LSI

z

nn dzpzwpdpwdp )|()|()(),(

•d is a localist representation of (trained) documents

•LDA provides a distributed representation

Page 25: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008

LDA•Vocabulary of |V| words

•Document is a collection of words from vocabulary.•N words in document•ww = (w1, ..., wN)

•Latent topics•random variable z, with values 1, ..., k

•Like topic model, document is generated by sampling a topic from a mixture and then sampling a word from a mixture.

•But topic model assumes a fixed mixture of topics (multinomial distribution) for each document.•LDA assumes a random mixture of topics (Dirichlet distribution) for each topic.

Page 26: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008

Generative Model

•“Plates” indicate looping structure•Outer plate replicated for each document•Inner plate replicated for each word•Same conditional distributions apply for each replicate

•Document probability

Page 27: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008

Fancier Version

111

1

1 1

)()(

)(

kkk

i i

ki ip

Page 28: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008

Inference

kN

n znnn

nn

N

nn

pp

dzwpzppp

zwpzppp

p

n

1

1

),|(),|,,(

),()()(),(

),()()(),,,(

),,|,(

w

wz

wz wwz

Page 29: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008

Inference•In general, this formula is intractable:

•Expanded version:

kN

n znnn dzwpzppp

n

1

),()()(),(w

dpN

n

k

i

V

j

wiji

k

ii

i i

i i jni

1 1 11

1 )()()(

),|(w

1 if wn is the j'th vocab word

Page 30: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008

Variational Approximation •Computing log likelihood and introducing Jensen's inequality: log(E[x]) >= E[log(x)]

•Find variational distribution q such that the above equation is computable.– q parameterized by γ and φn

– Maximize bound with respect to γ and φn to obtain best approximation to p(w | α, β)

– Lead to variational EM algorithm

•Sampling algorithms (e.g., Gibbs sampling) are also common

Page 31: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008

Data Sets

C. Elegans Community abstracts5,225 abstracts28,414 unique terms

TREC AP corpus (subset)16,333 newswire articles23,075 unique terms

Held-out data – 10% Removed terms

50 stop words, words appearing once

Page 32: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008

C. Elegans

Note: fold in hack for pLSI to allow it to handle novel documents.Involves refitting p(z|dnew) parameters -> sort of a cheat

Page 33: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008

AP

Page 34: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 34

Summary: PLSA vs. LDA • LDA adds a Dirichlet distribution on top of PLSA to

regularize the model

• Estimation of LDA is more complicated than PLSA

• LDA is a generative model, while PLSA isn’t

• PLSA is more likely to over-fit the data than LDA

• Which one to use? – If you need generalization capacity, LDA– If you want to mine topics from a collection, PLSA may

be better (we want overfitting!)

Page 35: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 35

Extension of PLSA: Contextual Probabilistic Latent

Semantic Analysis (CPLSA)

Page 36: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 36

A General Introduction to EM

Data: X (observed) + H(hidden) Parameter:

“Incomplete” likelihood: L( )= log p(X| )“Complete” likelihood: Lc( )= log p(X,H| )

EM tries to iteratively maximize the incomplete likelihood:

Starting with an initial guess (0),

1. E-step: compute the expectation of the complete likelihood

2. M-step: compute (n) by maximizing the Q-function

( 1)( 1) ( 1)( ; ) [ ( ) | ] ( | , ) log ( , )n

i

n nc i i

h

Q E L X p H h X P X h

( ) ( 1) ( 1)arg max ( ; ) arg max ( | , ) log ( , )i

n n ni i

h

Q p H h X P X h

Page 37: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 37

Convergence Guarantee

Goal: maximizing “Incomplete” likelihood: L( )= log p(X| ) I.e., choosing (n), so that L((n))-L((n-1))0

Note that, since p(X,H| ) =p(H|X, ) P(X| ) , L() =Lc() -log p(H|X, ) L((n))-L((n-1)) = Lc((n))-Lc( (n-1))+log [p(H|X, (n-1) )/p(H|X, (n))]

Taking expectation w.r.t. p(H|X, (n-1)), L((n))-L((n-1)) = Q((n); (n-1))-Q( (n-1); (n-1)) + D(p(H|X, (n-1))||p(H|X, (n)))

KL-divergence, always non-negativeEM chooses (n) to maximize Q

Therefore, L((n)) L((n-1))!

Doesn’t contain H

Page 38: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 38

Another way of looking at EM

Likelihood p(X| )

current guess

Lower bound(Q function)

next guess

E-step = computing the lower boundM-step = maximizing the lower bound

L((n-1)) + Q(; (n-1)) -Q( (n-1); (n-1) ) + D(p(H|X, (n-1) )||p(H|X, ))

L((n-1)) + Q(; (n-1)) -Q( (n-1); (n-1) )

Page 39: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 39

Why Contextual PLSA?

Page 40: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 40

Motivating Example:Comparing Product Reviews

Common Themes “IBM” specific “APPLE” specific “DELL” specific

Battery Life Long, 4-3 hrs Medium, 3-2 hrs Short, 2-1 hrs

Hard disk Large, 80-100 GB Small, 5-10 GB Medium, 20-50 GB

Speed Slow, 100-200 Mhz Very Fast, 3-4 Ghz Moderate, 1-2 Ghz

IBM LaptopReviews

APPLE LaptopReviews

DELL LaptopReviews

Unsupervised discovery of common topics and their variations

Page 41: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 41

Motivating Example:Comparing News about Similar Topics

Common Themes “Vietnam” specific “Afghan” specific “Iraq” specific

United nations … … …Death of people … … …… … … …

Vietnam War Afghan War Iraq War

Unsupervised discovery of common topics and their variations

Page 42: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 42

Motivating Example:Discovering Topical Trends in Literature

Unsupervised discovery of topics and their temporal variations

Theme Strength

Time1980 1990 1998 2003

TF-IDF Retrieval

IR Applications

Language ModelText Categorization

Page 43: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 43

Motivating Example:Analyzing Spatial Topic Patterns

• How do blog writers in different states respond to topics such as “oil price increase during Hurricane Karina”?

• Unsupervised discovery of topics and their variations in different locations

Page 44: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 44

Motivating Example: Sentiment Summary

Unsupervised/Semi-supervised discovery of topics and different sentiments of the topics

Page 45: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 45

Research Questions• Can we model all these problems generally?

• Can we solve these problems with a unified approach?

• How can we bring human into the loop?

Page 46: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 46

Contextual Text Mining• Given collections of text with contextual information (meta-

data)

• Discover themes/subtopics/topics (interesting word clusters)

• Compute variations of themes over contexts

• Applications:– Summarizing search results– Federation of text information– Opinion analysis – Social network analysis – Business intelligence– ..

Page 47: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 47

Context Features of Text (Meta-data)

Weblog Article

Author

Author’s OccupationLocationTime

communities

source

Page 48: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 48

Context = Partitioning of Text

1999

2005

2006

1998

…… ……

papers written in 1998

WWW SIGIR ACL KDD SIGMOD

papers written by authors in US

Papers about Web

Page 49: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 49

Themes/Topics

• Uses of themes:– Summarize topics/subtopics– Navigate in a document space– Retrieve documents– Segment documents– …

Theme 1

Theme k

Theme 2

Background B

government 0.3

response 0.2..donate 0.1relief 0.05help 0.02 ..

city 0.2new 0.1orleans 0.05 ..

Is 0.05the 0.04a 0.03 ..

[ Criticism of government response to the hurricane primarily consisted of criticism of its response to the approach of the storm and its aftermath, specifically in the delayed response ] to the [ flooding of New Orleans. … 80% of the 1.3 million residents of the greater New Orleans metropolitan area evacuated ] …[ Over seventy countries pledged monetary donations or other assistance]. …

Page 50: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 50

View of Themes: Context-Specific Version of Views

Context: After 1998 (Language models)

Context: Before 1998 (Traditional models)

vectorspace

TF-IDFOkapiLSI

vectorRocchio

weightingfeedback

term

retrieval

feedback

languagemodelsmoothing

querygeneration

mixture

estimateEM

pseudo

model

feedbackjudgeexpansionpseudoquery

Theme 2:

FeedbackTheme 1:

Retrieval Model

retrievemodelrelevancedocumentquery

Page 51: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 51

Coverage of Themes: Distribution over Themes

Background

• Theme coverage can depend on context

Oil Price

Government Response

Aid and donation

Criticism of government response to the hurricane primarily consisted of criticism of its response to … The total shut-in oil production from the Gulf of Mexico … approximately 24% of the annual production and the shut-in gas production … Over seventy countries pledged monetary donations or other assistance. …

Background

Oil PriceGovernment Response

Aid and donation

Context: Texas

Context: Louisiana

Page 52: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 52

General Tasks of Contextual Text Mining

• Theme Extraction: Extract the global salient themes

– Common information shared over all contexts

• View Comparison: Compare a theme from different views

– Analyze the content variation of themes over contexts

• Coverage Comparison: Compare the theme coverage of different contexts

– Reveal how closely a theme is associated to a context

• Others: – Causal analysis– Correlation analysis

Page 53: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 53

A General Solution: CPLSA • CPLAS = Contextual Probabilistic Latent Semantic Analysis

• An extension of PLSA model ([Hofmann 99]) by – Introducing context variables– Modeling views of topics– Modeling coverage variations of topics

• Process of contextual text mining– Instantiation of CPLSA (context, views, coverage)– Fit the model to text data (EM algorithm)– Compute probabilistic topic patterns

Page 54: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 54

Documentcontext:

Time = July 2005Location = Texas

Author = xxxOccup. = Sociologist

Age Group = 45+…

“Generation” Process of CPLSAView1 View2 View3

Themes

government

donation

New Orleans

government 0.3 response 0.2..

donate 0.1relief 0.05help 0.02 ..

city 0.2new 0.1orleans 0.05 ..

Texas July 2005

sociologist

Theme coverages:

Texas July 2005 document

……

Choose a view

Choose a Coverage

government

donate

new

Draw a word from i

response

aid help

Orleans

Criticism of government response to the hurricane primarily consisted of criticism of its response to … The total shut-in oil production from the Gulf of Mexico … approximately 24% of the annual production and the shut-in gas production … Over seventy countries pledged monetary donations or other assistance. …

Choose a theme

Page 55: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 55

• To generate a document D with context feature set C: – Choose a view vi according to the view distribution

– Choose a coverage кj according to the coverage distribution

– Choose a theme according to the coverage кj

– Generate a word using – The likelihood of the document collection is:

Probabilistic Model

),|( CDvp i

),|( CDp j

il

D

D),( 111

))|()|(),|(),|(log(),()(logCD Vw

k

lilj

m

jj

n

ii wplpCDpCDvpDwcp

il

Page 56: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 56

Parameter Estimation: EM Algorithm• Interesting patterns:

– Theme content variation for each view:

– Theme strength variation for each context

• Prior from a user can be incorporated using MAP estimation

n

i

m

j

k

l lit

jt

jt

it

ilt

jt

jt

it

ljiwwplpCDpCDvp

wplpCDpCDvpzp

1' 1' 1' '')()(

')(

')(

)()()()(

,,,)|()'|'(),|(),|(

)|()|(),|(),|()1(

n

i Vw

m

j

k

l ljiw

Vw

m

j

k

l ljiwi

t

zpDwc

zpDwcCDvp

1' 1' 1' ',',',

1 1 ,,,)1(

)1(),(

)1(),(),|(

m

j Vw

n

i

k

l ljiw

Vw

n

i

k

l ljiwj

t

zpDwc

zpDwcCDp

1' 1' 1' ',',',

1 1 ,,,)1(

)1(),(

)1(),(),|(

l

l CD Vw

n

i ljiw

CD Vw

n

i ljiwj

t

zpDwc

zpDwclp

1' ),( ' 1' ',,',

),( 1 ,,,)1(

)1(),(

)1(),()|(

D

D

Vw CD

m

j ljiw

CD

m

j ljiwil

t

zpDwc

zpDwcwp

' ),( 1' ,',,'

),( 1 ,,,)1(

)1(),'(

)1(),()|(

D

D

)|( ilwp

)|( jlp

Page 57: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 57

Regularization of the Model• Why?

– Generality high complexity (inefficient, multiple local maxima)– Real applications have domain constraints/knowledge

• Two useful simplifications: – Fixed-Coverage: Only analyze the content variation of themes (e.g.,

author-topic analysis, cross-collection comparative analysis )– Fixed-View: Only analyze the coverage variation of themes (e.g.,

spatiotemporal theme analysis)

• In general– Impose priors on model parameters– Support the whole spectrum from unsupervised to supervised

learning

Page 58: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 58

Interpretation of Topics

Statistical topic models

term 0.1599relevance 0.0752weight 0.0660 feedback 0.0372independence 0.0311model 0.0310frequent 0.0233probabilistic 0.0188document 0.0173…

term 0.1599relevance 0.0752weight 0.0660 feedback 0.0372independence 0.0311model 0.0310frequent 0.0233probabilistic 0.0188document 0.0173…

term 0.1599relevance 0.0752weight 0.0660 feedback 0.0372independence 0.0311model 0.0310frequent 0.0233probabilistic 0.0188document 0.0173…Multinomial topic models

NLP ChunkerNgram stat.

database system, clustering algorithm, r tree, functional dependency, iceberg cube, concurrency control, index structure …

Candidate label pool

Collection (Context)

Ranked Listof Labels

clustering algorithm;distance measure;…

Relevance Score Re-rankingCoverage; Discrimination

Page 59: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 59

Relevance: the Zero-Order Score

• Intuition: prefer phrases covering high probability words

Clustering

dimensional

algorithm

birch

shape

Latent Topic

Good Label (l1): “clustering algorithm”

body

Bad Label (l2): “body shape”

p(w|)

)()|(

lplp

Page 60: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 60

Relevance: the First-Order Score

• Intuition: prefer phrases with similar context (distribution)

Clustering

dimension

partition

algorithm

hash

Clustering

hash

dimension

algorithm

partition

C: SIGMOD Proceedings

Topic

… …

P(w|) P(w|l1)

D(||l1) < D(||l2)

Good Label (l1):“clustering algorithm”

Clustering

hash

dimension

join

algorithm

Bad Label (l2):“hash join”

P(w|l2)

w

ClwPMIwp )|,()|(

Score (l, )

Page 61: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 61

Sample Results• Comparative text mining

• Spatiotemporal pattern mining

• Sentiment summary

• Event impact analysis

• Temporal author-topic analysis

Page 62: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 62

Comparing News Articles Iraq War (30 articles) vs. Afghan War (26 articles)

Cluster 1 Cluster 2 Cluster 3

Common

Theme

united 0.042nations 0.04…

killed 0.035month 0.032deaths 0.023…

Iraq

Theme

n 0.03Weapons 0.024Inspections 0.023…

troops 0.016hoon 0.015sanches 0.012…

Afghan Theme

Northern 0.04alliance 0.04kabul 0.03taleban 0.025aid 0.02…

taleban 0.026rumsfeld 0.02hotel 0.012front 0.011…

The common theme indicates that “United Nations” is involved in both wars

Collection-specific themes indicate different roles of “United Nations” in the two wars

Page 63: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 63

Comparing Laptop Reviews

Top words serve as “labels” for common themes(e.g., [sound, speakers], [battery, hours], [cd,drive])

These word distributions can be used to segment text and add hyperlinks between documents

Page 64: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 64

Spatiotemporal Patterns in Blog Articles

• Query= “Hurricane Katrina”

• Topics in the results:

• Spatiotemporal patterns

Page 65: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 65

Theme Life Cycles for Hurricane Katrina

city 0.0634orleans 0.0541new 0.0342louisiana 0.0235flood 0.0227evacuate 0.0211storm 0.0177…

price 0.0772oil 0.0643gas 0.0454 increase 0.0210product 0.0203fuel 0.0188company 0.0182…

Oil Price

New Orleans

Page 66: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 66

Theme Snapshots for Hurricane Katrina

Week4: The theme is again strong along the east coast and the Gulf of Mexico

Week3: The theme distributes more uniformly over the states

Week2: The discussion moves towards the north and west

Week5: The theme fades out in most states

Week1: The theme is the strongest along the Gulf of Mexico

Page 67: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 67

Theme Life Cycles: KDD

00. 0020. 0040. 0060. 0080. 01

0. 0120. 0140. 0160. 0180. 02

1999 2000 2001 2002 2003 2004Time (year)

Nor

mal

ized

Stre

ngth

of T

hem

e

Biology DataWeb InformationTime SeriesClassificationAssociation RuleClusteringBussiness

Global Themes life cycles of KDD AbstractsGlobal Themes life cycles of KDD Abstracts

gene 0.0173expressions 0.0096probability 0.0081microarray 0.0038…

marketing 0.0087customer 0.0086model 0.0079business 0.0048…

rules 0.0142association 0.0064support 0.0053…

Page 68: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 68

Theme Evolution Graph: KDDT

SVM 0.007criteria 0.007classifica – tion 0.006linear 0.005…

decision 0.006tree 0.006classifier 0.005class 0.005Bayes 0.005…

Classifica - tion 0.015text 0.013unlabeled 0.012document 0.008labeled 0.008learning 0.007…

Informa - tion 0.012web 0.010social 0.008retrieval 0.007distance 0.005networks 0.004…

……

1999

web 0.009classifica –tion 0.007features0.006topic 0.005…

mixture 0.005random 0.006cluster 0.006clustering 0.005variables 0.005… topic 0.010

mixture 0.008LDA 0.006 semantic 0.005…

2000 2001 2002 2003 2004

Page 69: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 69

Blog Sentiment Summary (query=“Da Vinci Code”)

Neutral Positive Negative

Facet 1:Movie

... Ron Howards selection of Tom Hanks to play Robert Langdon.

Tom Hanks stars in the movie,who can be mad at that?

But the movie might get delayed, and even killed off if he loses.

Directed by: Ron Howard Writing credits: Akiva Goldsman ...

Tom Hanks, who is my favorite movie star act the leading role.

protesting ... will lose your faith by ... watching the movie.

After watching the movie I went online and some research on ...

Anybody is interested in it?

... so sick of people making such a big deal about a FICTION book and movie.

Facet 2:Book

I remembered when i first read the book, I finished the book in two days.

Awesome book. ... so sick of people making such a big deal about a FICTION book and movie.

I’m reading “Da Vinci Code” now.

So still a good book to past time.

This controversy book cause lots conflict in west society.

Page 70: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 70

Results: Sentiment Dynamics

Facet: the book “ the da vinci code”. ( Bursts during the movie, Pos > Neg )

Facet: religious beliefs ( Bursts during the movie, Neg > Pos )

Page 71: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 71

Event Impact Analysis: IR Research

vector 0.0514concept 0.0298extend 0.0297 model 0.0291space 0.0236boolean 0.0151function 0.0123feedback 0.0077…

xml 0.0678email 0.0197 model 0.0191collect 0.0187judgment 0.0102rank 0.0097subtopic 0.0079…

probabilist 0.0778model 0.0432logic 0.0404 ir 0.0338boolean 0.0281algebra 0.0200estimate 0.0119weight 0.0111…

model 0.1687language 0.0753estimate 0.0520 parameter 0.0281distribution 0.0268probable 0.0205smooth 0.0198markov 0.0137likelihood 0.0059…

1998

Publication of the paper “A language modeling approach to information retrieval”

Starting of the TREC conferences

year1992

term 0.1599relevance 0.0752weight 0.0660 feedback 0.0372independence 0.0311model 0.0310frequent 0.0233probabilistic 0.0188document 0.0173…

Theme: retrieval models

SIGIR papersSIGIR papers

Page 72: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 72

Temporal-Author-Topic Analysis

pattern 0.1107frequent 0.0406frequent-pattern 0.039 sequential 0.0360pattern-growth 0.0203constraint 0.0184push 0.0138…

project 0.0444itemset 0.0433intertransaction 0.0397 support 0.0264associate 0.0258frequent 0.0181closet 0.0176prefixspan 0.0170…

research 0.0551next 0.0308transaction 0.0308 panel 0.0275technical 0.0275article 0.0258revolution 0.0154innovate 0.0154…

close 0.0805pattern 0.0720sequential 0.0462 min_support 0.0353threshold 0.0207top-k 0.0176fp-tree 0.0102…

index 0.0440graph 0.0343web 0.0307gspan 0.0273substructure 0.0201gindex 0.0164bide 0.0115xml 0.0109…

2000time

Author

Author B

Author AGlobal theme: frequent patterns

Jiawei Han

Rakesh Agrawal

Page 73: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 73

Modeling Topical Communities (Mei et al. 08)

73

Community 1: Information Retrieval

Community 2: Data Mining

Community 3: Machine Learning

Page 74: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 74

Other Extensions (LDA Extensions)

• Many extensions of LDA, mostly done by David Blei, Andrew McCallum and their co-authors

• Some examples:– Hierarchical topic models [Blei et al. 03]

– Modeling annotated data [Blei & Jordan 03]

– Dynamic topic models [Blei & Lafferty 06]

– Pachinko allocation [Li & McCallum 06])

• Also, some specific context extension of PLSA, e.g., author-topic model [Steyvers et al. 04]

Page 75: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 75

Future Research Directions • Topic models for text mining

– Evaluation of topic models – Improve the efficiency of estimation and inferences – Incorporate linguistic knowledge– Applications in new domains and for new tasks

• Text mining in general– Combination of NLP-style and DM-style mining algorithms– Integrated mining of text (unstructured) and unstructured data

(e.g., Text OLAP)– Interactive mining:

• Incorporate user constraints and support iterative mining• Design and implement mining languages

Page 76: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 76

What You Should Know• How PLSA works

• How EM algorithm works in general

• Contextual PLSA can be used to perform many quite different interesting text mining tasks

Page 77: 龙星计划课程 : 信息检索 Topic Models for Text Mining

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2008 77

Roadmap• This lecture: Topic models for text mining

• Next lecture: Next generation search engines