60
Topic Modeling, Threshold SVD October 27, 2014 Topic Modeling, Threshold SVD October 27, 2014 1 / 15

Topic Modeling, Threshold SVD€¦ · Topic Modeling, Threshold SVD October 27, 2014 Topic Modeling, Threshold SVD October 27, 2014 1 / 15

  • Upload
    others

  • View
    12

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Topic Modeling, Threshold SVD€¦ · Topic Modeling, Threshold SVD October 27, 2014 Topic Modeling, Threshold SVD October 27, 2014 1 / 15

Topic Modeling, Threshold SVD

October 27, 2014

Topic Modeling, Threshold SVD October 27, 2014 1 / 15

Page 2: Topic Modeling, Threshold SVD€¦ · Topic Modeling, Threshold SVD October 27, 2014 Topic Modeling, Threshold SVD October 27, 2014 1 / 15

Simple Setting: Signal and Noise

A n × n matrix and S ⊆ [n], |S| = k .

Aij all independent r.v.’sFor i , j ∈ S, Pr(Aij ≥ µ) ≥ 1/2. Signal = µ.For other i , j , Aij is N(0, σ2). Noise = σ.Given A, µ, σ, find S. [Recall Planted Clique.]

A =

. . . . . . . .

. µ+ . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . N(0, σ2) . .

. . . . . . . .

. . . . . . . .

Topic Modeling, Threshold SVD October 27, 2014 2 / 15

Page 3: Topic Modeling, Threshold SVD€¦ · Topic Modeling, Threshold SVD October 27, 2014 Topic Modeling, Threshold SVD October 27, 2014 1 / 15

Simple Setting: Signal and Noise

A n × n matrix and S ⊆ [n], |S| = k .Aij all independent r.v.’s

For i , j ∈ S, Pr(Aij ≥ µ) ≥ 1/2. Signal = µ.For other i , j , Aij is N(0, σ2). Noise = σ.Given A, µ, σ, find S. [Recall Planted Clique.]

A =

. . . . . . . .

. µ+ . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . N(0, σ2) . .

. . . . . . . .

. . . . . . . .

Topic Modeling, Threshold SVD October 27, 2014 2 / 15

Page 4: Topic Modeling, Threshold SVD€¦ · Topic Modeling, Threshold SVD October 27, 2014 Topic Modeling, Threshold SVD October 27, 2014 1 / 15

Simple Setting: Signal and Noise

A n × n matrix and S ⊆ [n], |S| = k .Aij all independent r.v.’sFor i , j ∈ S, Pr(Aij ≥ µ) ≥ 1/2. Signal = µ.

For other i , j , Aij is N(0, σ2). Noise = σ.Given A, µ, σ, find S. [Recall Planted Clique.]

A =

. . . . . . . .

. µ+ . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . N(0, σ2) . .

. . . . . . . .

. . . . . . . .

Topic Modeling, Threshold SVD October 27, 2014 2 / 15

Page 5: Topic Modeling, Threshold SVD€¦ · Topic Modeling, Threshold SVD October 27, 2014 Topic Modeling, Threshold SVD October 27, 2014 1 / 15

Simple Setting: Signal and Noise

A n × n matrix and S ⊆ [n], |S| = k .Aij all independent r.v.’sFor i , j ∈ S, Pr(Aij ≥ µ) ≥ 1/2. Signal = µ.For other i , j , Aij is N(0, σ2). Noise = σ.

Given A, µ, σ, find S. [Recall Planted Clique.]

A =

. . . . . . . .

. µ+ . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . N(0, σ2) . .

. . . . . . . .

. . . . . . . .

Topic Modeling, Threshold SVD October 27, 2014 2 / 15

Page 6: Topic Modeling, Threshold SVD€¦ · Topic Modeling, Threshold SVD October 27, 2014 Topic Modeling, Threshold SVD October 27, 2014 1 / 15

Simple Setting: Signal and Noise

A n × n matrix and S ⊆ [n], |S| = k .Aij all independent r.v.’sFor i , j ∈ S, Pr(Aij ≥ µ) ≥ 1/2. Signal = µ.For other i , j , Aij is N(0, σ2). Noise = σ.Given A, µ, σ, find S. [Recall Planted Clique.]

A =

. . . . . . . .

. µ+ . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . N(0, σ2) . .

. . . . . . . .

. . . . . . . .

Topic Modeling, Threshold SVD October 27, 2014 2 / 15

Page 7: Topic Modeling, Threshold SVD€¦ · Topic Modeling, Threshold SVD October 27, 2014 Topic Modeling, Threshold SVD October 27, 2014 1 / 15

Condition on Signal-to-Noise ratio

SNR = µσ .

Standard Planted Clique (PC) problem is like having SNR = O(1).

Known Results for PC : If SNR≥√

nk , then we can find S.

We have not been able to beat this lower bound requirement onSNR for PC. In fact, Feldman, Grigorescu, Reyzin, Vempala, Xiaohave shown: Cannot be beaten by Statistical Learning Algorithms.

Topic Modeling, Threshold SVD October 27, 2014 3 / 15

Page 8: Topic Modeling, Threshold SVD€¦ · Topic Modeling, Threshold SVD October 27, 2014 Topic Modeling, Threshold SVD October 27, 2014 1 / 15

Condition on Signal-to-Noise ratio

SNR = µσ .

Standard Planted Clique (PC) problem is like having SNR = O(1).

Known Results for PC : If SNR≥√

nk , then we can find S.

We have not been able to beat this lower bound requirement onSNR for PC. In fact, Feldman, Grigorescu, Reyzin, Vempala, Xiaohave shown: Cannot be beaten by Statistical Learning Algorithms.

Topic Modeling, Threshold SVD October 27, 2014 3 / 15

Page 9: Topic Modeling, Threshold SVD€¦ · Topic Modeling, Threshold SVD October 27, 2014 Topic Modeling, Threshold SVD October 27, 2014 1 / 15

Condition on Signal-to-Noise ratio

SNR = µσ .

Standard Planted Clique (PC) problem is like having SNR = O(1).

Known Results for PC : If SNR≥√

nk , then we can find S.

We have not been able to beat this lower bound requirement onSNR for PC. In fact, Feldman, Grigorescu, Reyzin, Vempala, Xiaohave shown: Cannot be beaten by Statistical Learning Algorithms.

Topic Modeling, Threshold SVD October 27, 2014 3 / 15

Page 10: Topic Modeling, Threshold SVD€¦ · Topic Modeling, Threshold SVD October 27, 2014 Topic Modeling, Threshold SVD October 27, 2014 1 / 15

Condition on Signal-to-Noise ratio

SNR = µσ .

Standard Planted Clique (PC) problem is like having SNR = O(1).

Known Results for PC : If SNR≥√

nk , then we can find S.

We have not been able to beat this lower bound requirement onSNR for PC. In fact, Feldman, Grigorescu, Reyzin, Vempala, Xiaohave shown: Cannot be beaten by Statistical Learning Algorithms.

Topic Modeling, Threshold SVD October 27, 2014 3 / 15

Page 11: Topic Modeling, Threshold SVD€¦ · Topic Modeling, Threshold SVD October 27, 2014 Topic Modeling, Threshold SVD October 27, 2014 1 / 15

Exponential Advantage in SNR by Thresholding

Brave new step: Threshold entries of A at µ→ 0-1 matrix B.

E(B) :

. . . . . . . .

. (1/2)+ . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . exp(−µ2/2σ2) . .

. . . . . . . .

. . . . . . . .

Subtract exp(−µ2/2σ2)

. . .→

|| · || ≥ k/4|| · || ≤

√n exp(−cµ2/σ2)

Rand. Matrix

So, SVD finds S provided exp(c(µ/σ)2) >

√n

k .Cf: Ordinary SVD succeeds if µ

σ >√

nk .

Topic Modeling, Threshold SVD October 27, 2014 4 / 15

Page 12: Topic Modeling, Threshold SVD€¦ · Topic Modeling, Threshold SVD October 27, 2014 Topic Modeling, Threshold SVD October 27, 2014 1 / 15

Exponential Advantage in SNR by Thresholding

Brave new step: Threshold entries of A at µ→ 0-1 matrix B.

E(B) :

. . . . . . . .

. (1/2)+ . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . exp(−µ2/2σ2) . .

. . . . . . . .

. . . . . . . .

Subtract exp(−µ2/2σ2)

. . .→

|| · || ≥ k/4|| · || ≤

√n exp(−cµ2/σ2)

Rand. Matrix

So, SVD finds S provided exp(c(µ/σ)2) >

√n

k .Cf: Ordinary SVD succeeds if µ

σ >√

nk .

Topic Modeling, Threshold SVD October 27, 2014 4 / 15

Page 13: Topic Modeling, Threshold SVD€¦ · Topic Modeling, Threshold SVD October 27, 2014 Topic Modeling, Threshold SVD October 27, 2014 1 / 15

Exponential Advantage in SNR by Thresholding

Brave new step: Threshold entries of A at µ→ 0-1 matrix B.

E(B) :

. . . . . . . .

. (1/2)+ . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . exp(−µ2/2σ2) . .

. . . . . . . .

. . . . . . . .

Subtract exp(−µ2/2σ2)

. . .→

|| · || ≥ k/4|| · || ≤

√n exp(−cµ2/σ2)

Rand. Matrix

So, SVD finds S provided exp(c(µ/σ)2) >

√n

k .Cf: Ordinary SVD succeeds if µ

σ >√

nk .

Topic Modeling, Threshold SVD October 27, 2014 4 / 15

Page 14: Topic Modeling, Threshold SVD€¦ · Topic Modeling, Threshold SVD October 27, 2014 Topic Modeling, Threshold SVD October 27, 2014 1 / 15

Exponential Advantage in SNR by Thresholding

Brave new step: Threshold entries of A at µ→ 0-1 matrix B.

E(B) :

. . . . . . . .

. (1/2)+ . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . exp(−µ2/2σ2) . .

. . . . . . . .

. . . . . . . .

Subtract exp(−µ2/2σ2)

. . .→

|| · || ≥ k/4|| · || ≤

√n exp(−cµ2/σ2)

Rand. Matrix

So, SVD finds S provided exp(c(µ/σ)2) >

√n

k .Cf: Ordinary SVD succeeds if µ

σ >√

nk .

Topic Modeling, Threshold SVD October 27, 2014 4 / 15

Page 15: Topic Modeling, Threshold SVD€¦ · Topic Modeling, Threshold SVD October 27, 2014 Topic Modeling, Threshold SVD October 27, 2014 1 / 15

Exponential Advantage in SNR by Thresholding

Brave new step: Threshold entries of A at µ→ 0-1 matrix B.

E(B) :

. . . . . . . .

. (1/2)+ . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . exp(−µ2/2σ2) . .

. . . . . . . .

. . . . . . . .

Subtract exp(−µ2/2σ2)

. . .→

|| · || ≥ k/4|| · || ≤

√n exp(−cµ2/σ2)

Rand. Matrix

So, SVD finds S provided exp(c(µ/σ)2) >√

nk .

Cf: Ordinary SVD succeeds if µσ >

√n

k .

Topic Modeling, Threshold SVD October 27, 2014 4 / 15

Page 16: Topic Modeling, Threshold SVD€¦ · Topic Modeling, Threshold SVD October 27, 2014 Topic Modeling, Threshold SVD October 27, 2014 1 / 15

Exponential Advantage in SNR by Thresholding

Brave new step: Threshold entries of A at µ→ 0-1 matrix B.

E(B) :

. . . . . . . .

. (1/2)+ . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . exp(−µ2/2σ2) . .

. . . . . . . .

. . . . . . . .

Subtract exp(−µ2/2σ2)

. . .→

|| · || ≥ k/4|| · || ≤

√n exp(−cµ2/σ2)

Rand. Matrix

So, SVD finds S provided exp(c(µ/σ)2) >

√n

k .

Cf: Ordinary SVD succeeds if µσ >

√n

k .

Topic Modeling, Threshold SVD October 27, 2014 4 / 15

Page 17: Topic Modeling, Threshold SVD€¦ · Topic Modeling, Threshold SVD October 27, 2014 Topic Modeling, Threshold SVD October 27, 2014 1 / 15

Exponential Advantage in SNR by Thresholding

Brave new step: Threshold entries of A at µ→ 0-1 matrix B.

E(B) :

. . . . . . . .

. (1/2)+ . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . exp(−µ2/2σ2) . .

. . . . . . . .

. . . . . . . .

Subtract exp(−µ2/2σ2)

. . .→

|| · || ≥ k/4|| · || ≤

√n exp(−cµ2/σ2)

Rand. Matrix

So, SVD finds S provided exp(c(µ/σ)2) >

√n

k .Cf: Ordinary SVD succeeds if µ

σ >√

nk .

Topic Modeling, Threshold SVD October 27, 2014 4 / 15

Page 18: Topic Modeling, Threshold SVD€¦ · Topic Modeling, Threshold SVD October 27, 2014 Topic Modeling, Threshold SVD October 27, 2014 1 / 15

Thresholding: Second Plus

Data points {A1,A2, . . . ,Aj , . . .} in Rd , d features.

Data points are in 2 “SOFT” clusters: Data point j belongs wj tocluster 1 and 1− wj to cluster 2. (More Generally, k clusters)Each cluster has some some dominant features and each datapoint has a dominant cluster.Aij ≥ µ if feature i is a dominant feature of the dominant topic ofdata point j .Aij ≤ σ otherwise.If variance above µ is larger than gap between µ and σ, a2-clustering criterion (like 2-means) may split the high weightcluster instead of separating it from the others.Two Differences from Mixtures: Soft, High Variance in dominantfeatures.

Topic Modeling, Threshold SVD October 27, 2014 5 / 15

Page 19: Topic Modeling, Threshold SVD€¦ · Topic Modeling, Threshold SVD October 27, 2014 Topic Modeling, Threshold SVD October 27, 2014 1 / 15

Thresholding: Second Plus

Data points {A1,A2, . . . ,Aj , . . .} in Rd , d features.Data points are in 2 “SOFT” clusters: Data point j belongs wj tocluster 1 and 1− wj to cluster 2. (More Generally, k clusters)

Each cluster has some some dominant features and each datapoint has a dominant cluster.Aij ≥ µ if feature i is a dominant feature of the dominant topic ofdata point j .Aij ≤ σ otherwise.If variance above µ is larger than gap between µ and σ, a2-clustering criterion (like 2-means) may split the high weightcluster instead of separating it from the others.Two Differences from Mixtures: Soft, High Variance in dominantfeatures.

Topic Modeling, Threshold SVD October 27, 2014 5 / 15

Page 20: Topic Modeling, Threshold SVD€¦ · Topic Modeling, Threshold SVD October 27, 2014 Topic Modeling, Threshold SVD October 27, 2014 1 / 15

Thresholding: Second Plus

Data points {A1,A2, . . . ,Aj , . . .} in Rd , d features.Data points are in 2 “SOFT” clusters: Data point j belongs wj tocluster 1 and 1− wj to cluster 2. (More Generally, k clusters)Each cluster has some some dominant features and each datapoint has a dominant cluster.

Aij ≥ µ if feature i is a dominant feature of the dominant topic ofdata point j .Aij ≤ σ otherwise.If variance above µ is larger than gap between µ and σ, a2-clustering criterion (like 2-means) may split the high weightcluster instead of separating it from the others.Two Differences from Mixtures: Soft, High Variance in dominantfeatures.

Topic Modeling, Threshold SVD October 27, 2014 5 / 15

Page 21: Topic Modeling, Threshold SVD€¦ · Topic Modeling, Threshold SVD October 27, 2014 Topic Modeling, Threshold SVD October 27, 2014 1 / 15

Thresholding: Second Plus

Data points {A1,A2, . . . ,Aj , . . .} in Rd , d features.Data points are in 2 “SOFT” clusters: Data point j belongs wj tocluster 1 and 1− wj to cluster 2. (More Generally, k clusters)Each cluster has some some dominant features and each datapoint has a dominant cluster.Aij ≥ µ if feature i is a dominant feature of the dominant topic ofdata point j .

Aij ≤ σ otherwise.If variance above µ is larger than gap between µ and σ, a2-clustering criterion (like 2-means) may split the high weightcluster instead of separating it from the others.Two Differences from Mixtures: Soft, High Variance in dominantfeatures.

Topic Modeling, Threshold SVD October 27, 2014 5 / 15

Page 22: Topic Modeling, Threshold SVD€¦ · Topic Modeling, Threshold SVD October 27, 2014 Topic Modeling, Threshold SVD October 27, 2014 1 / 15

Thresholding: Second Plus

Data points {A1,A2, . . . ,Aj , . . .} in Rd , d features.Data points are in 2 “SOFT” clusters: Data point j belongs wj tocluster 1 and 1− wj to cluster 2. (More Generally, k clusters)Each cluster has some some dominant features and each datapoint has a dominant cluster.Aij ≥ µ if feature i is a dominant feature of the dominant topic ofdata point j .Aij ≤ σ otherwise.

If variance above µ is larger than gap between µ and σ, a2-clustering criterion (like 2-means) may split the high weightcluster instead of separating it from the others.Two Differences from Mixtures: Soft, High Variance in dominantfeatures.

Topic Modeling, Threshold SVD October 27, 2014 5 / 15

Page 23: Topic Modeling, Threshold SVD€¦ · Topic Modeling, Threshold SVD October 27, 2014 Topic Modeling, Threshold SVD October 27, 2014 1 / 15

Thresholding: Second Plus

Data points {A1,A2, . . . ,Aj , . . .} in Rd , d features.Data points are in 2 “SOFT” clusters: Data point j belongs wj tocluster 1 and 1− wj to cluster 2. (More Generally, k clusters)Each cluster has some some dominant features and each datapoint has a dominant cluster.Aij ≥ µ if feature i is a dominant feature of the dominant topic ofdata point j .Aij ≤ σ otherwise.If variance above µ is larger than gap between µ and σ, a2-clustering criterion (like 2-means) may split the high weightcluster instead of separating it from the others.

Two Differences from Mixtures: Soft, High Variance in dominantfeatures.

Topic Modeling, Threshold SVD October 27, 2014 5 / 15

Page 24: Topic Modeling, Threshold SVD€¦ · Topic Modeling, Threshold SVD October 27, 2014 Topic Modeling, Threshold SVD October 27, 2014 1 / 15

Thresholding: Second Plus

Data points {A1,A2, . . . ,Aj , . . .} in Rd , d features.Data points are in 2 “SOFT” clusters: Data point j belongs wj tocluster 1 and 1− wj to cluster 2. (More Generally, k clusters)Each cluster has some some dominant features and each datapoint has a dominant cluster.Aij ≥ µ if feature i is a dominant feature of the dominant topic ofdata point j .Aij ≤ σ otherwise.If variance above µ is larger than gap between µ and σ, a2-clustering criterion (like 2-means) may split the high weightcluster instead of separating it from the others.Two Differences from Mixtures: Soft, High Variance in dominantfeatures.

Topic Modeling, Threshold SVD October 27, 2014 5 / 15

Page 25: Topic Modeling, Threshold SVD€¦ · Topic Modeling, Threshold SVD October 27, 2014 Topic Modeling, Threshold SVD October 27, 2014 1 / 15

Topic Modeling: The Problem

Joint Work with T. Bansal and C. Bhattacharyyad features - words in the dictionary. A document is a d− (column)vector.

k topics. Topic l is a d− vector. (Probabilities of words in topic).To generate doc j , generate a random convex combination of topicvectors.Generate words of doc. j in i.i.d. trials , each from the multinomialwith prob.s = Convex Combination. ***DRAW PICTURE ONBOARD WITH SPORTS, POLITICS, WEATHER***The Topic Modeling Problem Given only A, find anapproximation to all topic vectors so that the l1 error in each topicvector is at most ε. l1 error crucial.Generally NP-hard.

Topic Modeling, Threshold SVD October 27, 2014 6 / 15

Page 26: Topic Modeling, Threshold SVD€¦ · Topic Modeling, Threshold SVD October 27, 2014 Topic Modeling, Threshold SVD October 27, 2014 1 / 15

Topic Modeling: The Problem

Joint Work with T. Bansal and C. Bhattacharyyad features - words in the dictionary. A document is a d− (column)vector.k topics. Topic l is a d− vector. (Probabilities of words in topic).

To generate doc j , generate a random convex combination of topicvectors.Generate words of doc. j in i.i.d. trials , each from the multinomialwith prob.s = Convex Combination. ***DRAW PICTURE ONBOARD WITH SPORTS, POLITICS, WEATHER***The Topic Modeling Problem Given only A, find anapproximation to all topic vectors so that the l1 error in each topicvector is at most ε. l1 error crucial.Generally NP-hard.

Topic Modeling, Threshold SVD October 27, 2014 6 / 15

Page 27: Topic Modeling, Threshold SVD€¦ · Topic Modeling, Threshold SVD October 27, 2014 Topic Modeling, Threshold SVD October 27, 2014 1 / 15

Topic Modeling: The Problem

Joint Work with T. Bansal and C. Bhattacharyyad features - words in the dictionary. A document is a d− (column)vector.k topics. Topic l is a d− vector. (Probabilities of words in topic).To generate doc j , generate a random convex combination of topicvectors.

Generate words of doc. j in i.i.d. trials , each from the multinomialwith prob.s = Convex Combination. ***DRAW PICTURE ONBOARD WITH SPORTS, POLITICS, WEATHER***The Topic Modeling Problem Given only A, find anapproximation to all topic vectors so that the l1 error in each topicvector is at most ε. l1 error crucial.Generally NP-hard.

Topic Modeling, Threshold SVD October 27, 2014 6 / 15

Page 28: Topic Modeling, Threshold SVD€¦ · Topic Modeling, Threshold SVD October 27, 2014 Topic Modeling, Threshold SVD October 27, 2014 1 / 15

Topic Modeling: The Problem

Joint Work with T. Bansal and C. Bhattacharyyad features - words in the dictionary. A document is a d− (column)vector.k topics. Topic l is a d− vector. (Probabilities of words in topic).To generate doc j , generate a random convex combination of topicvectors.Generate words of doc. j in i.i.d. trials , each from the multinomialwith prob.s = Convex Combination. ***DRAW PICTURE ONBOARD WITH SPORTS, POLITICS, WEATHER***

The Topic Modeling Problem Given only A, find anapproximation to all topic vectors so that the l1 error in each topicvector is at most ε. l1 error crucial.Generally NP-hard.

Topic Modeling, Threshold SVD October 27, 2014 6 / 15

Page 29: Topic Modeling, Threshold SVD€¦ · Topic Modeling, Threshold SVD October 27, 2014 Topic Modeling, Threshold SVD October 27, 2014 1 / 15

Topic Modeling: The Problem

Joint Work with T. Bansal and C. Bhattacharyyad features - words in the dictionary. A document is a d− (column)vector.k topics. Topic l is a d− vector. (Probabilities of words in topic).To generate doc j , generate a random convex combination of topicvectors.Generate words of doc. j in i.i.d. trials , each from the multinomialwith prob.s = Convex Combination. ***DRAW PICTURE ONBOARD WITH SPORTS, POLITICS, WEATHER***The Topic Modeling Problem Given only A, find anapproximation to all topic vectors so that the l1 error in each topicvector is at most ε. l1 error crucial.

Generally NP-hard.

Topic Modeling, Threshold SVD October 27, 2014 6 / 15

Page 30: Topic Modeling, Threshold SVD€¦ · Topic Modeling, Threshold SVD October 27, 2014 Topic Modeling, Threshold SVD October 27, 2014 1 / 15

Topic Modeling: The Problem

Joint Work with T. Bansal and C. Bhattacharyyad features - words in the dictionary. A document is a d− (column)vector.k topics. Topic l is a d− vector. (Probabilities of words in topic).To generate doc j , generate a random convex combination of topicvectors.Generate words of doc. j in i.i.d. trials , each from the multinomialwith prob.s = Convex Combination. ***DRAW PICTURE ONBOARD WITH SPORTS, POLITICS, WEATHER***The Topic Modeling Problem Given only A, find anapproximation to all topic vectors so that the l1 error in each topicvector is at most ε. l1 error crucial.Generally NP-hard.

Topic Modeling, Threshold SVD October 27, 2014 6 / 15

Page 31: Topic Modeling, Threshold SVD€¦ · Topic Modeling, Threshold SVD October 27, 2014 Topic Modeling, Threshold SVD October 27, 2014 1 / 15

Topic Modeling is Soft Clustering

Topic Vectors ≡ Cluster Centers

Each data point (doc) belongs to a weighted combination ofclusters. Generated from a distribution (happens to bemultinomial) with expectation = weighted combination.Even if we manage to solve the clustering problem somehow, it isnot true that cluster centers are averages of documents. BigDistinction from Learning Mixtures which is hard clusetring.

Topic Modeling, Threshold SVD October 27, 2014 7 / 15

Page 32: Topic Modeling, Threshold SVD€¦ · Topic Modeling, Threshold SVD October 27, 2014 Topic Modeling, Threshold SVD October 27, 2014 1 / 15

Topic Modeling is Soft Clustering

Topic Vectors ≡ Cluster CentersEach data point (doc) belongs to a weighted combination ofclusters. Generated from a distribution (happens to bemultinomial) with expectation = weighted combination.

Even if we manage to solve the clustering problem somehow, it isnot true that cluster centers are averages of documents. BigDistinction from Learning Mixtures which is hard clusetring.

Topic Modeling, Threshold SVD October 27, 2014 7 / 15

Page 33: Topic Modeling, Threshold SVD€¦ · Topic Modeling, Threshold SVD October 27, 2014 Topic Modeling, Threshold SVD October 27, 2014 1 / 15

Topic Modeling is Soft Clustering

Topic Vectors ≡ Cluster CentersEach data point (doc) belongs to a weighted combination ofclusters. Generated from a distribution (happens to bemultinomial) with expectation = weighted combination.Even if we manage to solve the clustering problem somehow, it isnot true that cluster centers are averages of documents. BigDistinction from Learning Mixtures which is hard clusetring.

Topic Modeling, Threshold SVD October 27, 2014 7 / 15

Page 34: Topic Modeling, Threshold SVD€¦ · Topic Modeling, Threshold SVD October 27, 2014 Topic Modeling, Threshold SVD October 27, 2014 1 / 15

Geometry

Topic Modeling = Soft Clustering

𝜇1

𝜇2𝜇3

Given doc’s (means of o’s), find 𝜇𝑙.

Helps to find nearly pure docs (X near corner)

o oX o

o o

o oX o

o o

o oX o

o oo o

X oo o

o oX o

o o

o oX o

o o

𝜇𝑙 = center of cluster 𝑙X = Weighted combination of 𝜇𝑙o’s are words in a doc – iid choices with mean X

Topic Modeling, Threshold SVD October 27, 2014 8 / 15

Page 35: Topic Modeling, Threshold SVD€¦ · Topic Modeling, Threshold SVD October 27, 2014 Topic Modeling, Threshold SVD October 27, 2014 1 / 15

Prior Results and Assumptions

Under Pure Topics and Primary Words (1− ε of words areprimary) Assumptions, SVD provably solves Topic Modeling.Papadimitriou, Raghavan, Tamaki, Vempala.

Long Standing Question/Belief: SVD cannot do the non-pure topiccase.LDA : Most popular non-pure model. Blei, Ng, Jordan. Multipletopics per doc are allowed. Topic weights (in a doc.) are(essentially) uncorrelated. Correlations: Blei, LaffertyAnandkumar, Foster, Hsu, Kakade, Liu do Topic Modeling underLDA, to l2 error using tensor methods. Parameters.Arora, Ge, Moitra Assume Anchor Word + Other parameters :Each topic has one word (a) occurring only in that topic (b) withhigh frequency. Provable algorithm: Do Topic Modeling with l1error per word. First provable algorithm.

Our Aim: Intuitive, empirically verified assumptions , NaturalAlgorithm.

Topic Modeling, Threshold SVD October 27, 2014 9 / 15

Page 36: Topic Modeling, Threshold SVD€¦ · Topic Modeling, Threshold SVD October 27, 2014 Topic Modeling, Threshold SVD October 27, 2014 1 / 15

Prior Results and Assumptions

Under Pure Topics and Primary Words (1− ε of words areprimary) Assumptions, SVD provably solves Topic Modeling.Papadimitriou, Raghavan, Tamaki, Vempala.Long Standing Question/Belief: SVD cannot do the non-pure topiccase.

LDA : Most popular non-pure model. Blei, Ng, Jordan. Multipletopics per doc are allowed. Topic weights (in a doc.) are(essentially) uncorrelated. Correlations: Blei, LaffertyAnandkumar, Foster, Hsu, Kakade, Liu do Topic Modeling underLDA, to l2 error using tensor methods. Parameters.Arora, Ge, Moitra Assume Anchor Word + Other parameters :Each topic has one word (a) occurring only in that topic (b) withhigh frequency. Provable algorithm: Do Topic Modeling with l1error per word. First provable algorithm.

Our Aim: Intuitive, empirically verified assumptions , NaturalAlgorithm.

Topic Modeling, Threshold SVD October 27, 2014 9 / 15

Page 37: Topic Modeling, Threshold SVD€¦ · Topic Modeling, Threshold SVD October 27, 2014 Topic Modeling, Threshold SVD October 27, 2014 1 / 15

Prior Results and Assumptions

Under Pure Topics and Primary Words (1− ε of words areprimary) Assumptions, SVD provably solves Topic Modeling.Papadimitriou, Raghavan, Tamaki, Vempala.Long Standing Question/Belief: SVD cannot do the non-pure topiccase.LDA : Most popular non-pure model. Blei, Ng, Jordan. Multipletopics per doc are allowed. Topic weights (in a doc.) are(essentially) uncorrelated. Correlations: Blei, Lafferty

Anandkumar, Foster, Hsu, Kakade, Liu do Topic Modeling underLDA, to l2 error using tensor methods. Parameters.Arora, Ge, Moitra Assume Anchor Word + Other parameters :Each topic has one word (a) occurring only in that topic (b) withhigh frequency. Provable algorithm: Do Topic Modeling with l1error per word. First provable algorithm.

Our Aim: Intuitive, empirically verified assumptions , NaturalAlgorithm.

Topic Modeling, Threshold SVD October 27, 2014 9 / 15

Page 38: Topic Modeling, Threshold SVD€¦ · Topic Modeling, Threshold SVD October 27, 2014 Topic Modeling, Threshold SVD October 27, 2014 1 / 15

Prior Results and Assumptions

Under Pure Topics and Primary Words (1− ε of words areprimary) Assumptions, SVD provably solves Topic Modeling.Papadimitriou, Raghavan, Tamaki, Vempala.Long Standing Question/Belief: SVD cannot do the non-pure topiccase.LDA : Most popular non-pure model. Blei, Ng, Jordan. Multipletopics per doc are allowed. Topic weights (in a doc.) are(essentially) uncorrelated. Correlations: Blei, LaffertyAnandkumar, Foster, Hsu, Kakade, Liu do Topic Modeling underLDA, to l2 error using tensor methods. Parameters.

Arora, Ge, Moitra Assume Anchor Word + Other parameters :Each topic has one word (a) occurring only in that topic (b) withhigh frequency. Provable algorithm: Do Topic Modeling with l1error per word. First provable algorithm.

Our Aim: Intuitive, empirically verified assumptions , NaturalAlgorithm.

Topic Modeling, Threshold SVD October 27, 2014 9 / 15

Page 39: Topic Modeling, Threshold SVD€¦ · Topic Modeling, Threshold SVD October 27, 2014 Topic Modeling, Threshold SVD October 27, 2014 1 / 15

Prior Results and Assumptions

Under Pure Topics and Primary Words (1− ε of words areprimary) Assumptions, SVD provably solves Topic Modeling.Papadimitriou, Raghavan, Tamaki, Vempala.Long Standing Question/Belief: SVD cannot do the non-pure topiccase.LDA : Most popular non-pure model. Blei, Ng, Jordan. Multipletopics per doc are allowed. Topic weights (in a doc.) are(essentially) uncorrelated. Correlations: Blei, LaffertyAnandkumar, Foster, Hsu, Kakade, Liu do Topic Modeling underLDA, to l2 error using tensor methods. Parameters.Arora, Ge, Moitra Assume Anchor Word + Other parameters :Each topic has one word (a) occurring only in that topic (b) withhigh frequency. Provable algorithm: Do Topic Modeling with l1error per word. First provable algorithm.

Our Aim: Intuitive, empirically verified assumptions , NaturalAlgorithm.

Topic Modeling, Threshold SVD October 27, 2014 9 / 15

Page 40: Topic Modeling, Threshold SVD€¦ · Topic Modeling, Threshold SVD October 27, 2014 Topic Modeling, Threshold SVD October 27, 2014 1 / 15

Prior Results and Assumptions

Under Pure Topics and Primary Words (1− ε of words areprimary) Assumptions, SVD provably solves Topic Modeling.Papadimitriou, Raghavan, Tamaki, Vempala.Long Standing Question/Belief: SVD cannot do the non-pure topiccase.LDA : Most popular non-pure model. Blei, Ng, Jordan. Multipletopics per doc are allowed. Topic weights (in a doc.) are(essentially) uncorrelated. Correlations: Blei, LaffertyAnandkumar, Foster, Hsu, Kakade, Liu do Topic Modeling underLDA, to l2 error using tensor methods. Parameters.Arora, Ge, Moitra Assume Anchor Word + Other parameters :Each topic has one word (a) occurring only in that topic (b) withhigh frequency. Provable algorithm: Do Topic Modeling with l1error per word. First provable algorithm.

Our Aim: Intuitive, empirically verified assumptions , NaturalAlgorithm.

Topic Modeling, Threshold SVD October 27, 2014 9 / 15

Page 41: Topic Modeling, Threshold SVD€¦ · Topic Modeling, Threshold SVD October 27, 2014 Topic Modeling, Threshold SVD October 27, 2014 1 / 15

Our Assumptions

Intuitive to Topic Modeling, not numerical parameters likecondition number.

Catchwords: Each topic has a set of words: (a) each occursmore frequently in the topic than others and (b) together, theyhave high frequency.Dominant Topics Each Document has a dominant topic whichhas weight (in that doc) of at least some α, whereas,non-dominant topics have weight at most some β.Nearly Pure Documents Each topic has a (small) fraction ofdocuments which are 1− δ pure for that topic.No Local Min.: For every word, the plot of number of documentsversus number of occurrences of word (conditioned on dominanttopic) has no local min. [Zipf’s law Or Unimodal.]

Topic Modeling, Threshold SVD October 27, 2014 10 / 15

Page 42: Topic Modeling, Threshold SVD€¦ · Topic Modeling, Threshold SVD October 27, 2014 Topic Modeling, Threshold SVD October 27, 2014 1 / 15

Our Assumptions

Intuitive to Topic Modeling, not numerical parameters likecondition number.Catchwords: Each topic has a set of words: (a) each occursmore frequently in the topic than others and (b) together, theyhave high frequency.

Dominant Topics Each Document has a dominant topic whichhas weight (in that doc) of at least some α, whereas,non-dominant topics have weight at most some β.Nearly Pure Documents Each topic has a (small) fraction ofdocuments which are 1− δ pure for that topic.No Local Min.: For every word, the plot of number of documentsversus number of occurrences of word (conditioned on dominanttopic) has no local min. [Zipf’s law Or Unimodal.]

Topic Modeling, Threshold SVD October 27, 2014 10 / 15

Page 43: Topic Modeling, Threshold SVD€¦ · Topic Modeling, Threshold SVD October 27, 2014 Topic Modeling, Threshold SVD October 27, 2014 1 / 15

Our Assumptions

Intuitive to Topic Modeling, not numerical parameters likecondition number.Catchwords: Each topic has a set of words: (a) each occursmore frequently in the topic than others and (b) together, theyhave high frequency.Dominant Topics Each Document has a dominant topic whichhas weight (in that doc) of at least some α, whereas,non-dominant topics have weight at most some β.

Nearly Pure Documents Each topic has a (small) fraction ofdocuments which are 1− δ pure for that topic.No Local Min.: For every word, the plot of number of documentsversus number of occurrences of word (conditioned on dominanttopic) has no local min. [Zipf’s law Or Unimodal.]

Topic Modeling, Threshold SVD October 27, 2014 10 / 15

Page 44: Topic Modeling, Threshold SVD€¦ · Topic Modeling, Threshold SVD October 27, 2014 Topic Modeling, Threshold SVD October 27, 2014 1 / 15

Our Assumptions

Intuitive to Topic Modeling, not numerical parameters likecondition number.Catchwords: Each topic has a set of words: (a) each occursmore frequently in the topic than others and (b) together, theyhave high frequency.Dominant Topics Each Document has a dominant topic whichhas weight (in that doc) of at least some α, whereas,non-dominant topics have weight at most some β.Nearly Pure Documents Each topic has a (small) fraction ofdocuments which are 1− δ pure for that topic.

No Local Min.: For every word, the plot of number of documentsversus number of occurrences of word (conditioned on dominanttopic) has no local min. [Zipf’s law Or Unimodal.]

Topic Modeling, Threshold SVD October 27, 2014 10 / 15

Page 45: Topic Modeling, Threshold SVD€¦ · Topic Modeling, Threshold SVD October 27, 2014 Topic Modeling, Threshold SVD October 27, 2014 1 / 15

Our Assumptions

Intuitive to Topic Modeling, not numerical parameters likecondition number.Catchwords: Each topic has a set of words: (a) each occursmore frequently in the topic than others and (b) together, theyhave high frequency.Dominant Topics Each Document has a dominant topic whichhas weight (in that doc) of at least some α, whereas,non-dominant topics have weight at most some β.Nearly Pure Documents Each topic has a (small) fraction ofdocuments which are 1− δ pure for that topic.No Local Min.: For every word, the plot of number of documentsversus number of occurrences of word (conditioned on dominanttopic) has no local min. [Zipf’s law Or Unimodal.]

Topic Modeling, Threshold SVD October 27, 2014 10 / 15

Page 46: Topic Modeling, Threshold SVD€¦ · Topic Modeling, Threshold SVD October 27, 2014 Topic Modeling, Threshold SVD October 27, 2014 1 / 15

The Algorothm - Threshold SVD (TSVD)

s = No. of docs. For this talk, probability that each topic isdominant is 1/k .

Threshold Compute the threshold for each word i : First “Gap”:Maxζ : Aij ≥ ζ for ≥ (s/2k) j ′s and Aij = ζ for ≤ εs j ′s.SVD Use SVD on thresholded matrix to get starting centers fork−means algorithm.k−means Run k−means. Will show: This identifies dominanttopic.Identify Catchwords Find the set of high frequency words ineach cluster. Will show: Set of Catchwords for topic.Identify Pure Docs Find the set of documents with highest totalnumber of occurrences of set of catchwords. Show: Nearly PureDocs. Their average ≈ topic vector.

Topic Modeling, Threshold SVD October 27, 2014 11 / 15

Page 47: Topic Modeling, Threshold SVD€¦ · Topic Modeling, Threshold SVD October 27, 2014 Topic Modeling, Threshold SVD October 27, 2014 1 / 15

The Algorothm - Threshold SVD (TSVD)

s = No. of docs. For this talk, probability that each topic isdominant is 1/k .Threshold Compute the threshold for each word i : First “Gap”:Maxζ : Aij ≥ ζ for ≥ (s/2k) j ′s and Aij = ζ for ≤ εs j ′s.

SVD Use SVD on thresholded matrix to get starting centers fork−means algorithm.k−means Run k−means. Will show: This identifies dominanttopic.Identify Catchwords Find the set of high frequency words ineach cluster. Will show: Set of Catchwords for topic.Identify Pure Docs Find the set of documents with highest totalnumber of occurrences of set of catchwords. Show: Nearly PureDocs. Their average ≈ topic vector.

Topic Modeling, Threshold SVD October 27, 2014 11 / 15

Page 48: Topic Modeling, Threshold SVD€¦ · Topic Modeling, Threshold SVD October 27, 2014 Topic Modeling, Threshold SVD October 27, 2014 1 / 15

The Algorothm - Threshold SVD (TSVD)

s = No. of docs. For this talk, probability that each topic isdominant is 1/k .Threshold Compute the threshold for each word i : First “Gap”:Maxζ : Aij ≥ ζ for ≥ (s/2k) j ′s and Aij = ζ for ≤ εs j ′s.SVD Use SVD on thresholded matrix to get starting centers fork−means algorithm.

k−means Run k−means. Will show: This identifies dominanttopic.Identify Catchwords Find the set of high frequency words ineach cluster. Will show: Set of Catchwords for topic.Identify Pure Docs Find the set of documents with highest totalnumber of occurrences of set of catchwords. Show: Nearly PureDocs. Their average ≈ topic vector.

Topic Modeling, Threshold SVD October 27, 2014 11 / 15

Page 49: Topic Modeling, Threshold SVD€¦ · Topic Modeling, Threshold SVD October 27, 2014 Topic Modeling, Threshold SVD October 27, 2014 1 / 15

The Algorothm - Threshold SVD (TSVD)

s = No. of docs. For this talk, probability that each topic isdominant is 1/k .Threshold Compute the threshold for each word i : First “Gap”:Maxζ : Aij ≥ ζ for ≥ (s/2k) j ′s and Aij = ζ for ≤ εs j ′s.SVD Use SVD on thresholded matrix to get starting centers fork−means algorithm.k−means Run k−means. Will show: This identifies dominanttopic.

Identify Catchwords Find the set of high frequency words ineach cluster. Will show: Set of Catchwords for topic.Identify Pure Docs Find the set of documents with highest totalnumber of occurrences of set of catchwords. Show: Nearly PureDocs. Their average ≈ topic vector.

Topic Modeling, Threshold SVD October 27, 2014 11 / 15

Page 50: Topic Modeling, Threshold SVD€¦ · Topic Modeling, Threshold SVD October 27, 2014 Topic Modeling, Threshold SVD October 27, 2014 1 / 15

The Algorothm - Threshold SVD (TSVD)

s = No. of docs. For this talk, probability that each topic isdominant is 1/k .Threshold Compute the threshold for each word i : First “Gap”:Maxζ : Aij ≥ ζ for ≥ (s/2k) j ′s and Aij = ζ for ≤ εs j ′s.SVD Use SVD on thresholded matrix to get starting centers fork−means algorithm.k−means Run k−means. Will show: This identifies dominanttopic.Identify Catchwords Find the set of high frequency words ineach cluster. Will show: Set of Catchwords for topic.

Identify Pure Docs Find the set of documents with highest totalnumber of occurrences of set of catchwords. Show: Nearly PureDocs. Their average ≈ topic vector.

Topic Modeling, Threshold SVD October 27, 2014 11 / 15

Page 51: Topic Modeling, Threshold SVD€¦ · Topic Modeling, Threshold SVD October 27, 2014 Topic Modeling, Threshold SVD October 27, 2014 1 / 15

The Algorothm - Threshold SVD (TSVD)

s = No. of docs. For this talk, probability that each topic isdominant is 1/k .Threshold Compute the threshold for each word i : First “Gap”:Maxζ : Aij ≥ ζ for ≥ (s/2k) j ′s and Aij = ζ for ≤ εs j ′s.SVD Use SVD on thresholded matrix to get starting centers fork−means algorithm.k−means Run k−means. Will show: This identifies dominanttopic.Identify Catchwords Find the set of high frequency words ineach cluster. Will show: Set of Catchwords for topic.Identify Pure Docs Find the set of documents with highest totalnumber of occurrences of set of catchwords. Show: Nearly PureDocs. Their average ≈ topic vector.

Topic Modeling, Threshold SVD October 27, 2014 11 / 15

Page 52: Topic Modeling, Threshold SVD€¦ · Topic Modeling, Threshold SVD October 27, 2014 Topic Modeling, Threshold SVD October 27, 2014 1 / 15

𝜇1

𝜇2 𝜇3

o oX o

o o

o oX o

o o

o oX o

o o

o oX o

o o

o oX o

o o

o oX o

o o

Thresh+SVD+k-means Dominant Topics

o oo

o o

o oo

o o

o oo

o o

Topic Modeling, Threshold SVD October 27, 2014 12 / 15

Page 53: Topic Modeling, Threshold SVD€¦ · Topic Modeling, Threshold SVD October 27, 2014 Topic Modeling, Threshold SVD October 27, 2014 1 / 15

Properties of Thresholding

Using no local min., show: No threshold splits any dominant topicin the “middle”. So, threshlded matrix is a “block” matrix forcatchwords. But for non-catchwords, can be high on severaltopics.

PICTURE ON THE BOARD OF A BLOCK MATRIX.Done ? No. Need inter-cluster separation ≥ intra-cluster spread(variance inside cluster).Catchwords provide sufficient inter-cluster separation.Inside-cluster variance bounded with machinery from RandomMatrix Theory. Beware: Only columns are independent. Rows arenot.Appeal to a result on k−means (Kumar, K.: If inter-clusterseparation ≥ inside-cluster directional stan. dev, then SVDfollowed by k−means clusters.

Topic Modeling, Threshold SVD October 27, 2014 13 / 15

Page 54: Topic Modeling, Threshold SVD€¦ · Topic Modeling, Threshold SVD October 27, 2014 Topic Modeling, Threshold SVD October 27, 2014 1 / 15

Properties of Thresholding

Using no local min., show: No threshold splits any dominant topicin the “middle”. So, threshlded matrix is a “block” matrix forcatchwords. But for non-catchwords, can be high on severaltopics.PICTURE ON THE BOARD OF A BLOCK MATRIX.

Done ? No. Need inter-cluster separation ≥ intra-cluster spread(variance inside cluster).Catchwords provide sufficient inter-cluster separation.Inside-cluster variance bounded with machinery from RandomMatrix Theory. Beware: Only columns are independent. Rows arenot.Appeal to a result on k−means (Kumar, K.: If inter-clusterseparation ≥ inside-cluster directional stan. dev, then SVDfollowed by k−means clusters.

Topic Modeling, Threshold SVD October 27, 2014 13 / 15

Page 55: Topic Modeling, Threshold SVD€¦ · Topic Modeling, Threshold SVD October 27, 2014 Topic Modeling, Threshold SVD October 27, 2014 1 / 15

Properties of Thresholding

Using no local min., show: No threshold splits any dominant topicin the “middle”. So, threshlded matrix is a “block” matrix forcatchwords. But for non-catchwords, can be high on severaltopics.PICTURE ON THE BOARD OF A BLOCK MATRIX.Done ? No. Need inter-cluster separation ≥ intra-cluster spread(variance inside cluster).

Catchwords provide sufficient inter-cluster separation.Inside-cluster variance bounded with machinery from RandomMatrix Theory. Beware: Only columns are independent. Rows arenot.Appeal to a result on k−means (Kumar, K.: If inter-clusterseparation ≥ inside-cluster directional stan. dev, then SVDfollowed by k−means clusters.

Topic Modeling, Threshold SVD October 27, 2014 13 / 15

Page 56: Topic Modeling, Threshold SVD€¦ · Topic Modeling, Threshold SVD October 27, 2014 Topic Modeling, Threshold SVD October 27, 2014 1 / 15

Properties of Thresholding

Using no local min., show: No threshold splits any dominant topicin the “middle”. So, threshlded matrix is a “block” matrix forcatchwords. But for non-catchwords, can be high on severaltopics.PICTURE ON THE BOARD OF A BLOCK MATRIX.Done ? No. Need inter-cluster separation ≥ intra-cluster spread(variance inside cluster).Catchwords provide sufficient inter-cluster separation.

Inside-cluster variance bounded with machinery from RandomMatrix Theory. Beware: Only columns are independent. Rows arenot.Appeal to a result on k−means (Kumar, K.: If inter-clusterseparation ≥ inside-cluster directional stan. dev, then SVDfollowed by k−means clusters.

Topic Modeling, Threshold SVD October 27, 2014 13 / 15

Page 57: Topic Modeling, Threshold SVD€¦ · Topic Modeling, Threshold SVD October 27, 2014 Topic Modeling, Threshold SVD October 27, 2014 1 / 15

Properties of Thresholding

Using no local min., show: No threshold splits any dominant topicin the “middle”. So, threshlded matrix is a “block” matrix forcatchwords. But for non-catchwords, can be high on severaltopics.PICTURE ON THE BOARD OF A BLOCK MATRIX.Done ? No. Need inter-cluster separation ≥ intra-cluster spread(variance inside cluster).Catchwords provide sufficient inter-cluster separation.Inside-cluster variance bounded with machinery from RandomMatrix Theory. Beware: Only columns are independent. Rows arenot.

Appeal to a result on k−means (Kumar, K.: If inter-clusterseparation ≥ inside-cluster directional stan. dev, then SVDfollowed by k−means clusters.

Topic Modeling, Threshold SVD October 27, 2014 13 / 15

Page 58: Topic Modeling, Threshold SVD€¦ · Topic Modeling, Threshold SVD October 27, 2014 Topic Modeling, Threshold SVD October 27, 2014 1 / 15

Properties of Thresholding

Using no local min., show: No threshold splits any dominant topicin the “middle”. So, threshlded matrix is a “block” matrix forcatchwords. But for non-catchwords, can be high on severaltopics.PICTURE ON THE BOARD OF A BLOCK MATRIX.Done ? No. Need inter-cluster separation ≥ intra-cluster spread(variance inside cluster).Catchwords provide sufficient inter-cluster separation.Inside-cluster variance bounded with machinery from RandomMatrix Theory. Beware: Only columns are independent. Rows arenot.Appeal to a result on k−means (Kumar, K.: If inter-clusterseparation ≥ inside-cluster directional stan. dev, then SVDfollowed by k−means clusters.

Topic Modeling, Threshold SVD October 27, 2014 13 / 15

Page 59: Topic Modeling, Threshold SVD€¦ · Topic Modeling, Threshold SVD October 27, 2014 Topic Modeling, Threshold SVD October 27, 2014 1 / 15

Getting Topic Vectors

PICTURE OF SIMPLEX with columns of M as extreme points andcluster of doc.s with each dominant topic.Taking average of docs in Tl no good.

Topic Modeling, Threshold SVD October 27, 2014 14 / 15

Page 60: Topic Modeling, Threshold SVD€¦ · Topic Modeling, Threshold SVD October 27, 2014 Topic Modeling, Threshold SVD October 27, 2014 1 / 15

Experimental Results

Topic Modeling, Threshold SVD October 27, 2014 15 / 15