Transcript
Page 1: Aravindan Vijayaraghavan CMU    Northwestern University

Aravindan Vijayaraghavan

CMU Northwestern University

Smoothed Analysis of Tensor Decompositions and Learning

based on joint works with

Aditya BhaskaraGoogle Research

Moses CharikarPrinceton

Ankur MoitraMIT

Page 2: Aravindan Vijayaraghavan CMU    Northwestern University

Factor analysis

d

d

โ€ข Sum of โ€œfewโ€ rank one matrices (k < d )

Assumption: matrix has a โ€œsimple explanationโ€

๐‘€=๐‘Ž1โŠ—๐‘1+๐‘Ž2โŠ—๐‘2+โ€ฆ+๐‘Ž๐‘˜โŠ—๐‘๐‘˜

M

Explain using few unobserved variables

Qn [Spearman]. Can we find the ``desiredโ€™โ€™ explanation ?

Page 3: Aravindan Vijayaraghavan CMU    Northwestern University

The rotation problem

Any suitable โ€œrotationโ€ of the vectors gives a different decomposition

A BT = A BTQ QT

Often difficult to find โ€œdesiredโ€ decomposition..

๐‘€=๐‘Ž1โŠ—๐‘1+๐‘Ž2โŠ—๐‘2+โ€ฆ+๐‘Ž๐‘˜โŠ—๐‘๐‘˜

Page 4: Aravindan Vijayaraghavan CMU    Northwestern University

Multi-dimensional arrays

Tensors

dd d

d d

โ€ข dimensional array tensor of order -tensor

โ€ข Represent higher order correlations, partial derivatives, etc.

โ€ข Collection of matrix (or smaller tensor) slices

Page 5: Aravindan Vijayaraghavan CMU    Northwestern University

3-way factor analysisTensor can be written as a sum of few rank-one tensors

๐‘‡โ‘=โˆ‘๐‘–=1

๐‘˜

๐‘Ž๐‘–โจ‚๐‘๐‘–โจ‚๐‘๐‘–

Rank(T) = smallest k s.t. T written as sum of k rank-1 tensors

3-Tensors:

T ๐‘Ž1 ๐‘Ž2 ๐‘Ž๐‘˜๐‘1 ๐‘2 ๐‘๐‘˜

๐‘ 1 ๐‘ 2 ๐‘ ๐‘˜

โ€ข Rank of 3-tensor . Rank of t-tensor

Thm [Harshmanโ€™70, Kruskalโ€™77]. Rank- decompositions for 3-tensors (and higher orders) unique under mild conditions.

3-way decompositions overcome rotation problem !

Page 6: Aravindan Vijayaraghavan CMU    Northwestern University

Learning Probabilistic Models:Parameter Estimation

Learning goal: Can the parameters of the model be learned from polynomial samples generated by the model ?

โ€ข EM algorithm โ€“ used in practice, but converges to local optima

HMMs for speech recognition

Mixture of Gaussiansfor clustering points

Question: Can given data be โ€œexplainedโ€ by a simple probabilistic model?

โ€ข Algorithms have exponential time & sample complexity

Multiview models

Page 7: Aravindan Vijayaraghavan CMU    Northwestern University

Parametersโ€ข Mixing weights: โ€ข Gaussian : mean , covariance : diagonal

Learning problem: Given many sample points, find

Probabilistic model for Clustering in -dims

Mixtures of (axis-aligned) Gaussians

โ€ข Algorithms use samples and time [FOSโ€™06, MVโ€™10]

โ€ข Lower bound of [MVโ€™10] in worst case

โ„๐‘‘

๐œ‡๐‘–๐‘ฅ

Aim: guarantees in realistic settings

Page 8: Aravindan Vijayaraghavan CMU    Northwestern University

Method of Moments and Tensor decompositions

step 1. compute a tensor whose decomposition encodes model parametersstep 2. find decomposition (and hence parameters)

โ‹ฑ

๐‘‘

โ‹…โ‹… โ‹ฏ โ‹…โ‹ฎ ๐ธ [๐‘ฅ๐‘– ๐‘ฅ ๐‘—๐‘ฅ๐‘˜ ] ยฟ

โ‹ฏ ยฟยฟ๐‘ป=โˆ‘

๐’Š=๐Ÿ

๐’Œ๐’˜ ๐’Š๐๐’ŠโŠ—๐๐’ŠโŠ—๐๐’Š

โ€ข Uniqueness Recover parameters and โ€ข Algorithm for Decomposition efficient learning

[Chang] [Allman, Matias, Rhodes][Anandkumar,Ge,Hsu, Kakade, Telgarsky]

Page 9: Aravindan Vijayaraghavan CMU    Northwestern University

What is known about Tensor Decompositions ?

Thm [Kruskalโ€™77]. Rank- decompositions for 3-tensors unique (non-algorithmic) when !

Thm [Jennrich via Harshmanโ€™70]. Find unique rank- decompositions for 3-tensors when !

โ€ข Uniqueness proof is algorithmic !โ€ข Called Full-rank case. No symmetry or orthogonality needed.โ€ข Rediscovered in [Leurgans et al 1993] [Chang 1996]

Thm [DeLathauwer, Castiang, Cardosoโ€™07]. Algorithm for 4-tensors of rank generically when

Thm [Chiantini Ottavianiโ€˜12]. Uniqueness (non-algorithmic) of 3-tensors of rank generically

Page 10: Aravindan Vijayaraghavan CMU    Northwestern University

Uniqueness and Algorithms resilient to noise of 1/poly(d,k) ?

Robustness to Errors

Beware : Sampling error

Empirical estimate

With samples, error

Thm [BCVโ€™14]. Robust version of Kruskal Uniqueness theorem (non-algorithmic) with error

Thm. Jennrichโ€™s polynomial time algorithm for Tensor Decompositions robust up to error

Open Problem: Robust version of generic results[De Lauthewer et al]?

Page 11: Aravindan Vijayaraghavan CMU    Northwestern University

Algorithms for Tensor Decompositions

Polynomial time algorithms when rank [Jennrich]

NP-hard when rank in worst case [Hastad, Hillar-Lim]

Overcome worst-case intractability using Smoothed Analysis

Polynomial time algorithms* for robust Tensor decompositions for rank k >> d (rank is any polynomial in dimension) *Algorithms for recovery up to error in

This talk

Page 12: Aravindan Vijayaraghavan CMU    Northwestern University

Efficient Learning when no. of clusters/ topics k dimension d[Chang 96, Mossel-Roch 06, Anandkumar et al. 09-14]โ€ข Learning Phylogenetic trees [Chang,MR]โ€ข Axis-aligned Gaussians [HK]โ€ข Parse trees [ACHKSZ,BHD,B,SC,PSX,LIPPX]โ€ข HMMs [AHK,DKZ,SBSGS]โ€ข Single Topic models [AHK], LDA [AFHKL] โ€ข ICA [GVX] โ€ฆ โ€ข Overlapping Communities [AGHK] โ€ฆ

Implications for Learning

``Full rankโ€™โ€™ or ``Non-degenerateโ€™โ€™ setting

Known only in restricted cases: No. of clusters No. of dims

Page 13: Aravindan Vijayaraghavan CMU    Northwestern University

Overcomplete Learning Setting

Number of clusters/topics/states dimension

Computer Vision

Previous algorithms do not work when !Speech

Need polytime decomposition of Tensors of rank

Page 14: Aravindan Vijayaraghavan CMU    Northwestern University

Smoothed Analysis

[Spielman & Teng 2000]

โ€ข Small random perturbation of inputmakes instances easy

โ€ข Best polytime guarantees in the absence of any worst-case guarantees

Smoothed analysis guarantees:

โ€ข Worst instances are isolated

Simplex algorithm solves LPs efficiently (explains practice).

Page 15: Aravindan Vijayaraghavan CMU    Northwestern University

Todayโ€™s talk: Smoothed Analysis for Learning [BCMV STOCโ€™14]

โ€ข First Smoothed Analysis treatment for Unsupervised Learning

Thm. Polynomial time algorithms for learning axis-aligned Gaussians, Multview models etc. even in ``overcomplete settingsโ€™โ€™.

Mixture of Gaussians

Thm. Polynomial time algorithms for tensor decompositions in smoothed analysis setting.

based on

Multiview models

Page 16: Aravindan Vijayaraghavan CMU    Northwestern University

Smoothed Analysis for LearningLearning setting (e.g. Mixtures of Gaussians)

Worst-case instances: Means in pathological configurations

Means not in adversarial configurations in real-world!

What if means perturbed slightly ?๐๐’Š ~๐๐’Š

Generally, parameters of the model are perturbed slightly.

Page 17: Aravindan Vijayaraghavan CMU    Northwestern University

Smoothed Analysis for Tensor Decompositions

1. Adversary chooses tensor

3. Input: . Analyse algorithm on .

2. is random -perturbation of

i.e. add independent (gaussian) random vector of length .

๐‘‡ ๐‘‘ร—๐‘‘ร—โ€ฆร—๐‘‘=โˆ‘๐‘–=1

๐‘˜

๐‘Ž๐‘–(1)โจ‚ ๐‘Ž๐‘–

(2)โจ‚โ€ฆโจ‚๐‘Ž๐‘–( ๐‘ก )

T ๐‘Ž1(1) ๐‘Ž2(1) ๐‘Ž๐‘˜(1)

๐‘Ž1(3 ) ๐‘Ž2(3 ) ๐‘Ž๐‘˜(3 )

๐‘Ž 1(2 ) ๐‘Ž 2(2 ) ๐‘Ž ๐‘˜(2 )

~๐‘‡=โˆ‘๐‘–=1

๐‘˜~๐‘Ž๐‘–

(1 )โจ‚~๐‘Ž๐‘–(2 )โจ‚โ€ฆโจ‚~๐‘Ž๐‘–

(๐‘ก )+noise

Factors of the Decomposition are perturbed

Page 18: Aravindan Vijayaraghavan CMU    Northwestern University

Algorithmic GuaranteesThm [BCMVโ€™14]. Polynomial time algorithm for decomposing t-tensor (-dim) in smoothed analysis model when rank w.h.p.

Running time, sample complexity = .

Guarantees for order- tensors in d-dims (each)

Rank of the t-tensor=k (number of clusters)

Previous Algorithms Algorithms (smoothed case)

Corollary. Polytime algorithms (smoothed analysis) for Mixtures of axis-aligned Gaussians, Multiview models etc. even in overcomplete setting i.e. no. of clusters for any constant C w.h.p.

Page 19: Aravindan Vijayaraghavan CMU    Northwestern University

Interpreting Smoothed Analysis Guarantees

Time, sample complexity = .

Works with probability 1-exp(- )

โ€ข Exponential small failure probability (for constant order t)

Smooth Interpolation between Worst-case and Average-case

โ€ข : worst-case

โ€ข is large: almost random vectors.

โ€ข Can handle inverse-polynomial in

Page 20: Aravindan Vijayaraghavan CMU    Northwestern University

Algorithm Details

Page 21: Aravindan Vijayaraghavan CMU    Northwestern University

Algorithm Outline

โ€ข Helps handle the over-complete setting

[Jennrich 70] A simple (robust) algorithm for 3-tensor T when:

2. For higher order tensors using ``tensoring / flatteningโ€™โ€™.

1. An algorithm for 3-tensors in the ``full rank settingโ€™โ€™ ().

๐‘‡=โˆ‘๐‘–=1

๐‘˜

๐ด๐‘–โจ‚๐ต๐‘–โจ‚๐ถ๐‘–Recall: ๐ด๐‘–

โ‹ฑ

๐‘‘ ๐‘‡

Aim: Recover A, B, C

โ€ข Any algorithm for full-rank (non-orthogonal) tensors suffices

Page 22: Aravindan Vijayaraghavan CMU    Northwestern University

Blast from the Past

โ‹ฑ

๐‘‡

๐‘ป โ‰ˆ๐โˆ‘๐‘–=1

๐‘˜

๐‘Ž๐‘–โจ‚๐‘๐‘–โจ‚ ๐‘๐‘–

Recall

๐‘Ž๐‘–

Aim: Recover A, B, C

Qn. Is this algorithm robust to errors ?Yes ! Needs perturbation bounds for eigenvectors.

[Stewart-Sun]

Thm. Efficiently decompose T and recover upto errorwhen 1) are min-singular-value 1/poly(d) 2) C doesnโ€™t have parallel columns.

[Jennrich via Harshman 70]Algorithm for 3-tensor

โ€ข A, B are full rank (rank=)โ€ข C has rank 2

โ€ข Reduces to matrix eigen-decompositions

Page 23: Aravindan Vijayaraghavan CMU    Northwestern University

Consider rank 1 tensor

Slices of tensors

sโ€™th slice:

sโ€™th slice:

All slices have a common diagonalization !

๐‘‡=โˆ‘๐‘–=1

๐‘˜

๐‘Ž๐‘–โŠ—๐‘๐‘–โŠ—๐‘๐‘– โˆ‘๐‘–=1

๐‘˜

๐‘๐‘– (๐‘  ) .(๐‘Ž๐‘–โŠ—๐‘๐‘–)

Random combination of slices: โˆ‘๐‘–=1

๐‘˜

โŸจ๐‘๐‘– ,๐‘ค โŸฉ .(๐‘Ž๐‘–โŠ—๐‘๐‘–)

Page 24: Aravindan Vijayaraghavan CMU    Northwestern University

Two matrices with common diagonalization

Simultaneous diagonalization

If 1) are invertible and

2) have unequal non-zero entries,

We can find by matrix diagonalization!

Page 25: Aravindan Vijayaraghavan CMU    Northwestern University

Algorithm:1. Take random combination along as .2. Take random combination along as .3. Find eigen-decomposition of to get . Similarly B,C.

Decomposition algorithm [Jennrich]

๐‘‡ โ‰ˆ๐œ–โˆ‘๐‘–=1

๐‘˜

๐‘Ž๐‘–โŠ—๐‘๐‘–โŠ—๐‘๐‘–

Thm. Efficiently decompose T and recover up to error (in Frobenius norm) when

1) are full rank i.e. min-singular-value 1/poly(d) 2) C doesnโ€™t have parallel columns (in a robust sense).

Page 26: Aravindan Vijayaraghavan CMU    Northwestern University

Overcomplete Case

into Techniques

Page 27: Aravindan Vijayaraghavan CMU    Northwestern University

Mapping to Higher DimensionsHow do we handle the case rank ?

(or even vectors with โ€œmanyโ€ linear dependencies?)

1. Tensor corresponding to map computable using the data

2. are linearly independent (min singular value)

map

โ„๐‘‘

๐‘“ (๐‘Žยฟยฟ1)ยฟ

๐‘“ (๐‘Ž๐‘–) โ„๐‘‘2

๐‘“ (๐‘Žยฟยฟ ๐‘˜)ยฟ๐‘“ (๐‘Žยฟยฟ 2)ยฟ๐‘Ž1

๐‘Ž2๐‘Ž๐‘˜

๐‘Ž๐‘–

โ€ข Reminiscent of Kernels in SVMs

maps parameter/factor vectors to higher dimensions s.t.

Factor matrix A

Page 28: Aravindan Vijayaraghavan CMU    Northwestern University

A mapping to higher dimensions

Qn: are these vectors linearly independent?Is ``essential dimensionโ€™โ€™ ?

Outer product / Tensor products:Map Map

Basic Intuition:

1. has dimensions.

2. For non-parallel unit vectors distance increases:

โ€ข Tensor is

Page 29: Aravindan Vijayaraghavan CMU    Northwestern University

Bad cases

Beyond Worst-case analysisCan we hope for โ€œdimensionโ€ to multiply โ€œtypicallyโ€?

Bad example where :โ€ข Every vectors of U and V are linearly independentโ€ข But vectors of Z are linearly dependent !

๐‘‰ (๐‘‘ร—๐‘˜)

๐‘ฃ ๐‘–โ‘

๐‘ˆ (๐‘‘ร—๐‘˜)

๐‘ข๐‘–โ‘

Z

=

Lem. Dimension (K-rank) under tensoring is additive.

U, V have rank=d. Vectors

Strategy does not work in the worst-case

But, bad examples are pathological and hard to construct!

Page 30: Aravindan Vijayaraghavan CMU    Northwestern University

Product vectors & linear structure

โ€ข Easy to compute tensor with as factors / parameters (``Flatteningโ€™โ€™

of 3t-order moment tensor)

โ€ข New factor matrix is full rank using Smoothed Analysis.

Theorem. For any matrix , for , with probability 1- exp(-poly(d)).

Map

๐‘Ž๐‘–

~๐‘Ž๐‘–random -perturbation

Page 31: Aravindan Vijayaraghavan CMU    Northwestern University

Proof sketch (t=2)

Main Issue: perturbation before product.. โ€ข easy if columns perturbed after tensor

product (simple anti-concentration bounds)

Technical componentshow perturbed product vectors behave like random vectors in

๐‘‘2

๐‘˜

๐‘ˆ

Prop. For any matrix , matrix below has with probability 1- exp(-poly(d)).

โ€ข only bits of randomness in dimsโ€ข Block dependencies

Page 32: Aravindan Vijayaraghavan CMU    Northwestern University

Projections of product vectorsQuestion. Given any vector and gaussian -perturbation , does have projection onto any given dimensional subspace with prob. ?

Easy : Take dimensional , -perturbation to will have projection on to S w.h.p.

anti-concentration for polynomials implies this with

probability 1-1/poly

โ€ฆ.... ~๐‘Ž (๐‘‘ )โŠ—๐‘

Much tougher for product of perturbations!(inherent block structure)

Page 33: Aravindan Vijayaraghavan CMU    Northwestern University

Projections of product vectors

dot product of block with

=

Question. Given any vector and gaussian -perturbation , does have projection onto any given dimensional subspace with prob. ?

๐‘‘22

๐‘‘2 ~๐‘Ž (๐‘‘ )โŠ—๐‘

is projection matrix onto

is a matrix

Page 34: Aravindan Vijayaraghavan CMU    Northwestern University

Two steps of Proof..

2. If has eigenvalues , then w.p. (over perturbation of ), has large projection onto .

follows easily analyzing projection of a vector to

a dim-k space

will show with

1. W.h.p. (over perturbation of b), has at least eigenvalues

Page 35: Aravindan Vijayaraghavan CMU    Northwestern University

Suppose: Choose first โ€œblocksโ€ in were orthogonal...

โ€ฆ.โ€ฆ.

โ€ฆ.

Structure in any subspace S

(restricted to cols)

โ€ข Entry (i,j) is:

โ€ข Translated i.i.d. Gaussian matrix!

has many big eigenvalues

โˆš๐‘‘

๐‘ฃ ๐‘–๐‘—โˆˆโ„๐‘‘

Page 36: Aravindan Vijayaraghavan CMU    Northwestern University

Main claim: every dimensional space has vectors with such a structure..

โ€ฆ.โ€ฆ.

โ€ฆ.

Property: picked blocks (d dim vectors) have โ€œreasonableโ€ component orthogonal to span of rest..

Finding Structure in any subspace S

Earlier argument goes through even with blocks not fully orthogonal!

๐‘ฃ1๐‘ฃ2๐‘ฃโˆš๐‘‘

Page 37: Aravindan Vijayaraghavan CMU    Northwestern University

Idea: obtain โ€œgoodโ€ columns one by one..

โ€ข Show there exists a block with many linearly independent โ€œchoicesโ€

โ€ข Fix some choices and argue the same property holds, โ€ฆ

Main claim (sketch)..

Generalization: similar result holds for higher order products, implies main result.

crucially use the fact that we have a dim

subspace

โ€ข Uses a delicate inductive argument

Page 38: Aravindan Vijayaraghavan CMU    Northwestern University

Summary

โ€ข Polynomial time Algorithms in

Overcomplete settings:

โ€ข Smoothed Analysis for Learning Probabilistic models.

Guarantees for order- tensors in d-dims (each)

Rank of the t-tensor=k (number of clusters)

Previous Algorithms Algorithms (smoothed case)

โ€ข Flattening gets beyond full-rank conditions: Plug into results on Spectral Learning of Probabilistic models

Page 39: Aravindan Vijayaraghavan CMU    Northwestern University

Future Directions

Smoothed Analysis for other Learning problems ?

Better guarantees using Higher-order momentsโ€ข Better bounds w.r.t. smallest singular value ?

Better Robustness to Errors

โ€ข Modelling errors?

โ€ข Tensor decomposition algorithms that more robust to errors ?

promise: [Barak-Kelner-Steurerโ€™14] using Lasserre hierarchy

Better dependence on rank k vs dim d (esp. 3 tensors)

โ€ข Next talk by Anandkumar: Random/ Incoherent decompositions

Page 40: Aravindan Vijayaraghavan CMU    Northwestern University

Thank You!

Questions?


Recommended