Aravindan Vijayaraghavan
CMU Northwestern University
Smoothed Analysis of Tensor Decompositions and Learning
based on joint works with
Aditya BhaskaraGoogle Research
Moses CharikarPrinceton
Ankur MoitraMIT
Factor analysis
d
d
โข Sum of โfewโ rank one matrices (k < d )
Assumption: matrix has a โsimple explanationโ
๐=๐1โ๐1+๐2โ๐2+โฆ+๐๐โ๐๐
M
Explain using few unobserved variables
Qn [Spearman]. Can we find the ``desiredโโ explanation ?
The rotation problem
Any suitable โrotationโ of the vectors gives a different decomposition
A BT = A BTQ QT
Often difficult to find โdesiredโ decomposition..
๐=๐1โ๐1+๐2โ๐2+โฆ+๐๐โ๐๐
Multi-dimensional arrays
Tensors
dd d
d d
โข dimensional array tensor of order -tensor
โข Represent higher order correlations, partial derivatives, etc.
โข Collection of matrix (or smaller tensor) slices
3-way factor analysisTensor can be written as a sum of few rank-one tensors
๐โ=โ๐=1
๐
๐๐โจ๐๐โจ๐๐
Rank(T) = smallest k s.t. T written as sum of k rank-1 tensors
3-Tensors:
T ๐1 ๐2 ๐๐๐1 ๐2 ๐๐
๐ 1 ๐ 2 ๐ ๐
โข Rank of 3-tensor . Rank of t-tensor
Thm [Harshmanโ70, Kruskalโ77]. Rank- decompositions for 3-tensors (and higher orders) unique under mild conditions.
3-way decompositions overcome rotation problem !
Learning Probabilistic Models:Parameter Estimation
Learning goal: Can the parameters of the model be learned from polynomial samples generated by the model ?
โข EM algorithm โ used in practice, but converges to local optima
HMMs for speech recognition
Mixture of Gaussiansfor clustering points
Question: Can given data be โexplainedโ by a simple probabilistic model?
โข Algorithms have exponential time & sample complexity
Multiview models
Parametersโข Mixing weights: โข Gaussian : mean , covariance : diagonal
Learning problem: Given many sample points, find
Probabilistic model for Clustering in -dims
Mixtures of (axis-aligned) Gaussians
โข Algorithms use samples and time [FOSโ06, MVโ10]
โข Lower bound of [MVโ10] in worst case
โ๐
๐๐๐ฅ
Aim: guarantees in realistic settings
Method of Moments and Tensor decompositions
step 1. compute a tensor whose decomposition encodes model parametersstep 2. find decomposition (and hence parameters)
โฑ
๐
โ โ โฏ โ โฎ ๐ธ [๐ฅ๐ ๐ฅ ๐๐ฅ๐ ] ยฟ
โฏ ยฟยฟ๐ป=โ
๐=๐
๐๐ ๐๐๐โ๐๐โ๐๐
โข Uniqueness Recover parameters and โข Algorithm for Decomposition efficient learning
[Chang] [Allman, Matias, Rhodes][Anandkumar,Ge,Hsu, Kakade, Telgarsky]
What is known about Tensor Decompositions ?
Thm [Kruskalโ77]. Rank- decompositions for 3-tensors unique (non-algorithmic) when !
Thm [Jennrich via Harshmanโ70]. Find unique rank- decompositions for 3-tensors when !
โข Uniqueness proof is algorithmic !โข Called Full-rank case. No symmetry or orthogonality needed.โข Rediscovered in [Leurgans et al 1993] [Chang 1996]
Thm [DeLathauwer, Castiang, Cardosoโ07]. Algorithm for 4-tensors of rank generically when
Thm [Chiantini Ottavianiโ12]. Uniqueness (non-algorithmic) of 3-tensors of rank generically
Uniqueness and Algorithms resilient to noise of 1/poly(d,k) ?
Robustness to Errors
Beware : Sampling error
Empirical estimate
With samples, error
Thm [BCVโ14]. Robust version of Kruskal Uniqueness theorem (non-algorithmic) with error
Thm. Jennrichโs polynomial time algorithm for Tensor Decompositions robust up to error
Open Problem: Robust version of generic results[De Lauthewer et al]?
Algorithms for Tensor Decompositions
Polynomial time algorithms when rank [Jennrich]
NP-hard when rank in worst case [Hastad, Hillar-Lim]
Overcome worst-case intractability using Smoothed Analysis
Polynomial time algorithms* for robust Tensor decompositions for rank k >> d (rank is any polynomial in dimension) *Algorithms for recovery up to error in
This talk
Efficient Learning when no. of clusters/ topics k dimension d[Chang 96, Mossel-Roch 06, Anandkumar et al. 09-14]โข Learning Phylogenetic trees [Chang,MR]โข Axis-aligned Gaussians [HK]โข Parse trees [ACHKSZ,BHD,B,SC,PSX,LIPPX]โข HMMs [AHK,DKZ,SBSGS]โข Single Topic models [AHK], LDA [AFHKL] โข ICA [GVX] โฆ โข Overlapping Communities [AGHK] โฆ
Implications for Learning
``Full rankโโ or ``Non-degenerateโโ setting
Known only in restricted cases: No. of clusters No. of dims
Overcomplete Learning Setting
Number of clusters/topics/states dimension
Computer Vision
Previous algorithms do not work when !Speech
Need polytime decomposition of Tensors of rank
Smoothed Analysis
[Spielman & Teng 2000]
โข Small random perturbation of inputmakes instances easy
โข Best polytime guarantees in the absence of any worst-case guarantees
Smoothed analysis guarantees:
โข Worst instances are isolated
Simplex algorithm solves LPs efficiently (explains practice).
Todayโs talk: Smoothed Analysis for Learning [BCMV STOCโ14]
โข First Smoothed Analysis treatment for Unsupervised Learning
Thm. Polynomial time algorithms for learning axis-aligned Gaussians, Multview models etc. even in ``overcomplete settingsโโ.
Mixture of Gaussians
Thm. Polynomial time algorithms for tensor decompositions in smoothed analysis setting.
based on
Multiview models
Smoothed Analysis for LearningLearning setting (e.g. Mixtures of Gaussians)
Worst-case instances: Means in pathological configurations
Means not in adversarial configurations in real-world!
What if means perturbed slightly ?๐๐ ~๐๐
Generally, parameters of the model are perturbed slightly.
Smoothed Analysis for Tensor Decompositions
1. Adversary chooses tensor
3. Input: . Analyse algorithm on .
2. is random -perturbation of
i.e. add independent (gaussian) random vector of length .
๐ ๐ร๐รโฆร๐=โ๐=1
๐
๐๐(1)โจ ๐๐
(2)โจโฆโจ๐๐( ๐ก )
T ๐1(1) ๐2(1) ๐๐(1)
๐1(3 ) ๐2(3 ) ๐๐(3 )
๐ 1(2 ) ๐ 2(2 ) ๐ ๐(2 )
~๐=โ๐=1
๐~๐๐
(1 )โจ~๐๐(2 )โจโฆโจ~๐๐
(๐ก )+noise
Factors of the Decomposition are perturbed
Algorithmic GuaranteesThm [BCMVโ14]. Polynomial time algorithm for decomposing t-tensor (-dim) in smoothed analysis model when rank w.h.p.
Running time, sample complexity = .
Guarantees for order- tensors in d-dims (each)
Rank of the t-tensor=k (number of clusters)
Previous Algorithms Algorithms (smoothed case)
Corollary. Polytime algorithms (smoothed analysis) for Mixtures of axis-aligned Gaussians, Multiview models etc. even in overcomplete setting i.e. no. of clusters for any constant C w.h.p.
Interpreting Smoothed Analysis Guarantees
Time, sample complexity = .
Works with probability 1-exp(- )
โข Exponential small failure probability (for constant order t)
Smooth Interpolation between Worst-case and Average-case
โข : worst-case
โข is large: almost random vectors.
โข Can handle inverse-polynomial in
Algorithm Details
Algorithm Outline
โข Helps handle the over-complete setting
[Jennrich 70] A simple (robust) algorithm for 3-tensor T when:
2. For higher order tensors using ``tensoring / flatteningโโ.
1. An algorithm for 3-tensors in the ``full rank settingโโ ().
๐=โ๐=1
๐
๐ด๐โจ๐ต๐โจ๐ถ๐Recall: ๐ด๐
โฑ
๐ ๐
Aim: Recover A, B, C
โข Any algorithm for full-rank (non-orthogonal) tensors suffices
Blast from the Past
โฑ
๐
๐ป โ๐โ๐=1
๐
๐๐โจ๐๐โจ ๐๐
Recall
๐๐
Aim: Recover A, B, C
Qn. Is this algorithm robust to errors ?Yes ! Needs perturbation bounds for eigenvectors.
[Stewart-Sun]
Thm. Efficiently decompose T and recover upto errorwhen 1) are min-singular-value 1/poly(d) 2) C doesnโt have parallel columns.
[Jennrich via Harshman 70]Algorithm for 3-tensor
โข A, B are full rank (rank=)โข C has rank 2
โข Reduces to matrix eigen-decompositions
Consider rank 1 tensor
Slices of tensors
sโth slice:
sโth slice:
All slices have a common diagonalization !
๐=โ๐=1
๐
๐๐โ๐๐โ๐๐ โ๐=1
๐
๐๐ (๐ ) .(๐๐โ๐๐)
Random combination of slices: โ๐=1
๐
โจ๐๐ ,๐ค โฉ .(๐๐โ๐๐)
Two matrices with common diagonalization
Simultaneous diagonalization
If 1) are invertible and
2) have unequal non-zero entries,
We can find by matrix diagonalization!
Algorithm:1. Take random combination along as .2. Take random combination along as .3. Find eigen-decomposition of to get . Similarly B,C.
Decomposition algorithm [Jennrich]
๐ โ๐โ๐=1
๐
๐๐โ๐๐โ๐๐
Thm. Efficiently decompose T and recover up to error (in Frobenius norm) when
1) are full rank i.e. min-singular-value 1/poly(d) 2) C doesnโt have parallel columns (in a robust sense).
Overcomplete Case
into Techniques
Mapping to Higher DimensionsHow do we handle the case rank ?
(or even vectors with โmanyโ linear dependencies?)
1. Tensor corresponding to map computable using the data
2. are linearly independent (min singular value)
map
โ๐
๐ (๐ยฟยฟ1)ยฟ
๐ (๐๐) โ๐2
๐ (๐ยฟยฟ ๐)ยฟ๐ (๐ยฟยฟ 2)ยฟ๐1
๐2๐๐
๐๐
โข Reminiscent of Kernels in SVMs
maps parameter/factor vectors to higher dimensions s.t.
Factor matrix A
A mapping to higher dimensions
Qn: are these vectors linearly independent?Is ``essential dimensionโโ ?
Outer product / Tensor products:Map Map
Basic Intuition:
1. has dimensions.
2. For non-parallel unit vectors distance increases:
โข Tensor is
Bad cases
Beyond Worst-case analysisCan we hope for โdimensionโ to multiply โtypicallyโ?
Bad example where :โข Every vectors of U and V are linearly independentโข But vectors of Z are linearly dependent !
๐ (๐ร๐)
๐ฃ ๐โ
๐ (๐ร๐)
๐ข๐โ
Z
=
Lem. Dimension (K-rank) under tensoring is additive.
U, V have rank=d. Vectors
Strategy does not work in the worst-case
But, bad examples are pathological and hard to construct!
Product vectors & linear structure
โข Easy to compute tensor with as factors / parameters (``Flatteningโโ
of 3t-order moment tensor)
โข New factor matrix is full rank using Smoothed Analysis.
Theorem. For any matrix , for , with probability 1- exp(-poly(d)).
Map
๐๐
~๐๐random -perturbation
Proof sketch (t=2)
Main Issue: perturbation before product.. โข easy if columns perturbed after tensor
product (simple anti-concentration bounds)
Technical componentshow perturbed product vectors behave like random vectors in
๐2
๐
๐
Prop. For any matrix , matrix below has with probability 1- exp(-poly(d)).
โข only bits of randomness in dimsโข Block dependencies
Projections of product vectorsQuestion. Given any vector and gaussian -perturbation , does have projection onto any given dimensional subspace with prob. ?
Easy : Take dimensional , -perturbation to will have projection on to S w.h.p.
anti-concentration for polynomials implies this with
probability 1-1/poly
โฆ.... ~๐ (๐ )โ๐
Much tougher for product of perturbations!(inherent block structure)
Projections of product vectors
dot product of block with
=
Question. Given any vector and gaussian -perturbation , does have projection onto any given dimensional subspace with prob. ?
๐22
๐2 ~๐ (๐ )โ๐
is projection matrix onto
is a matrix
Two steps of Proof..
2. If has eigenvalues , then w.p. (over perturbation of ), has large projection onto .
follows easily analyzing projection of a vector to
a dim-k space
will show with
1. W.h.p. (over perturbation of b), has at least eigenvalues
Suppose: Choose first โblocksโ in were orthogonal...
โฆ.โฆ.
โฆ.
Structure in any subspace S
(restricted to cols)
โข Entry (i,j) is:
โข Translated i.i.d. Gaussian matrix!
has many big eigenvalues
โ๐
๐ฃ ๐๐โโ๐
Main claim: every dimensional space has vectors with such a structure..
โฆ.โฆ.
โฆ.
Property: picked blocks (d dim vectors) have โreasonableโ component orthogonal to span of rest..
Finding Structure in any subspace S
Earlier argument goes through even with blocks not fully orthogonal!
๐ฃ1๐ฃ2๐ฃโ๐
Idea: obtain โgoodโ columns one by one..
โข Show there exists a block with many linearly independent โchoicesโ
โข Fix some choices and argue the same property holds, โฆ
Main claim (sketch)..
Generalization: similar result holds for higher order products, implies main result.
crucially use the fact that we have a dim
subspace
โข Uses a delicate inductive argument
Summary
โข Polynomial time Algorithms in
Overcomplete settings:
โข Smoothed Analysis for Learning Probabilistic models.
Guarantees for order- tensors in d-dims (each)
Rank of the t-tensor=k (number of clusters)
Previous Algorithms Algorithms (smoothed case)
โข Flattening gets beyond full-rank conditions: Plug into results on Spectral Learning of Probabilistic models
Future Directions
Smoothed Analysis for other Learning problems ?
Better guarantees using Higher-order momentsโข Better bounds w.r.t. smallest singular value ?
Better Robustness to Errors
โข Modelling errors?
โข Tensor decomposition algorithms that more robust to errors ?
promise: [Barak-Kelner-Steurerโ14] using Lasserre hierarchy
Better dependence on rank k vs dim d (esp. 3 tensors)
โข Next talk by Anandkumar: Random/ Incoherent decompositions
Thank You!
Questions?