Nearly-TightSampleComplexityBounds forLearningMixturesofGaussians
Hassan Ashtiani12, Shai Ben-David2, Christopher Liaw3, Nicholas J.A. Harvey3, Abbas Mehrabian4, Yaniv Plan3
1McMaster University, 2University of Waterloo, 3University of British Columbia, 4McGill University
Our results
Theorem. The sample complexity for learningmixtures of k Gaussians in Rd up to total variationdistance ε is (Θ(·) suppresses polylog(kd/ε) factors)
• Θ(kd2
ε2
)(for general Gaussians)
• Θ(kdε2
)(for axis-aligned Gaussians)
Correspondingly, given n samples from the true distribution, theminimax risk is O(
√kd2/n) and O(
√kd/n), respectively.
PAC Learning of Distributions• Given i.i.d. samples from unknown target distribution D, outputD such that
dTV (D, D) = supE|PrD
[E]− PrD
[E]| = 1
2‖fD − fD′‖1 ≤ ε.
• F : An arbitrary class of distributions (e.g. Gaussians)k-mix(F): k-mixtures of F ,i.e. k-mix(F) :=
∑i∈[k] wiDi : wi ≥ 0,
∑i wi = 1,Di ∈ F.
• Sample complexity of F is minimum number mF (ε) such thatthere is an algorithm that, given mF (ε) samples from D, outputsD with dTV (D, D) ≤ ε.• PAC learning is not equivalent to parameter estimationwhere goal is to recover parameters of distribution.
Compression FrameworkWe develop a novel compression framework that uses few samplesto build a representative family of distributions.
1. Encoder given true distribution D ∈ F & draws m(ε) points D.
2. Encoder sends t(ε) points and/or “helper” bits to decoder.
3. Decoder outputs D ∈ F such that dTV (D, D) ≤ ε w.p. 2/3.
If this is possible, we say F is (m(ε), t(ε))-compressible.
Compression Theorem
Compression Theorem [ABHLMP ’18].If F is (m(ε), t(ε))-compressible then sample complexity for
learning F is O(m(ε) + t(ε)
ε2
).
Compression of MixturesLemma. If F is (m(ε), t(ε))-compressible then k-mix(F) is
(km(ε/k)ε , kt(ε/k))-compressible.
Compression Theorem for Mixtures.If F is (m(ε), t(ε))-compressible then sample complexity for learning
k-mix(F) is O(
km(ε/k)ε + kt(ε/k)
ε2
).
Example: Gaussians in R
Claim. Gaussians in R are (1/ε, 2)-compressible.1. True distribution is N (µ, σ2); encoder draws 1/ε points fromN (µ, σ2).
2. With high probability, ∃Xi ≈ µ+ σ,Xj ≈ µ− σ.
3. Encoder sends Xi, Xj ; decoder recovers µ, σ approximately.
Outline of the AlgorithmAssume: (i) F is (m(ε), t(ε))-compressible; (ii) true dist. D ∈ FInput: Error parameter ε > 0.
1. Draw m(ε) i.i.d. samples from D.
2. Encoder has at most m(ε)t(ε)2t(ε) outputs so enumerate allM = m(ε)t(ε)2t(ε) of decoder’s outputs, D1, . . . ,DM .By assumption, dTV (Di,D) ≤ ε for some i.
3. Use tournament algorithm [DL ’01] to find best distributionamongst D1, . . . ,DM ; O(log(M)/ε2) samples suffice for this step.
Sample complexity is m(ε) +O(log(M)/ε2) = O(m(ε) + t(ε)
ε2
).
Proof of Upper Bound
Lemma. Gaussians in Rd are (O(d), O(d2))-compressible.Sketch of lemma. Suppose true Gaussian is N (µ,Σ).• Encoder draws O(d) points from N (µ,Σ).• Points give rough shape of ellipsoid induced by µ,Σ; encoder
sends points & O(d2) bits; decoder approximates ellipsoid.• Decoder outputs N (µ, Σ).
Proof of upper bound. Combine lemma with compression theorem.
Lower Bound TechniqueTheorem (Fano’s Inequality). If D1, . . . ,Dr are distributions suchthat dTV (Di,Dj) ≥ ε and KL(Di,Dj) ≤ ε2 for all i 6= j then samplecomplexity is Ω(log(r)/ε2).
• Use probabilistic method to find 2Ω(d2) Gaussian distributionssatisfying hypothesis of Fano’s Inequality.
• Repeat following procedure 2Ω(d2) times:
1. Start with identity covariance matrix.
2. Choose random subspace Sa of dimension d/10 & perturbeigenvalues by ε/
√d along Sa.
Let Σa be corresponding covariance matrix and Da =N (0,Σa).
Claim. If a 6= b then KL(Da,Db) ≤ ε2 and dTV (Da,Db) ≥ ε withprobability 1− exp(−Ω(d2)).
Can lift construction to get 2Ω(kd2) k-mixture of d-dimensionalGaussians satsifying Fano’s Inequality.
Remark. Lower bound for axis-aligned proved by [SOAJ ’14].
References[DL ’01] Devroye, L., & Lugosi, G. (2001). Combinatorial methods in density
estimation. Springer Science & Business Media.
[SOAJ ’14] Suresh, A. T., Orlitsky, A., Acharya, J., & Jafarpour, A.
(2014). Near-optimal-sample estimators for spherical gaussian mixtures. In
Advances in Neural Information Processing Systems (pp. 1395-1403).