49
Resilience: A Criterion for Learning in the Presence of Arbitrary Outliers Jacob Steinhardt, Moses Charikar, Gregory Valiant ITCS 2018 January 14, 2018

Resilience: A Criterion for Learning in the Presence of

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Resilience: A Criterion for Learning in the Presence of

Resilience: A Criterion for Learning

in the Presence of Arbitrary Outliers

Jacob Steinhardt, Moses Charikar, Gregory Valiant

ITCS 2018

January 14, 2018

Page 2: Resilience: A Criterion for Learning in the Presence of

Motivation: Robust Learning

Question

What concepts can be learned robustly,even if some data is arbitrarily corrupted?

1

Page 3: Resilience: A Criterion for Learning in the Presence of

Example: Mean Estimation

Problem

Given data x1, . . . , xn ∈ Rd, of which (1 − ε)n come from p∗

(and remaining εn are arbitrary outliers), estimate mean µ of p∗.

2

Page 4: Resilience: A Criterion for Learning in the Presence of

Example: Mean Estimation

Problem

Given data x1, . . . , xn ∈ Rd, of which (1 − ε)n come from p∗

(and remaining εn are arbitrary outliers), estimate mean µ of p∗.

2

Page 5: Resilience: A Criterion for Learning in the Presence of

Example: Mean Estimation

Problem

Given data x1, . . . , xn ∈ Rd, of which (1 − ε)n come from p∗

(and remaining εn are arbitrary outliers), estimate mean µ of p∗.

2

Page 6: Resilience: A Criterion for Learning in the Presence of

Example: Mean Estimation

Problem

Given data x1, . . . , xn ∈ Rd, of which (1 − ε)n come from p∗

(and remaining εn are arbitrary outliers), estimate mean µ of p∗.

2

Page 7: Resilience: A Criterion for Learning in the Presence of

Example: Mean Estimation

Problem

Given data x1, . . . , xn ∈ Rd, of which (1 − ε)n come from p∗

(and remaining εn are arbitrary outliers), estimate mean µ of p∗.

Issue: high dimensions2

Page 8: Resilience: A Criterion for Learning in the Presence of

Mean Estimation: Gaussian Example

Suppose clean data is Gaussian:

xi ∼ N (µ, I)︸ ︷︷ ︸Gaussian mean µ

variance 1 each coord.

3

Page 9: Resilience: A Criterion for Learning in the Presence of

Mean Estimation: Gaussian Example

Suppose clean data is Gaussian:

xi ∼ N (µ, I)︸ ︷︷ ︸Gaussian mean µ

variance 1 each coord.

√d ‖xi − µ‖2 ≈

√12 + · · ·+ 12 =

√d

3

Page 10: Resilience: A Criterion for Learning in the Presence of

Mean Estimation: Gaussian Example

Suppose clean data is Gaussian:

xi ∼ N (µ, I)︸ ︷︷ ︸Gaussian mean µ

variance 1 each coord.

√d

ε√d

‖xi − µ‖2 ≈√12 + · · ·+ 12 =

√d

3

Page 11: Resilience: A Criterion for Learning in the Presence of

Mean Estimation: Gaussian Example

Suppose clean data is Gaussian:

xi ∼ N (µ, I)︸ ︷︷ ︸Gaussian mean µ

variance 1 each coord.

√d

ε√d

‖xi − µ‖2 ≈√12 + · · ·+ 12 =

√d

Cannot filter independentlyeven if know true density!

3

Page 12: Resilience: A Criterion for Learning in the Presence of

History

Progress in high dimensions only recently:

• Tukey median [1975]: robust but NP-hard

• Donoho estimator [1982]: high error

• [DKKLMS16, LRV16]: first dimension-independent error bounds

4

Page 13: Resilience: A Criterion for Learning in the Presence of

History

Progress in high dimensions only recently:

• Tukey median [1975]: robust but NP-hard

• Donoho estimator [1982]: high error

• [DKKLMS16, LRV16]: first dimension-independent error bounds

• large body of work since then [CSV17, DKKLMS17, L17, DBS17]

• many other problems including PCA [XCM10], regression[NTN11], classification [FHKP09], etc.

4

Page 14: Resilience: A Criterion for Learning in the Presence of

This Talk

Question

What general and simple properties enable robust estimation?

5

Page 15: Resilience: A Criterion for Learning in the Presence of

This Talk

Question

What general and simple properties enable robust estimation?

New information-theoretic criterion: resilience.

5

Page 16: Resilience: A Criterion for Learning in the Presence of

Resilience

Suppose xii∈S is a set of points in Rd.

Definition (Resilience)

A set S is (σ, ε)-resilient in a norm ‖ · ‖ around a point µ iffor all subsets T ⊆ S of size at least (1− ε)|S|,∥∥∥ 1

|T |∑i∈T

(xi − µ)∥∥∥ ≤ σ.

Intuition: all large subsets have similar mean.

6

Page 17: Resilience: A Criterion for Learning in the Presence of

Main Result

Let S ⊆ Rd be a set of of (1− ε)n “good” points.

Let Sout be a set of εn arbitrary outliers.

We observe S = S ∪ Sout.

Theorem

If S is (σ, ε1−ε )-resilient around µ, then it is possible to output

µ such that ‖µ− µ‖ ≤ 2σ.

In fact, outputting the center of any resilient subset of S will work!

7

Page 18: Resilience: A Criterion for Learning in the Presence of

Pigeonhole Argument

Claim: If S and S′ are (σ, ε1−ε )-resilient around µ and µ′ and have size

(1− ε)n, then ‖µ− µ′‖ ≤ 2σ.

8

Page 19: Resilience: A Criterion for Learning in the Presence of

Pigeonhole Argument

Claim: If S and S′ are (σ, ε1−ε )-resilient around µ and µ′ and have size

(1− ε)n, then ‖µ− µ′‖ ≤ 2σ.

Proof:

S

S

8

Page 20: Resilience: A Criterion for Learning in the Presence of

Pigeonhole Argument

Claim: If S and S′ are (σ, ε1−ε )-resilient around µ and µ′ and have size

(1− ε)n, then ‖µ− µ′‖ ≤ 2σ.

Proof:

S

S

S′

8

Page 21: Resilience: A Criterion for Learning in the Presence of

Pigeonhole Argument

Claim: If S and S′ are (σ, ε1−ε )-resilient around µ and µ′ and have size

(1− ε)n, then ‖µ− µ′‖ ≤ 2σ.

Proof:

S

S

S′ S ∩ S′

• Let µS∩S′ be the mean of S ∩ S′.• By Pigeonhole, |S ∩ S′| ≥ ε

1−ε |S′|.

8

Page 22: Resilience: A Criterion for Learning in the Presence of

Pigeonhole Argument

Claim: If S and S′ are (σ, ε1−ε )-resilient around µ and µ′ and have size

(1− ε)n, then ‖µ− µ′‖ ≤ 2σ.

Proof:

S

S

S′ S ∩ S′

• Let µS∩S′ be the mean of S ∩ S′.• By Pigeonhole, |S ∩ S′| ≥ ε

1−ε |S′|.

• Then ‖µ′ − µS∩S′‖ ≤ σ by resilience.

• Similarly, ‖µ− µS∩S′‖ ≤ σ.• Result follows by triangle inequality.

8

Page 23: Resilience: A Criterion for Learning in the Presence of

Implication: Mean Estimation

Lemma

If a dataset has bounded covariance, it is (ε,O(√ε))-resilient

(in the `2-norm).

9

Page 24: Resilience: A Criterion for Learning in the Presence of

Implication: Mean Estimation

Lemma

If a dataset has bounded covariance, it is (ε,O(√ε))-resilient

(in the `2-norm).

Proof: If εn points 1/√ε from mean, would make variance 1.

Therefore, deleting εn points changes mean by at most ≈ ε ·1/√ε =√ε.

9

Page 25: Resilience: A Criterion for Learning in the Presence of

Implication: Mean Estimation

Lemma

If a dataset has bounded covariance, it is (ε,O(√ε))-resilient

(in the `2-norm).

Proof: If εn points 1/√ε from mean, would make variance 1.

Therefore, deleting εn points changes mean by at most ≈ ε ·1/√ε =√ε.

Corollary

If the clean data has bounded covariance, its mean can beestimated to `2-error O(

√ε) in the presence of εn outliers.

9

Page 26: Resilience: A Criterion for Learning in the Presence of

Implication: Mean Estimation

Lemma

If a dataset has bounded covariance, it is (ε,O(√ε))-resilient

(in the `2-norm).

Proof: If εn points 1/√ε from mean, would make variance 1.

Therefore, deleting εn points changes mean by at most ≈ ε ·1/√ε =√ε.

Corollary

If the clean data has bounded kth moments, its mean can beestimated to `2-error O(ε1−1/k) in the presence of εn outliers.

9

Page 27: Resilience: A Criterion for Learning in the Presence of

Implication: Learning Discrete Distributions

Suppose we observe samples from a distribution π on 1, . . . ,m.

Samples come in r-tuples, which are either all good or all outliers.

10

Page 28: Resilience: A Criterion for Learning in the Presence of

Implication: Learning Discrete Distributions

Suppose we observe samples from a distribution π on 1, . . . ,m.

Samples come in r-tuples, which are either all good or all outliers.

Corollary

The distribution π can be estimated (in TV distance) to errorO(ε

√log(1/ε)/r) in the presence of εn outliers.

10

Page 29: Resilience: A Criterion for Learning in the Presence of

Implication: Learning Discrete Distributions

Suppose we observe samples from a distribution π on 1, . . . ,m.

Samples come in r-tuples, which are either all good or all outliers.

Corollary

The distribution π can be estimated (in TV distance) to errorO(ε

√log(1/ε)/r) in the presence of εn outliers.

• follows from resilience in `1-norm

• see also [Qiao & Valiant, 2018] later in this session!

10

Page 30: Resilience: A Criterion for Learning in the Presence of

A Majority of Outliers

Can also handle the case where clean set has size only αn (α < 12 ):

S

S

11

Page 31: Resilience: A Criterion for Learning in the Presence of

A Majority of Outliers

Can also handle the case where clean set has size only αn (α < 12 ):

S

S

• cover S by resilient sets

11

Page 32: Resilience: A Criterion for Learning in the Presence of

A Majority of Outliers

Can also handle the case where clean set has size only αn (α < 12 ):

S

S′

SS ∩ S′

• cover S by resilient sets

• at least one set S′ must have high overlap with S...

11

Page 33: Resilience: A Criterion for Learning in the Presence of

A Majority of Outliers

Can also handle the case where clean set has size only αn (α < 12 ):

S

S′

SS ∩ S′

• cover S by resilient sets

• at least one set S′ must have high overlap with S...

• ...and hence ‖µ′ − µ‖ ≤ 2σ as before.

11

Page 34: Resilience: A Criterion for Learning in the Presence of

A Majority of Outliers

Can also handle the case where clean set has size only αn (α < 12 ):

S

S′

SS ∩ S′

• cover S by resilient sets

• at least one set S′ must have high overlap with S...

• ...and hence ‖µ′ − µ‖ ≤ 2σ as before.

• Recovery in list-decodable model [BBV08].

11

Page 35: Resilience: A Criterion for Learning in the Presence of

Implication: Stochastic Block Models

Set of αn good and (1− α)n bad vertices.

12

Page 36: Resilience: A Criterion for Learning in the Presence of

Implication: Stochastic Block Models

Set of αn good and (1− α)n bad vertices.

• good ↔ good: dense (avg. deg. = a)

• good ↔ bad: sparse (avg. deg. = b)

12

Page 37: Resilience: A Criterion for Learning in the Presence of

Implication: Stochastic Block Models

Set of αn good and (1− α)n bad vertices.

• good ↔ good: dense (avg. deg. = a)

• good ↔ bad: sparse (avg. deg. = b)

• bad ↔ bad: arbitrary

12

Page 38: Resilience: A Criterion for Learning in the Presence of

Implication: Stochastic Block Models

Set of αn good and (1− α)n bad vertices.

• good ↔ good: dense (avg. deg. = a)

• good ↔ bad: sparse (avg. deg. = b)

• bad ↔ bad: arbitrary

Question: when can good set be recovered (in terms of α, a, b)?

12

Page 39: Resilience: A Criterion for Learning in the Presence of

Implication: Stochastic Block Models

Using resilience in “truncated `1-norm”, can show:

Corollary

The set of good vertices can be approximately recovered when-

ever (a−b)2a log(2/α)

α2 .

13

Page 40: Resilience: A Criterion for Learning in the Presence of

Implication: Stochastic Block Models

Using resilience in “truncated `1-norm”, can show:

Corollary

The set of good vertices can be approximately recovered when-

ever (a−b)2a log(2/α)

α2 .

Matches Kesten-Stigum threshold up to log factors!

13

Page 41: Resilience: A Criterion for Learning in the Presence of

Implication: Stochastic Block Models

Using resilience in “truncated `1-norm”, can show:

Corollary

The set of good vertices can be approximately recovered when-

ever (a−b)2a log(2/α)

α2 .

Matches Kesten-Stigum threshold up to log factors!

For planted clique (a = n, b = n/2), recover cliques of size Ω(√n log n).

• this is tight [S’17]

13

Page 42: Resilience: A Criterion for Learning in the Presence of

Algorithmic Results

Can (sometimes) turn info-theoretic into algorithmic results.

14

Page 43: Resilience: A Criterion for Learning in the Presence of

Algorithmic Results

Can (sometimes) turn info-theoretic into algorithmic results.

Most existing algorithmic results rely on bounded covariance.

14

Page 44: Resilience: A Criterion for Learning in the Presence of

Algorithmic Results

Can (sometimes) turn info-theoretic into algorithmic results.

Most existing algorithmic results rely on bounded covariance.

We show:

• for strongly convex norms, resilient sets can be “pruned” to havebounded covariance

• if injective norm is approximable, bounded covariance → efficientalgorithm with

√ε error

• both true for `p-norms! (p ∈ [2,∞])

14

Page 45: Resilience: A Criterion for Learning in the Presence of

Algorithmic Results

Can (sometimes) turn info-theoretic into algorithmic results.

Most existing algorithmic results rely on bounded covariance.

We show:

• for strongly convex norms, resilient sets can be “pruned” to havebounded covariance

• if injective norm is approximable, bounded covariance → efficientalgorithm with

√ε error

• both true for `p-norms! (p ∈ [2,∞])

See [Li, 2017] and [Du, Balakrishnan, & Singh, 2017] for a non-`p-norm.

14

Page 46: Resilience: A Criterion for Learning in the Presence of

Other Results

Finite-sample bounds

Extension to SVD

15

Page 47: Resilience: A Criterion for Learning in the Presence of

Summary

Information-theoretic criterion yielding (tight?) robust recovery bounds.

• based on simple pigeonhole arguments

16

Page 48: Resilience: A Criterion for Learning in the Presence of

Summary

Information-theoretic criterion yielding (tight?) robust recovery bounds.

• based on simple pigeonhole arguments

Benefit: from statistical problem to algorithmic problem.

16

Page 49: Resilience: A Criterion for Learning in the Presence of

Summary

Information-theoretic criterion yielding (tight?) robust recovery bounds.

• based on simple pigeonhole arguments

Benefit: from statistical problem to algorithmic problem.

Open questions:

• resilience for other problems (e.g. regression)

• efficient algos under other assumptions

• matching lower bounds?

16