Stochastic Neighbor Compression - GitHub...

Preview:

Citation preview

Stochastic Neighbor Compression

Matt J. KusnerStephen Tyree

Kilian Q. WeinbergerKunal Agrawal

NN Classifier

class 1

class 3

class 2

[Cover & Hart, 1967]

class 1

class 3

class 2= test input

[Cover & Hart, 1967]NN Classifier

class 3

class 2

class 1

1-nearest neighbor rule[Cover & Hart, 1967]

NN Classifier

class 3

class 2

class 1

1-nearest neighbor rule

during test-time...

complexity:

training inputsfeatures

O�dn

nd

[Cover & Hart, 1967]NN Classifier

class 3

class 2

class 1

1-nearest neighbor rule

during test-time...

for each test input!

e.g.complexity:

training inputsfeatures

O�dn

nd

[Cover & Hart, 1967]

n = 6⇥ 104

d = 784

⇡ 47 million

O(dn�

NN Classifier

1-nearest neighbor rule

How can we reduce this complexity?

complexity:

training inputsfeatures

O�dn

nd

NN Classifier

1-nearest neighbor rule

1. reduce nHow can we reduce this complexity?

complexity:

training inputsfeatures

O�dn

nd

NN Classifier

1-nearest neighbor rule

1. reduce n Training Consistent Sampling

Prototype Generation

Prototype Positioning

[Hart, 1968][Anguilli, 2005]

[Mollineda et al., 2002][Bandyopadhyay & Maulik, 2002]

[Bermejo & Cabestany, 1999][Toussaint 2002]

How can we reduce this complexity?

complexity:

training inputsfeatures

O�dn

nd

NN Classifier

1-nearest neighbor rule

1. reduce n

2. reduce d

How can we reduce this complexity?

complexity:

training inputsfeatures

O�dn

nd

NN Classifier

1-nearest neighbor rule

1. reduce n

2. reduce d

Dimensionality Reduction

[Hinton & Roweis, 2002][Tenenbaum et al., 2000]

[Weinberger et al., 2004][van der Maaten & Hinton, 2008]

[Weinberger & Saul, 2009]

How can we reduce this complexity?

complexity:

training inputsfeatures

O�dn

nd

NN Classifier

1-nearest neighbor rule

3. use data structures

1. reduce n

2. reduce d

How can we reduce this complexity?

complexity:

training inputsfeatures

O�dn

nd

NN Classifier

1-nearest neighbor rule

Tree Structures

Hashing

[Omohundro, 1989][Beygelzimer et al., 2006]

[Andoni & Indyk, 2006][Gionis et al., 1999]3. use data structures

1. reduce n

2. reduce d

How can we reduce this complexity?

complexity:

training inputsfeatures

O�dn

nd

NN Classifier

1-nearest neighbor rule

3. use data structures

1. reduce n

2. reduce d

How can we reduce this complexity?

complexity:

training inputsfeatures

O�dn

nd

NN Classifier

1-nearest neighbor rule

Dataset Compression

3. use data structures

1. reduce n

2. reduce d

How can we reduce this complexity?

complexity:

training inputsfeatures

O�dn

nd

NN Classifier

Main Idealearn new synthetic inputs!

‘compressed’ datatraining data

Main Idealearn new synthetic inputs!

‘compressed’ data

z1

"· · · zm

#training data

xn�1

"x2 xix1 · · · · · ·

#xn

y1 y2 yi yn�1 yn· · · · · · y1 ym· · ·

Main Idealearn new synthetic inputs!

‘compressed’ data

z1

"· · · zm

#training data

xn�1

"x2 xix1 · · · · · ·

#xn

y1 y2 yi yn�1 yn· · · · · · y1 ym· · ·

m << n

Main Idealearn new synthetic inputs!

use ‘compressed’ set to make future predictions

‘compressed’ data

z1

"· · · zm

#

y1 ym· · ·

Approach

class 1

class 3

class 2

steps to a new training set:

Approach

class 1

class 3

class 21. subsample

steps to a new training set:

Approach

class 1

class 3

class 21. subsample

steps to a new training set:

2. learn prototypes

Approach

class 1

class 3

class 21. subsample2. learn prototypes

steps to a new training set:

move inputs to minimize 1-nn training error

Learning Prototypesmove inputs to minimize 1-nn training error

label of nearest neighbor to in set {z1, . . . , zm}

xi

true label

minz1,...,zm

nX

i=1

[nn{z1,...,zm}(xi) 6= yi]

move inputs to minimize 1-nn training error

Learning Prototypes

label of nearest neighbor to in set {z1, . . . , zm}

xi

true label

minz1,...,zm

nX

i=1

[nn{z1,...,zm}(xi) 6= yi]

xi

z1z2

z3 z4

move inputs to minimize 1-nn training error

Learning Prototypes

minz1,...,zm

nX

i=1

[nn{z1,...,zm}(xi) 6= yi]

not continousnot differentiable

move inputs to minimize 1-nn training error

Learning Prototypes

minz1,...,zm

nX

i=1

[nn{z1,...,zm}(xi) 6= yi]

[Hinton & Roweis, 2002]Stochastic Neighborhood

move inputs to minimize 1-nn training error

Learning Prototypes

minz1,...,zm

nX

i=1

[nn{z1,...,zm}(xi) 6= yi]

[Hinton & Roweis, 2002]Stochastic Neighborhood

xi

move inputs to minimize 1-nn training error

Learning Prototypes

minz1,...,zm

nX

i=1

[nn{z1,...,zm}(xi) 6= yi]

[Hinton & Roweis, 2002]Stochastic Neighborhood

xi

move inputs to minimize 1-nn training error

Learning Prototypes

pi

probability --- is predicted correctly

xi

minz1,...,zm

nX

i=1

[nn{z1,...,zm}(xi) 6= yi]

[Hinton & Roweis, 2002]Stochastic Neighborhood

xi

move inputs to minimize 1-nn training error

Learning Prototypes

pipi

probability --- is predicted correctly

xi1

Z

X

j:yj=yi

exp(��2kxi � zjk22),

minz1,...,zm

nX

i=1

[nn{z1,...,zm}(xi) 6= yi]

xi

move inputs to minimize 1-nn training error

Learning Prototypes

pipi

probability --- is predicted correctly

xi1

Z

X

j:yj=yi

exp(��2kxi � zjk22),

min

z1,...,zm

nX

i=1

� log(pi)

minimize negative log likelihood

minz1,...,zm

nX

i=1

[nn{z1,...,zm}(xi) 6= yi]

xi

move inputs to minimize 1-nn training error

Learning Prototypes

pipi

probability --- is predicted correctly

xi1

Z

X

j:yj=yi

exp(��2kxi � zjk22),

min

z1,...,zm

nX

i=1

� log(pi)

minimize negative log likelihood

use conjugate gradient descent!

Results[a motivating visualization]

Results

- 38 people- ~64 images/person- lighting changes

Yale-Faces

Results

- 38 people- ~64 images/person- lighting changes

Yale-Faces

ResultsYale-Faces

- lighting changes

- 38 people- ~64 images/person

Results

- 38 people- ~64 images/person- lighting changes

Yale-Faces

Resultssubsampled real faces

Resultssubsampled real faces

Results

Resultslearned faces

Results[datasets]

Results

Resultstraining inputs

Resultsnumber of classes

Resultsoriginal & reduced features

Resultsoriginal & reduced features

[Weinberger & Saul, 2009]

Results[error]

Results

Results

ratio of training inputs in compressed set

Results

test error

Results

Results

on full training set

Results

initialization

Results

training set consistent sampling

Results

our method

Results

Results[test-time speed-up]

Resultscomplementary approaches

ball-treeslocality-sensitive hashing (LSH)

[test-time speed-up]

Resultscomplementary approaches

ball-trees

[test-time speed-up]

locality-sensitive hashing (LSH)

Resultscomplementary approaches

ball-trees

compression rate

[test-time speed-up]

locality-sensitive hashing (LSH)

1-NN full dataset

1-NN after SNC

Resultscomplementary approaches

ball-trees

compression rate

[test-time speed-up]

locality-sensitive hashing (LSH)

ball-trees full dataset

ball-trees after SNC

Resultscomplementary approaches

ball-trees

compression rate

[test-time speed-up]

locality-sensitive hashing (LSH)

LSH full dataset

LSH after SNC

Resultscomplementary approaches

ball-treeslocality-sensitive hashing (LSH)

[test-time speed-up]

Resultscomplementary approaches

ball-treeslocality-sensitive hashing (LSH)

[test-time speed-up]

Summary

subsample after optimizationoriginal data

▫ learns a compressed training set for NN classifierStochastic Neighbor Compression

Summary▫ learns a compressed training set for NN classifier

▫ Stochastic Neighborhood[Hinton & Roweis, 2002]

subsample after optimizationoriginal data

xi

Stochastic Neighbor Compression

Summary▫ learns a compressed training set for NN classifier

▫ Stochastic Neighborhood[Hinton & Roweis, 2002]

subsample after optimizationoriginal data

▫ Compression by 96% without error increase in 5/7 cases

Stochastic Neighbor Compression

Summary▫ learns a compressed training set for NN classifier

▫ Stochastic Neighborhood[Hinton & Roweis, 2002]

subsample after optimizationoriginal data

▫ Compression by 96% without error increase in 5/7 cases

▫ test-time speed-ups on top of NN data structures

Stochastic Neighbor Compression

Thank you!Questions?

Results[robustness to label noise]

Results[robustness to label noise]

Results[robustness to label noise]

noise added to training labels

Results[robustness to label noise]

using larger k

Results[robustness to label noise]

training set consistent sampling

Results[robustness to label noise]

our method

Recommended