Download pdf - andoni beyondLSH mmdsmmds-data.org/presentations/2014/andoni_mmds14.pdf · Algorithms’and’Lower’Bounds’ Space Time Comment Reference [IM’98] [PTW’08, PTW’10] [IM’98]

Beyond Locality Sensitive Hashing

Alex Andoni (Microsoft Research)

Joint with: Piotr Indyk (MIT), Huy L. Nguyen (Princeton), Ilya Razenshteyn (MIT)

Nearest Neighbor Search (NNS) • 

Motivation •  Generic setup:

•  Points model objects (e.g. images) •  Distance models (dis)similarity measure

•  Application areas: •  machine learning: k-‐NN rule •  image/video/music recognition, deduplication, bioinformatics, etc…

•  Distance can be: •  Hamming, Euclidean, …

•  Primitive for other problems: •  find the similar pairs, clustering…

000000 011100 010100 000100 010100 011111

000000 001100 000100 000100 110100 111111

Approximate NNS • 

q

r p

cr

Locality-‐Sensitive Hashing • 

q

p

1

[Indyk-Motwani’98]

q

“not-‐so-‐small”

Locality sensitive hash functions

• 

6

[Indyk-Motwani’98]

1

Algorithms and Lower Bounds Space Time Comment Reference

[IM’98]

[PTW’08, PTW’10]

[IM’98]

[DIIM’04, AI’06]

[MNP’06]

[OWZ’11]

[PTW’08, PTW’10]

[MNP’06]

[OWZ’11]

LSH is tight…

leave the rest to cell-‐probe lower bounds?

Main Result

• 

9

A look at LSH lower bounds

• 

10

[O’Donnell-‐Wu-‐Zhou’11]

Why not NNS lower bound?

• 

11

Our algorithm: intuition

• 

12

Nice Configuration: “sparsity”

• 

13

Reduction: into spherical LSH

• 

14

Two-‐level algorithm

• 

Details

• 

16

Practice •  Practice uses data-‐dependent partitions!

•  “wherever theoreticians suggest to use random dimensionality reduction, use PCA”

•  Lots of variants •  Trees: kd-‐trees, quad-‐trees, ball-‐trees, rp-‐trees, PCA-‐trees, sp-‐trees…

•  no guarantees: e.g., are deterministic

•  Is there a better way to do partitions in practice?

• Why do PCA-‐trees work? •  [Abdullah-‐A-‐Kannan-‐Krauthgamer]: if have more structure

17

Finale

• 

18