Beyond Locality Sensitive Hashing
Alex Andoni (Microsoft Research)
Joint with: Piotr Indyk (MIT), Huy L. Nguyen (Princeton), Ilya Razenshteyn (MIT)
Nearest Neighbor Search (NNS) •
Motivation • Generic setup:
• Points model objects (e.g. images) • Distance models (dis)similarity measure
• Application areas: • machine learning: k-‐NN rule • image/video/music recognition, deduplication, bioinformatics, etc…
• Distance can be: • Hamming, Euclidean, …
• Primitive for other problems: • find the similar pairs, clustering…
000000 011100 010100 000100 010100 011111
000000 001100 000100 000100 110100 111111
Approximate NNS •
q
r p
cr
Locality-‐Sensitive Hashing •
q
p
1
[Indyk-Motwani’98]
q
“not-‐so-‐small”
Locality sensitive hash functions
•
6
[Indyk-Motwani’98]
1
Algorithms and Lower Bounds Space Time Comment Reference
[IM’98]
[PTW’08, PTW’10]
[IM’98]
[DIIM’04, AI’06]
[MNP’06]
[OWZ’11]
[PTW’08, PTW’10]
[MNP’06]
[OWZ’11]
LSH is tight…
leave the rest to cell-‐probe lower bounds?
Main Result
•
9
A look at LSH lower bounds
•
10
[O’Donnell-‐Wu-‐Zhou’11]
Why not NNS lower bound?
•
11
Our algorithm: intuition
•
12
Nice Configuration: “sparsity”
•
13
Reduction: into spherical LSH
•
14
Two-‐level algorithm
•
Details
•
16
Practice • Practice uses data-‐dependent partitions!
• “wherever theoreticians suggest to use random dimensionality reduction, use PCA”
• Lots of variants • Trees: kd-‐trees, quad-‐trees, ball-‐trees, rp-‐trees, PCA-‐trees, sp-‐trees…
• no guarantees: e.g., are deterministic
• Is there a better way to do partitions in practice?
• Why do PCA-‐trees work? • [Abdullah-‐A-‐Kannan-‐Krauthgamer]: if have more structure
17
Finale
•
18