Upload
tanya-denisyuk
View
7.037
Download
2
Embed Size (px)
Citation preview
Efficient Similarity Search on Big Datawith office laptopSergii ShelpukHead of Data Science, V.I.Tech
The ProblemYou have a database of 30M patients with all medical records. Each patient described by 250K of binary features.
You need a system for finding N most similar patients to a given one.Jesus Christ, its Big Data, get Hadoop!
Jesus Christ, its Big Data, get Hadoop!
Can we do better?Two main ideas:we dont need the meaning of each feature, we only care about similarity of the patients;we dont want to compare very different patients, we want to compare only the most similar ones.
Step 1: Reduce dimensionalityDecrease dimensionality of the data while preserving similaritiesLocality-sensitive hashing and minhashing
K-Means clusteringK-Means clustering groups similar patients in one group
Step 2: Group similarGroup similar patients and store groups as separate filesStore centroids of each cluster in a separate file, too
ApproachTo find N similar patients:Load a patientReduce dimensionality with minhashingLoad centroid fileCompare patient to every centroidLoad cluster file of the closest centroidCompare patient with patients in the clusterShow top N similar
Results50000 clusters up to ~1000 patients per cluster~500Kb-1Mb of every cluster file~18Mb centroid file
To do similarity search you need:~20Gb HDD~20Mb RAMSearch works in ~100 milliseconds on a regular office laptop
Thank you?