11

Сергей Шельпук-«Эффективный поиск похожих объектов на больших данных при помощи офисного ноутбука»

Embed Size (px)

Citation preview

Efficient Similarity Search on Big Datawith office laptopSergii ShelpukHead of Data Science, V.I.Tech

The ProblemYou have a database of 30M patients with all medical records. Each patient described by 250K of binary features.

You need a system for finding N most similar patients to a given one.Jesus Christ, its Big Data, get Hadoop!

Jesus Christ, its Big Data, get Hadoop!

Can we do better?Two main ideas:we dont need the meaning of each feature, we only care about similarity of the patients;we dont want to compare very different patients, we want to compare only the most similar ones.

Step 1: Reduce dimensionalityDecrease dimensionality of the data while preserving similaritiesLocality-sensitive hashing and minhashing

K-Means clusteringK-Means clustering groups similar patients in one group

Step 2: Group similarGroup similar patients and store groups as separate filesStore centroids of each cluster in a separate file, too

ApproachTo find N similar patients:Load a patientReduce dimensionality with minhashingLoad centroid fileCompare patient to every centroidLoad cluster file of the closest centroidCompare patient with patients in the clusterShow top N similar

Results50000 clusters up to ~1000 patients per cluster~500Kb-1Mb of every cluster file~18Mb centroid file

To do similarity search you need:~20Gb HDD~20Mb RAMSearch works in ~100 milliseconds on a regular office laptop

Thank you?