Upload
yoh-okuno
View
14.917
Download
0
Embed Size (px)
Citation preview
Fast and Accurate K-‐means for Large Datasets
Michael Shindler, Alex Wong, Adam Meyerson
Presenter: Yoh Okuno #nipsreading
• Name: Yoh Okuno
• R&D Engineer at Yahoo! Japan
• Interest: NLP (Natural Language Processing),
Machine Learning, and Data Mining.
• Skills: C/C++, Java, Python, and Hadoop.
• Website: http://yoh.okuno.name/
About Presenter
Overview 1. Recent Advancement on K-‐means Clustering
– Batch versus Streaming Settings
– Related Works and Our Contribution
2. Algorithm for Large-‐Scale K-‐means Clustering
– Streaming + Mini-‐Batch + Smart Initialization
3. Incorporating Approximate Nearest Neighbor Search
– Based on Random Projection (Hashing)
4. Evaluation and Discussion
1. Recent Advancement on K-‐means Clustering
Review of the Standard K-‐means Clustering
• Minimize cost function below iteratively:
1. Update z with fixed μ (assign cluster number)
2. Update μ with fixed z (calculate average)
minimize:N�
i=1
�xi − µzi�2
x_i: i-‐th data point z_i: cluster number μ_j: centroid of j-‐th cluster
Where:
Related Works and Our Contributions
• The standard batch algorithm [Lloyd 1982]
• Streaming approaches [Aggarwal 2007]
• Mini-‐batch approaches [Sculley 2010]
• Our work is based on a recent streaming
approach [Braverman+ 2011]
• Incorporated approximate nearest neighbor
2. Algorithm for Large-‐Scale K-‐means Clustering
Initialize
Streaming
Mini Batch
Initialize clusters • Create clusters until the buffer will be full
– Run nearest neighbor search on the new data
– Add a cluster randomly (according to its distance)
Streaming K-‐means Clustering • Renew clusters randomly in the same way
Same to the previous page
Ball k-‐means on weighted points
• Run ball k-‐means on weighted points
[Braverman+ 2011] [Ostrovsky+ 2006]
3. Incorporating Approximate Nearest Neighbor Search
Bottleneck: nearest neighbor search among points
Approximate Nearest Neighbor Search
• Use simple random projection
1. Set ω ∈ R^d as [0, 1) randomly
2. Calculate inner product of ω and clusters
3. Given query x, calculate inner product x・ω
4. Find the nearest cluster with x using product
4. Evaluation and Discussions
Datasets
• BigCross dataset:
– Size: 11 million points in 55 dimensions
• Census 1990: national survey
– 2 million points in 68 dimensions
• Environment: C++ / Ubuntu / 2.9Ghz / 6GB
Note: Lower cost is Better
Note: Lower time is better
Conclusion
• Proposed a fast, accurate k-‐means clustering
based on a streaming algorithm
• Incorporated approximate nearest neighbor
search with the proposed algorithm
• Excellent on both practice and theory
References • [Lloyd 1982] Least Squares Quantization in PCM. IEEE on
Information Theory.
• [Aggarwal 2007] Data Streams: Models and Algorithms.
Springer.
• [Braverman+ 2011] Streaming K-‐means on Well-‐
Clusterable Data. SODA.
• [Ackermann+ 2010] StreamKM++: A Clustering Algorithm
for Data Streams. ALENEX.
• [Sculley 2010] Web-‐Scale K-‐means Clustering. WWW.
Any Questions?