Fast and Accurate K-means for Large Datasets #nipsereading

Fast and Accurate K-‐means for Large Datasets

Michael Shindler, Alex Wong, Adam Meyerson

Presenter: Yoh Okuno #nipsreading

•  Name: Yoh Okuno

•  R&D Engineer at Yahoo! Japan

•  Interest: NLP (Natural Language Processing),

Machine Learning, and Data Mining.

•  Skills: C/C++, Java, Python, and Hadoop.

•  Website: http://yoh.okuno.name/

About Presenter

Overview 1.  Recent Advancement on K-‐means Clustering

–  Batch versus Streaming Settings

–  Related Works and Our Contribution

2.  Algorithm for Large-‐Scale K-‐means Clustering

–  Streaming + Mini-‐Batch + Smart Initialization

3.  Incorporating Approximate Nearest Neighbor Search

–  Based on Random Projection (Hashing)

4.  Evaluation and Discussion

1. Recent Advancement on K-‐means Clustering

Review of the Standard K-‐means Clustering

•  Minimize cost function below iteratively:

1.  Update z with fixed μ (assign cluster number)

2.  Update μ with fixed z (calculate average)

minimize:N�

i=1

�xi − µzi�2

x_i: i-‐th data point z_i: cluster number μ_j: centroid of j-‐th cluster

Where:

Related Works and Our Contributions

•  The standard batch algorithm [Lloyd 1982]

•  Streaming approaches [Aggarwal 2007]

•  Mini-‐batch approaches [Sculley 2010]

•  Our work is based on a recent streaming

approach [Braverman+ 2011]

•  Incorporated approximate nearest neighbor

2. Algorithm for Large-‐Scale K-‐means Clustering

Initialize

Streaming

Mini Batch

Initialize clusters •  Create clusters until the buffer will be full

– Run nearest neighbor search on the new data

– Add a cluster randomly (according to its distance)

Streaming K-‐means Clustering •  Renew clusters randomly in the same way

Same to the previous page

Ball k-‐means on weighted points

•  Run ball k-‐means on weighted points

[Braverman+ 2011] [Ostrovsky+ 2006]

3. Incorporating Approximate Nearest Neighbor Search

Bottleneck: nearest neighbor search among points

Approximate Nearest Neighbor Search

•  Use simple random projection

1.  Set ω ∈ R^d as [0, 1) randomly

2.  Calculate inner product of ω and clusters

3.  Given query x, calculate inner product x・ω

4.  Find the nearest cluster with x using product

4. Evaluation and Discussions

Datasets

•  BigCross dataset:

– Size: 11 million points in 55 dimensions

•  Census 1990: national survey

– 2 million points in 68 dimensions

•  Environment: C++ / Ubuntu / 2.9Ghz / 6GB

Note: Lower cost is Better

Note: Lower time is better

Conclusion

•  Proposed a fast, accurate k-‐means clustering

based on a streaming algorithm

•  Incorporated approximate nearest neighbor

search with the proposed algorithm

•  Excellent on both practice and theory

References •  [Lloyd 1982] Least Squares Quantization in PCM. IEEE on

Information Theory.

•  [Aggarwal 2007] Data Streams: Models and Algorithms.

Springer.

•  [Braverman+ 2011] Streaming K-‐means on Well-‐

Clusterable Data. SODA.

•  [Ackermann+ 2010] StreamKM++: A Clustering Algorithm

for Data Streams. ALENEX.

•  [Sculley 2010] Web-‐Scale K-‐means Clustering. WWW.

Any Questions?

Technology

Fast and Accurate K-means for Large Datasets #nipsereading