22
Fast and Accurate Kmeans for Large Datasets Michael Shindler, Alex Wong, Adam Meyerson Presenter:Yoh Okuno #nipsreading

Fast and Accurate K-means for Large Datasets #nipsereading

Embed Size (px)

Citation preview

Page 1: Fast and Accurate K-means for Large Datasets #nipsereading

Fast  and  Accurate  K-­‐means  for  Large  Datasets

Michael  Shindler,  Alex  Wong,  Adam  Meyerson  

Presenter:  Yoh  Okuno  #nipsreading  

Page 2: Fast and Accurate K-means for Large Datasets #nipsereading

•  Name:  Yoh  Okuno    

•  R&D  Engineer  at  Yahoo!  Japan  

•  Interest:  NLP  (Natural  Language  Processing),  

Machine  Learning,  and  Data  Mining.  

•  Skills:  C/C++,  Java,  Python,  and  Hadoop.  

•  Website:  http://yoh.okuno.name/  

About  Presenter

Page 3: Fast and Accurate K-means for Large Datasets #nipsereading

Overview 1.  Recent  Advancement  on  K-­‐means  Clustering  

–  Batch  versus  Streaming  Settings  

–  Related  Works  and  Our  Contribution  

2.  Algorithm  for  Large-­‐Scale  K-­‐means  Clustering  

–  Streaming  +  Mini-­‐Batch  +  Smart  Initialization  

3.  Incorporating  Approximate  Nearest  Neighbor  Search  

–  Based  on  Random  Projection  (Hashing)  

4.  Evaluation  and  Discussion  

Page 4: Fast and Accurate K-means for Large Datasets #nipsereading

1.  Recent  Advancement  on  K-­‐means  Clustering  

Page 5: Fast and Accurate K-means for Large Datasets #nipsereading

Review  of  the  Standard  K-­‐means  Clustering

•  Minimize  cost  function  below  iteratively:  

 

 

 

 

1.  Update  z  with  fixed  μ  (assign  cluster  number)  

2.  Update  μ  with  fixed  z  (calculate  average)  

minimize:N�

i=1

�xi − µzi�2

x_i:  i-­‐th  data  point  z_i:  cluster  number  μ_j:  centroid  of  j-­‐th  cluster  

Where:

Page 6: Fast and Accurate K-means for Large Datasets #nipsereading

Related  Works  and  Our  Contributions  

•  The  standard  batch  algorithm  [Lloyd  1982]  

•  Streaming  approaches  [Aggarwal  2007]  

•  Mini-­‐batch    approaches  [Sculley  2010]  

•  Our  work  is  based  on  a  recent  streaming  

approach  [Braverman+  2011]    

•  Incorporated  approximate  nearest  neighbor

Page 7: Fast and Accurate K-means for Large Datasets #nipsereading

2.  Algorithm  for  Large-­‐Scale  K-­‐means  Clustering  

Page 8: Fast and Accurate K-means for Large Datasets #nipsereading
Page 9: Fast and Accurate K-means for Large Datasets #nipsereading

Initialize

Streaming

Mini  Batch

Page 10: Fast and Accurate K-means for Large Datasets #nipsereading

Initialize  clusters •  Create  clusters  until  the  buffer  will  be  full  

– Run  nearest  neighbor  search  on  the  new  data  

– Add  a  cluster  randomly  (according  to  its  distance)  

 

Page 11: Fast and Accurate K-means for Large Datasets #nipsereading

Streaming  K-­‐means  Clustering •  Renew  clusters  randomly  in  the  same  way  

Same  to  the    previous  page

Page 12: Fast and Accurate K-means for Large Datasets #nipsereading

Ball  k-­‐means  on  weighted  points

•  Run  ball  k-­‐means  on  weighted  points  

[Braverman+  2011]  [Ostrovsky+  2006]  

Page 13: Fast and Accurate K-means for Large Datasets #nipsereading

3.  Incorporating  Approximate  Nearest  Neighbor  Search

Page 14: Fast and Accurate K-means for Large Datasets #nipsereading

Bottleneck:  nearest  neighbor  search  among    points

Page 15: Fast and Accurate K-means for Large Datasets #nipsereading

Approximate  Nearest  Neighbor  Search

•  Use  simple  random  projection  

1.  Set  ω  ∈  R^d  as  [0,  1)  randomly  

2.  Calculate  inner  product  of  ω  and  clusters  

3.  Given  query  x,  calculate  inner  product  x・ω  

4.  Find  the  nearest  cluster  with  x  using  product  

Page 16: Fast and Accurate K-means for Large Datasets #nipsereading

4.  Evaluation  and  Discussions

Page 17: Fast and Accurate K-means for Large Datasets #nipsereading

Datasets

•  BigCross  dataset:    

– Size:  11  million  points  in  55  dimensions  

•  Census  1990:  national  survey  

– 2  million  points  in  68  dimensions  

•  Environment:  C++  /  Ubuntu  /  2.9Ghz  /  6GB  

Page 18: Fast and Accurate K-means for Large Datasets #nipsereading

Note:  Lower  cost  is  Better

Page 19: Fast and Accurate K-means for Large Datasets #nipsereading

Note:  Lower  time  is  better

Page 20: Fast and Accurate K-means for Large Datasets #nipsereading

Conclusion

•  Proposed  a  fast,  accurate  k-­‐means  clustering  

based  on  a  streaming  algorithm  

•  Incorporated  approximate  nearest  neighbor  

search  with  the  proposed  algorithm  

•  Excellent  on  both  practice  and  theory

Page 21: Fast and Accurate K-means for Large Datasets #nipsereading

References •  [Lloyd  1982]  Least  Squares  Quantization  in  PCM.  IEEE  on  

Information  Theory.  

•  [Aggarwal  2007]  Data  Streams:  Models  and  Algorithms.  

Springer.  

•  [Braverman+  2011]  Streaming  K-­‐means  on  Well-­‐

Clusterable  Data.  SODA.  

•  [Ackermann+  2010]  StreamKM++:  A  Clustering  Algorithm  

for  Data  Streams.  ALENEX.  

•  [Sculley  2010]  Web-­‐Scale  K-­‐means  Clustering.  WWW.  

Page 22: Fast and Accurate K-means for Large Datasets #nipsereading

Any  Questions?