DP-IV presentation - ashutosh

Performance analysis of C-means Clustering on Big Data using Hadoop

Guided ByProf. A. J. Umbarkar

Presented ByA. S. Sathe

BROAD AREA : DISTRIBUTED COMPUTING, DATA MINING

SUB AREA: CLUSTERING ALGORITHMS, DATA CLUSTERING

Presentation Agenda• Literature Survey• Problem Statement• Objectives achieved• Results• Future Scope• References

Data Growth Rate[7]

Relevance • Data Clustering - Classification of a data set into a Similar groups based on

some criteria

• Big Data- Amount of data that is difficult to process using traditional database and software techniques

• Hadoop – A MapReduce Architecture based distributed computing framework

• Document Clustering • Text based data stored in file format or unstructured format• Based on text property like frequency of words, keywords provided etc.• Text properties are considered as similarity criteria• Based on similarity criteria documents are differentiated

Relevance• Need of data clustering• Data Mining is used for Knowledge Discovery from Data [KDD].• Based on historical data• Historical data may be Big Data• Big data processing is very tedious task• Data clustering is preprocessing for Big data processing• Processed data will be used for data mining• Data clustering give better results than randomly placed data.

Relevance• Why Text clustering• Type of unstructured data• Free from any database constraints• File can be very large without any restrictions• In real time scenario text clustering

• Retrieve, Filter, and Categorize documents• Information Retrieval

• Clustered data is useful for Knowledge Data Retrieval

Relevance• Why Hadoop• Distributed Framework• Can use processor capacity on the fly• Made for Big data processing

Problem Statement

• “Performance Analysis of C-means Clustering on Big Data using Hadoop.”

Objectives achieved Design of processing model of Fuzzy C-Means

Algorithm for Map-Reduce Implementation of C-means algorithm on Map-Reduce Testing & Performance analysis of above algorithm

with Big-Data on Map-Reduce Compare C-means with other equivalent works

Fuzzy C-means Clustering

• For example: we have initial centroid 3 & 11 (with m=2)

• For node 2 (1st element): U11 = The membership of first node to first cluster

U12 =The membership of first node to second cluster

%78.988281

%22.1821

112112

Dataset Conversion

Hadoop based

K-Meanson

Documents

Fuzzy C-Means

on Documents

Hadoop based

Fuzzy C-Means

on Documents Fu

Results

Experimental Setup

3 Centroids

4 Centroids

5 Centroids 6 Centroids Split

4 Itr 6 Itr 4 Itr 6 Itr 4 Itr 6 Itr 4 Itr 6 Itr

Classical K-Means √ √ √ √ √ √ √ √ Not Applicable

Hadoop Based K-Means

√ √ √ √ √ √ √ √ 4 Mb Split

√ √ √ √ √ √ √ √ 8 Mb Split

16 Mb Split

√ √ √ √ √ √ √ √ 32 Mb Split

Classical Fuzzy C-Means √ √ √ √ √ √ √ √ Not Applicable

Hadoop Based Fuzzy C-Means

√ √ √ √ √ √ √ √ 4 Mb Split

√ √ √ √ √ √ √ √ 8 Mb Split

16 Mb Split

√ √ √ √ √ √ √ √ 32 Mb Split

Experimental Setup

ClassicalK-Means

2 Node K-Means

4 NodeK-Means

8 NodeK-Means

0 100 200 300 400 500 600 700 800 900 1000

6 centroid5 centroid4 centroid3 centroid

Time (Sec)

25Classical

2 Node FCM

4 NodeFCM

8 NodeFCM

0 1000 2000 3000 4000 5000 6000 7000 8000 9000

6 centroid5 centroid4 centroid3 centroid

Time in sec

2Node 4 Node 8 Node0

4MB Split KM Performance

4 ITR6 ITR

No. of Nodes

4MB Split FCM Performance

4 ITR6 ITR

No. of Nodes

Speedup Comparison of KM w.r.t. HKM

Speedup Comparison of FCM w.r.t. HFCM

8MB Split HKM Performance

4 ITR6 ITR

No of Nodes

8MB Split HFCM Performance

4 ITR6 ITR

No. of Nodes

Speedup Comparison of KM w.r.t. HKM

ns4 Mb Split 8 Mb Split 32 mb Split

4 Mb Split 8 Mb Split 32 mb Split

2Node4 Node8 Node

HKM HFCM

4 Mb Split 8 Mb Split 32 mb Split 4 Mb Split 8 Mb Split 32 mb Split0

2Node4 Node8 Node

HKM HFCM

HKM and HFCM speedup performances and comparison

Analysis based on cluster sizes

KM 2 Node HKM 4 Node HKM 8 Node HKM0

3 Centroids4 Centroids5 Centroids6 Centroids

Average FCM and HFCM time consumption w.r.t cluster sizes

CONT…

Average KM and HKM time consumption w.r.t cluster sizes

FCM 2 Node HFCM 4 Node HKM 8 Node HKM0

100000

3 Centroids4 Centroids5 Centroids6 Centroids

Future Scope

Paper publication• Submitted to IEEE CONECCT 2015

Tools and Platform Required1. Text Dataset4. Hadoop 1.215. JDK 1.66. O.S. Ubuntu 14.04

References1. Cui, Xiaoli et al. "Optimized big data K-means clustering using

MapReduce." The Journal of Supercomputing, Vol 70, pp.1249-1259, 2014.

2. Jain, Anil K., M. NarasimhaMurty, and Patrick J. Flynn. "Data clustering: a review." ACM computing surveys (CSUR), Vol.31, pp.264-323, (1999). DOI:10.1145/331499.331504

3. Zhao, Weizhong et al. "Parallel k-means clustering based on mapreduce." In Cloud Computing Springer Berlin Heidelberg, Vol. 5931, pp. 674-679, 2009.

4. Xie, Jiong, Shu Yin, Xiaojun Ruan, Zhiyang Ding, Yun Tian, James Majors, Adam Manzanares, and Xiao Qin. "Improving mapreduce performance through data placement in heterogeneous hadoop clusters." In Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on, pp. 1-9. IEEE, 2010. DOI:10.1109/IPDPSW.2010.5470880

References(cont...)5. J.Dean, S.Ghemawat, MapReduce, Commun. ACM 51(1) (2008)107,Jan

6. A.Asuncionand, D.J.Newman, UCI Machine Learning Repository, available http://archive.ics.uci.edu/ml/ (accessed:07-Jan-2015)

7. https://www.linkedin.com/pulse/big-data-whats-deal-debarchan-sarkar [Used on Apr 9, 2015]

nsQUESTIONS???

Thank You

DP-IV presentation - ashutosh

Documents

Name Roll No: Ashutosh Agarkar 1 Fauzia Hasan 22 Jane Nazareth 37 Ankit Patel 41 Rajitha Pillai 44 Name Roll No: Ashutosh Agarkar 1 Fauzia Hasan 22 Jane

ASHUTOSH JHA

DP-01 DP-01FX DP-01FX/CDtascam.com/content/downloads/products/351/dp01fxcd_eng.pdf · DP-01 DP-01FX DP-01FX/CD ... Loading a saved song ... Phantom power

5.chronic inflammation dr ashutosh kumar

Ashutosh Singh So Mr. s Singh - Tcs

Brekke - ABS DP Presentation

Ashutosh Kumar, Sambarta Chatterjee, Mintu Nandi, and Arti ... · Ashutosh Kumar, Sambarta Chatterjee,∗ Mintu Nandi,† and Arti Dua‡ Department of Chemistry, Indian Institute

Mech - IJME - Simulation - Ashutosh Verma

2-67-1343898530-Mech - IJME - Simulation - Ashutosh Verma

Ashutosh Kotwal Duke Universitykotwal/iucaa08.pdfRutherford did it, shooting a particles at a gold foil, to tell us the structure of the atom (1911) Quantum mechanics: Dr ~ h / Dp

DATECS Програмен интерфейс на каси DP-05, DP-25, DP-35, WP ... datecs Програмен интерфейс на каси dp-05, dp-25, dp-35, wp-50, dp-150

Ashutosh Yadav

Ashutosh Kotwal Duke University - · Ashutosh Kotwal Duke University. Why Build Accelerators? From Atoms to Quarks Scattering of probe particles off matter to investigate substructure,

DMAPS- Ashutosh

DP-50 / DP-50D

Report heydar aliyev zaha hadid aniruddh ashutosh snehanshu

1. Prof. S.S.Alam & Dr. S.N. Alam- Sir Ashutosh Mukherjee

k10995 ashutosh yadav(RAC)

ASHUTOSH KR JHA(39)

sporazumenie... · datecs dp 50 kl - 74 daisy micro c 01 kl- 15 ... dp50 dp 50 kl dp 50 kl dp 50 kl dp 50 kl dp 50 kl dp 50 kl dp 50 kl dp 50 dp 50 kl 078257818 ot 257863 ot 257929