36
By Hanaa Ismail Elshazly PhD Student Faculty of Computers and Information Cairo University Intelligent Visualization of Multi Dimension Data Sets Faculty of Computers and Information - Cairo University Department of Computer Sciences Supervisors Prof Aboul Ella Hassanien & Prof. Abeer Mohamed El Korany

Hanaa phd presentation 14-4-2017

Embed Size (px)

Citation preview

Page 1: Hanaa phd  presentation  14-4-2017

ByHanaa Ismail Elshazly

PhD Student Faculty of Computers and Information

Cairo University

Intelligent Visualization of Multi Dimension

Data Sets

Faculty of Computers and Information - Cairo UniversityDepartment of Computer Sciences

Supervisors

Prof Aboul Ella Hassanien & Prof. Abeer Mohamed El

Korany

Page 2: Hanaa phd  presentation  14-4-2017

Big Image

Multidimensional data

Reduction

Visualize

Intelligent Visualization of Multidimensional Data Sets

Dimensions: A dimension is a key descriptor, an index, by which you can access facts according to the value (or values) you want

Information visualization is the study of (interactive) visual representations of abstract data to reinforce human cognition. The abstract data include both numerical and non-numerical data, such as text and geographic information

Page 3: Hanaa phd  presentation  14-4-2017

ContentsIntroduction1

2

3 Experimental Results

4 Conclusion

55 Future Work

6

Proposed Framework

Page 4: Hanaa phd  presentation  14-4-2017

IntroductionGeneral

• Massive and complex data are generated every day in many fields due to the advance of hardware and software technology.

• Curse of dimensionality is a major obstacle in machine learning and data mining.

• Clinical data referring to patients’ investigations contain irrelevant attributes that degrade the classification performance.

• Visualization is important when analyzing multidimensional datasets, since it can help humans discover and understand complex relationships in data.

Page 5: Hanaa phd  presentation  14-4-2017

IntroductionData Problems

Data Quality Integrating redundant data

from different sources Mining information from

heterogeneous databases Difficulty in training set Dynamic databases Dimensionality

Page 6: Hanaa phd  presentation  14-4-2017

Introduction Dimensionality reduction

• In machine learning and statistics, dimensionality reduction or dimension reduction is the process of reducing the number of random variables under consideration via obtaining a set of principal variables. It can be divided into feature selection and feature extraction.

• Most popular search methods that are manageable in low space can be totally unmanageable in high dimension space

• The curse of dimensionality is a major obstacle in machine learning and data mining

• Reduction of the dimensionality of features space leads to a successful classification Selecting the optimal feature subset can substantially improve the classification performance

Page 7: Hanaa phd  presentation  14-4-2017

Filter

Wrapper

Embedded

• Improve the comprehensibility of the induced concepts

• Decrease of dataset complexity

• Improve classification performance

• Resources saving• Visualization ability• Better understanding of

extracted knowledge• Reducing computation

Requirement• Reduces the effect of

curse of dimensionality

FS Techniques

Reduced DataMassive Data

Microarray GE

Medical Images

Huge Databases

Finance Data

Sensor Arrays

Web Documents

Introduction Dimensionality reduction

Page 8: Hanaa phd  presentation  14-4-2017

Introduction The curse of Dimensionality

Damming Factor

Computational Complexity

Limits applicability of ML techniques to real world problems

Slow Learning Process

Difficulty of Inducing Concepts

Decrease Predictive Performance

Add extra difficultiesin finding potentially useful knowledge

Difficulty to add visualization ability

Limited human capability

Human inspection and interpretation of the data is not feasible

Intractable behavior of Search Methods

Conventional database management and dataanalysis tools are insufficient.

Storage requirements

Page 9: Hanaa phd  presentation  14-4-2017

www.themegallery.com

Proposed General Framework

Pre-processing phase

Feature selection

phaseClassification phase

Rule refine phase

Visualization phase

Page 10: Hanaa phd  presentation  14-4-2017

Proposed General Framework

Preprocess Phase

Discretization

Simplification

Equal Binning

FeatureSelection

Phase

Reduction

PCARS

Classification Phase

Rough Set Rules

Generation

Rough Set Discernibility Matrix

Rules Refinement

Phase

Reduction

Entropy GA

Visualization

Visualization

NodesEdgesChartsGrids

Page 11: Hanaa phd  presentation  14-4-2017

Experimental Data Sets

Classes Instances Features Source Data Set 2 classes 569 samples Features

32 UCI (Machine LearningRepository)

Wisconsin Breast Cancer–Diagnosis

2 classes samples 198 Features 32

UCI (Machine LearningRepository)

Wisconsin Breast Cancer–Prognosis

2 classes 267 samples 45 Features

UCI (Machine LearningRepository)

SPECTF Heart Dataset

4 classes 148 samples 18 Features University MedicalCentre, Institute of Oncology, Ljubljana, Yugoslavia

Lymphography

2 classes 583 samples 11 Features UCI (Machine LearningRepository)

Indian Liver Patient Dataset

2 classes 102 samples 12600 Features

UCI (Machine LearningRepository)

Prostate

Page 12: Hanaa phd  presentation  14-4-2017

Pre-processing Phase

Aim : Used to reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals and replacing low level concepts by higher level concepts.

Techniques:• Equal Binning : Transform

numerical variables into categorical counterparts.

• Simplification : Rescaling data in the range [1,3].

PREPROCESS

Discretization

Discretized Data

Simplification

Simplified Data

Multidimensional Data

Discretization

Page 13: Hanaa phd  presentation  14-4-2017

Pre-processing PhaseEqual Binning Algorithm

Foreach feature V in data (D){ Dividing domain of V into k intervals of equal size. The width of intervals is:

w = (max(V)-min(V))/k

And the interval boundaries are:

min+w, min+2w, ... , min+(k-1)w}

Hanaa Ismail Elshazly et al., “Rough Sets and Genetic Algorithms: A hybrid approach to breast cancer classification”, Proceedings of the Information and Communication Technologies, (WICT), ISBN: 978-1-4673-4806-5, World Congress, IEEE, pp 260-265, 2012.

How Discretization techniques influence the classification of breast cancer data

Bool.Reas% Binging %

Entropy %

91 92.9 77.2 Naïve Bayes95.3 95.3 91.4 Decision Rules94 94.7 76.1 KNN

Page 14: Hanaa phd  presentation  14-4-2017

Feature Selection Phase Feature Selection

Phase

Rough Set

PCA

Positive Regions Extraction

Discernibility Matrix

Reduced Data

Positive Regions

Final Reducts

Simplified Data

Aim: Determine a minimal feature subset that best contribute to accuracy and retain high efficiency in representing the original features while negligee the features with little contribution in prediction process.

PCA (Principal component Analysis) • A statistical technique useful in data

compression and reduction.• Rough Sets

• The main goal of the rough set analysis is induction of (learning) approximations of concepts.

Page 15: Hanaa phd  presentation  14-4-2017

Principal component Analysis Algorithm

Feature Selection Phase

Page 16: Hanaa phd  presentation  14-4-2017

Feature Selection Phase PCA Performance as a transformation method in ROTATION FOREST for Chronic eye disease diagnosis

• Hanaa Ismail Elshazly, Abeer Mohamed El Korany, Aboul Ella Hassanien, Ahmad Taher Azar, “Ensemble classifiers for biomedical data : performance evaluation”, 8th International Conference on Computer Engineering & Systems (ICCES), ISBN: 978-1-4799-0078-7, pp 184-189, 2013.

• Hanaa Ismail Elshazly, Abeer Mohamed El Korany, Aboul Ella Hassanien, Mohamed Waly, “Chronic Eye Disease diagnosis using ensemble-based classifier ”, Second International Conference on Engineering and Technology(ICET), German University – Cairo-Egypt, 2014.

Many transformation methods were applied in the literature such as Principal component analysis (PCA), nonparametric discriminant analysis (NDA), random projections (RP), independent component analysis (ICA). • PCA gave the best results due to the provided diversity. • PCA preserves the discriminatory features.• PCA provides the best results compared to those extracted through non-

parametric discriminant analysis (NDA) or random projections. • PCA was chosen as a transformation method in the following research

papers :

Page 17: Hanaa phd  presentation  14-4-2017

Feature Selection Phase

Hanaa Ismail Elshazly, Ahmad Taher Azar, Abeer Mohamed El Korany, Aboul Ella Hassanien, “Hybrid System based on Rough Sets and Genetic Algorithms for Medical Data Classifications”, International Journal of Fuzzy System Applications (IJFSA), doi: 10.4018/ijfsa.2013100103, 3(4), 31-46, 2013.Descrinibility

Rough Sets for Reduct GenerationLet T = (U, C, D) be a decision table, with

}.,...,,{ 21 nuuuU M(T), we will mean matrix defined as:

)]d(u)[d(u Dd if )}c(u)c(u :C{c)]d(u)[d(u Dd if λ ij

jijiji

m

nn

ijm ,Uui }},...,2,1{,:{)( njijmuf ijjiT

ijm ,ijma .ijm

),( falsemij .ijm

),(truetmij .ijm

Where

is the disjunction of all variables a such that

(2)

(3)

if

if

(1) if

For any

Page 18: Hanaa phd  presentation  14-4-2017

Classification Phase

Classification Phase

Phas

e

Rule Generation

Classification with Decision

Rules

Testing

Generated Rules

Classified Instances

Tested Instances

Multidimensional Data

Final Reducts

Aim : The learning algorithm called classifier has as goal to return a set of decision rules with a procedure that makes possible to classify objects not found in the original decision table. Rough Set Rules Generation using Discernibility Matrix

Page 19: Hanaa phd  presentation  14-4-2017

Rough Set Rules Generation Algorithm

Let T = (U, C, D) be a decision table, with }.,...,,{ 21 nuuuU M(T), we will mean matrix defined as:

)]d(u)[d(u Dd if )}c(u)c(u :C{c)]d(u)[d(u Dd if λ ij

jiji

jim

nn

ijm is the set of all the condition attributes that classify objects ui and uj into different classes.

,Uui }},...,2,1{,:{)( njijmuf ijjiT

ijm ,ijma .ijm),( falsemij .ijm

),(truetmij .ijm

Whereis the disjunction of all variables a such that

(2)

(3)

if

if

(1) if

Page 20: Hanaa phd  presentation  14-4-2017

Comparison of different classifiers against different data Sets

Hanaa Ismail Elshazly et al., “Rough Sets and Genetic Algorithms: A hybrid approach to breast cancer classification”, Proceedings of the Information and Communication Technologies, (WICT), ISBN: 978-1-4673-4806-5, World Congress, IEEE, pp 260-265, 2012.

Hanaa Ismail Elshazly et al., “Hybrid System based on Rough Sets and Genetic Algorithms for Medical Data Classifications”, International Journal of Fuzzy System Applications (IJFSA), doi: 10.4018/ijfsa.2013100103, 3(4), 31-46, 2013.

Page 21: Hanaa phd  presentation  14-4-2017

Rules Refinement Phase

RULE REFINEMENT

GeneratedReducts

InformativeReduct

All Rules Generated

Rules Allocation

Selected Rules

Testing

Criteria Termination

Classified Instances

Reducts Evaluation

Multidimensional Data

GA

Refined Decision Rules

Test Multidimensional

Data

Reduce rules number to be easily visualized and presented to an expert without decreasing the accuracy. Reduct Evaluation using Entropy GA using Support and Confidence as Fitness Function

Page 22: Hanaa phd  presentation  14-4-2017

Reduct EvaluationAlgorithms of Decision tree depend on Information Gain to find the expected amount of information that would be needed to truly classified.

Calculate entropy of the target : Gain(T) = Entropy (T);

Entropy (T) = where c is the possible values of the target Foreach in Reducts { Foreach x In R

{ Entropy (T,X) = }

} Choose with the largest information gain.

i2c

1i i plogp

iR

E(c) ) (c xccP

iR),( XTEntEi

Page 23: Hanaa phd  presentation  14-4-2017

Genetic Algorithm Using Support and Confidence as Fitness Function

Body ==> Consequent [ Support , Confidence ] Consequent: represents a discovered property for the

examined data. Support: represents the percentage of the records

satisfying the body or the consequent. Confidence: represents the percentage of the records

satisfying both the body and the consequent to those satisfying only the body.

Page 24: Hanaa phd  presentation  14-4-2017

Visualization PhaseExpert can manage induced rules through levels of trusting that enable fast trust decision.• Graph Nodes • Edges• Charts• Grids

VISUALIZATION

Measurement Calculation for Rules Supporting

Refined Rules with Trusted Levels

Rendering

Rules & Reducts

Refined Decision Rules

Page 25: Hanaa phd  presentation  14-4-2017

Visualization of Breast Cancer Reducts Visualization of features of the breast data set ordered by its occurrence over all extracted reducts.

Experimental Results

Page 26: Hanaa phd  presentation  14-4-2017

Visualization of Breast Cancer Rules Visualization of global and detailed nodes representing refined classification rules of the breast data. 86 R 400 R 87000 R

Experimental Results

Page 27: Hanaa phd  presentation  14-4-2017

Visualization of Breast Cancer Rules

Visualization of Refined Breast Cancer Decision Rules According to Trusting Levels.

Experimental Results

Page 28: Hanaa phd  presentation  14-4-2017

Visualization of Breast Cancer Rules Navigation through Refined Breast Cancer Decision Rules According to Trusting Levels.

Experimental Results

Page 29: Hanaa phd  presentation  14-4-2017

Visualization of Prostate Cancer Reducts Visualization of all reducts of the Prostate Cancer data set and all features ordered by its occurrence in all extracted reducts.

Experimental Results

Page 30: Hanaa phd  presentation  14-4-2017

Visualization of Prostate Cancer Rules Navigation through Refined Prostate Cancer Decision Rules According to Trusting Levels.

26 R 117R 22000 R

Experimental Results

Page 31: Hanaa phd  presentation  14-4-2017

Visualization of Prostate Cancer Rules Visualization of Refined Prostate Cancer Decision Rules According to Trusting Levels.

Experimental Results

Page 32: Hanaa phd  presentation  14-4-2017

Visualization of Prostate Cancer Rules Navigation through Refined Prostate Cancer Decision Rules According to Trusting Levels.

Experimental Results

Page 33: Hanaa phd  presentation  14-4-2017

Performance analysis

DTKNNNBRFDRROTRSGADTKNNNBRFDRROTRSGABreast DiagProstate

0

0.2

0.4

0.6

0.8

1

1.2

0.930.960.920.960.980.971

0.620.660.640.670.62

0.72

1

Accuracy

Hanaa Ismail Elshazly et al., ”Weighted Reduct Selection Metaheuristic Based Approach for Rules Reduction and Visualization” , International Conference on Computing Communication and Automation (ICCCA2016), IEEE, Buddh Nagar Uttar Pradesh, India, 2016

Experimental Results

Page 34: Hanaa phd  presentation  14-4-2017

Conclusions• We have presented an

approach for knowledge-based classification and visualization of decision rules which enhances the classification process and improves the insight into rules knowledge.

• Physician can detect a minimum number of rules with trusted levels to reach an efficient diagnosis of diseases.

Page 35: Hanaa phd  presentation  14-4-2017

Future Work• Promising results of the proposed

approach encourage the possibility of applying the approach on other multi dimensional data sets.

• Other visualization dynamic techniques can be applied to meet the different requirements of physicians.

Page 36: Hanaa phd  presentation  14-4-2017