Hanaa phd presentation 14-4-2017

ByHanaa Ismail Elshazly

PhD Student Faculty of Computers and Information

Cairo University

Intelligent Visualization of Multi Dimension

Data Sets

Faculty of Computers and Information - Cairo UniversityDepartment of Computer Sciences

Supervisors

Prof Aboul Ella Hassanien & Prof. Abeer Mohamed El

Korany

Big Image

Multidimensional data

Reduction

Visualize

Intelligent Visualization of Multidimensional Data Sets

Dimensions: A dimension is a key descriptor, an index, by which you can access facts according to the value (or values) you want

Information visualization is the study of (interactive) visual representations of abstract data to reinforce human cognition. The abstract data include both numerical and non-numerical data, such as text and geographic information

ContentsIntroduction1

2

3 Experimental Results

4 Conclusion

55 Future Work

6

Proposed Framework

IntroductionGeneral

• Massive and complex data are generated every day in many fields due to the advance of hardware and software technology.

• Curse of dimensionality is a major obstacle in machine learning and data mining.

• Clinical data referring to patients’ investigations contain irrelevant attributes that degrade the classification performance.

• Visualization is important when analyzing multidimensional datasets, since it can help humans discover and understand complex relationships in data.

IntroductionData Problems

Data Quality Integrating redundant data

from different sources Mining information from

heterogeneous databases Difficulty in training set Dynamic databases Dimensionality

Introduction Dimensionality reduction

• In machine learning and statistics, dimensionality reduction or dimension reduction is the process of reducing the number of random variables under consideration via obtaining a set of principal variables. It can be divided into feature selection and feature extraction.

• Most popular search methods that are manageable in low space can be totally unmanageable in high dimension space

• The curse of dimensionality is a major obstacle in machine learning and data mining

• Reduction of the dimensionality of features space leads to a successful classification Selecting the optimal feature subset can substantially improve the classification performance

Filter

Wrapper

Embedded

• Improve the comprehensibility of the induced concepts

• Decrease of dataset complexity

• Improve classification performance

• Resources saving• Visualization ability• Better understanding of

extracted knowledge• Reducing computation

Requirement• Reduces the effect of

curse of dimensionality

FS Techniques

Reduced DataMassive Data

Microarray GE

Medical Images

Huge Databases

Finance Data

Sensor Arrays

Web Documents

Introduction Dimensionality reduction

Introduction The curse of Dimensionality

Damming Factor

Computational Complexity

Limits applicability of ML techniques to real world problems

Slow Learning Process

Difficulty of Inducing Concepts

Decrease Predictive Performance

Add extra difficultiesin finding potentially useful knowledge

Difficulty to add visualization ability

Limited human capability

Human inspection and interpretation of the data is not feasible

Intractable behavior of Search Methods

Conventional database management and dataanalysis tools are insufficient.

Storage requirements

www.themegallery.com

Proposed General Framework

Pre-processing phase

Feature selection

phaseClassification phase

Rule refine phase

Visualization phase

Proposed General Framework

Preprocess Phase

Discretization

Simplification

Equal Binning

FeatureSelection

Phase

Reduction

PCARS

Classification Phase

Rough Set Rules

Generation

Rough Set Discernibility Matrix

Rules Refinement

Phase

Reduction

Entropy GA

Visualization

Visualization

NodesEdgesChartsGrids

Experimental Data Sets

Classes Instances Features Source Data Set 2 classes 569 samples Features

32 UCI (Machine LearningRepository)

Wisconsin Breast Cancer–Diagnosis

2 classes samples 198 Features 32

UCI (Machine LearningRepository)

Wisconsin Breast Cancer–Prognosis

2 classes 267 samples 45 Features


SPECTF Heart Dataset

4 classes 148 samples 18 Features University MedicalCentre, Institute of Oncology, Ljubljana, Yugoslavia

Lymphography

2 classes 583 samples 11 Features UCI (Machine LearningRepository)

Indian Liver Patient Dataset

2 classes 102 samples 12600 Features


Prostate

Pre-processing Phase

Aim : Used to reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals and replacing low level concepts by higher level concepts.

Techniques:• Equal Binning : Transform

numerical variables into categorical counterparts.

• Simplification : Rescaling data in the range [1,3].

PREPROCESS

Discretization

Discretized Data

Simplification

Simplified Data

Multidimensional Data

Discretization

Pre-processing PhaseEqual Binning Algorithm

Foreach feature V in data (D){ Dividing domain of V into k intervals of equal size. The width of intervals is:

w = (max(V)-min(V))/k

And the interval boundaries are:

min+w, min+2w, ... , min+(k-1)w}

Hanaa Ismail Elshazly et al., “Rough Sets and Genetic Algorithms: A hybrid approach to breast cancer classification”, Proceedings of the Information and Communication Technologies, (WICT), ISBN: 978-1-4673-4806-5, World Congress, IEEE, pp 260-265, 2012.

How Discretization techniques influence the classification of breast cancer data

Bool.Reas% Binging %

Entropy %

91 92.9 77.2 Naïve Bayes95.3 95.3 91.4 Decision Rules94 94.7 76.1 KNN

Feature Selection Phase Feature Selection

Phase

Rough Set

PCA

Positive Regions Extraction

Discernibility Matrix

Reduced Data

Positive Regions

Final Reducts

Simplified Data

Aim: Determine a minimal feature subset that best contribute to accuracy and retain high efficiency in representing the original features while negligee the features with little contribution in prediction process.

PCA (Principal component Analysis) • A statistical technique useful in data

compression and reduction.• Rough Sets

• The main goal of the rough set analysis is induction of (learning) approximations of concepts.

Principal component Analysis Algorithm

Feature Selection Phase

Feature Selection Phase PCA Performance as a transformation method in ROTATION FOREST for Chronic eye disease diagnosis

• Hanaa Ismail Elshazly, Abeer Mohamed El Korany, Aboul Ella Hassanien, Ahmad Taher Azar, “Ensemble classifiers for biomedical data : performance evaluation”, 8th International Conference on Computer Engineering & Systems (ICCES), ISBN: 978-1-4799-0078-7, pp 184-189, 2013.

• Hanaa Ismail Elshazly, Abeer Mohamed El Korany, Aboul Ella Hassanien, Mohamed Waly, “Chronic Eye Disease diagnosis using ensemble-based classifier ”, Second International Conference on Engineering and Technology(ICET), German University – Cairo-Egypt, 2014.

Many transformation methods were applied in the literature such as Principal component analysis (PCA), nonparametric discriminant analysis (NDA), random projections (RP), independent component analysis (ICA). • PCA gave the best results due to the provided diversity. • PCA preserves the discriminatory features.• PCA provides the best results compared to those extracted through non-

parametric discriminant analysis (NDA) or random projections. • PCA was chosen as a transformation method in the following research

papers :

Feature Selection Phase

Hanaa Ismail Elshazly, Ahmad Taher Azar, Abeer Mohamed El Korany, Aboul Ella Hassanien, “Hybrid System based on Rough Sets and Genetic Algorithms for Medical Data Classifications”, International Journal of Fuzzy System Applications (IJFSA), doi: 10.4018/ijfsa.2013100103, 3(4), 31-46, 2013.Descrinibility

Rough Sets for Reduct GenerationLet T = (U, C, D) be a decision table, with

}.,...,,{ 21 nuuuU M(T), we will mean matrix defined as:

)]d(u)[d(u Dd if )}c(u)c(u :C{c)]d(u)[d(u Dd if λ ij

jijiji

m

nn

ijm ,Uui }},...,2,1{,:{)( njijmuf ijjiT

ijm ,ijma .ijm

),( falsemij .ijm

),(truetmij .ijm

Where

is the disjunction of all variables a such that

(2)

(3)

if

if

(1) if

For any



Phas

e

Rule Generation

Classification with Decision

Rules

Testing

Generated Rules

Classified Instances

Tested Instances


Final Reducts

Aim : The learning algorithm called classifier has as goal to return a set of decision rules with a procedure that makes possible to classify objects not found in the original decision table. Rough Set Rules Generation using Discernibility Matrix

Rough Set Rules Generation Algorithm

Let T = (U, C, D) be a decision table, with }.,...,,{ 21 nuuuU M(T), we will mean matrix defined as:

)]d(u)[d(u Dd if )}c(u)c(u :C{c)]d(u)[d(u Dd if λ ij

jiji

jim

nn

ijm is the set of all the condition attributes that classify objects ui and uj into different classes.

,Uui }},...,2,1{,:{)( njijmuf ijjiT

ijm ,ijma .ijm),( falsemij .ijm

),(truetmij .ijm

Whereis the disjunction of all variables a such that

(2)

(3)

if

if

(1) if

Comparison of different classifiers against different data Sets

Hanaa Ismail Elshazly et al., “Rough Sets and Genetic Algorithms: A hybrid approach to breast cancer classification”, Proceedings of the Information and Communication Technologies, (WICT), ISBN: 978-1-4673-4806-5, World Congress, IEEE, pp 260-265, 2012.

Hanaa Ismail Elshazly et al., “Hybrid System based on Rough Sets and Genetic Algorithms for Medical Data Classifications”, International Journal of Fuzzy System Applications (IJFSA), doi: 10.4018/ijfsa.2013100103, 3(4), 31-46, 2013.

Rules Refinement Phase

RULE REFINEMENT

GeneratedReducts

InformativeReduct

All Rules Generated

Rules Allocation

Selected Rules

Testing

Criteria Termination

Classified Instances

Reducts Evaluation


GA

Refined Decision Rules

Test Multidimensional

Data

Reduce rules number to be easily visualized and presented to an expert without decreasing the accuracy. Reduct Evaluation using Entropy GA using Support and Confidence as Fitness Function

Reduct EvaluationAlgorithms of Decision tree depend on Information Gain to find the expected amount of information that would be needed to truly classified.

Calculate entropy of the target : Gain(T) = Entropy (T);

Entropy (T) = where c is the possible values of the target Foreach in Reducts { Foreach x In R

{ Entropy (T,X) = }

} Choose with the largest information gain.

i2c

1i i plogp

iR

E(c) ) (c xccP

iR),( XTEntEi

Genetic Algorithm Using Support and Confidence as Fitness Function

Body ==> Consequent [ Support , Confidence ] Consequent: represents a discovered property for the

examined data. Support: represents the percentage of the records

satisfying the body or the consequent. Confidence: represents the percentage of the records

satisfying both the body and the consequent to those satisfying only the body.

Visualization PhaseExpert can manage induced rules through levels of trusting that enable fast trust decision.• Graph Nodes • Edges• Charts• Grids

VISUALIZATION

Measurement Calculation for Rules Supporting

Refined Rules with Trusted Levels

Rendering

Rules & Reducts

Refined Decision Rules

Visualization of Breast Cancer Reducts Visualization of features of the breast data set ordered by its occurrence over all extracted reducts.

Experimental Results

Visualization of Breast Cancer Rules Visualization of global and detailed nodes representing refined classification rules of the breast data. 86 R 400 R 87000 R


Visualization of Breast Cancer Rules

Visualization of Refined Breast Cancer Decision Rules According to Trusting Levels.


Visualization of Breast Cancer Rules Navigation through Refined Breast Cancer Decision Rules According to Trusting Levels.


Visualization of Prostate Cancer Reducts Visualization of all reducts of the Prostate Cancer data set and all features ordered by its occurrence in all extracted reducts.


Visualization of Prostate Cancer Rules Navigation through Refined Prostate Cancer Decision Rules According to Trusting Levels.

26 R 117R 22000 R


Visualization of Prostate Cancer Rules Visualization of Refined Prostate Cancer Decision Rules According to Trusting Levels.


Visualization of Prostate Cancer Rules Navigation through Refined Prostate Cancer Decision Rules According to Trusting Levels.


Performance analysis

DTKNNNBRFDRROTRSGADTKNNNBRFDRROTRSGABreast DiagProstate

0

0.2

0.4

0.6

0.8

1

1.2

0.930.960.920.960.980.971

0.620.660.640.670.62

0.72

1

Accuracy

Hanaa Ismail Elshazly et al., ”Weighted Reduct Selection Metaheuristic Based Approach for Rules Reduction and Visualization” , International Conference on Computing Communication and Automation (ICCCA2016), IEEE, Buddh Nagar Uttar Pradesh, India, 2016


Conclusions• We have presented an

approach for knowledge-based classification and visualization of decision rules which enhances the classification process and improves the insight into rules knowledge.

• Physician can detect a minimum number of rules with trusted levels to reach an efficient diagnosis of diseases.

Future Work• Promising results of the proposed

approach encourage the possibility of applying the approach on other multi dimensional data sets.

• Other visualization dynamic techniques can be applied to meet the different requirements of physicians.

Education

Hanaa phd presentation 14-4-2017