Feature Selection Techniques for Software Fault Prediction (Summary)

Feature Selection Techniques For

Software Fault Prediction

(Summary)

Sungdo Gu

2015.03.27

MOTIVATION & PAPERS

What is the minimum number of software metrics(features) that should be considered for building an effective defect prediction model?

• A typical software defect prediction model is trained using software metrics and fault data that have been collected from previously-developed software releases or similar projects

• Quality of the software is an important aspect and software fault prediction helps to better concentrate on faulty modules.

• With increasing complexity of software nowadays, feature selection is important to remove the redundant, irrelevant and erroneous data from dataset.

“How Many Software Metrics Should be Selected for Defect Prediction?”

“Measuring Stability of Threshold-based Feature Selection Techniques”

“A Hybrid Feature Selection Model For Software Fault Prediction”

FEATURE SELECTION TECHNIQUE

Feature Selection Technique

feature ranking feature subset selection

Feature Selection Technique

filter : which a feature subset is selected without involving any learning algorithm.

wrapper : use feedback from a learning algorithm to determine which features to include in building a classification model.

Feature Selection

: the process of choosing a subset of feature.

SOFTWARE METRICS

A software metric is a quantitative measure of a degree to which a software system or process possesses some property.

CK metrics were desigened:

to measure unique aspects of the Object Oriented approach. to measure complexity of the design.

McCabe & Halstead metrics were designed:

to measure complexity of module-based program.

SOFTWARE METRICS: Examples

CK Metrics: Examples

WMC (Weighted Methods per Class)

Definition

• WMC is the sum of the complexity of the methods of a class.

• WMC = Number of Methods (NOM), when all methods’ complexity are considered UNITY.

DIT (Depth of Inheritance Tree)

Definition

• The maximum length from the node to the root of the tree

CBO (Coupling Between Objects)

Definition

• It is a count of the number of other classes to which it is coupled.

THRESHOLD-BASED FEATURE RANKING

Five versions of TBFS feature rankers based on five different performance metrics are considered.

• Mutual Information (MI)• Kolmogorov-Smirnov (KS)• Deviance (DV)• Area Under the ROC (Receiver Operating Characteristic) Curve (AUC)• Area Under the Precision-Recall Curve (PRC)

Threshold-Based Feature Selection technique (TBFS)

: belongs to filter-based feature ranking techniques category.

the TBFS can be extended to additional performance metrics such as F-measure, Odds Ratio etc.

THRESHOLD-BASED FEATURE RANKING

CLASSIFIER

Three classifiers

Multilayer Perceptron k-Nearest Neighbors Logistic Regression

Classifier Performance Metric

→ AUC (Area Under the ROC(Receiver Operating Characteristic))

: Performance metric that considers the ability of a classifier to differentiate between the two classes.

- The AUC is a single-value measurement, whose value ranges from 0 to 1.

SOFTWARE MEASUREMENT DATA

The software metrics & fault data collected from a real-world software project.

: The Eclipse from the PROMISE data repository.

Transform the original data by

(1) removing all non-numeric attributes

(2) converting the post-release defects attribute to a binary class attribute

: fault-prone (fp) / not-fault-prone (nfp)

EMPIRICAL DESIGN

Rank the metrics and choose the top 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15 and 20 metrics according to their respective scores.

The defect prediction models are evaluated in term of the AUC performance metric.

To understand the impact of

different size of feature subset

the five filter-based rankers

the three different learners on the models’ predicive power

five-fold cross-validation

EMPIRICAL RESULT

STABILITY (ROBUSTNESS)

The STABILITY of feature selection method is normally defined as the degree of agreement between its outputs when applied to randomly-selected subsets of the same input data.

where 𝑛 is the total number of features in the dataset, 𝑑 is the cardinality of the intersection between subsets 𝑇𝑖 and 𝑇𝑗, and

Let 𝑇𝑖 𝑎𝑛𝑑 𝑇𝑗 be subsets of features, where 𝑇𝑖 = 𝑇𝑗 = 𝑘.

=> The greater the consistency index, the more similar the subsets are.

• To assess the robustness (stability) of feature selection techniques, consistency index was used.

ANOTHER RESULTS

A HYBRID FEATURE SELECTION MODEL

• Correlation based Feature Selection• Chi-Squared• OneR• Gain Ratio

Filter-method

• Naïve Bayes• RBF Network (Radial Basis Function Network)• J48 (Decision Tree)

Wrapper-method

A HYBRID FEATURE SELECTION: RESULT

Thank you

Feature Selection Techniques for Software Fault Prediction (Summary)

Software

IR Module Fault

Directional Earth Fault Protection and Earth Fault …...2 Directional Earth Fault Protection. Directional Earth Fault Relay ม กจะถ กน ามาใช ในงานท

Fault Analysis and Fault Tolerance of a Base Station System for

Fault injection techniques, design pattern for fault injector system

analicer fault detro

åd Legend Actiue Fault Fault Trace (Site Active Fault ...€¦ · åd Legend Actiue Fault Fault Trace (Site Active Fault Trace Dip Presumed Active Fault Tilting 1 : 25,000 I km 60

Fault tree analysis

Fault Diagnosis and Fault Tolerant Controlwebpages.iust.ac.ir/b_moaveni/courses/1st Lecture_FTC.pdf · 6 Fault D. and FTCS by: Dr B. Moaveni 11 Faults and fault tolerance A fault

Thesis Map - Sierra Bacha 30k · Bacha Fault Bacha Fault Pozo Coyote Fault Coyote Fault Pozo Pozo Coyote Fault Noriega Fault San Ignacio Fault? Amado-Libertad Fault A A’ B’ B

Borehole Muography 1. Extension to the underground of MUOGRAPHY Target : Fault zone structure － position, strike, dip,width, and density →Prediction on

Prediction, Learning and Games - Chapter 7 Prediction and ...walid.krichene.net/notes/reading-plg-chap7.pdf · Prediction,LearningandGames-Chapter7 PredictionandPlayingGames WalidKrichene

Great Sumatran Fault

Klasifikasi Fault Seal

Posters · 2020. 8. 4. · 3 Shikhar Saxena Nanyang Technological University OnionMHC: peptide - HLA-A*02:01 binding prediction using both structure and sequence feature sets 4 Yuanfei

Dynamic Feature Selection for Dependency Parsing › ~jason › papers › he+al.emnlp13.slides.pdf · Part-of-Speech Tagging Parsing $ Fruit flies like a banana. 3 Structured Prediction

Fault tolerant system_130629

5. Basic strategy for trend prediction 5.1 Feature extraction

Sequence-based feature prediction on proteinsguanine.evolbio.mpg.de/homePage/mkleen06.pdf · (Letunic et al., 2006) to analyze raw protein sequences. InterProScan applies 19484 models

Dynamic Potential-Model-Based Feature for Lane Change ...yamashita/paper/B/B160Final.pdf · Dynamic Potential-Model-Based Feature for Lane Change Prediction Hanwool Woo 1, Yonghoon

Bab 3 Algoritma Feature Pengurangan · Dimensi Benda kerja ... Kontainer feature Kontainer feature merupakan suatu wadah untuk menampung feature-feature yang dimasukkan ke benda kerja