Feature Selection Techniques for Software Fault Prediction (Summary)

Feature Selection Techniques For

Software Fault Prediction

(Summary)

Sungdo Gu

2015.03.27

MOTIVATION & PAPERS

What is the minimum number of software metrics(features) that should be considered for building an effective defect prediction model?

• A typical software defect prediction model is trained using software metrics and fault data that have been collected from previously-developed software releases or similar projects

• Quality of the software is an important aspect and software fault prediction helps to better concentrate on faulty modules.

• With increasing complexity of software nowadays, feature selection is important to remove the redundant, irrelevant and erroneous data from dataset.

“How Many Software Metrics Should be Selected for Defect Prediction?”

“Measuring Stability of Threshold-based Feature Selection Techniques”

“A Hybrid Feature Selection Model For Software Fault Prediction”

FEATURE SELECTION TECHNIQUE

Feature Selection Technique

feature ranking feature subset selection

Feature Selection Technique

filter : which a feature subset is selected without involving any learning algorithm.

wrapper : use feedback from a learning algorithm to determine which features to include in building a classification model.

Feature Selection

: the process of choosing a subset of feature.

SOFTWARE METRICS

A software metric is a quantitative measure of a degree to which a software system or process possesses some property.

CK metrics were desigened:

to measure unique aspects of the Object Oriented approach. to measure complexity of the design.

McCabe & Halstead metrics were designed:

to measure complexity of module-based program.

SOFTWARE METRICS: Examples

<McCabe & Halstead Metrics> <CK Metrics>

CK Metrics: Examples

WMC (Weighted Methods per Class)

Definition

• WMC is the sum of the complexity of the methods of a class.

• WMC = Number of Methods (NOM), when all methods’ complexity are considered UNITY.

DIT (Depth of Inheritance Tree)

Definition

• The maximum length from the node to the root of the tree

CBO (Coupling Between Objects)

Definition

• It is a count of the number of other classes to which it is coupled.

THRESHOLD-BASED FEATURE RANKING

Five versions of TBFS feature rankers based on five different performance metrics are considered.

• Mutual Information (MI)• Kolmogorov-Smirnov (KS)• Deviance (DV)• Area Under the ROC (Receiver Operating Characteristic) Curve (AUC)• Area Under the Precision-Recall Curve (PRC)

Threshold-Based Feature Selection technique (TBFS)

: belongs to filter-based feature ranking techniques category.

the TBFS can be extended to additional performance metrics such as F-measure, Odds Ratio etc.

THRESHOLD-BASED FEATURE RANKING

CLASSIFIER

Three classifiers

Multilayer Perceptron k-Nearest Neighbors Logistic Regression

Classifier Performance Metric

→ AUC (Area Under the ROC(Receiver Operating Characteristic))

: Performance metric that considers the ability of a classifier to differentiate between the two classes.

- The AUC is a single-value measurement, whose value ranges from 0 to 1.

SOFTWARE MEASUREMENT DATA

The software metrics & fault data collected from a real-world software project.

: The Eclipse from the PROMISE data repository.

Transform the original data by

(1) removing all non-numeric attributes

(2) converting the post-release defects attribute to a binary class attribute

: fault-prone (fp) / not-fault-prone (nfp)

EMPIRICAL DESIGN

Rank the metrics and choose the top 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15 and 20 metrics according to their respective scores.

The defect prediction models are evaluated in term of the AUC performance metric.

To understand the impact of

different size of feature subset

the five filter-based rankers

the three different learners on the models’ predicive power

five-fold cross-validation

EMPIRICAL RESULT

EMPIRICAL RESULT

STABILITY (ROBUSTNESS)

The STABILITY of feature selection method is normally defined as the degree of agreement between its outputs when applied to randomly-selected subsets of the same input data.

where 𝑛 is the total number of features in the dataset, 𝑑 is the cardinality of the intersection between subsets 𝑇𝑖 and 𝑇𝑗, and

Let 𝑇𝑖 𝑎𝑛𝑑 𝑇𝑗 be subsets of features, where 𝑇𝑖 = 𝑇𝑗 = 𝑘.

=> The greater the consistency index, the more similar the subsets are.

• To assess the robustness (stability) of feature selection techniques, consistency index was used.

ANOTHER RESULTS

A HYBRID FEATURE SELECTION MODEL

A HYBRID FEATURE SELECTION MODEL

• Correlation based Feature Selection• Chi-Squared• OneR• Gain Ratio

Filter-method

• Naïve Bayes• RBF Network (Radial Basis Function Network)• J48 (Decision Tree)

Wrapper-method

A HYBRID FEATURE SELECTION: RESULT

A HYBRID FEATURE SELECTION: RESULT

Thank you

Q & A