Upload
sungdogu
View
277
Download
1
Embed Size (px)
Citation preview
Feature Selection Techniques For
Software Fault Prediction
(Summary)
Sungdo Gu
2015.03.27
MOTIVATION & PAPERS
What is the minimum number of software metrics(features) that should be considered for building an effective defect prediction model?
• A typical software defect prediction model is trained using software metrics and fault data that have been collected from previously-developed software releases or similar projects
• Quality of the software is an important aspect and software fault prediction helps to better concentrate on faulty modules.
• With increasing complexity of software nowadays, feature selection is important to remove the redundant, irrelevant and erroneous data from dataset.
“How Many Software Metrics Should be Selected for Defect Prediction?”
“Measuring Stability of Threshold-based Feature Selection Techniques”
“A Hybrid Feature Selection Model For Software Fault Prediction”
FEATURE SELECTION TECHNIQUE
Feature Selection Technique
feature ranking feature subset selection
Feature Selection Technique
filter : which a feature subset is selected without involving any learning algorithm.
wrapper : use feedback from a learning algorithm to determine which features to include in building a classification model.
Feature Selection
: the process of choosing a subset of feature.
SOFTWARE METRICS
A software metric is a quantitative measure of a degree to which a software system or process possesses some property.
CK metrics were desigened:
to measure unique aspects of the Object Oriented approach. to measure complexity of the design.
McCabe & Halstead metrics were designed:
to measure complexity of module-based program.
SOFTWARE METRICS: Examples
<McCabe & Halstead Metrics> <CK Metrics>
CK Metrics: Examples
WMC (Weighted Methods per Class)
Definition
• WMC is the sum of the complexity of the methods of a class.
• WMC = Number of Methods (NOM), when all methods’ complexity are considered UNITY.
DIT (Depth of Inheritance Tree)
Definition
• The maximum length from the node to the root of the tree
CBO (Coupling Between Objects)
Definition
• It is a count of the number of other classes to which it is coupled.
THRESHOLD-BASED FEATURE RANKING
Five versions of TBFS feature rankers based on five different performance metrics are considered.
• Mutual Information (MI)• Kolmogorov-Smirnov (KS)• Deviance (DV)• Area Under the ROC (Receiver Operating Characteristic) Curve (AUC)• Area Under the Precision-Recall Curve (PRC)
Threshold-Based Feature Selection technique (TBFS)
: belongs to filter-based feature ranking techniques category.
the TBFS can be extended to additional performance metrics such as F-measure, Odds Ratio etc.
THRESHOLD-BASED FEATURE RANKING
CLASSIFIER
Three classifiers
Multilayer Perceptron k-Nearest Neighbors Logistic Regression
Classifier Performance Metric
→ AUC (Area Under the ROC(Receiver Operating Characteristic))
: Performance metric that considers the ability of a classifier to differentiate between the two classes.
- The AUC is a single-value measurement, whose value ranges from 0 to 1.
SOFTWARE MEASUREMENT DATA
The software metrics & fault data collected from a real-world software project.
: The Eclipse from the PROMISE data repository.
Transform the original data by
(1) removing all non-numeric attributes
(2) converting the post-release defects attribute to a binary class attribute
: fault-prone (fp) / not-fault-prone (nfp)
EMPIRICAL DESIGN
Rank the metrics and choose the top 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15 and 20 metrics according to their respective scores.
The defect prediction models are evaluated in term of the AUC performance metric.
To understand the impact of
different size of feature subset
the five filter-based rankers
the three different learners on the models’ predicive power
five-fold cross-validation
EMPIRICAL RESULT
EMPIRICAL RESULT
STABILITY (ROBUSTNESS)
The STABILITY of feature selection method is normally defined as the degree of agreement between its outputs when applied to randomly-selected subsets of the same input data.
where 𝑛 is the total number of features in the dataset, 𝑑 is the cardinality of the intersection between subsets 𝑇𝑖 and 𝑇𝑗, and
Let 𝑇𝑖 𝑎𝑛𝑑 𝑇𝑗 be subsets of features, where 𝑇𝑖 = 𝑇𝑗 = 𝑘.
=> The greater the consistency index, the more similar the subsets are.
• To assess the robustness (stability) of feature selection techniques, consistency index was used.
ANOTHER RESULTS
A HYBRID FEATURE SELECTION MODEL
A HYBRID FEATURE SELECTION MODEL
• Correlation based Feature Selection• Chi-Squared• OneR• Gain Ratio
Filter-method
• Naïve Bayes• RBF Network (Radial Basis Function Network)• J48 (Decision Tree)
Wrapper-method
A HYBRID FEATURE SELECTION: RESULT
A HYBRID FEATURE SELECTION: RESULT
Thank you
Q & A