8/10/2019 Robuste block
http://slidepdf.com/reader/full/robuste-block 1/6
Meta-Learning and Multi-Objective Optimization to
Design Ensemble of Classifiers
Antonino A. Feitosa Neto
Informatics and Applied Mathematics Department
Federal University of Rio Grande do Norte
Natal–RN, Brazil, 59072-970
Email: antonino [email protected]
Anne M. P. Canuto
Informatics and Applied Mathematics Department
Federal University of Rio Grande do Norte
Natal, RN - Brazil, 59072-970
Email: [email protected]
Abstract—Ensemble of classifiers, or simply ensemble systems,have been proved to be efficient for pattern recognition tasks.However, its design can become a difficult task. For instance, thechoice of its individual classifiers and the use of feature selectionmethods are very difficult to define in the design of these systems.In order to smooth out this problem, we will apply meta-learningand multi-objective optimization in the choice of importantparameters of ensemble systems. Therefore, this work appliesmeta-learning techniques to define an initial configuration of anmulti-objective optimization algorithm, more specifically NSGAII. The meta-learner is used to recommend the proportion of eachtype of base classifiers to compose the ensemble systems. TheNSGA II is used to generate heterogeneous ensembles selectingattributes, types and parameters of base classifiers optimizing
the classification error and the bad diversity. The results areanalysed using error rate and multi-objective metrics in order toverify whether to use of meta-learning generates more accurateensemble systems.
I. INTRODUCTION
The performance of classification systems can be increased
by use of ensemble of classifiers [1]. An ensemble of clas-
sifiers, or simply ensemble systems is a set of learning
algorithms (or base classifiers) where the outputs of all base
classifier are combined to provide the final output. Thus, it is
important the presence of a combination module. An ensemble
system will tend succeed if the output of the base classifier
are not correlated. In other words, each base classifier should
to make mistakes in different instances. We call this featurediversity [1]. Therefore, we need to select base classifiers so
that diversity and accuracy should be as high as possible.
Other important aspect in the design of ensemble systems is
related to the size of an ensemble system. In other words, the
choice of the individual classifiers of an ensemble. In some
real world applications, the number of classifiers required
to form an ensemble with a reasonable accuracy could vary
enormously. Thus, this choice is not an easy task since if we
use a smaller number of individual classifiers, the ensemble
system will not perform properly. On the other hand, if
we use a larger number of individual classifiers, there will
be an unnecessary use of resource that will not lead to an
improvement in the accuracy of these systems. The problem
of member selection can be considered as a search problem,in which it aims to find the size of the ensemble system as
well as its composition (types of classifiers).
There are a reasonable number of papers in the literature
which address either feature selection methods in ensembles or
classifier member selection in ensembles, using optimization
techniques, such as in [9], [10], [17], [19], [20], [21]. However,
the majority of these papers address only one issue (either
ensemble member or features). In addition, most of these
papers use diversity measures as a guide to select members
or features in ensemble systems. In addition, to the best
of our knowledge, there is no work in the literature that
applies meta-learning along with optimization techniques for
optimazing ensemble systems. In this paper, we will apply
meta-learning and multi-objective optimization in the choice
of important parameters (attributes and base classifiers) of
ensemble systems.
This paper is an extension of the work in [3] that aims to
analyse how two diversity measures (good and bad diversity,
proposed in [4]) could be used to generate more accurate en-
sembles. The work [3] shows that optimizing only the error or
the error and bad diversity generates ensembles more accurate
than optimization of the possible combinations of optimiza-
tion objectives, this is, combinations between error, bad and
good diversity. However, in [3], we only used multi-objective
optimization algorithms in the design of ensemble systems.
As a consequence, we observed that the initial population of
the NSGA II algorithm had an important role in the final
solution of this algorithm. Therefore, aiming to improve evenfurther the performance of the obtained ensemble systems, we
decided to include meta-learning in the design of ensemble
systems. The main objective of this work is to verify how the
initial configuration of ensemble systems had an effect in the
outcome of the optimization algorithm. For this purpose, we
use meta-learning techniques to define the initial configuration
of ensemble systems. The hypothesis is that starting with a
more adapted population, then this population will converge
more efficiently, this is, more accurate and more diversified
ensemble systems.
This paper is divided into five sections and it is organized
as follows. Section 2 describes the research works related to
the subject of this paper. Section 3 presents the methodology
used in the experimental work of this paper, while an analysisof the results provided by the empirical analysis is shown in
Section 4. Finally, Section 5 presents the final remarks of this
2014 Brazilian Conference on Intelligent Systems
978-1-4799-5618-0/14 $31.00 © 2014 IEEE
DOI 10.1109/BRACIS.2014.27
91
2014 Brazilian Conference on Intelligent Systems
978-1-4799-5618-0/14 $31.00 © 2014 IEEE
DOI 10.1109/BRACIS.2014.27
91
8/10/2019 Robuste block
http://slidepdf.com/reader/full/robuste-block 2/6
paper.
II. RELATED WOR K
Several studies have analysed one important aspect when
designing ensemble systems, either feature or member selec-
tion. For instance, the selection of individual classifiers (size
and type) to compose ensemble systems has focused on finding
the most relevant subset of classifiers in order to improve the
combined performance. There are some works reported in the
literature that use selection methods in these systems, such as
in [2], [21], [22], [5]. Nowadays, one of the most attractive
selection methods is based on the genetic algorithms. This is
due to the possibility of dealing with a population of solutions
rather than only one solution [19], [18].
On the other hand, several authors have investigated the use
of feature selection methods in ensembles, such as in [23],
[20], [17], [9], [24], [10]. Optimization techniques have been
used to automate the search for the optimum attribute subsets.
Recently, several authors have investigated genetic algorithms
(GA) to design ensemble of classifiers [2], [9].
In the context of meta-learning, the work in [6] presentsdifferent ways of how meta-learning can be applied to classi-
fication problems. The results generated by meta-learning are
compared with the results where the initial configurations are
generated randomly, where each type of classification system
has the same chance to be chosen. The use of meta-learning
for ensemble systems has not been explored in the literature,
with only two papers [8], [7] and they apply meta-learning to
directly recommend parameters for ensemble systems.
Thus, unlike most of the previous works reviewed, this paper
aims to combine meta-learning and optimization techniques
for optimizing ensemble systems. In addition, it uses opti-
mization techniques in both ensemble components (individual
classifiers and features). In addition, we use meta-learning
to define the initial configurations (initial population of the
genetic algorithm) of the optimization technique.
III. ENSEMBLE S YSTEMS
Ensemble systems, also known as multi-classifier systems
or fusion of experts, exploit the idea that different classifiers
can offer complementary information about patterns to be
classified, thereby improving the effectiveness of the overall
recognition process [1]. These systems are composed of a
set of N individual classifiers (ICs), organized in a parallel
way, that receive the input patterns and send their output
to a combination module which is responsible for provid-
ing the final output of the system. Therefore, unlabelled
instances {xi ∈ Rd|i = 1, 2,...,n} will be presented toall individual classifiers and a combination method combine
their output to produce the overall output of the system
O = Comb(yj), {yj = (yj1,...,yjk | j = 1,...,N and k =1,...,l}, where the number of individual classifiers is defined
by c and l represents the number of labels of a dataset. In
ensemble systems, the main aim is that individual components
offer complementary information about an instance and this
complementary information can lead to an improvement in
the effectiveness of the whole recognition process [1].
In the design of ensemble systems, two main issues are
important, which are: the ensemble components and the com-
bination methods that will be used. In relation to the first issue,
the members of an ensemble are chosen and implemented.
As mentioned previously, the correct choice of the set of
individual classifiers is fundamental to the overall performance
of an ensemble. The ideal situation is a set of individual
classifiers with uncorrelated errors - they would be combined
in such a way as to minimize the effect of these failures. That
is, the base classifiers should be diverse among themselves.
Once a set of classifiers has been created and selected,
the next step is to choose an effective way of combining
their outputs. The choice of the best combination method
for an ensemble requires exhaustive training and testing. In
fact, the choice of the combination method of an ensemble is
very important and difficult to achieve. There is a reasonable
number of combination methods reported in the literature [1].
In this paper, the focus is on the simplest way to fuse the
output of the individual classifiers, thus we used the Majorityvote.
A. Diversity in Ensemble Systems
In an ensemble system, for instance, diversity can be
enhanced using the idea of building the individual classifiers
using different views, as follows.
• Parameter settings: the use of different initial parameter
settings for the individual classifiers can increase the
diversity among them, increasing the diversity of the
whole system;
• Training datasets: The use of feature selection techniques
or learning strategies (for instance, Bagging and Boost-
ing) makes the individual classifiers to have differentviews of the same problems, enhancing diversity of the
ensembles;
• Classifier types: the use of different types of classification
algorithms (heterogeneous structures) can increase the
diversity of the ensemble system.
There are different diversity measures available from dif-
ferent fields of research. However, to date, as there is no
consensus on the formal definition for this term, the task of
measure diversity becomes a non-trivial question. This issue
has been addressed by some authors [2]. In a recent work,
[4], the authors have employed the perspective that a diversity
measure should be naturally derived as a direct consequence of
two main factors, which are: the loss function of interest and
the combiner function. Based on this, they have proposed a
decomposition of the classification error for ensemble systems,
using the majority vote combiner, into three terms: individual
accuracy, good diversity, and bad diversity. These diversity
terms have been applied in ensemble systems based on the
number of votes when an ensemble makes a decision. The
equations for good Eq. (1) and bad Eq. (2) diversities are
described by:
9292
8/10/2019 Robuste block
http://slidepdf.com/reader/full/robuste-block 3/6
D+ = 1
N
#P +
i=1
v−i (1)
D− = 1
N
#P −
i=1
v+i (2)
Where N is the amount of ensemble members, v+i is the
amount of correct votes for the instance i in the set of instances
of, P − is the quantity of instances classified incorrectly by the
ensemble, v−i is the amount of incorrect votes for classification
of the instance i and P + is the quantity of instances classified
correctly by the ensemble [4].
IV. EXPERIMENTAL A NALYSIS
We perform an empirical investigation to assess how meta-
learning can affect the performance of the optimization tech-
nique. In this experiment, the meta-learning is responsible to
define the initial configuration of heterogeneous ensembles
that constitute the initial solutions of the optimization tech-
nique. We use NSGA II as optimization technique. In this
case, the meta-learning will suggest possible configurationsthat should be used in the NSGA-II algorithm.
A. Meta-learning
Meta-learning is the first phase of our methodology, which
will recommend initial configurations for the ensembles sys-
tems to be used by the optimization algorithms. Although the
optimization technique is applied to feature selection and base
classifiers, for simplicity reasons, the meta-learning module
will recommend the base classifiers to compose the ensemble
systems. More specifically, it defines the proportion of all three
learning algorithms (k-NN, Decision Tree and Naive Bayes
classifier) to compose the ensemble. This recommendation
will be taken into consideration by the optimization technique
when defining the initial population, as a sort of a priori
information in the creation of the initial population.
The meta-learner is generated from 540 datasetoids [12]
generated by some of the datasets presented in Table I in
which almost all of them were taken from UCI repository [13].
The datasets that were not taken from UCI were Gaussian,
Simulated and Jude. They are synthetic databases that simulate
microarray data and were created to test the machine learning
algorithms in the gene expression analysis [14]. It is important
to enphasize that we do not use the same datasets for the
meta-learning and for the performance evaluation (presented
in the next section). For the meta-learning process, we used
27 datasets and 17 for the performance evaluation. Table II
presents the datasets used in the performance evaluation.Each dataset is evaluated 10 times for each learning al-
gorithm (k-NN, Decision Tree and Naive Bayes classifier)
and a 10-fold cross validation was applied, making a total
of 100 runs. Then, the best learning algorithm for each run is
selected. After all runs, we have the number of times in which
each learning algorithm was the winner. These values are
transformed into the posterior probability distribution as used
TABLE IDATASETS U SED IN THE M ETALEARNING
Annealing Hepatitis SoybeanLargeArrhythmia Horse Colic SpamBaseAudiology Ionosphere SPECTHeartBalance Iris SPECTFHeartBreastTissue Jude StatlogAustralianCar KRKPA7 StatlogGermanClimateModel Labor StatlogHeartCongressionalVoting LibrasMovement TransfusionCre dit Approva l Par kinsons Vehic leCyl inde rBands Pitts burghV1 VowelDer matology Pitts burghV2 Wavef or mEcoli PlanningRelax WineFlags Promoter WisconsinDiagnosticGlass Protein WisconsinOriginalHea rtCl eve la nd Segment Wis cons inPr ognost icHeartHungrarian Sick ZooHeartLongBeach SimulatedHeartWitzerland Sonar
as the proportion of each learning algorithm in the ensemble
systems.
In order to avoid losing diversity in the optimization algo-rithm, a randomized selection of the initial population was
done, in which the average performance of each learning
algorithm was used as probability of selection the proportion
recommended by the meta-learner. For instance, if the average
performance of the learning algorithms was 85%, then the
initial population will contain 85% of the individuals created
based on recommendation of the meta-learning and 15% of
individuals randomly selected. In addition, as the meta-learner
recommends only the proportion of each learning algorithm,
the inidivuals were created using this recommendation with a
margin of 10%. For instance, if the meta-learner recommends
50% of k -NN, we create individuals with 45 to 55% of k -NN.
The meta-dataset that consist of 17 meta-features. The meta-
features are obtained of the database according to the settings
below where Ca is the set of categorical attributes, Na is
the set of numeric attributes and C is the label attribute.
For example, the first meta-feature, Attributes by Instances, is
calculated as quotient between the quantity fo attributes and
the quantity of instances.
1) Attributes by Instances #attributes/#instances2) Smaller Class by Bigger Class:
#smallerclassset/#biggerclassset3) Categorical by Attributes: #Ca/#attributes4) Numerical by Attributes: #N a/#attributes5) Class Entropy: entropy(C )6) Attributes Entropy: mean({x|y ∈ C a ∧ x =
entropy(y)})7) Mutual Information: mean({x|y ∈ Ca ∧ x =
mutualInformation(y, C )})8) Conditional Entropy: mean({x|y ∈ C a ∧ x =
conditionalEntropy (y, C )})9) Joint Entropy: mean({x|y ∈ C a ∧ x =
jointE ntropy(y, C )})10) Signal to Noise:
9393
8/10/2019 Robuste block
http://slidepdf.com/reader/full/robuste-block 4/6
(attributeEntropy − mutualInformation)/mutualInfo
11) High Correlation: (#{x|y ∈ N a ∧ x =correlation(y) > 0.25})/#N a
12) Low Correlation: (#{x|y ∈ N a ∧ x =correlation(y) < −0.25})/#N a
13) Neutral Correlation: (#{x|y ∈ N a ∧ x =|correlation(y)| < 0.25})/#N a
14) Positive Skew: (#{x|y ∈ N a ∧ x =skew(y) >= 0})/#Na
15) Negative Skew: (#{x|y ∈ N a ∧ x =skew(y) <= 0})/#Na
16) Positive Kurtosis: (#{x|y ∈ N a ∧ x =kurtosis(y) >= 3})/#N a
17) Negative Kurtosis: (#{x|y ∈ N a ∧ x =correlation(y) <= 3})/#N a
B. Optimization Algorithm
As already mentioned, we use two objectives, which are
classification error and bad diversity are the optimization
objectives (minimization objectives). Each individual (chro-mosome) is a heterogeneous ensemble that describes the set
of attributes, learning algorithms and their main parameters.
As already mentioned, the possible types of learning algorithm
are k-NN, Decision Tree, Naive Bayes Classifier (a description
of these methods can be found in [11]) and absence (indicates
that the base classifier is not part of the ensemble). For k-
NN, the only considered parameter was k and it varies in
[1, 10] ∈ N. The Decision Tree can modify the minimum
amount of instances in a leaf node and it varies in [1, 10] ∈ N.
The Naive Bayes Classifier can modify the way in which
the numerical attributes are treated. In this case, there are
three possibilities, which are: Gaussian distribution, entropy
discretization or by using kernel of Gaussian distributions.
In the NSGA II algorithm, we use the following parameters:a population of 30 individuals. Uniform mutation operator at a
rate of 10% when applied to modify an attribute by selecting
it or not, the learning type and/or parameter of base classifier
including removal of the base classifier. Two point crossover
operator at a rate of 80%. Finally, as a stopping criterion we
adopt a maximum of 100 epochs or convergence criterion,
this is, when 1/4 of population are equal in relation to the
optimization goals.
Thus, for comparison purposes, we generate three types
of initial configurations: one generated by meta-learning and
two are defined arbitrarily. In all cases, the set of parameters
and attributes for each base classifier is chosen randomly and
uniformly.
C. Methods and Materials
As already mentioned, a subset of the datasets listed in
Table I are used for assessment of experimental scenarios,
these databases are presented in Table II. In the comparative
analysis, the experimental scenarios consist of different ways
to recommend the initial solutions of the NSGA II algorithms.
The experimental scenarios are listed below:
1) Meta: The optimization algorithm creates an initial pop-
ulation where each individual represents an ensemble
with 30 base classifiers, using the recommendation made
by the meta-learner. The meta-leaner determines the
proportion of them are generated by the k-NN, of
Decision Tree and Naive Bayes Classifier.
2) Rand: It generates a an initial population where each
individual is chosen randomly and uniformly among k-
NN, Decision Tree, Naive Bayes and absent (indicates
that this base classifier is not active). In this scenario,
we can have individual that represent homogeneous and
heterogeneous ensembles of different sizes.
3) Equal: It generates a an initial population where each
individual represents an ensemble of 30 base classifiers
which 10 are generated by the k-NN, 10 by Decision
Tree and 10 by Naive Bayes Classifier. In this cases,
differences on the initial parameters and attribute distri-
butions will distinguish the individuals.
In order to compare the obtained results of the different
learning methods, a statistical test is applied. The Friedman
Test was applied, coupled with Wilcoxon test [15].
V. RESULTS
In order to validate the feasibility of the proposed method-
ology, an empirical analysis is conducted and this section
presents the obtained results. This empirical analysis was
done in two different parts. In the first one, we analyse the
performance of the error rate of the obtained results. In the
second part, we analyse the experimental scenarios, analysing
some multi-objective parameters. These two parts will be
described in the next two subsections.
A. Error rate of the Ensemble Systems
This first analysis is done in terms of the accuracy (error rateor classification error) of the ensemble systems produced by
the non-dominated solutions of the optimization techniques.
In this case, the solution with the lowest error rate of all
non-dominant solutions is selected. In addition, we compare
the accuracy of all three possible scenarios, Meta, Rand and
Equal. Table II presents the percentage of classification error
for all three experimental scenarios. For each dataset, the
experimental scenario that achieves the lowest error rate is
bold.
In analysing Table II, it can be observed that our approach
(Meta) achieved the highest performance (the lowest error
rate) in more than 70% (12 out of 17) of the analysed
datasets. The Equal scenario achieved the lowest error rate
in the remaining of the datasets and this means that the Randscenario did not achieve the lowest error rate in any dataset.
In order to evaluate these results from a statistical point of
view, the Friedman test [15] is applied. The results indicates a
significant difference p−value = 0.0015 (using a significance
level of 0.05). This means that the performance of all three
experimental scenarios are different, from a statistical point of
view.
9494
8/10/2019 Robuste block
http://slidepdf.com/reader/full/robuste-block 5/6
TABLE IICLASSIFICATION E RROR OF THE E NSEMBLE SYSTEMS
Datasets Equals Rand M eta
Balance 0.0790±0.0001 0.0872±0.0003 0.0784±0.0001BreastTissue 0.2113±0.0002 0.2462±0.0002 0.2028±0.0002Ecoli 0.1533±0.0002 0.2113±0.0008 0.1563±0.0001G la ss Id en tifi cat io n 0 .20 09±0.0007 0.2640±0.0017 0.1813±0.0003
HeartDiseaseLongBeachVA 0.4255±0.0003 0.4565±0.0007 0.4235±0.0003
HeartDiseaseWitzerland 0.3244±0.0002 0.3382±0.0003 0.3220±0.0001Hepatitis 0.1032±0.0016 0.0942±0.0019 0.0903±0.0018Iris 0.0293±0.0001 0.0327±0.0001 0.0293±0.0001
Labor 0.0000±0.0000 0.0000±0.0000 0.0000±0.0000
LungCancer 0.2813±0.0007 0.2875±0.0006 0.2969±0.0014PittsburghBridgesV1 0.2029±0.0001 0.2362±0.0002 0.2048±0.0002P ittsburghBridgesV2 0.2324±0.0001 0.2724±0.0004 0.2305±0.0001PlanningRelax 0.2599±0.0004 0.2703±0.0002 0.2632±0.0003SPECTHeart 0.2313±0.0004 0.2350±0.0002 0.2325±0.0001
Transfusion 0.2274±0.0001 0.2303±0.0001 0.2293±0.0001Wine 0.0000±0.0000 0.0011±0.0001 0.0011±0.0001Zoo 0.0158±0.0001 0.0129±0.0001 0.0158±0.0001
Then, we decided to apply the Wilcoxon test [15] in order to
evaluate the statistical difference to each pair of experimental
scenarios and considering all datasets at once. The results are
p−value = 1.0, p−value = 0.4282 and p−value = 0.4181when compared Meta x Equal, Meta x Rand and Randx Equal, respectively. Therefore, the Wilcoxon test failed
to identify a significant difference between the experimental
scenarios and we can not state which scenario presented the
lowest classification error, from a statistical point of view.
As the Wilcoxon test did not detect statistical differences
when we considered all datasets, we decided to apply a
statistical test for each dataset separately and these results are
presents in Table III. In this Table, the results are presented
as ”CaseA x CaseB” and it corresponds to the application
of the Mann-Whitney test [15], comparing samples of CaseAagainst CaseB. In all tests, a significance level of 0.05 is
adopted.
TABLE IIITHE RESULTS OF THE W ILCOXON TEST: CLASSIFICATION E RROR
Dataset Meta x Equals Meta x Ra nd R an d x Equals
Balance 0.4497 0.2568 0.5453BreastTissue 0.2265 0.0002 0.0003Ecoli 0.1509 0.0002 0.0002Glass 0.0757 0.0002 0.0010HeartLongBeach 0.9397 0.0082 0.0073HeartWitzerland 0.5708 0.0343 0.1306Hepatitis 0.2730 0.4963 0.7337Iris 1.0000 0.1509 0.1509Labor 1.0000 1.0000 1.0000LungCancer 0.3847 0.6501 0.5967PittsburghV1 0.6501 0.0005 0.0003PittsburghV2 0.7624 0.0003 0.0003PlanningRelax 0.7337 0.4057 0.2123SPECTHeart 0.7055 0.5453 0.7055Transfusion 0.4274 0.4963 0.0284
Wine 0.4497 1.0000 0.4497Zoo 1.0000 0.3643 0.3643
When comparing the p-values obtained in table III, we can
observe that a significant difference was not detected in any
dataset, between M eta and Equal. However, when comparing
Meta x Rand and Rand x Equal, we can see that there
are statistical differences in about 42% of the datasets (7 out
of 17), for both comparisons. These differences indicate that
the idea of using meta-learning can provide more accurate
ensemble systems in around half of the analysed cases, which
is an interesting result for our methodology.
B. Multi-objective Analysis
When using multi-objective optimization techniques, one of
the biggest challenges is to compare the outcomes provided by
these techniques. In general, the outcome of a multi-objectiveoptimization is a set of non-dominated points, called approx-
imation set. In this paper, we apply the dominance ranking
approach which yields general statements about the relative
performance of multi-objective optimizers fairly independent
of preference information [16]. This approach is recommended
as the first step in any comparison. If conclusions cannot be
drawn based only on this approach (p-values higher than the
significance level), other measures, named quality indicators,
are applied to point out differences among the sets generated
by different stochastic optimizers. Moreover, we apply Pareto
compliant indicators which are those that whenever an ap-
proximation set A is preferable to other one, B, with respect
to weak Pareto dominance, the indicator value for A is at
least as good as the indicator value for B. The concept of
weak dominance states that one solution z1 weakly dominates
another z2 if z1 is not worse than z2 in all objectives. The
Pareto-compliant quality indicators multiplicative binary-ε and
hyper-volume are used here when significant differences are
not devised with the dominance ranking approach. The ε-
indicator gives the factor by which an approximation set is
worse than another with respect to all objectives. The hyper-
volume measures the portion of the objective space that is
weakly dominated by an approximation set.
The table IV presents the results of the Friedman test
comparing all three experimental scenarios in relation to multi-
objective metrics. As already mentioned, the binary-ε and
hyper-volume criteria will be analysed only and only if nostatistical difference was detected in the dominance ranking
[16].
TABLE IVMULTI-O BJECTIVE A NALYSIS
Datasets DominanceRanking Hypervolume binary-ε
Balance 0.3284 0.3284 0.0002BreastTissue 0.7205 0.0006 0.0004Ecoli 0.3359 0.0008 0.0002Glass 0.0370 - -HeartLongBeach 0.3679 0.0014 0.0002HeartWitzerland 0.1357 0.0007 0.0159Hepatitis 0.1421 0.0138 0.0013Iris 0.3983 0.3050 0.0001Labor 1.0000 1.0000 0.0018
LungCancer 0.0004 - -PittsburghV1 0.0191 - -PittsburghV2 0.0101 - -PlanningRelax 0.1546 0.0002 0.0016SPECTHeart 0.0232 - -Transfusion 0.3679 0.0003 0.5119Zoo 0.7584 0.0145 0.0033Wine 0.2434 0.2434 0.0040
In the multi-objective context (Table IV) we can observe that
the dominance ranking did not detect statistical significance
9595
8/10/2019 Robuste block
http://slidepdf.com/reader/full/robuste-block 6/6
in more than 70% of the cases (12 out of 17 cases). Of
theses cases, significant differences were not detected in four
datasets, Balance, Iris, Labor, and Wine, when analysing
Hypervolume. The results in Table IV show that all three
scenarios had a similar performance since they provided multi-
objective metrics that proved not to be statistically significant.
In the cases where the results are statistically significant (5 for
dominance ranking, 12 for Hypervolume and 17 for binary-
ε), we applied the Wilcoxon test. As a result of it, we noticed
that there is a mixture of winners (best values), among all
three experimental scenarios. Once again, this is an indication
that the experimental scenarios provided similar performance,
from a statistical point of view.
V I . FINAL R EMARKS
This paper applied meta-learning techniques to recommend
the initial configurations for the optimization of ensemble
systems, using NSGA II algorithm. The idea is to use meta-
learning as a first step of the optimization algorithm and it
helps in the selection of the initial population. In order to
evaluate the performance of the proposed methodology, anempirical investigation was conducted. In this investigation,
three experimental scenarios were used, Rand Meta and
Equal, and they consist of different ways to recommend the
initial solutions of the NSGA II algorithms.
When analysing the error rates of the obtained ensemble
systems, the Meta scenario achieved the lowest error rate in
more than 70% of the datasets and it proved to be statistically
significant in around 42% of the datasets, when compared with
Rand scenario. In multi-objective analysis, it indicates that
all three experimental scenarios provided similar performance,
mainly for the dominance ranking parameter.
The obtained results might not be as good as we expected,
mainly for the multi-objective parameters. However, these
results do not exclude the possibility that we can use meta-learning to generate more accurate ensembles. We believe that
the recommendation of only one parameter did not provide
sufficient information for the optimization techniques and this
caused the different scenarios to have similar performance.
Nonetheless, as an on-going research, we are analyzing the use
of meta-learning to recommend other ensemble parameters.
REFERENCES
[1] Kuncheva, L. I.: Combining Pattern Classifiers: Methods and Algo-rithms. Wiley (2004)
[2] Kuncheva, L., Whitaker, C. J.: Measures of Diversity in ClassifierEnsembles and Their Relationship With the Ensemble Accuracy. In:Machine Learning, vol. 51, pp. 181–207, Springer (2003)
[3] Feitosa Neto, A., Canuto, A., Ludemir, T. B.: Using good and bad
diversity measures in the design of ensemble systems: A geneticalgorithm approach. In: IEEE Proceedings of Congress on EvolutionaryComputation (CEC), 2013.
[4] Brown, G., Kuncheva, L.: ”good” and ”bad” diversity in majority voteensembles. In: N. El Gayar, J. Kittler, F. Roli. Multiple ClassifierSystems, LNCS, vol. 5997, pp. 124–133, Springer, (2010)
[5] Deb, K., Pratap, A., Agarwal, S., Meyarivan, T.: A fast and elitistmultiobjective genetic algorithm: Nsga-ii. In: 2nd IEEE Transactionson Evolutionary Computation, vol. 6, pp. 182-197, (2002)
[6] Brazdil, P., Giraud-Carrier, C., Soares, C., Vilalta, R.: Metalearning:Applications to Data Mining. Springer (2009)
[7] R. R. Parente, A. M. P. Canuto, and J. C. Xavier, “Characterizationmeasures of ensemble systems using a meta-learning approach,” in
Neural Networks (IJCNN), The 2013 International Joint Conference on,2013, pp. 1–8.
[8] P. P. Bonissone, “Lazy meta-learning: creating customized model en-sembles on demand,” in Proceedings of the 2012 World Congress con-
ference on Advances in Computational Intelligence. Berlin, Heidelberg:Springer-Verlag, 2012, pp. 1–23.
[9] Lee, M., Boroczky, L., Sungur-Stasik, K., Cann, A., Borczuk, A.,
Kawut, S., Powell, C.: A two-step approach for feature selection andclassifier ensemble construction in computer-aided diagnosis. In: 21stIEEE International Symposium on Computer-Based Medical Systems,pp. 548-553 (2008)
[10] Oliveira, L. S., Morita, M., Sabourin, R.: Feature selection for ensemblesapplied to handwriting recognition. Int. J. Doc. Anal. Recognit., vol. 8,pp. 262-279, (2006).
[11] Witten, I. H., Frank, E.: Data Mining: Practical Machine Learning Toolsand Techniques. ed. 2. Elsevier. (2005)
[12] Soares, C.: UCI++: Improved Support for Algorithm Selection UsingDatasetoids. In: Advances in Knowledge Discovery and Data Mining.LNCS, vol. 5476, pp. 499–506 (2009)
[13] Asuncin, A., Newman, D. J.: UCI Machine Learning Repository. Univer-sity of California at Irvine. 2007. Available: http://ics.uci.edu/ ∼mlearn/ MLRepository.html
[14] Monti, S., Tamayo, P., Mesirov, J., Golub, T.: Consensus clustering: Aresampling-based method for class discovery and visualization of geneexpression microarray data. In: Machine Learning: Functional Genomics
Special Issue, pp. 91–118, (2003)[15] Gibbons, J. D., Chakraborti, S.: Nonparametric Statistical Inference:
Fourth Edition, Revised and Expanded. Marcel Dekker, Alabama (2003)[16] Knowles, J., Thiele, L., Zitzler, E.:A tutorial on the performance assess-
ment of stochastic multiobjective optimizers. Computer Engineering andNetworks Laboratory (TIK), ETH Zurich, Tech. Rep. TIK Report 214,(2006)
[17] Derrac J., Garca S., Herrera F.: A first study on the use of coevolutionaryalgorithms for instance and feature selection. In: Hybrid ArtificialIntelligence Systems, ser. Lecture Notes in Computer Science. Corchado,E., Wu, X., Oja, E., Herrero, L., Baruque, B. Springer Berlin, vol. 5572,pp. 557-564 (2009).
[18] Ruta, D., Gabrys, B.: Classifier selection for majority voting. Informa-tion Fusion, vol. 6, no. 1, pp. 63-81, Diversity in Multiple ClassifierSystems. 2005. Available: http://www.sciencedirect.com/science/article/ B6W76-4CHJ5N7-1/2/8b9b23704745066ff7fd97ba9fba3315
[19] Santos, E., Sabourin, R., Maupin, P.: Single and multi-objective geneticalgorithms for the selection of ensemble of classifiers. In: International
Joint Conference on Neural Networks, pp. 3070-3077 (2006).[20] Santana, L., Silva, L., Canuto, A., Pintro, F., Vale, K.: A comparativeanalysis of genetic algorithm and ant colony optimization to selectattributes for an heterogeneous ensemble of classifiers. In: (CEC), 2010IEEE Congress on Evolutionary Computation, pp. 1-8 (2010).
[21] Sylvester, J., Chawla, N.: Evolutionary ensemble creation and thinning.In: IJCNN 06 International Joint Conference on Neural Networks, pp.5148-5155 (2006).
[22] Wang, W., Partridge, D., Etherington, J.: Hybrid ensembles andcoincident-failure diversity. In: IJCNN 01, International Joint Confer-ence on Neural Networks, vol. 4, pp. 2376-2381 (2001).
[23] Bacauskiene, M., Verikas, A., Gelzinis, A., Valincius, D.: A featureselection technique for generation of classification committees and itsapplication to categorization of laryngeal images. Pattern Recogn, vol.42, pp. 645-654, (2009).
[24] Tsymbal, A., Puuronen, S., Patterson, D., W.: Ensemble feature selectionwith the simple bayesian classification, Information Fusion, vol. 4, no.2, pp. 87 100, (2003).
9696