Download pdf - Robuste block

8/10/2019 Robuste block

http://slidepdf.com/reader/full/robuste-block 1/6

Meta-Learning and Multi-Objective Optimization to

Design Ensemble of Classifiers

Antonino A. Feitosa Neto

Informatics and Applied Mathematics Department

Federal University of Rio Grande do Norte

Natal–RN, Brazil, 59072-970

Email: antonino [email protected]

Anne M. P. Canuto

Informatics and Applied Mathematics Department

Federal University of Rio Grande do Norte

Natal, RN - Brazil, 59072-970

Email: [email protected]

Abstract—Ensemble of classifiers, or simply ensemble systems,have been proved to be efficient for pattern recognition tasks.However, its design can become a difficult task. For instance, thechoice of its individual classifiers and the use of feature selectionmethods are very difficult to define in the design of these systems.In order to smooth out this problem, we will apply meta-learningand multi-objective optimization in the choice of importantparameters of ensemble systems. Therefore, this work appliesmeta-learning techniques to define an initial configuration of anmulti-objective optimization algorithm, more specifically NSGAII. The meta-learner is used to recommend the proportion of eachtype of base classifiers to compose the ensemble systems. TheNSGA II is used to generate heterogeneous ensembles selectingattributes, types and parameters of base classifiers optimizing

the classification error and the bad diversity. The results areanalysed using error rate and multi-objective metrics in order toverify whether to use of meta-learning generates more accurateensemble systems.

I. INTRODUCTION

The performance of classification systems can be increased

by use of ensemble of classifiers [1]. An ensemble of clas-

sifiers, or simply ensemble systems is a set of learning

algorithms (or base classifiers) where the outputs of all base

classifier are combined to provide the final output. Thus, it is

important the presence of a combination module. An ensemble

system will tend succeed if the output of the base classifier

are not correlated. In other words, each base classifier should

to make mistakes in different instances. We call this featurediversity [1]. Therefore, we need to select base classifiers so

that diversity and accuracy should be as high as possible.

Other important aspect in the design of ensemble systems is

related to the size of an ensemble system. In other words, the

choice of the individual classifiers of an ensemble. In some

real world applications, the number of classifiers required

to form an ensemble with a reasonable accuracy could vary

enormously. Thus, this choice is not an easy task since if we

use a smaller number of individual classifiers, the ensemble

system will not perform properly. On the other hand, if

we use a larger number of individual classifiers, there will

be an unnecessary use of resource that will not lead to an

improvement in the accuracy of these systems. The problem

of member selection can be considered as a search problem,in which it aims to find the size of the ensemble system as

well as its composition (types of classifiers).

There are a reasonable number of papers in the literature

which address either feature selection methods in ensembles or

classifier member selection in ensembles, using optimization

techniques, such as in [9], [10], [17], [19], [20], [21]. However,

the majority of these papers address only one issue (either

ensemble member or features). In addition, most of these

papers use diversity measures as a guide to select members

or features in ensemble systems. In addition, to the best

of our knowledge, there is no work in the literature that

applies meta-learning along with optimization techniques for

optimazing ensemble systems. In this paper, we will apply

meta-learning and multi-objective optimization in the choice

of important parameters (attributes and base classifiers) of

ensemble systems.

This paper is an extension of the work in [3] that aims to

analyse how two diversity measures (good and bad diversity,

proposed in [4]) could be used to generate more accurate en-

sembles. The work [3] shows that optimizing only the error or

the error and bad diversity generates ensembles more accurate

than optimization of the possible combinations of optimiza-

tion objectives, this is, combinations between error, bad and

good diversity. However, in [3], we only used multi-objective

optimization algorithms in the design of ensemble systems.

As a consequence, we observed that the initial population of

the NSGA II algorithm had an important role in the final

solution of this algorithm. Therefore, aiming to improve evenfurther the performance of the obtained ensemble systems, we

decided to include meta-learning in the design of ensemble

systems. The main objective of this work is to verify how the

initial configuration of ensemble systems had an effect in the

outcome of the optimization algorithm. For this purpose, we

use meta-learning techniques to define the initial configuration

of ensemble systems. The hypothesis is that starting with a

more adapted population, then this population will converge

more efficiently, this is, more accurate and more diversified

ensemble systems.

This paper is divided into five sections and it is organized

as follows. Section 2 describes the research works related to

the subject of this paper. Section 3 presents the methodology

used in the experimental work of this paper, while an analysisof the results provided by the empirical analysis is shown in

Section 4. Finally, Section 5 presents the final remarks of this

2014 Brazilian Conference on Intelligent Systems

978-1-4799-5618-0/14 $31.00 © 2014 IEEE

DOI 10.1109/BRACIS.2014.27

91

2014 Brazilian Conference on Intelligent Systems

978-1-4799-5618-0/14 $31.00 © 2014 IEEE

DOI 10.1109/BRACIS.2014.27

91



paper.

II. RELATED WOR K

Several studies have analysed one important aspect when

designing ensemble systems, either feature or member selec-

tion. For instance, the selection of individual classifiers (size

and type) to compose ensemble systems has focused on finding

the most relevant subset of classifiers in order to improve the

combined performance. There are some works reported in the

literature that use selection methods in these systems, such as

in [2], [21], [22], [5]. Nowadays, one of the most attractive

selection methods is based on the genetic algorithms. This is

due to the possibility of dealing with a population of solutions

rather than only one solution [19], [18].

On the other hand, several authors have investigated the use

of feature selection methods in ensembles, such as in [23],

[20], [17], [9], [24], [10]. Optimization techniques have been

used to automate the search for the optimum attribute subsets.

Recently, several authors have investigated genetic algorithms

(GA) to design ensemble of classifiers [2], [9].

In the context of meta-learning, the work in [6] presentsdifferent ways of how meta-learning can be applied to classi-

fication problems. The results generated by meta-learning are

compared with the results where the initial configurations are

generated randomly, where each type of classification system

has the same chance to be chosen. The use of meta-learning

for ensemble systems has not been explored in the literature,

with only two papers [8], [7] and they apply meta-learning to

directly recommend parameters for ensemble systems.

Thus, unlike most of the previous works reviewed, this paper

aims to combine meta-learning and optimization techniques

for optimizing ensemble systems. In addition, it uses opti-

mization techniques in both ensemble components (individual

classifiers and features). In addition, we use meta-learning

to define the initial configurations (initial population of the

genetic algorithm) of the optimization technique.

III. ENSEMBLE S YSTEMS

Ensemble systems, also known as multi-classifier systems

or fusion of experts, exploit the idea that different classifiers

can offer complementary information about patterns to be

classified, thereby improving the effectiveness of the overall

recognition process [1]. These systems are composed of a

set of N individual classifiers (ICs), organized in a parallel

way, that receive the input patterns and send their output

to a combination module which is responsible for provid-

ing the final output of the system. Therefore, unlabelled

instances {xi ∈ Rd|i = 1, 2,...,n} will be presented toall individual classifiers and a combination method combine

their output to produce the overall output of the system

O = Comb(yj), {yj = (yj1,...,yjk | j = 1,...,N and k =1,...,l}, where the number of individual classifiers is defined

by c and l represents the number of labels of a dataset. In

ensemble systems, the main aim is that individual components

offer complementary information about an instance and this

complementary information can lead to an improvement in

the effectiveness of the whole recognition process [1].

In the design of ensemble systems, two main issues are

important, which are: the ensemble components and the com-

bination methods that will be used. In relation to the first issue,

the members of an ensemble are chosen and implemented.

As mentioned previously, the correct choice of the set of

individual classifiers is fundamental to the overall performance

of an ensemble. The ideal situation is a set of individual

classifiers with uncorrelated errors - they would be combined

in such a way as to minimize the effect of these failures. That

is, the base classifiers should be diverse among themselves.

Once a set of classifiers has been created and selected,

the next step is to choose an effective way of combining

their outputs. The choice of the best combination method

for an ensemble requires exhaustive training and testing. In

fact, the choice of the combination method of an ensemble is

very important and difficult to achieve. There is a reasonable

number of combination methods reported in the literature [1].

In this paper, the focus is on the simplest way to fuse the

output of the individual classifiers, thus we used the Majorityvote.

A. Diversity in Ensemble Systems

In an ensemble system, for instance, diversity can be

enhanced using the idea of building the individual classifiers

using different views, as follows.

• Parameter settings: the use of different initial parameter

settings for the individual classifiers can increase the

diversity among them, increasing the diversity of the

whole system;

• Training datasets: The use of feature selection techniques

or learning strategies (for instance, Bagging and Boost-

ing) makes the individual classifiers to have differentviews of the same problems, enhancing diversity of the

ensembles;

• Classifier types: the use of different types of classification

algorithms (heterogeneous structures) can increase the

diversity of the ensemble system.

There are different diversity measures available from dif-

ferent fields of research. However, to date, as there is no

consensus on the formal definition for this term, the task of

measure diversity becomes a non-trivial question. This issue

has been addressed by some authors [2]. In a recent work,

[4], the authors have employed the perspective that a diversity

measure should be naturally derived as a direct consequence of

two main factors, which are: the loss function of interest and

the combiner function. Based on this, they have proposed a

decomposition of the classification error for ensemble systems,

using the majority vote combiner, into three terms: individual

accuracy, good diversity, and bad diversity. These diversity

terms have been applied in ensemble systems based on the

number of votes when an ensemble makes a decision. The

equations for good Eq. (1) and bad Eq. (2) diversities are

described by:

9292



D+ = 1

N

#P +

i=1

v−i (1)

D− = 1

N

#P −

i=1

v+i (2)

Where N is the amount of ensemble members, v+i is the

amount of correct votes for the instance i in the set of instances

of, P − is the quantity of instances classified incorrectly by the

ensemble, v−i is the amount of incorrect votes for classification

of the instance i and P + is the quantity of instances classified

correctly by the ensemble [4].

IV. EXPERIMENTAL A NALYSIS

We perform an empirical investigation to assess how meta-

learning can affect the performance of the optimization tech-

nique. In this experiment, the meta-learning is responsible to

define the initial configuration of heterogeneous ensembles

that constitute the initial solutions of the optimization tech-

nique. We use NSGA II as optimization technique. In this

case, the meta-learning will suggest possible configurationsthat should be used in the NSGA-II algorithm.

A. Meta-learning

Meta-learning is the first phase of our methodology, which

will recommend initial configurations for the ensembles sys-

tems to be used by the optimization algorithms. Although the

optimization technique is applied to feature selection and base

classifiers, for simplicity reasons, the meta-learning module

will recommend the base classifiers to compose the ensemble

systems. More specifically, it defines the proportion of all three

learning algorithms (k-NN, Decision Tree and Naive Bayes

classifier) to compose the ensemble. This recommendation

will be taken into consideration by the optimization technique

when defining the initial population, as a sort of a priori

information in the creation of the initial population.

The meta-learner is generated from 540 datasetoids [12]

generated by some of the datasets presented in Table I in

which almost all of them were taken from UCI repository [13].

The datasets that were not taken from UCI were Gaussian,

Simulated and Jude. They are synthetic databases that simulate

microarray data and were created to test the machine learning

algorithms in the gene expression analysis [14]. It is important

to enphasize that we do not use the same datasets for the

meta-learning and for the performance evaluation (presented

in the next section). For the meta-learning process, we used

27 datasets and 17 for the performance evaluation. Table II

presents the datasets used in the performance evaluation.Each dataset is evaluated 10 times for each learning al-

gorithm (k-NN, Decision Tree and Naive Bayes classifier)

and a 10-fold cross validation was applied, making a total

of 100 runs. Then, the best learning algorithm for each run is

selected. After all runs, we have the number of times in which

each learning algorithm was the winner. These values are

transformed into the posterior probability distribution as used

TABLE IDATASETS U SED IN THE M ETALEARNING

Annealing Hepatitis SoybeanLargeArrhythmia Horse Colic SpamBaseAudiology Ionosphere SPECTHeartBalance Iris SPECTFHeartBreastTissue Jude StatlogAustralianCar KRKPA7 StatlogGermanClimateModel Labor StatlogHeartCongressionalVoting LibrasMovement TransfusionCre dit Approva l Par kinsons Vehic leCyl inde rBands Pitts burghV1 VowelDer matology Pitts burghV2 Wavef or mEcoli PlanningRelax WineFlags Promoter WisconsinDiagnosticGlass Protein WisconsinOriginalHea rtCl eve la nd Segment Wis cons inPr ognost icHeartHungrarian Sick ZooHeartLongBeach SimulatedHeartWitzerland Sonar

as the proportion of each learning algorithm in the ensemble

systems.

In order to avoid losing diversity in the optimization algo-rithm, a randomized selection of the initial population was

done, in which the average performance of each learning

algorithm was used as probability of selection the proportion

recommended by the meta-learner. For instance, if the average

performance of the learning algorithms was 85%, then the

initial population will contain 85% of the individuals created

based on recommendation of the meta-learning and 15% of

individuals randomly selected. In addition, as the meta-learner

recommends only the proportion of each learning algorithm,

the inidivuals were created using this recommendation with a

margin of 10%. For instance, if the meta-learner recommends

50% of k -NN, we create individuals with 45 to 55% of k -NN.

The meta-dataset that consist of 17 meta-features. The meta-

features are obtained of the database according to the settings

below where Ca is the set of categorical attributes, Na is

the set of numeric attributes and C is the label attribute.

For example, the first meta-feature, Attributes by Instances, is

calculated as quotient between the quantity fo attributes and

the quantity of instances.

1) Attributes by Instances #attributes/#instances2) Smaller Class by Bigger Class:

#smallerclassset/#biggerclassset3) Categorical by Attributes: #Ca/#attributes4) Numerical by Attributes: #N a/#attributes5) Class Entropy: entropy(C )6) Attributes Entropy: mean({x|y ∈ C a ∧ x =

entropy(y)})7) Mutual Information: mean({x|y ∈ Ca ∧ x =

mutualInformation(y, C )})8) Conditional Entropy: mean({x|y ∈ C a ∧ x =

conditionalEntropy (y, C )})9) Joint Entropy: mean({x|y ∈ C a ∧ x =

jointE ntropy(y, C )})10) Signal to Noise:

9393



(attributeEntropy − mutualInformation)/mutualInfo

11) High Correlation: (#{x|y ∈ N a ∧ x =correlation(y) > 0.25})/#N a

12) Low Correlation: (#{x|y ∈ N a ∧ x =correlation(y) < −0.25})/#N a

13) Neutral Correlation: (#{x|y ∈ N a ∧ x =|correlation(y)| < 0.25})/#N a

14) Positive Skew: (#{x|y ∈ N a ∧ x =skew(y) >= 0})/#Na

15) Negative Skew: (#{x|y ∈ N a ∧ x =skew(y) <= 0})/#Na

16) Positive Kurtosis: (#{x|y ∈ N a ∧ x =kurtosis(y) >= 3})/#N a

17) Negative Kurtosis: (#{x|y ∈ N a ∧ x =correlation(y) <= 3})/#N a

B. Optimization Algorithm

As already mentioned, we use two objectives, which are

classification error and bad diversity are the optimization

objectives (minimization objectives). Each individual (chro-mosome) is a heterogeneous ensemble that describes the set

of attributes, learning algorithms and their main parameters.

As already mentioned, the possible types of learning algorithm

are k-NN, Decision Tree, Naive Bayes Classifier (a description

of these methods can be found in [11]) and absence (indicates

that the base classifier is not part of the ensemble). For k-

NN, the only considered parameter was k and it varies in

[1, 10] ∈ N. The Decision Tree can modify the minimum

amount of instances in a leaf node and it varies in [1, 10] ∈ N.

The Naive Bayes Classifier can modify the way in which

the numerical attributes are treated. In this case, there are

three possibilities, which are: Gaussian distribution, entropy

discretization or by using kernel of Gaussian distributions.

In the NSGA II algorithm, we use the following parameters:a population of 30 individuals. Uniform mutation operator at a

rate of 10% when applied to modify an attribute by selecting

it or not, the learning type and/or parameter of base classifier

including removal of the base classifier. Two point crossover

operator at a rate of 80%. Finally, as a stopping criterion we

adopt a maximum of 100 epochs or convergence criterion,

this is, when 1/4 of population are equal in relation to the

optimization goals.

Thus, for comparison purposes, we generate three types

of initial configurations: one generated by meta-learning and

two are defined arbitrarily. In all cases, the set of parameters

and attributes for each base classifier is chosen randomly and

uniformly.

C. Methods and Materials

As already mentioned, a subset of the datasets listed in

Table I are used for assessment of experimental scenarios,

these databases are presented in Table II. In the comparative

analysis, the experimental scenarios consist of different ways

to recommend the initial solutions of the NSGA II algorithms.

The experimental scenarios are listed below:

1) Meta: The optimization algorithm creates an initial pop-

ulation where each individual represents an ensemble

with 30 base classifiers, using the recommendation made

by the meta-learner. The meta-leaner determines the

proportion of them are generated by the k-NN, of

Decision Tree and Naive Bayes Classifier.

2) Rand: It generates a an initial population where each

individual is chosen randomly and uniformly among k-

NN, Decision Tree, Naive Bayes and absent (indicates

that this base classifier is not active). In this scenario,

we can have individual that represent homogeneous and

heterogeneous ensembles of different sizes.

3) Equal: It generates a an initial population where each

individual represents an ensemble of 30 base classifiers

which 10 are generated by the k-NN, 10 by Decision

Tree and 10 by Naive Bayes Classifier. In this cases,

differences on the initial parameters and attribute distri-

butions will distinguish the individuals.

In order to compare the obtained results of the different

learning methods, a statistical test is applied. The Friedman

Test was applied, coupled with Wilcoxon test [15].

V. RESULTS

In order to validate the feasibility of the proposed method-

ology, an empirical analysis is conducted and this section

presents the obtained results. This empirical analysis was

done in two different parts. In the first one, we analyse the

performance of the error rate of the obtained results. In the

second part, we analyse the experimental scenarios, analysing

some multi-objective parameters. These two parts will be

described in the next two subsections.

A. Error rate of the Ensemble Systems

This first analysis is done in terms of the accuracy (error rateor classification error) of the ensemble systems produced by

the non-dominated solutions of the optimization techniques.

In this case, the solution with the lowest error rate of all

non-dominant solutions is selected. In addition, we compare

the accuracy of all three possible scenarios, Meta, Rand and

Equal. Table II presents the percentage of classification error

for all three experimental scenarios. For each dataset, the

experimental scenario that achieves the lowest error rate is

bold.

In analysing Table II, it can be observed that our approach

(Meta) achieved the highest performance (the lowest error

rate) in more than 70% (12 out of 17) of the analysed

datasets. The Equal scenario achieved the lowest error rate

in the remaining of the datasets and this means that the Randscenario did not achieve the lowest error rate in any dataset.

In order to evaluate these results from a statistical point of

view, the Friedman test [15] is applied. The results indicates a

significant difference p−value = 0.0015 (using a significance

level of 0.05). This means that the performance of all three

experimental scenarios are different, from a statistical point of

view.

9494



TABLE IICLASSIFICATION E RROR OF THE E NSEMBLE SYSTEMS

Datasets Equals Rand M eta

Balance 0.0790±0.0001 0.0872±0.0003 0.0784±0.0001BreastTissue 0.2113±0.0002 0.2462±0.0002 0.2028±0.0002Ecoli 0.1533±0.0002 0.2113±0.0008 0.1563±0.0001G la ss Id en tifi cat io n 0 .20 09±0.0007 0.2640±0.0017 0.1813±0.0003

HeartDiseaseLongBeachVA 0.4255±0.0003 0.4565±0.0007 0.4235±0.0003

HeartDiseaseWitzerland 0.3244±0.0002 0.3382±0.0003 0.3220±0.0001Hepatitis 0.1032±0.0016 0.0942±0.0019 0.0903±0.0018Iris 0.0293±0.0001 0.0327±0.0001 0.0293±0.0001

Labor 0.0000±0.0000 0.0000±0.0000 0.0000±0.0000

LungCancer 0.2813±0.0007 0.2875±0.0006 0.2969±0.0014PittsburghBridgesV1 0.2029±0.0001 0.2362±0.0002 0.2048±0.0002P ittsburghBridgesV2 0.2324±0.0001 0.2724±0.0004 0.2305±0.0001PlanningRelax 0.2599±0.0004 0.2703±0.0002 0.2632±0.0003SPECTHeart 0.2313±0.0004 0.2350±0.0002 0.2325±0.0001

Transfusion 0.2274±0.0001 0.2303±0.0001 0.2293±0.0001Wine 0.0000±0.0000 0.0011±0.0001 0.0011±0.0001Zoo 0.0158±0.0001 0.0129±0.0001 0.0158±0.0001

Then, we decided to apply the Wilcoxon test [15] in order to

evaluate the statistical difference to each pair of experimental

scenarios and considering all datasets at once. The results are

p−value = 1.0, p−value = 0.4282 and p−value = 0.4181when compared Meta x Equal, Meta x Rand and Randx Equal, respectively. Therefore, the Wilcoxon test failed

to identify a significant difference between the experimental

scenarios and we can not state which scenario presented the

lowest classification error, from a statistical point of view.

As the Wilcoxon test did not detect statistical differences

when we considered all datasets, we decided to apply a

statistical test for each dataset separately and these results are

presents in Table III. In this Table, the results are presented

as ”CaseA x CaseB” and it corresponds to the application

of the Mann-Whitney test [15], comparing samples of CaseAagainst CaseB. In all tests, a significance level of 0.05 is

adopted.

TABLE IIITHE RESULTS OF THE W ILCOXON TEST: CLASSIFICATION E RROR

Dataset Meta x Equals Meta x Ra nd R an d x Equals

Balance 0.4497 0.2568 0.5453BreastTissue 0.2265 0.0002 0.0003Ecoli 0.1509 0.0002 0.0002Glass 0.0757 0.0002 0.0010HeartLongBeach 0.9397 0.0082 0.0073HeartWitzerland 0.5708 0.0343 0.1306Hepatitis 0.2730 0.4963 0.7337Iris 1.0000 0.1509 0.1509Labor 1.0000 1.0000 1.0000LungCancer 0.3847 0.6501 0.5967PittsburghV1 0.6501 0.0005 0.0003PittsburghV2 0.7624 0.0003 0.0003PlanningRelax 0.7337 0.4057 0.2123SPECTHeart 0.7055 0.5453 0.7055Transfusion 0.4274 0.4963 0.0284

Wine 0.4497 1.0000 0.4497Zoo 1.0000 0.3643 0.3643

When comparing the p-values obtained in table III, we can

observe that a significant difference was not detected in any

dataset, between M eta and Equal. However, when comparing

Meta x Rand and Rand x Equal, we can see that there

are statistical differences in about 42% of the datasets (7 out

of 17), for both comparisons. These differences indicate that

the idea of using meta-learning can provide more accurate

ensemble systems in around half of the analysed cases, which

is an interesting result for our methodology.

B. Multi-objective Analysis

When using multi-objective optimization techniques, one of

the biggest challenges is to compare the outcomes provided by

these techniques. In general, the outcome of a multi-objectiveoptimization is a set of non-dominated points, called approx-

imation set. In this paper, we apply the dominance ranking

approach which yields general statements about the relative

performance of multi-objective optimizers fairly independent

of preference information [16]. This approach is recommended

as the first step in any comparison. If conclusions cannot be

drawn based only on this approach (p-values higher than the

significance level), other measures, named quality indicators,

are applied to point out differences among the sets generated

by different stochastic optimizers. Moreover, we apply Pareto

compliant indicators which are those that whenever an ap-

proximation set A is preferable to other one, B, with respect

to weak Pareto dominance, the indicator value for A is at

least as good as the indicator value for B. The concept of

weak dominance states that one solution z1 weakly dominates

another z2 if z1 is not worse than z2 in all objectives. The

Pareto-compliant quality indicators multiplicative binary-ε and

hyper-volume are used here when significant differences are

not devised with the dominance ranking approach. The ε-

indicator gives the factor by which an approximation set is

worse than another with respect to all objectives. The hyper-

volume measures the portion of the objective space that is

weakly dominated by an approximation set.

The table IV presents the results of the Friedman test

comparing all three experimental scenarios in relation to multi-

objective metrics. As already mentioned, the binary-ε and

hyper-volume criteria will be analysed only and only if nostatistical difference was detected in the dominance ranking

[16].

TABLE IVMULTI-O BJECTIVE A NALYSIS

Datasets DominanceRanking Hypervolume binary-ε

Balance 0.3284 0.3284 0.0002BreastTissue 0.7205 0.0006 0.0004Ecoli 0.3359 0.0008 0.0002Glass 0.0370 - -HeartLongBeach 0.3679 0.0014 0.0002HeartWitzerland 0.1357 0.0007 0.0159Hepatitis 0.1421 0.0138 0.0013Iris 0.3983 0.3050 0.0001Labor 1.0000 1.0000 0.0018

LungCancer 0.0004 - -PittsburghV1 0.0191 - -PittsburghV2 0.0101 - -PlanningRelax 0.1546 0.0002 0.0016SPECTHeart 0.0232 - -Transfusion 0.3679 0.0003 0.5119Zoo 0.7584 0.0145 0.0033Wine 0.2434 0.2434 0.0040

In the multi-objective context (Table IV) we can observe that

the dominance ranking did not detect statistical significance

9595



in more than 70% of the cases (12 out of 17 cases). Of

theses cases, significant differences were not detected in four

datasets, Balance, Iris, Labor, and Wine, when analysing

Hypervolume. The results in Table IV show that all three

scenarios had a similar performance since they provided multi-

objective metrics that proved not to be statistically significant.

In the cases where the results are statistically significant (5 for

dominance ranking, 12 for Hypervolume and 17 for binary-

ε), we applied the Wilcoxon test. As a result of it, we noticed

that there is a mixture of winners (best values), among all

three experimental scenarios. Once again, this is an indication

that the experimental scenarios provided similar performance,

from a statistical point of view.

V I . FINAL R EMARKS

This paper applied meta-learning techniques to recommend

the initial configurations for the optimization of ensemble

systems, using NSGA II algorithm. The idea is to use meta-

learning as a first step of the optimization algorithm and it

helps in the selection of the initial population. In order to

evaluate the performance of the proposed methodology, anempirical investigation was conducted. In this investigation,

three experimental scenarios were used, Rand Meta and

Equal, and they consist of different ways to recommend the

initial solutions of the NSGA II algorithms.

When analysing the error rates of the obtained ensemble

systems, the Meta scenario achieved the lowest error rate in

more than 70% of the datasets and it proved to be statistically

significant in around 42% of the datasets, when compared with

Rand scenario. In multi-objective analysis, it indicates that

all three experimental scenarios provided similar performance,

mainly for the dominance ranking parameter.

The obtained results might not be as good as we expected,

mainly for the multi-objective parameters. However, these

results do not exclude the possibility that we can use meta-learning to generate more accurate ensembles. We believe that

the recommendation of only one parameter did not provide

sufficient information for the optimization techniques and this

caused the different scenarios to have similar performance.

Nonetheless, as an on-going research, we are analyzing the use

of meta-learning to recommend other ensemble parameters.

REFERENCES

[1] Kuncheva, L. I.: Combining Pattern Classifiers: Methods and Algo-rithms. Wiley (2004)

[2] Kuncheva, L., Whitaker, C. J.: Measures of Diversity in ClassifierEnsembles and Their Relationship With the Ensemble Accuracy. In:Machine Learning, vol. 51, pp. 181–207, Springer (2003)

[3] Feitosa Neto, A., Canuto, A., Ludemir, T. B.: Using good and bad

diversity measures in the design of ensemble systems: A geneticalgorithm approach. In: IEEE Proceedings of Congress on EvolutionaryComputation (CEC), 2013.

[4] Brown, G., Kuncheva, L.: ”good” and ”bad” diversity in majority voteensembles. In: N. El Gayar, J. Kittler, F. Roli. Multiple ClassifierSystems, LNCS, vol. 5997, pp. 124–133, Springer, (2010)

[5] Deb, K., Pratap, A., Agarwal, S., Meyarivan, T.: A fast and elitistmultiobjective genetic algorithm: Nsga-ii. In: 2nd IEEE Transactionson Evolutionary Computation, vol. 6, pp. 182-197, (2002)

[6] Brazdil, P., Giraud-Carrier, C., Soares, C., Vilalta, R.: Metalearning:Applications to Data Mining. Springer (2009)

[7] R. R. Parente, A. M. P. Canuto, and J. C. Xavier, “Characterizationmeasures of ensemble systems using a meta-learning approach,” in

Neural Networks (IJCNN), The 2013 International Joint Conference on,2013, pp. 1–8.

[8] P. P. Bonissone, “Lazy meta-learning: creating customized model en-sembles on demand,” in Proceedings of the 2012 World Congress con-

ference on Advances in Computational Intelligence. Berlin, Heidelberg:Springer-Verlag, 2012, pp. 1–23.

[9] Lee, M., Boroczky, L., Sungur-Stasik, K., Cann, A., Borczuk, A.,

Kawut, S., Powell, C.: A two-step approach for feature selection andclassifier ensemble construction in computer-aided diagnosis. In: 21stIEEE International Symposium on Computer-Based Medical Systems,pp. 548-553 (2008)

[10] Oliveira, L. S., Morita, M., Sabourin, R.: Feature selection for ensemblesapplied to handwriting recognition. Int. J. Doc. Anal. Recognit., vol. 8,pp. 262-279, (2006).

[11] Witten, I. H., Frank, E.: Data Mining: Practical Machine Learning Toolsand Techniques. ed. 2. Elsevier. (2005)

[12] Soares, C.: UCI++: Improved Support for Algorithm Selection UsingDatasetoids. In: Advances in Knowledge Discovery and Data Mining.LNCS, vol. 5476, pp. 499–506 (2009)

[13] Asuncin, A., Newman, D. J.: UCI Machine Learning Repository. Univer-sity of California at Irvine. 2007. Available: http://ics.uci.edu/ ∼mlearn/ MLRepository.html

[14] Monti, S., Tamayo, P., Mesirov, J., Golub, T.: Consensus clustering: Aresampling-based method for class discovery and visualization of geneexpression microarray data. In: Machine Learning: Functional Genomics

Special Issue, pp. 91–118, (2003)[15] Gibbons, J. D., Chakraborti, S.: Nonparametric Statistical Inference:

Fourth Edition, Revised and Expanded. Marcel Dekker, Alabama (2003)[16] Knowles, J., Thiele, L., Zitzler, E.:A tutorial on the performance assess-

ment of stochastic multiobjective optimizers. Computer Engineering andNetworks Laboratory (TIK), ETH Zurich, Tech. Rep. TIK Report 214,(2006)

[17] Derrac J., Garca S., Herrera F.: A first study on the use of coevolutionaryalgorithms for instance and feature selection. In: Hybrid ArtificialIntelligence Systems, ser. Lecture Notes in Computer Science. Corchado,E., Wu, X., Oja, E., Herrero, L., Baruque, B. Springer Berlin, vol. 5572,pp. 557-564 (2009).

[18] Ruta, D., Gabrys, B.: Classifier selection for majority voting. Informa-tion Fusion, vol. 6, no. 1, pp. 63-81, Diversity in Multiple ClassifierSystems. 2005. Available: http://www.sciencedirect.com/science/article/ B6W76-4CHJ5N7-1/2/8b9b23704745066ff7fd97ba9fba3315

[19] Santos, E., Sabourin, R., Maupin, P.: Single and multi-objective geneticalgorithms for the selection of ensemble of classifiers. In: International

Joint Conference on Neural Networks, pp. 3070-3077 (2006).[20] Santana, L., Silva, L., Canuto, A., Pintro, F., Vale, K.: A comparativeanalysis of genetic algorithm and ant colony optimization to selectattributes for an heterogeneous ensemble of classifiers. In: (CEC), 2010IEEE Congress on Evolutionary Computation, pp. 1-8 (2010).

[21] Sylvester, J., Chawla, N.: Evolutionary ensemble creation and thinning.In: IJCNN 06 International Joint Conference on Neural Networks, pp.5148-5155 (2006).

[22] Wang, W., Partridge, D., Etherington, J.: Hybrid ensembles andcoincident-failure diversity. In: IJCNN 01, International Joint Confer-ence on Neural Networks, vol. 4, pp. 2376-2381 (2001).

[23] Bacauskiene, M., Verikas, A., Gelzinis, A., Valincius, D.: A featureselection technique for generation of classification committees and itsapplication to categorization of laryngeal images. Pattern Recogn, vol.42, pp. 645-654, (2009).

[24] Tsymbal, A., Puuronen, S., Patterson, D., W.: Ensemble feature selectionwith the simple bayesian classification, Information Fusion, vol. 4, no.2, pp. 87 100, (2003).

9696