72

Viktor Kajml diploma thesis

  • Upload
    kajmlv

  • View
    181

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Viktor Kajml diploma thesis

Czech Technical University in Prague

Faculty of Electrical Engineering

DIPLOMA THESIS

Bc. Viktor Kajml

Black box optimization: Restarting versus MetaMaxalgorithm

Department of Cybernetics

Project supervisor: Ing. Petr Posik, Ph.D.

Prague, 2014

Page 2: Viktor Kajml diploma thesis
Page 3: Viktor Kajml diploma thesis
Page 4: Viktor Kajml diploma thesis
Page 5: Viktor Kajml diploma thesis
Page 6: Viktor Kajml diploma thesis
Page 7: Viktor Kajml diploma thesis
Page 8: Viktor Kajml diploma thesis
Page 9: Viktor Kajml diploma thesis

Abstrakt

Tato diplomová práce se zabývá vyhodnocením nového perspektivního optimaliza-£ního algoritmu, nazvaného MetaMax. Hlavním cílem je zhodnotit vhodnost jehopouºití pro °e²ení problém· optimalizace £erné sk°í¬ky se spojitými parametry, obzvlá²t¥v porovnání s ostatními metodami b¥ºn¥ pouºívanými v této oblasti. Za tímto ú£elemje MetaMax a vybrané tradi£ní restartovací strategie, podrobn¥ otestován na rozsáhlésad¥ srovnávacích funkcí, za pouºití r·zných algoritm· lokálního prohledávání. Taktonam¥°ené výsledky jsou poté porovnány a vyhodnoceny. Druhotným cílem je navrhnouta implementovat modikace algoritmu MetaMax v jistých oblastech, kde je prostorpro zlep²ení jeho výkon·.

Abstract

This diploma thesis is focused on evaluating a new promising multi-start op-timization algorithm called MetaMax. The main goal is to assess its utilityit in the area of black-box continuous parameter optimization, especially incomparison with other strategies commonly used in this area. To achieve this,MetaMax and a selection of traditional restart strategies are thoroughly testedon a large set of benchmark problems and using multiple dierent local searchalgorithms. Their results are then compared and evaluated. An additionalgoal is to suggest and implement modications of the MetaMax algorithm,in certain areas where it seems that there could be a potential room for im-provement.

Page 10: Viktor Kajml diploma thesis
Page 11: Viktor Kajml diploma thesis

I would like to thank:

Mr. Petr Po²ík for his help on this thesis

The Centre of Machine perception at the Czech Technical University in Praguefor providing me with access to their computer grid

My friends and family for their support

Page 12: Viktor Kajml diploma thesis
Page 13: Viktor Kajml diploma thesis

Contents

1 Introduction 1

2 Problem description and related work 3

2.1 Local search algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Multi-start strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3 MetaMax algorithm and its variants 9

3.1 Suggested modications . . . . . . . . . . . . . . . . . . . . . . . . . 13

4 Experimental setup 16

4.1 Used multi-start strategies . . . . . . . . . . . . . . . . . . . . . . . . 184.2 Used metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.3 Implementation details . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5 Results 25

5.1 Compass search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255.2 Nelder-Mead method . . . . . . . . . . . . . . . . . . . . . . . . . . . 315.3 BFGS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365.4 CMA-ES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415.5 Additional results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

6 Conclusion 48

A Used local search algorithms 51

A.1 Compass search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51A.2 Nelder-Mead algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 51A.3 BFGS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54A.4 CMA-ES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

B CD contents 56

C Acknowledgements 56

i

Page 14: Viktor Kajml diploma thesis

List of Tables

1 Benchmark function groups . . . . . . . . . . . . . . . . . . . . . . . 172 Algorithm specic restart strategies . . . . . . . . . . . . . . . . . . . 203 Tested multi-start strategies . . . . . . . . . . . . . . . . . . . . . . . 214 Compass search - best restart strategies for each dimensionality . . . 265 Compass search - results of restart strategies . . . . . . . . . . . . . . 266 Compass search - results of MetaMax(k) and corresponding xed restart

strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 Compass search - results of MetaMax strategies . . . . . . . . . . . . 308 Nelder-Mead - best restart strategies for each dimensionality . . . . . 319 Nelder-Mead - results of restart strategies . . . . . . . . . . . . . . . . 3210 Nelder-Mead - results of MetaMax(k) and corresponding xed restart

strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3411 Nelder-Mead - results of MetaMax strategies . . . . . . . . . . . . . . 3512 BFGS - best restart strategies for each dimensionality . . . . . . . . . 3613 BFGS - results of restart strategies . . . . . . . . . . . . . . . . . . . 3714 BFGS - results of MetaMax(k) and corresponding xed restart strategies 3815 BFGS - results of MetaMax strategies . . . . . . . . . . . . . . . . . . 4016 CMA-ES - best restart strategies for each dimensionality . . . . . . . 4117 CMA-ES - results of restart strategies . . . . . . . . . . . . . . . . . . 4218 CMA-ES - results of MetaMax(k) and corresponding xed restart

strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4319 CMA-ES - results of MetaMax strategies . . . . . . . . . . . . . . . . 4520 CD contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

ii

Page 15: Viktor Kajml diploma thesis

List of Figures

1 Restart condition based on function value stagnation . . . . . . . . . 72 Example of monotone transformation of f(x) . . . . . . . . . . . . . . 153 MetaMax selection mechanisms . . . . . . . . . . . . . . . . . . . . . 164 Example ECDF graph . . . . . . . . . . . . . . . . . . . . . . . . . . 245 Compass search - ECDF comparing MetaMax(k) with an equivalent

xed restart strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . 276 Compass search - ECDF of MetaMax variants using 100 instances . . 297 Compass search - ECDF of MetaMax variants using 50d instances . . 318 Nelder-Mead - ECDF comparing MetaMax(k) strategies . . . . . . . 329 BFGS - ECDF of the best restart strategies . . . . . . . . . . . . . . 3710 BFGS - ECDF of MetaMax variants using 50d instances . . . . . . . 3911 CMA-ES - ECDF of function value stagnation based restart strategies 4112 CMA-ES - ECDF comparison of MetaMax variants using 50d instances 4413 MetaMax timing measurements . . . . . . . . . . . . . . . . . . . . . 4714 ECDF comparing MetaMax strategies using dierent instance selec-

tion methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4815 Nelder-Mead algorithm in 2D . . . . . . . . . . . . . . . . . . . . . . 52

List of Algorithms

1 Typical structure of a local search algorithm . . . . . . . . . . . . . . . 22 Variable neighbourhood search . . . . . . . . . . . . . . . . . . . . . . 93 MetaMax(k) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 MetaMax(∞) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 MetaMax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 Compass search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 517 Nelder-Mead method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 538 BFGS algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 549 CMA-ES algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

iii

Page 16: Viktor Kajml diploma thesis
Page 17: Viktor Kajml diploma thesis

1 Introduction

The goal of this thesis is to implement and evaluate the performance of theMetaMax optimization algorithm, particularly in comparison with other commonlyused optimization strategies.

MetaMax was proposed by György and Kocsis in [GK11] and the results theypresent seem very interesting and suggest that MetaMax might be a very competitivealgorithm. Our goal is to more closely evaluate its performance on problems fromthe area of black-box continuous optimization, by performing a series of exhaustivemeasurements and comparing the results with those of several commonly used restartstrategies.

This text is organized as follows: Firstly there is a short overview of the subjectsof mathematical, continuous and black-box optimization, local search algorithms andmulti-start strategies. This is meant as an introduction for readers who might notbe familiar with these topics. Readers who already have knowledge of these eldsmight wish to skip forward to the following sections, which describe the MetaMaxalgorithm, the experimental setup, used optimization strategies and the softwareimplementation. In the last two sections, the measured results are summed up andevaluated.

The mathematical optimization problem is dened as selecting the best element,according to some criteria, from a set of feasible elements. Most common form ofthe problem is nding a set of parameters x1,opt, . . . , xd,opt, where d is the problemdimension, for which a value of a given objective function f(x1, . . . , xd) is minimal,that is f(x1,opt, . . . , xd,opt) ≤ f(x1, . . . , xd) for all possible values of x1, . . . , xd.

Within this eld of mathematical optimization, it is possible to dene severalsubelds based on the properties of the parameters x1, . . . , xd and the amount ofinformation available about the objective function f .

Combinatorial optimization: The set of all possible solutions (possible combi-nations of the parameter values) is nite. Usually some subset of Nd.

Integer programming: All of the parameters are restricted to be integers:x1, . . . , xd ∈ N. Can be considered to be a subset of combinatorial optimization.

Mixed integer programming: Some parameters are real-valued and some areintegers.

Continuous optimization: The set of all possible solutions is innite. Usuallyx1, . . . , xd ∈ R.

black-box optimization: Assumes that only a bare minimum of informationabout f is given. It can evaluated at an arbitrary point x, returning the func-tion value f(x), but besides that, no other properties of f are known. In orderto solve this kind of problems, we have to resolve to searching (The exact tech-niques are be described in more detail later in this text). Furthermore we arealmost never guaranteed to nd the exact solution, just one that is sucientlyclose to it, and there is almost always a non-zero probability that even anapproximate solution might not be found at all.

1

Page 18: Viktor Kajml diploma thesis

White box optimization deals with problems where we have some additionalknowledge about f , for example its gradient, which can obviously be veryuseful when looking for its minimum.

In this text we will almost exclusively with black-box continuous optimizationproblems.

For a practical example of a black-box optimization problem, imagine the processof trying to design an airfoil, which should have certain desired properties. It is possi-ble to describe the airfoil by a vector of variables representing its various parameters- length, thickness, shape, etc. This will be the parameter vector x. Then, we canrun aerodynamic simulation with the airfoil described by x, evaluate how closely itmatches the desired properties, and based on that, assign a function value f(x) tothe parameter vector. In this way, the simulator becomes the black-box function fand the problem is transformed into task of minimizing the objective function f .We can then use black-box optimization methods to nd the parameter vector xoptwhich will give us an airfoil with the desired properties.

This example hopefully suciently illustrates the fact that, black-box optimiza-tion can be a very powerful tool, as it allows us to nd reasonably good solutionseven for problems which we might not be able to, or would not know how to, solveotherwise.

As already mentioned, the usual method for nding optima (the best possibleset of parameters xopt) in continuous mathematical optimization is searching. Thestructure of a typical local search algorithm is as follows:

Algorithm 1: Typical structure of a local search algorithm

Select a starting solution x0 somehow (most commonly randomly) from the1

set of feasible solutions.Set current solution: xc ← x02

Get function value f(xc).3

while Stop condition not met do4

Generate a set of neighbour solutions Xn similar to xc5

Evaluate f at each xn ∈ Xn6

Find the best neighbour solution x∗ = argmaxxn∈Xnf(x)7

if f(x∗) < f(xc) then8

Update the current solution xc ← x∗9

else10

Modify the way of generating neighbour solutions11

return xc12

In the case of continuous optimization, a solution is represented simply by apoint in Rd. There are various ways of generating neighbour solutions. In general,two neighbouring solutions should be dierent from each other, but in some sensealso similar. In continuous optimization, this usually means that the solutions areclose in terms of Euclidean distance, but not identical.

2

Page 19: Viktor Kajml diploma thesis

The algorithm described above has the property that it always creates neighboursolutions close to the current solution and moves the current solution in the directionof decreasing f(x). This makes it a greedy algorithm, which works well in cases wherethe objective function is unimodal (has only one optimum), but for multimodalfunctions (functions with multiple local optima), the resulting behaviour will not beideal. The algorithm will move in the direction of the nearest optimum (the optimumwith basin of attraction containng x0), but when it gets there it will not move anyfurther, as at this point, all the neighbour solutions will be worse than the currentsolution. Such algorithm can therefore be relied on to nd the nearest local optimum,but there is no guarantee that it will also be the global one. The global optimumwill be found only when x0 happens to "land" in its basin of attraction.

The method which is most commonly used to overcome this problem, is to runmultiple instances of the local search algorithm from dierent starting positions x0.Then it is probable that at least one of them will start in the basin of attraction ofthe global optimum and will be able nd it.

There are various dierent multi-start strategies which implement this basic idea,with MetaMax, the main subject of this thesis, being one of them.

More thorough description of the local search algorithms problem of getting stuckin a local optimum is described in section 2. Detailed description of the MetaMaxalgorithm and its variations is given in section 3. Structure of the performed ex-periments is described in section 4. Finally, the measured results are presented andevaluated in section 5.

2 Problem description and related work

As mentioned in the previous section, local search algorithms have problemsnding the global optimum of functions with multiple optima (also called multimodalfunctions). In this section we focus on this problem more thoroughly. We describeseveral common types of local search algorithms in more detail and discuss theirsusceptibility to getting stuck in a local optimum. Next, we will describe severalmethods to overcome this problem.

2.1 Local search algorithms

Following are the descriptions of four commonly used kinds of local search algo-rithms, which we hope will give the reader a more concrete idea about the functioningof local search algorithms, than the very basic example described in algorithm 1.

Line search algorithms try to solve the problem of minimizing a d-dimensionalfunction f by using a series of one-dimensional minimization tasks, called linesearches. During each step of the algorithm, an imaginary line is created start-ing at the current solution xc and going in a suitably chosen direction σ. Then,the line is searched for a point x, with the minimal value of f(x), and thecurrent solution is updated: xc ← x. In this way, the algorithm will eventuallyconverge on a nearest local optimum of f .

3

Page 20: Viktor Kajml diploma thesis

The question remains - How to chose the search direction σ? The most simplealgorithms just use a preselected set of directions (usually vectors in an or-thonormal positive d-dimensional base) and loop through them on successiveiterations. This method is quite simple to implement, but it has trouble copingwith ill-conditioned functions.

An obvious idea might be to use information about the functions gradient fordetermining the search direction. However, this turns out not to be much moreeective than simple alternating algorithms. The best results are achieved wheninformation about both the functions gradient and its Hessian is used. Then,it is possible to get quite robust and well performing algorithms. Note, thatfor black-box optimization problems, it is necessary to obtain the gradient byestimation, as it is not explicitly available.

Examples of this kind of algorithm are: Symmetric rank one method, gradientdescent algorithm and Broyden-Fletcher-GoldfarbShanno algorithm

Pattern search algorithms closely t the description given in algorithm 1. Theygenerate the neighbour solutions xn ∈ Xn in dened positions (a pattern)relative to the current solution xc. If any of the neighbour solutions is found tobe better than the current one, it then becomes the new current solution, thenext set of neighbour solutions is generated around it and so on.

If none of the neighbour solutions is found to be better (an unsuccessful iter-ation), then the pattern is contracted so that in the next step the neighboursolutions are generated closer to xc. In this way the algorithm will converge tothe nearest local optimum (for proof, please see [KLT03]). Advanced patternsearch algorithms use patterns, which change size shape according to variousrules, both on successful and unsuccessful iterations.

Typical algorithms of this type are: Compass search (or coordinate search),Nelder-Mead simplex algorithm and Luus-Jakola algorithm.

Population based algorithms keep track of a number of solutions, also calledindividuals, at one time, which together constitute a population. A new gen-eration of solutions is generated each step, based on the properties of a set ofselected (usually the best) individuals from the previous generation. Dierentalgorithms vary in the exact implementation of this process.

For example, in the family of genetic algorithms, this process is designed toemulate natural evolution: Properties of each individual (in case of continuousoptimization, this means its position) are encoded into a genome and newindividuals are created by combining parts of genomes of successful individualsfrom the previous generation, or by random mutation. Unsuccessful individualsare discarded, in an analogy with the natural principle of survival of the ttest.

Other population based algorithms, such as CMA-ES take a somewhat moremathematical approach: New generations are populated by sampling a multi-variate normal distribution, which is in turn updated every step, based on theproperties of a number of best individuals from the previous generation.

4

Page 21: Viktor Kajml diploma thesis

Swarm intelligence algorithms are based on the observation, that it is possible toget quite well performing optimization algorithms by trying to emulate naturalbehaviours, such as ocking of birds or sh schools. Each solution represents onemember of a swarm and moves around the search space according to a simpleset of rules. For example, it might try to keep certain minimal distance formother ock members, while also heading in the direction with the best valuesof f(x). The specic rules vary a great deal between dierent algorithms, butin general even a simple individual behaviour is often enough to result in quitecomplex collective emergent behaviour. Because swarm intelligence algorithmskeep track of multiple individuals/solutions during each step, they can also beconsidered to be a subset of population based algorithms.

Some examples of this class of algorithms are the Particle swarm optimizationalgorithm and the Fish school search algorithm.

Pattern search and line search algorithms, have the property that they alwayschoose neighbour solutions close to the current solution and they move in directionof decreasing f(x). Thus, as was already described in the previous section, they areable to nd only the local optimum, which is nearest to their starting position x0.

Population based and swarm intelligence algorithms might be somewhat less sus-ceptible to this behaviour, in the case where the initial population is spread over alarge area of the search space. Then there is a chance that some individuals might"land" near to the global optimum, and eventually "pull" the others towards it.

There are several modications of local search algorithms specically designed toovercome the problem of getting stuck in a local optimum. We shall now describetwo basic ones - Simulated annealing and Tabu search. The main idea behind them,is to limit the local search algorithms greedy behaviour by sometimes taking stepsother than those, which lead to the greatest decrease of f(x).

Simulated annealing implements the above mentioned idea in a very straightfor-ward way: During each step, the local search algorithm may select any of thegenerated neighbour solutions with a non-zero probability, thus possibly notselecting the best one.

The probability P of choosing a particular neighbour solution xn is a functionof f(xc), f(xn), and s, where s is the number of steps already taken by thealgorithm. Usually, it increases with the value of ∆f = f(xc)− f(xn), so thatthe best neighbour solutions are still likely to be picked the most often. Theprobability of choosing a neighbour solution other than the best one also usuallydecreases as s increases, so that the algorithm behaves more randomly in thebeginning and then, as the time goes on, settles down to a more predictablebehaviour and converges to the nearest optimum. This is somewhat similarto the metallurgical process of annealing, from where the algorithm takes itsname.

It is possible to apply this method to almost any of the previously mentioned lo-cal search algorithms, simply by adding the possibility of choosing neighbour so-lutions, which are not the best. In practice, the exact form of P (f(xc), f(xn), s)

5

Page 22: Viktor Kajml diploma thesis

has to be ne-tuned for a given problem in order to get good results. Therefore,this algorithm is of limited usefulness in the area of black-box optimization.

Tabu search works by keeping list of previously visited solutions, which is calledthe tabu list. It selects potential moves only from the set of neighbour solutions,which are not on the this list, even if it means choosing a solution, which isworse than the current one. The selected solution is then added to the tabulist and the oldest entry in the Tabu list is deleted. The list therefore works ina way similar to a cyclic buer.

This method has been originally designed for solving combinatorial optimiza-tion problems and it requires certain modications in order to be useful in thearea of continuous parameter optimization. At the very least, it is necessaryto modify the method to not only discard neighbour solutions which are onthe tabu list, but also solutions which are close to them. Without this, thealgorithm would not work very well, as the probability of generating the exactsame solution twice in Rd is quite small.

There is a multitude of advanced variations of this basic method, for exampleit is possible to add aspiration rules, which override the tabu status of solutionsthat would lead to a large decrease in f(x). For a detailed description of tabusearch adapted for continuous optimization, please see [CS00].

2.2 Multi-start strategies

Multi-start strategies allow eectively using local search algorithms on functionswith multiple local optima without making any modication to the way they work.The basic idea is, that if we run a search algorithm multiple times, each time froma dierent starting position x0, then it is probable that at least one of the startingpositions will be in the basin of attraction of the global optimum and thus thecorresponding local search algorithm will be able to nd it. Of course, the probabilityof this depends on the number of algorithm instances that are run, relative to thenumber and properties of the functions optima. It is possible to think about multi-start strategies as meta-heuristics, running above, and controlling multiple instancesof local search algorithm sub-heuristics.

Restart strategies are a subset of multi-start strategies, where multiple instancesare run one at a time in succession. The most basic implementation of a restartstrategy is to take the total amount of allowed resource budget (usually a set numberof objective function evaluations), evenly divide it into multiple slots, and use eachof them to run one instance of a local search algorithm. A very important choiceis deciding the length of a single slot. The optimal length largely depends on thespecic problem and type of used algorithm. If the length is set too low, then thealgorithm might not have enough time to converge to its nearest optimum. If it is toolong, then there is a possibility that resources will be wasted on running instanceswhich are stuck in local optima and can no longer improve.

Of course, all of the time slots do not have to be of the same length. A goodstrategy for black-box optimization is to start with low length and keep increasing

6

Page 23: Viktor Kajml diploma thesis

it for each subsequent slot. In this way, a reasonable performance can be achievedeven if we are unable to choose the most suitable slot length for a given problem inadvance.

A dierent restart strategy is to keep each instance going as long as it needsto until it converges to an optimum. The most universal way to detect convergenceis to look for stagnation of values of the objective function over a number of pastfunction evaluations (or past local search algorithm iterations). If the best objectivefunction value found so far does not improve by at least the limit tf over the last hffunction evaluations, then the current algorithm instance is terminated and new oneis started. For convenience, in the subsequent text we will call hf the function valuehistory length and tf the function value history tolerance. An example of this restartcondition is given in gure 1: The best solution found after v function evaluationsis marked as x∗v and its corresponding function value as f(x∗v). In the gure, we seethat the restart condition is triggered because at the last function evaluation m, thefollowing is true: f(x∗m−hf ) ≤ f(x∗m) + tf

0 5 10 15 20 25 30 35v

500

1000

1500

2000

2500

3000

f(xv)

m−hf

f(x ∗m )

f(x ∗m ) +tf

Figure 1: Restart condition based on function value stagnationDisplays the objective function value f(xv) (dashed black line) of eval-uation v, and the best objective function value reached after v functionevaluations f(x∗s) (solid black line), over the interval < 0,m > functionevaluations. The values f(x∗m), f(x

∗m) + tf and m− hf are highlighted.

It is, of course, necessary to choose specic values of hf and tf but usually it isnot overly dicult to nd a combination which works well for a large set of problems.

Various dierent ways of detecting convergence and corresponding restart con-ditions can be used. For example, reaching zero-gradient for line-search algorithms,

7

Page 24: Viktor Kajml diploma thesis

reaching minimal size of pattern for pattern search algorithms, etc.There are also various ways of choosing the starting position x0 for new local

search algorithm instances. The simplest one is to choose x0 by sampling randomuniform distribution over the set of all feasible solutions. This is very easy to im-plement and often gives good results. However, it is also possible to use informationgained by the previous instances, when choosing x0 for a new one.

A simple algorithm, which utilizes this idea is the Iterated search: The rst in-stance i1 is started from an arbitrary position and it is run until it converges (oruntil it exhausts certain amount of resources) and returns the best solution it hasfound x∗i1. Then, the starting position for the next instance is selected from the neigh-bourhood N of x∗i1 . Note, that N is a qualitatively dierent neighbourhood, thanwhat the instance i1 might be using to generate neighbourhood solutions each step.It is usually much larger, with the goal being to generate the new starting point forinstance i2 by perturbing the best solution of i1 enough, to move it to a dierentbasin of attraction. If the new instance nds a solution x∗i2 better than x

∗i1, then the

next instance is started from the neighbourhood N (x∗i2). If f(x∗i2) ≥ f(x∗i1) and abetter solution is not found, then the next instance is started from the neighbour-hood N (x∗i1) again. This is repeated until a stop condition is triggered. An obviousassumption, that this method makes, is that the minima of the objective functionare grouped close together. If this is not the case, then it might be better to useuniform random sampling.

The big question is, how to choose the size of the neighbourhood N ? Too small,and the new instance might fall into the same basin of attraction as the previous one.Too big, and the results will be similar to choosing the starting position uniformlyrandomly. Another method, called the Variable neighbourhood search, which can, ina way, be considered to be an improved version of the iterated search, tackles thisproblem by using multiple neighbourhood structures N1, . . . ,Nk of varying sizes,where N1 is the smallest and the following neighbourhoods are successively larger,with Nk being the largest. The restarting procedure is the same as with iteratedsearch, with the following modication: If a local search algorithm instance ik, startedfrom the neighbourhood N1(x

∗ik−1

) does not improve the current best solution, thenthe algorithm tries starting the next instance from N2(x

∗ik−1

), then N3(x∗ik−1

), and soon. The structure of a basic variable neighbourhood search, as given in [HM03], page10, is described in algorithm 2. This algorithm can also be used as a description ofiterated search, if the set of neighbourhood structures contains only one element.

Yet another group of methods, which aim to prevent local search algorithmsfrom getting stuck in local optima is based on the idea that it is not necessary torun multiple local search algorithm instances one after another, but they can be runat the same time. Then, it is possible to evaluate the expected performance of eachinstance based on the results it obtained so far and allocate the resources to the best(or most promising) ones. This is somewhat similar to the well known multi-armedbandit problem.

The basic implementation of this idea is called the explore and exploit strategy.It involves initially running all of its k algorithm instances until a certain fractionof its resource budget is expended. This is the exploration phase. Then, the best

8

Page 25: Viktor Kajml diploma thesis

Algorithm 2: Variable neighbourhood search

input : initial position x0, set of neighbourhood structures N1, . . . ,Nk ofincreasing size

x∗ ← local_search(x0)1

k ← 12

while Stop condition not met do3

Generate random point x′ from Nk(x∗)4

y∗ ← local_search(x′)5

if f(y∗) < f(x∗) then6

x∗ ← y∗7

k ← 18

else9

k ← k + 110

return x∗11

instance is selected and run until the rest of the resource budget is used up - Theexploitation phase.

There is, again, an obvious trade o between the amount of resources allocatedto each phase. The exploration phase should be long enough, so that, when it ends,it is possible to reliably identify the best instance. On the other hand, it is necessaryto have enough resources left in the exploitation phase in order for the selected bestinstance to converge to the optimum. In practice, it is actually not that dicult tond balance between these two phases, that gives good results for a wide range ofproblems.

Methods like this, which run multiple local search algorithm instances at thesame time, belong into the group of portfolio algorithms. We should, however, notethat portfolio algorithms are usually used in a somewhat dierent way than describedhere. Most commonly, they run multiple instances of dierent local search algorithms,each of which is well suited for a dierent kind of problem. This allows the portfolioalgorithm to select instances of that algorithm, which is able to solve the givenproblem the most eciently, even without knowing its properties a priori.

The MetaMax algorithm, which is the main subject of this thesis, is also an port-folio algorithm. However, we use it only running one kind of local search algorithm ata time, to allow for a more fair and direct comparison with restart strategies, whichtypically use only one kind of local search algorithm.

3 MetaMax algorithm and its variants

The MetaMax algoithm is a multi-start portfolio strategy presented by Györgyand Koscis in [GK11]. There are, actually, three versions of the algorithm, whichdier in certain details. They are called MetaMax(k), MetaMax(∞) and MetaMaxand they will be described in detail in this section.

Please note, that while in this text we usually presume all optimization prob-

9

Page 26: Viktor Kajml diploma thesis

lems to be minimization problems, the text in [GK11] assumes a maximization task.Therefore, while describing the workings of MetaMax algorithm in this section, wewill keep to the convention in [GK11], but in the rest of the text we will refer tominimization tasks as usual. Our implementation of MetaMax was modied to workwith minimization tasks.

György and Kocsis demonstrate ([GK11], page 413, equation 2) that convergenceof an instance of a local search algorithm, after s steps, can be optimistically esti-mated with large probability as:

lims→∞

f(x∗s) ≤ f(x∗s) + gσ(s) (1)

Where f(x∗s) is the best function value obtained by the local search algorithm in-

stance up until the step s and gσ(s) is a non increasing, non negative function withlims→∞ g(s) = 0. Note, that the notation used here is a little dierent than in [GK11],but the meaning is the same.

In practice, the exact form of g(s) is not known, so the right side of equation 1has to be approximated as:

f(x∗s) + ch(s) (2)

Where c is an unknown constant and h(s) is a positive, monotone, decreasing functionwith the following properties:

h(0) = 1, lims→∞h(s) = 0 (3)

One possible simple form of this function is h(s) = e−s. In the subsequent text, weshall call this function "the estimate function". György and Kocsis do not use thisname in their work. In fact, they do not use any name for this function at all andrefer to it simply as h function. However, we think that this is not very convenient,hence we picked a suitable name.

Based on equations 1 and 2, it is possible to create a strategy that allocatesresources only to those instances, which are estimated to converge the most quicklyand maximize the value of expression 2 for a certain range of the constant c. Theproblem of nding these instances can be solved eectively by transforming it intoa problem of nding the upper right convex hull of a set of points in the followingway:

We assume, that there is k number of instances in total and that each instance Aikeeps track of the number of steps it has taken si, the position xi,si of the best solutionit has found so far and its corresponding function value f(x∗i,si). If we represent theset of the local search algorithm instances Ai, i = 1, . . . , k by a set of points:

P : (h(si), f(x∗i,si)), i = 1, . . . , k (4)

Then the instances which minimize the value of expression 2 for a certain range ofc correspond to those points, which lie on the upper right convex hull of the set P .Because the term upper right convex hull is not quite standard, we should clarifythat we understand it to mean an intersection of the upper convex hull and the rightconvex hull.

10

Page 27: Viktor Kajml diploma thesis

Note, that presumably for simplicity, the authors of [GK11] assumed only localsearch algorithms which use the same number of function evaluations every step. Foralgorithms where this is not true, it makes more sense to set si equal to the numberof function evaluations used by the instance i so far instead. We believe that this isa better way to measure the use of resources by individual instances, which is alsoconrmed in [PG13].

György and Kocsis suggest using a form of estimate function, which changes basedon the amount of resources used by all the local search algorithm instances, in orderto encourage more exploratory behaviour as the MetaMax algorithm progresses.Therefore, in our implementation, we use the following estimate function, which isrecommended in [GK11]:

h(vi, vt) = e−vi/vt (5)

Where vi is the number of function evaluations used by instance i and vt is the totalnumber of function evaluations used by all of the instances combined.

The simplest of the three MetaMax variants is MetaMax(k). It uses k local searchalgorithm instances and is described in algorithm 3. For convenience and improvedreadability, we will use simplied notation, when describing MetaMax variants:

vi for number of function evaluations used by local search algorithm instance i sofar

xi for position of the best solution found by instance i so far

fi for function value of xi

In the descriptions, we also assume that the estimate function h is a function of onlyone variable.

Algorithm 3: MetaMax(k)

input : function to be optimized f , number of algorithm instances k and amonotone non-decreasing function h with properties as given inequation 3

Step each of the k local search algorithm instances Ai and update their1

variables vi, xi and fiwhile stop conditions not met do2

For i = 1, . . . , k, select algorithm Ai if there exists c > 0 so that:3

fi + ch(vi) > fj + ch(vj) for all j = 1, . . . , k so that (vi, fi) 6= (vj, fj). Ifthere are multiple algorithms with identical v and f , then select only oneof them at random.Step each selected Ai and update its variables vi, xi and fi.4

Find the best instance: b = argmin1,...,k(fi).5

Update the best solution: x∗ ← xb.6

return x∗7

As with a priori scheduled restart strategies, there is the question of choosingthe right number of instances (parameter k) to use. The other two versions of the

11

Page 28: Viktor Kajml diploma thesis

algorithm - MetaMax and MetaMax(∞) get around this problem by gradually in-creasing the number of instances, starting with a single one and adding a new oneevery round. Thus, the number of instances tends to innity as the algorithm keepsrunning. This allows to prove that the algorithm is consistent. That is, it will almostsurely nd the global optimum if kept running for an innite amount of time.

Please note, that in some literature, such as [Neu11], the term asymptoticallycomplete is used, instead of consistent, but both of them mean the same thing. Alsonote, that we use the word round to refer to a step of the MetaMax algorithm, in orderto avoid confusion with steps of local search algorithms. MetaMax and MetaMax(∞)are described in algorithms 5 and 4 respectively, also using the simplied notation.

Algorithm 4: MetaMax(∞)

input : function to be optimized f , monotone non-decreasing function h withproperties as given in equation 3

r ← 11

while stop conditions not met do2

Add a new local search algorithm instance Ar, step it once and initialize3

its variables vr, xr and frFor i = 1, . . . , r, select algorithm Ai if there exists c > 0 so that:4

fi + ch(vi) > fj + ch(vj) for all j = 1, . . . , r so that (vi, fi) 6= (vj, fj). Ifthere are multiple algorithms with identical v and f , then select only oneof them at random.Step each selected Ai and update its variables vi, xi and fi.5

Find the best instance: b = argmin1,...,r(fi).6

Update the best solution: x∗ ← xb.7

r ← r + 18

return x∗9

MetaMax and MetaMax(∞) dier only in one point (lines 6 and 7 in algorithm5): If, after stepping all selected instances, the best instance is a dierent one thanin the previous round, MetaMax will step it until it overtakes the old best instancein terms of used resources.

In [GK11] it is shown that MetaMax asymptotically approaches the performanceof its best local search algorithm instance as the number of rounds increases. Theo-retical analysis suggests that the number of instances increases at a rate of Ω(

√vt),

where vt is the total number of used function evaluations. However, practical resultsgive a rate of growth only of Ω( vt

logvt). Based on this, it can also be estimated ([GK11],

page 439) that to nd the global optimum xopt, MetaMax needs only a logarithmicfactor more function evaluations than a local search algorithm instance, which wouldstart in the basin of attraction of xopt.

Note a small dierence in the way MetaMax and MetaMax(∞) are describedin algorithms 5 and 4 from their descriptions in [GK11]. There, a new algorithminstance Ar is added with fr = 0 and sr = 0 and takes at most one step during theround that it is added. This is possible because in [GK11] a non-negative objectivefunction f and a maximization task are assumed. Therefore, an algorithm instance

12

Page 29: Viktor Kajml diploma thesis

Algorithm 5: MetaMax

input : function to be optimized f , monotone non-decreasing function h withproperties as given in equation 3

r ← 11

while stop conditions not met do2

Add a new local search algorithm instance Ar, step it once and initialize3

its variables vr, xr and frFor i = 1, . . . , r, select algorithm Ai if there exists c > 0 so that:4

fi + ch(vi) > fj + ch(vj) for all j = 1, . . . , r so that (vi, fi) 6= (vj, fj). Ifthere are multiple algorithms with identical v and f , then select only oneof them at random.Step each selected Ai and update its variables vi, xi and fi.5

Find the best instance: br = argmin1,...,r(fi).6

If br 6= br−1 step instance Abr until vbr ≥ vbr−17

Update the best solution: x∗ ← xb.8

r ← r + 19

return x∗10

can be added without taking any steps rst, and assigned a function value fr = 0,which is guaranteed to not be better than any of the function values of the otherinstances.

We are, however, dealing with a minimization problem with a known target value(see [Han+13b]) but no upper bound on f and, consequently, no worst possible valueof f . Therefore, we made a little change and step the new instance Ar immediatelyafter it is added. It can then also be stepped second time, during step 4 in algorithms5 and 4. We believe, that this has no signicant impact on performance.

3.1 Suggested modications

MetaMax and MetaMax(∞) will add a new instance each round as long asthey are running, with no limit on the maximum number of instances. The au-thors of [GK11] state that the worst-case computational overhead of MetaMax andMetaMax(∞) is O(r2), where r is the number of rounds. For the purpose of opti-mizing functions, where each function evaluation uses up a large amount of com-putational time (for which MetaMax was primarily designed), the overhead will benegligible compared to the time spent calculating function values and will not presenta signicant problem. However, in comparison with restart strategies which have typ-ically almost no overhead this is still a disadvantage for MetaMax. Therefore, it wouldbe desirable to come up with some mechanism that would improve its computationalcomplexity.

An obvious solution would be to limit the total number of instances which canbe added or slow down the rate at which they are added so that there will never betoo many of them. However, this would make MetaMax and MetaMax(∞) behavebasically in the same way as MetaMax(k) and lose their main property, which is the

13

Page 30: Viktor Kajml diploma thesis

consistency based on always generating new instances.A better solution would be to add a mechanism which would discard one of

already existing instances every time a new one is added and therefore keep the totalnumber of instances at any given time constant. The important question is: Whichone of the existing instances should be discarded?

We propose the following approach: Discard the instance which has not beenselected for the longest time. If there are multiple instances which qualify, discard theone with the worst function value. The rationale behind this discarding mechanismis that MetaMax most often selects (allocates the most resources to) those instances,which have the best optimistic estimate of convergence. Therefore, the instanceswhich are selected the least often will likely not give very good results in the future,and so make good candidates for deletion. An alternative, method may also be todiscard the absolute worst instance (in terms of the best objective function valuefound so far). Which is even simpler, but we feel that it does not follow so naturallyfrom the principles behind MetaMax. Therefore, for most of our experiments we willuse the discarding of least selected instances.

Another area where we think it might be benecial to modify the workings ofMetaMax, is the mechanism of selecting instances to be stepped in each round.The original mechanism has two possible disadvantages: Firstly, it is not invariantto monotone transformation of the objective function values. By this we mean amapping f(x) → f ′(x), which itself is only a function of the value of f(x) and notthe parameter vector x. The monotone property meaning, that if f(x1) < f(x2) thenf ′(x1) < f ′(x2) for all possible x1 and x2. Such a monotone transformation will notchange the location of the optima of f(x). I will also not change the direction ofgradient of f(x) for any x, but not necessarily its magnitude. An example of suchtransformation is given in gure 2.

Logically, it would not make much sense to require an optimization algorithm tobe invariant to an objective function value transformation, which is not monotone,as it could change the position of the functions optima.

The second possible disadvantage of the convex hull based instance selectionmechanism is that it also behaves dierently based on the choice of the estimatefunction h. This is not such a great disadvantage as the rst one, because f(x) isgiven, while h can be chosen freely. However, it would still be benecial if we couldentirely remove the need to choose h.

To overcome these problems, we propose a new instance selection mechanism. Ituses the same representation of local search algorithm instances as a set of points P ,given in equation 4 but it select those instances, which correspond to non-dominatedpoints of P in the sense of maximizing fi and maximizing h(vi) (or analogicallymaximizing fi and minimizing vi). This method is clearly invariant to both monotonetransformation of objective function values f → f ′ and dierent choices of h, asdetermining non-dominated points depends only on their ordering along the axes fiand h(xi), which will always be preserved due to the fact that both f → f ′ and hare monotone. Moreover, the points which lie on the right upper convex hull of P ,and thus maximize the optimistic estimate fi + ch(vi), are always non-dominated,and thus will always be selected.

14

Page 31: Viktor Kajml diploma thesis

3 2 1 0 1 2 3 32

10

12

30

5

10

15

3 2 1 0 1 2 3 32

10

12

30

1000

2000

3000

4000

3 2 1 0 1 23

2

1

0

1

2

3 2 1 0 1 23

2

1

0

1

2

Figure 2: Example of monotone transformation of f(x)Displays a 3D mesh plot of a Rastrigin like function f(x) in the top left, atransformed function f(x)3 in the top right and their respective contourplots on the bottom. It is clear, that the shape of the contours is thesame, but their heights are not.

A possible disadvantage of the proposed algorithm is, that at each round it se-lects many more points than the original convex hull mechanism. This might resultin selecting instances with low convergence estimate too often, and not dedicatingenough resources to the more promising ones. A visual comparison of the two selec-tion mechanisms and demonstration of the inuence of choice of estimate functionupon selection are presented in gure 3.

15

Page 32: Viktor Kajml diploma thesis

3.53.02.52.01.51.00.50.0

f i

1e8

0.1 0.2 0.3 0.4h(vi)

543210

f3 i

1e25

0.1 0.2 0.3 0.4h(vi)

Figure 3: MetaMax selection mechanismsCompares the original selection mechanism based on nding upper con-vex hull (left sub-gures), with the new proposed mechanism based onselecting non-dominated points (right sub-gures). Also demonstrates theeects of monotone transformation of the objective function values on theselection, with f(x) for the upper sub-gures and f(x)3 for those on thebottom. Selected points are marked as red diamonds, connected by a redline. Unselected points are marked as lled black circles.

4 Experimental setup

All of the experiments were conducted using the COCO (Comparing continuousoptimizers) framework [Han13a], which is an open-source set of tools for systematicevaluation and comparison of real-parameter optimization strategies. It provides a setof 24 benchmark functions of dierent types, chosen to thoroughly test the limits andcapabilities of optimization algorithms. Also included are tools for running experi-ments on these functions and logging, processing and visualising the measured data.The library for running experiments is provided in versions for C, Java, R, Matlaband Python. The post processing part of the framework is available for Python only.The benchmark functions are divided into 6 groups, according to their properties.They are briey described in table 1. For detailed description, please see [Han+13a].There are also multiple instances dened for each function, which are created byapplying various transformations to the base formula.

We shall now briey explain some of the functions properties mentioned in table1. As already mentioned, the terms unimodal and multimodal refer to functions with

16

Page 33: Viktor Kajml diploma thesis

Name Functions Description

separ 1-5 Separable functionslcond 6-9 Functions with low or moderate conditionalityhcond 10-14 Unimodal functions with high conditionalitymulti 15-19 Multimodal structured functionsmult2 20-24 Multimodal functions with weak global structure

Table 1: Benchmark function groups

single optimum and multiple local optima respectively.Conditionality describes how much the functions gradient changes depending on

direction. Simply put, functions with high conditionality (also called ill-conditionedfunctions), at certain points, grow rapidly in some directions but slowly in others.This often means that the gradient points away from the local optimum, whichpresents a dicult problem for some local search algorithms. To give a more visualdescription, one can imagine that 3D graphs of two-dimensional ill conditioned func-tions usually form sharp ridges, while those of well conditioned functions form gentleround hills.

Separable functions have the following form: f(x1, x2, . . . , xd) = f(x1) + f(x2) +...+f(xd), which means that they can be minimized by minimizing d one-dimensionalfunctions, where d is the number of dimensions of the separable function.

In order to exhaustively evaluate performance of the selected strategies, we de-cided to make the following series of measurements for each strategy:

1. Using four dierent local search algorithms - Compass search, Nelder-Meadmethod, BFGS and CMA-ES. In order to evaluate the eect of algorithmchoice.

2. Using all of the 24 noiseless benchmark functions available in the COCO frame-work, to measure performance on a wide variety of dierent problems.

3. Using the following dimensionalities : d = 2, 3, 5, 10, 20. To see how much isthe performance aected by the number of dimensions.

4. Using the rst fteen instances of each function. According to [Han+13b], thisnumber is sucient to provide statistically sound data.

Resource budget for minimizing a single function instance (a single trial) was set to105d, meaning 100000 times the number of dimensions of the instance.

The reasons for choosing the four local search algorithms are: Compass searchalgorithm was chosen for its simplicity, in order to allow us to evaluate whetherMetaMax can improve performance of such a basic algorithm. Nelder-Mead methodwas chosen as a more sophisticated representative of the group of pattern searchalgorithms, than compass search. BFGS was selected as a typical line search method.Finally, CMA-ES is there to represent population based algorithms. It is also the mostadvanced of the four algorithms and thus we expect that it will perform the bestof the four selected algorithm. For a more detailed description of these algorithms,please see section A.

17

Page 34: Viktor Kajml diploma thesis

4.1 Used multi-start strategies

In this section, we describe the selected MetaMax and restart strategies, whichwere evaluated using the methods described above. For convenience, we assigned ashorthand name to each used strategy, so that we can write, for example "csa-h-10d",instead of "objective function stagnation based restart strategy with history length10d using compass search algorithm", which is impractically verbose. The shorthandnames have the following form: abbreviation of the used local search algorithm, dash,used multi-start strategy, dash, strategy parameters. A list of all used strategies andtheir shorthand names is given in table 3.

We chose two commonly used restart strategies to compare with MetaMax: a xedrestart strategy with a set number of resources allocated to each local search algo-rithm run, and a dynamic restart strategy with restart condition based on objectivefunction value stagnation.

Performance of these two strategies largely depends on the combination of theproblem being solved and the strategy parameters. Therefore, we decided to use sixxed restart strategies and six dynamic restart function stagnation strategies withdierent parameters:

• Fixed restart strategies

Run lengths: nf = 100d, 200d, 500d, 1000d, 2000d, 5000d evaluations.Shorthand names: algorithm-f-nf

• Function value stagnation restart strategies

Function value history lengths: hf = 2d, 5d, 10d, 20d, 50d, 100d evaluationsFunction value tolerance: tf = 10−10

Shorthand names: algorithm-h-hf

Note, that the parameters depend on the number of dimensions of the measuredfunction d. This is consistent with the fact that the total resource budget of thestrategy also depends on d and that we can expect that for higher dimensionalities,the used local search algorithms will need longer runs to converge.

The rationale behind choosing the used parameter values is the following: Usingthe function evaluation budget of 105d, run lengths longer than 5000d would give usless than 20 restarts per trial. This would result in a very low chance of nding theglobal optimum on most of the benchmark functions, some of which can have up to10d optima. Also, it is probable that most local search algorithms will converge along time before using up all 50000d function evaluations and then the rest of theallocated resources would be essentially wasted on running an instance which cannotimprove any more. Conversely, run lengths smaller than 100d are probably not longenough to allow most local search algorithm instances to converge and so there wouldbe little sense in using them.

The choice of the upper bound of the function value history length hf as 100dis based on a similar idea: For values greater than 100d the restart condition wouldtrigger too long after the local search algorithm has already converged, and so wewould be needlessly wasting resources on it. The choice of the lower bound of hfdepends on the used algorithm. For a restart strategy to function properly, hf has to

18

Page 35: Viktor Kajml diploma thesis

be greater, or at least as much, as the number of function evaluations that the usedlocal search algorithm uses during one step. The above stated value of hf = 2d isthe minimal value for which the Nelder-Mead and BFGS algorithms work properly.For the other two algorithms, the minimal value is hf = 5d. We decided to base thefunction value history length on number of used function evaluations, rather than onnumber of taken steps, because it allows for a more direct comparison of performanceof the same strategy using two dierent algorithms.

Choosing the value of the function stagnation tolerance tf involved a little bitmore guesswork. There is target function value dened for all of the benchmarkfunctions, which is equal to the function value at their global optimum f(xopt) plusa tolerance value ftol = 10−8. That is, the function instance is considered to besolved if we nd some point x with f(x) ≤ f(xopt) + ftol. We based our choice ofthe function stagnation tolerance parameter tf = 10−10 on ftol. Setting the value oftf one hundred times lower than ftol should make it large enough to reliably detectconvergence, while not being too large to trigger the reset condition prematurely,when the local search algorithm is still converging.

The goal of using multiple strategies with dierent parameter values is to haveat least one xed restart and one objective function value stagnation based strategy,that performs well on the set of all functions, for each measured dimensionality.

For easier comparison of results of the xed restart strategies, we represent themall together, by choosing only the results of the best performing strategies for eachdimensionality and collecting them into a "best of" collection of results, which wewill refer to by the shorthand name algorithm-f-comb. This represents the resultsof running a xed restart strategy, which is able to choose the optimal run length(from the set of six used run lengths), based on dimensionality of the function beingsolved. The results of objective function value stagnation strategies are representedin an analogous way, under the name algorithm-h-comb.

Besides the already mentioned restart strategies, we decided to add four more,each based on a restart condition specic to one of the used local search algorithms.Shorthand names for these strategies are algorithm-special. They are describedin table 2.

In order to save computing time, and as per recommendation in [Han+13b], weused an additional termination criterion that halts the execution of a restart strategyafter 100 restarts, even if the resource budget has not yet been exhausted and thesolution has not been found. This does not impact the accuracy of the measurements,as 100 restarts is enough to provide statistically signicant amount of data and themetrics which we use (see subsection 4.2) are not biased against results of runswhich did not use up the entire resource budget. In fact, the xed restart strategiesf-100d, f-200d and f-500d always reach 100 restarts before they can fully exhausttheir resource budgets.

The idea of using the original "pure" versions of MetaMax and MetaMax(∞)algorithms, which keep adding local search algorithm instances without limit, provedto be impractical due to its excessive computational resource requirements (for thelength of experiments that were planned). Therefore, we performed measurementsusing only the modied versions of MetaMax and MetaMax(∞) with the added

19

Page 36: Viktor Kajml diploma thesis

Algorithm Description

Compass search Restart when the variable a, which aects how far from the cur-rent solution the algorithm generates neighbour solutions, decreasesbelow 10−10. It naturally decreases as the algorithm converges, sochecking its value makes for a good restart condition.

Nelder-Mead We chose a similar condition to the one mentioned above. Restartis triggered when distance between the two points of the simplexwhich are the farthest apart from each other decreases below 10−10.The rationale is similar as above: the simplex keeps growing smalleras the algorithm converges. It might be more mathematically properto check the area (or volume, or hyper-volume, depending on thedimensionality) of the simplex, but we discarded this idea out ofconcern that it might be too computationally intensive.

BFGS Restart condition is triggered if the norm of the gradient is smallerthan 10−10. Since the algorithm already uses information about thegradient, it makes sense also to use it for detecting convergence.

CMA-ES The recommended settings for CMA-ES given in [Han11] suggestusing 9 dierent restart conditions. Here we use these recommendedsettings. Note that when using CMA-ES with the other restartstrategies, we use only a single restart condition and the additionalones are disabled. In a sense, we are not using the algorithm to itsfull potential, but this allows for a more direct comparison withother local search algorithms.

Table 2: Algorithm specic restart strategies

mechanism (described in subsection 3.1) for limiting maximum number of instances.For all MetaMax strategies, we used the recommended form of estimate function:h = e−vi/vt . Measurements were performed using the following MetaMax strategies:

1. MetaMax(k), with k=20, k=50 and k=100. This gives the same total numberof local search algorithm instances as when using xed restart strategies withrun lengths equal to 5000d, 2000d and 1000d respectively. This makes it pos-sible to evaluate the degree to which the MetaMax mechanism of selecting themost promising instances improves the performance over these correspondingrestart strategies. The expectation is, that the success rate for MetaMax(k) willnot increase, because the number of instances and thus the ability to explorethe search space stays the same. However, MetaMax(k) should converge fasterthan the xed restart strategies, because it should be able to identify the bestinstances and allocate resources to them appropriately.

2. MetaMax and MetaMax(∞) with the maximum number of instances set to100. This should allow us to asses the benets of the mechanism of adding newinstances (and deleting old ones), by comparing the results with MetaMax(k),which uses the same number of instances each round, but does not add ordelete any. Here, we would expect an increase in success rate on multimodal

20

Page 37: Viktor Kajml diploma thesis

functions, as the additional instances ,generated each round, should allow thealgorithms to explore the search space more thoroughly. However, the limit of100 instances will possibly still not be enough to get a good success rate formultimodal problems with high dimensionality.

3. MetaMax and MetaMax(∞) with maximum number of instances set to 50d.This should allow the algorithms to scale better with the number of dimen-sions and, hopefully, further improve their performance. The number 50d waschosen as a reasonable compromise between computation time and expectedperformance. We expect to get the best results here.

Shorthand names for MetaMax variants were chosen as algorithm-k-X for Meta-Max(k), algorithm-m-X for MetaMax and algorithm-i-X for MetaMax(∞), whereX is the maximum allowed number of instances (or, equivalently, the value of k forMetaMax(k)).

Fixed restart strategies

f-100d Run length = 100d evaluationsf-200d Run length = 200d evaluationsf-500d Run length = 500d evaluationsf-1000d Run length = 1000d evaluationsf-2000d Run length = 2000d evaluationsf-5000d Run length = 5000d evaluationsf-comb Combined xed restart strategy

Function value stagnation restart strategies

h-2d History length = 2d evaluationsh-5d History length = 5d evaluationsh-10d History length = 10d evaluationsh-20d History length = 20d evaluationsh-50d History length = 50d evaluationsh-100d History length = 100d evaluationsh-comb Combined function value stagnation restart strategy

Other restart strategies

special Special restart strategy specic to each algorithm, see table 2MetaMax variants

k-20 MetaMax(k) with k=20k-50 MetaMax(k) with k=50k-100 MetaMax(k) with k=100k-50d MetaMax(k) with k=50dm-100 MetaMax with maximum number of instances = 100m-50d MetaMax with maximum number of instances = 50di-100 MetaMax(∞) with maximum number of instances = 100i-50d MetaMax(∞) with maximum number of instances = 50d

Table 3: Tested multi-start strategies

21

Page 38: Viktor Kajml diploma thesis

There is a number of additional interesting aspects of the MetaMax variants,which would be worth testing and evaluating. For example:

• Comparison of MetaMax and MetaMax(∞) with the limit on maximum num-ber of instances and without it.

• Performance of dierent methods of discarding old instances.

• Inuence of dierent choices of estimate function on performance.

• Performance of our proposed alternative method for selecting instances.

However, it was not practically possible (mainly time-wise) to perform full sized(105d function evaluatoin budget) experiments which would test all of these features.Therefore, we decided to make a series of smaller measurements, with the maximumnumber of function evaluations per trial set to 5000d, using only dimensionalitiesd=5, d=10 and d=20 and using only the BFGS algorithm. This should allow us totest these features at least in a limited way and see if any of them warrant furtherattention. More specically, we made the following series of measurements:

1. MetaMax and MetaMax(∞) without limit on maximum number of instances

2. MetaMax and MetaMax(∞) with maximum instance limit 5d, 10d and 20d,discarding the most inactive instances

3. MetaMax and MetaMax(∞) with maximum instance limit 5d, 10d and 20d,discarding the worst instances

4. MetaMax(k) with k=5d, k=10d and k=20d

These measurements were repeated three times, rst time using the recommendedform of the estimate function h1(vi, vt) = e−vi/vt , second time with a simpliedfunction h(vi) = evi and the third time using the proposed alternate instance selectionmethod, based on selecting non-dominated points.

4.2 Used metrics

In this section, we describe the various metrics that were used to compare resultsof dierent strategies. The simplest one is the success rate. For a set of trials U(usually of one strategy running on one or more benchmark functions) and a chosentarget value t, it can be dened as:

SR(U, t) =|u ∈ U : fbest(u) ≤ t|

|U |(6)

Where |U | is the number of trials and |u : u ∈ U, fbest(u) ≤ t| is the number of trialswhich have found a solution at least as good as t. In the rest of this text we use amean success rate, averaged over a set of target values T

SRm(U, T ) =1

|T |∑t∈T

SR(U, t) (7)

22

Page 39: Viktor Kajml diploma thesis

The main metric used in the COCO framework is the expected running time,or ERT. It estimates the expected number of function evaluations that a selectedstrategy will take to reach a target function value t for the rst time, over a set oftrials U . It is dened as:

ERT (U, t) =1

|u ∈ U : fbest(u) ≤ t|∑u∈U

evals(u, t) (8)

Where evals(u, t) is the number of function evaluations used by trial u to reachtarget t, or the total number of evaluations used by u if it never reached t. Expression|u ∈ U : fbest(U) ≤ t| is the number of successful trials for target t. If there wereno such trials, then ERT (U, t) =∞. In the rest o this text we will use ERT averagedover a set of target values T , in a similar way to what is described in equation 7. Wewill also usually compute it using a set trials obtained by running the same strategyon multiple dierent functions, usually all functions in one of the function groupsdescribed in table 1.

For comparing two or more strategies in terms of success rates and expectedrunning times, we use graphs of the empirical cumulative distributive function ofrun lengths, or ECDF. Such a graph displays on the y-axis the percentage of trialsfor which ERT (averaged over a set of target values T ) is lower than the number ofevaluations x, where x is the corresponding value on the x-axis. It can also be said,that for each x it shows the expected average success rate, if a function evaluationbudget equal to x was used. For easier comparison of ECDF graphs across dierentdimensionalities, the values on the x-axis are divided by the number of dimensions.The function displayed in the graph can then be dened as:

y(x) =1

d|T ||U |∑u∈U

|t ∈ T : ERT (t, u) ≤ x| (9)

An example ECDF graph, like ones that are used throughout the rest of the text,is given in gure 4. It shows ERTs of two sets of trials measured by running twodierent strategies on the set of all benchmark functions, for d=10 and averagedover a set of 50 target values. The target values are logarithmically distributed inthe interval < 10−8; 102 >. We use this same set of target values in all our ECDFgraphs.

The marker × denotes the median number of function evaluations of unsuccessfultrials, divided by the number of dimensions. Values to the right of this marker are(mostly) estimated using bootstrapping (for details of the bootstrapping method,please refer to [Han+13b]). The fact that we use 15 trials for each strategy-functionpair means, that the estimate is reliable only up to about fteen times the number ofevaluations marked by ×. This is a fact that should be kept in mind when evaluatingthe results. The thick orange line in the plot is represents the best results obtainedduring the 2009 BBOB workshop for the same set of problems and is provided forreference.

Since we are dealing with a very large amount of measured results, it would bedesirable to have a method of comparing them, that is even more concise than ECDFgraphs. To this end, we use metric called aggregate performance index (API), dened

23

Page 40: Viktor Kajml diploma thesis

0 1 2 3 4 5 6 7 8log10evaluations/D

0.0

0.2

0.4

0.6

0.8

1.0Pro

port

ion o

f tr

ials

nm-k-100

bfgs-k-100

best 2009f1-24,10-D

Figure 4: Example ECDF graphComparison of the results of MetaMax(k), with k=100, using BFGS andNelder-Mead local search algorithms, on the set of all benchmark func-tions. The strategy using BFGS clearly outperforms the other one, bothin terms of success rate and speed of convergence.

by Mr. Po²ík in a yet unpublished (at the time of writing this text) article [Po²13]. Itis based on the idea that the ECDF graph of the results of an ideal strategy, whichsolves the given problem instantly, would be a straight horizontal line across the topof the plot. Conversely, for the worst possible strategy imaginable, the graph wouldbe a straight line along the bottom. It is apparent, that the area above (or bellow)the graph makes for quite a natural measure of eectiveness of dierent strategies.Given a set of ERTs A, their aggregate performance index can be computed as:

API(A) = exp

(1

|A|∑a∈A

log10 a

)(10)

For the purposes of computing API, the ERTs of unsuccessful trials which are bydenition ∞ have to be replaced with a value that is higher than ERT of any suc-cessful trial. The choice of this value determines how much the unsuccessful trialsare penalized and thus aects the nal ERT score. For our purposes, we chose thevalue 108d.

Since we are computing API from the area above the graph this means that thelower its value the better the corresponding strategy performs. Using API essentiallyallows us to represent results of a set of trials by a single number and to easilycompare performances of dierent optimization strategies.

4.3 Implementation details

The software side of this project was implemented mostly in Python, with partsin C. The original plan was to write the project purely in Python, which was cho-

24

Page 41: Viktor Kajml diploma thesis

sen because of its ease of use and availability of many open-source scientic andmathematical libraries. However, during the project it was found out that a purePython code performs too slowly and would not allow us to make all the necessarymeasurements. Therefore, parts of the program had to be changed over to C, whichhas improved performance to a reasonable level.

The used implementations of BFGS and Nelder-Mead algorithms, are based onthe code from open-source Scipy library. They were modied to allow running thealgorithms in single steps, which is necessary in order for them to work with Meta-Max. An open-source implementation of CMA-ES was used, available at [Han13b].Implementation of MetaMax was written based on its description in [GK11]. It washowever, necessary to make several little changes to it mainly because it is designedwith a maximization task in mind but we needed to use it for minimization problems.For nding upper convex hulls we used Andrew's algorithm with some additional preand post processing, to get the exact behaviour described in [GK11].

For description of the source code, please see the le source/readme.txt on theattached CD.

5 Results

In this section we will evaluate results of the selected multi-start strategies. Wedecided to split the results into four subsections based on used local search algo-rithm. We present and compare the results mainly using tables, which list APIs andsuccess rates for dierent groups of functions and dierent dimensionalities. For con-venience, the best results are highlighted with bold text. We also show ECDF graphsto illustrate particularly interesting results. Results of the smaller experiments, de-scribed at the end of section 4.1 and results of timing measurements are summarizedin subsection 5.5.

The values of success rates and APIs, which are shown in this section, are com-puted only using data bootstrapped up to the value of 105d function evaluations.In our opinion, these values represent the real performance of the selected strategiesbetter than if we were to use fully bootstrapped data, which are estimated to a largedegree and therefore not so statistically reliable. In ECDF graphs, bootstrapped re-sults are shown up to 107d evaluations. All of the APIs and success rates are averagedover a set of multiple targets, as described in subsection 4.3.

The measured data are provided in their entirety on the attached CD (see sectionB) in the form of tarballed Python pickle les, which can be processed using theBBOB post processing framework. It was not possible to provide the data in theiroriginal form, as text les, because their total size would be in the order of gigabytes,which would clearly not t on the attached medium.

5.1 Compass search

Table 4 summarizes which of the used xed restart and function value stagnationrestart strategies were best for each dimensionality and chosen for the best-of resultcollections cs-f-comb and cs-h-comb. Table 5 then compares these two sets of re-

25

Page 42: Viktor Kajml diploma thesis

sults together with results obtained by the compass search specic restart strategycs-special.

It is apparent, that for the best strategies the values of run length and functionvalue history length increase with the number of dimensions. This is not unexpectedas compass search uses 2d or 2d-1 function evaluations at each step.

Dimensionality Fixed Stagnation based

d=2 cs-f-100d cs-h-5d

d=3 cs-f-100d cs-h-5d

d=5 cs-f-200d cs-h-10d

d=10 cs-f-500d cs-h-10d

d=20 cs-f-500d cs-h-20d

Table 4: Compass search - best restart strategies for each dimensionality

The comparison of the best restart strategies suggests, that all of them have quitesimilar overall performance, with cs-h-comb being a little better than the others interms of success rate and cs-f-comb in terms of API. In the subsequent tables, wewill provide results of cs-f-comb for reference, as an exaple of a well tuned restartstrategy.

None of the strategies performs very well on multimodal and highly conditionedfunctions. This is to be expected, as the compass search algorithm is known to havetrouble with ill conditioned problems and multimodal problems are dicult to solvefor any algorithm.

log10API Success rate [%]2D 3D 5D 10D 20D 2D 3D 5D 10D 20D

all

cs-f-comb 3.75 4.46 5.52 6.31 6.90 85 74 53 41 34cs-h-comb 3.82 4.69 5.62 6.28 6.86 85 74 56 44 39

cs-special 4.16 4.92 5.80 6.39 6.91 78 68 48 43 37

separ cs-f-comb 2.58 3.03 4.33 4.93 5.50 100 100 69 66 62

cs-h-comb 2.69 3.11 4.03 4.86 5.43 100 100 84 66 63

cs-special 2.63 3.09 4.32 4.95 5.43 100 100 69 66 63

lcond cs-f-comb 3.84 4.18 5.34 6.35 7.44 84 84 63 50 26

cs-h-comb 3.49 4.22 5.15 5.97 6.99 100 98 84 72 51

cs-special 3.63 4.18 5.38 6.27 7.26 100 100 72 64 40

hcond cs-f-comb 4.19 5.38 5.99 6.63 7.03 82 52 43 35 33

cs-h-comb 5.40 6.07 6.50 6.91 7.31 47 35 29 28 26cs-special 5.83 6.20 6.45 6.89 7.30 37 32 31 28 26

multi cs-f-comb 4.29 5.17 6.59 7.34 7.80 80 63 26 14 10

cs-h-comb 4.10 5.21 6.95 7.53 7.98 79 66 18 11 7cs-special 4.71 6.08 6.99 7.61 7.96 72 44 17 10 8

mult2 cs-f-comb 3.86 4.48 5.31 6.30 6.85 80 74 66 42 36

cs-h-comb 3.36 4.74 5.38 6.08 6.63 100 76 70 52 50

cs-special 3.89 4.89 5.76 6.20 6.65 85 72 55 51 50

Table 5: Compass search - results of restart strategies

26

Page 43: Viktor Kajml diploma thesis

Comparison of the results of three MetaMax(k) strategies with correspondingxed restart strategies which use the same total number of local search algorithminstances is given in table 6. They conrm our expectations, and show that, overall,MetaMax(k) converges faster than a comparable xed restart strategy. The onlyexception being the group separ. This can be explained by the fact that functionsfrom this group are very simple and can be generally solved by a single, or only veryfew, runs of the local search algorithm. In this case, the MetaMax mechanism ofselecting multiple instances each round is more of a hindrance than a benet.

In terms of success rate, MetaMax(k) is always as good or even better than thecomparable xed restart strategy, with the improvement being especially obvious onthe groups lcond and mult2.

Of the three tested variants of MetaMax(k), cs-m-100 is the best overall. How-ever, it is not better than a well tuned restart strategy like cs-f-comb.

Figure 5 shows a behaviour which was observed across all function groups and di-mensionalities when comparing MetaMax(k) with corresponding xed restart strate-gies: At rst, MetaMax(k) converges much more slowly than the restart strategy, asit is still in the phase of initialising all of its instances. However, as soon as this isnished, it starts converging quickly and overtakes the restart strategy for a certaininterval. After that, its rate of convergence slows down again and it ends up withsuccess rate (for 105d function evaluations) similar to that of the restart strategy.This eect seems to get less pronounced with increasing number of dimensions.

0 1 2 3 4 5 6 7 8log10evaluations/D

0.0

0.2

0.4

0.6

0.8

1.0

Pro

port

ion o

f tr

ials

cs-k-50

cs-f-2000d

best 2009f1-24,5-D

Figure 5: Compass search - ECDF comparing MetaMax(k) with an equivalent xedrestart strategy

Results of comparing cs-k-100, cs-m-100 and cs-i-100 are shown in table 7.It is apparent, that using the same number of instances at a time MetaMax andMetaMax(∞) clearly outperform MetaMax(k) on all function groups, both in termsof speed of convergence and success rates.

In general, they also provide results at least as good, or better than the best

27

Page 44: Viktor Kajml diploma thesis

log10API Success rate [%]2D 3D 5D 10D 20D 2D 3D 5D 10D 20D

all

cs-f-1000d 4.60 5.20 5.85 6.36 7.01 74 61 46 41 30cs-f-2000d 4.82 5.45 5.94 6.46 6.96 72 55 45 39 35

cs-f-5000d 5.18 5.71 6.08 6.59 7.13 67 48 43 36 28cs-k-20 4.55 5.30 5.87 6.42 6.99 67 52 45 41 35

cs-k-50 4.16 5.01 5.75 6.37 6.98 74 58 47 42 35

cs-k-100 4.00 4.84 5.62 6.29 6.97 75 62 49 43 35

cs-f-comb 3.75 4.46 5.52 6.31 6.90 85 74 53 41 34

separ

cs-f-1000d 3.28 3.72 4.58 5.01 5.44 100 100 68 65 63

cs-f-2000d 3.54 4.22 4.61 5.08 5.49 100 84 68 65 63

cs-f-5000d 4.08 4.74 4.86 5.18 5.51 100 68 67 65 62cs-k-20 3.38 4.16 4.62 5.21 5.75 84 68 67 65 62cs-k-50 2.82 3.88 4.63 5.24 5.81 100 84 68 65 63

cs-k-100 2.68 3.85 4.71 5.29 5.80 100 84 68 66 63

cs-f-comb 2.58 3.03 4.33 4.93 5.50 100 100 69 66 62

lcond

cs-f-1000d 4.56 5.22 5.75 6.33 7.42 80 60 54 50 27cs-f-2000d 4.75 5.37 5.84 6.45 7.36 78 56 52 44 31cs-f-5000d 4.83 5.38 5.96 6.53 7.48 82 57 50 46 26cs-k-20 4.19 5.12 5.70 6.27 7.11 86 58 55 50 41

cs-k-50 3.83 4.98 5.58 6.28 7.22 88 63 57 50 37cs-k-100 3.85 4.54 5.45 6.28 7.29 88 78 62 52 34cs-f-comb 3.84 4.18 5.34 6.35 7.44 84 84 63 50 26

hcond

cs-f-1000d 5.32 5.92 6.29 6.66 7.14 55 40 37 35 30cs-f-2000d 5.56 6.06 6.39 6.89 7.23 52 37 33 27 28cs-f-5000d 5.97 6.21 6.52 6.93 7.32 40 35 30 27 26cs-k-20 5.32 5.95 6.35 6.83 7.27 47 36 33 29 28cs-k-50 5.14 5.77 6.23 6.79 7.21 52 42 36 31 31cs-k-100 5.03 5.67 6.13 6.70 7.17 56 45 40 34 33

cs-f-comb 4.19 5.38 5.99 6.63 7.03 82 52 43 35 33

multi

cs-f-1000d 5.26 6.18 6.81 7.38 7.82 62 35 22 14 10

cs-f-2000d 5.44 6.34 6.87 7.44 7.85 59 31 21 13 10

cs-f-5000d 5.94 6.51 7.00 7.48 7.91 45 26 17 12 9cs-k-20 5.27 6.15 6.76 7.34 7.76 50 27 18 12 9cs-k-50 4.80 5.93 6.62 7.26 7.74 62 31 21 13 10

cs-k-100 4.58 5.81 6.47 7.23 7.70 62 34 25 14 10

cs-f-comb 4.29 5.17 6.59 7.34 7.80 80 63 26 14 10

mult2

cs-f-1000d 4.59 4.99 5.80 6.43 7.40 72 70 52 44 18cs-f-2000d 4.79 5.23 5.94 6.42 6.94 70 69 52 47 41

cs-f-5000d 5.00 5.66 6.05 6.80 7.48 70 54 52 34 18cs-k-20 4.51 5.09 5.88 6.40 7.06 70 69 52 50 34cs-k-50 4.12 4.48 5.66 6.24 6.96 72 70 54 50 34cs-k-100 3.82 4.25 5.30 5.94 6.94 73 70 54 51 34cs-f-comb 3.86 4.48 5.31 6.30 6.85 80 74 66 42 36

Table 6: Compass search - results of MetaMax(k) and corresponding xed restartstrategies

28

Page 45: Viktor Kajml diploma thesis

restart strategies. There is almost no dierence between the performance of cs-m-100and cs-i-100, which corresponds with results presented in [GK11]. Dierences inperformance seem to diminish with increasing dimensionality and, for d=10 andd=20. all of the MetaMax strategies which use 100 instances perform almost thesame.

ECDF graph in gure 6 shows an interesting behaviour, where cs-m-100 andcs-i-100 start converging right away and overtake cs-k-100 while it is still in theprocess of initializing all of its instances. After that, MetaMax(k) catches up and fora certain interval all of the strategies perform the same. Then, MetaMax(k) stopsconverging, the other two strategies overtake it again and ultimately achieve bettersuccess rates. The sudden stop in MetaMax(k) convergence presumably happenswhen all of its best instances have already found their local optima, after whichthere is no possibility of nding better solutions without adding new instances, whichMetaMax(k) cannot do.

0 1 2 3 4 5 6 7 8log10evaluations/D

0.0

0.2

0.4

0.6

0.8

1.0

Pro

port

ion o

f tr

ials

cs-k-100

cs-i-100

cs-m-100

best 2009f1-24,5-D

Figure 6: Compass search - ECDF of MetaMax variants using 100 instances

In the next set of measurements using cs-m-50d, cs-i-50d and cs-k-50d, itbecame apparent that the increased limit on maximum number of instances does notcause any noticeable increase in performance for MetaMax and MetaMax(∞). Theperformance of MetaMax(k) was somewhat improved, but overall it is still worse thanthe other two MetaMax variants and slightly worse than the best restart strategies.These results are also presented in table 7.

ECDF graph in gure 7 shows results of cs-k-50d and cs-m50d compared withthe collection of best xed restart strategy results cs-f-comb. We have omittedcs-i-50d as its performance is very similar to that of cs-m-50d.

In conclusion, we can say that using the compass search algorithm MetaMaxand MetaMax(∞) perform better than even well tuned restart strategies, and thatincreasing the maximum number of allowed instances does not have any signicanteect on their performance.

29

Page 46: Viktor Kajml diploma thesis

log10API Success rate [%]2D 3D 5D 10D 20D 2D 3D 5D 10D 20D

all

cs-k-100 4.00 4.84 5.62 6.29 6.97 75 62 49 43 35cs-m-100 3.33 4.19 5.21 6.15 6.88 92 78 61 46 39

cs-i-100 3.34 4.15 5.14 6.16 6.87 91 79 64 46 39

cs-k-50d 3.97 4.73 5.64 6.29 7.01 76 66 50 45 37cs-m-50d 3.35 4.14 5.22 6.15 6.88 91 79 61 46 38cs-i-50d 3.35 4.16 5.23 6.15 6.88 91 79 61 46 39

cs-f-comb 3.75 4.46 5.52 6.31 6.90 85 74 53 41 34

separ

cs-k-100 2.68 3.85 4.71 5.29 5.80 100 84 68 66 63cs-m-100 2.44 3.01 4.19 5.22 5.85 100 100 84 66 63cs-i-100 2.42 3.04 3.83 5.27 5.78 100 100 100 66 63cs-k-50d 2.67 3.54 4.79 5.49 6.14 100 100 69 66 64

cs-m-50d 2.45 2.99 4.16 5.23 5.82 100 100 84 66 63cs-i-50d 2.45 3.03 4.21 5.20 5.87 100 100 84 66 64

cs-f-comb 2.58 3.03 4.33 4.93 5.50 100 100 69 66 62

lcond

cs-k-100 3.85 4.54 5.45 6.28 7.29 88 78 62 52 34cs-m-100 3.14 3.88 5.09 6.14 7.32 100 92 70 55 33cs-i-100 3.18 3.85 5.11 6.10 7.32 99 91 70 56 34cs-k-50d 4.05 4.46 5.48 6.19 7.48 84 80 62 57 28cs-m-50d 3.21 3.82 5.17 6.09 7.33 100 90 68 56 32cs-i-50d 3.20 3.90 5.17 6.06 7.29 98 91 68 57 35

cs-f-comb 3.84 4.18 5.34 6.35 7.44 84 84 63 50 26

hcond

cs-k-100 5.03 5.67 6.13 6.70 7.17 56 45 40 34 33cs-m-100 4.14 5.22 5.84 6.53 7.12 80 57 48 40 35

cs-i-100 4.22 5.16 5.89 6.53 7.18 77 58 47 40 33cs-k-50d 5.00 5.62 6.14 6.67 7.29 56 46 41 38 32cs-m-50d 4.16 5.21 5.88 6.53 7.17 79 58 47 40 33cs-i-50d 4.24 5.16 5.88 6.58 7.15 77 60 47 38 34cs-f-comb 4.19 5.38 5.99 6.63 7.03 82 52 43 35 33

multi

cs-k-100 4.58 5.81 6.47 7.23 7.70 62 34 25 14 10cs-m-100 3.61 4.70 6.13 7.09 7.68 90 68 35 19 12

cs-i-100 3.58 4.57 6.16 7.08 7.69 89 74 34 19 12

cs-k-50d 4.32 5.70 6.44 7.17 7.68 68 37 26 16 11cs-m-50d 3.59 4.62 6.12 7.08 7.68 89 74 36 20 12

cs-i-50d 3.58 4.59 6.14 7.09 7.68 90 74 34 19 12

cs-f-comb 4.29 5.17 6.59 7.34 7.80 80 63 26 14 10

mult2

cs-k-100 3.82 4.25 5.30 5.94 6.94 73 70 54 51 34cs-m-100 3.30 4.09 4.76 5.78 6.51 90 74 71 53 50

cs-i-100 3.28 4.08 4.73 5.80 6.47 90 74 71 52 50

cs-k-50d 3.84 4.25 5.29 5.94 6.53 72 72 55 52 50

cs-m-50d 3.30 4.02 4.77 5.77 6.49 90 74 71 52 50

cs-i-50d 3.27 4.05 4.75 5.80 6.49 90 75 71 52 50

cs-f-comb 3.86 4.48 5.31 6.30 6.85 80 74 66 42 36

Table 7: Compass search - results of MetaMax strategies

30

Page 47: Viktor Kajml diploma thesis

0 1 2 3 4 5 6 7 8log10evaluations/D

0.0

0.2

0.4

0.6

0.8

1.0Pro

port

ion o

f tr

ials

cs-f-comb

cs-k-50d

cs-m-50d

best 2009f1-24,10-D

Figure 7: Compass search - ECDF of MetaMax variants using 50d instances

5.2 Nelder-Mead method

The best restart strategies for each dimensionality are listed in table 8 and theirresults are compared in table 9.

For the xed restart strategies we see the expected behaviour, where run lengthsof the best strategies increase with the number of dimensions. However, there seemto be only two best objective function stagnation based strategies: nm-h-10d andnm-h-100d. Interestingly enough, the switch between them occurs between d=5 andd=10, which is also the point where the overall performance of the Nelder-Meadalgorithm decreases dramatically.

Dimensionality Fixed Stagnation based

d=2 nm-f-100d nm-h-10d

d=3 nm-f-100d nm-h-10d

d=5 nm-f-500d nm-h-10d

d=10 nm-f-1000d nm-h-100d

d=20 nm-f-5000d nm-h-100d

Table 8: Nelder-Mead - best restart strategies for each dimensionality

The algorithm performs very well for low number of dimensions - d=2, d=3 andto some extent also d=5. With results for these dimensionalities approaching thoseof the best algorithms from the 2009 BBOB conference. On the other hand, theperformance for higher dimensionalities is very poor, especially on the group hcond.The three best-of restart strategies, compared in table 9, are all quite evenly matchedwith nm-special being the best overall by a small margin and nm-f-comb being theworst.

The comparison of MetaMax(k) with corresponding xed restart strategies, givenin table 10, shows that MetaMax(k) performs better on multimodal functions and

31

Page 48: Viktor Kajml diploma thesis

log10API Success rate [%]2D 3D 5D 10D 20D 2D 3D 5D 10D 20D

all

nm-f-comb 3.03 3.92 4.97 6.31 7.71 92 81 62 45 18nm-h-comb 2.91 3.89 4.71 6.19 7.61 92 76 67 45 19nm-special 2.95 3.88 4.84 6.04 7.49 92 79 63 51 24

separ nm-f-comb 2.55 3.22 4.47 6.02 7.07 100 100 64 44 40

nm-h-comb 2.51 3.88 4.38 5.99 6.80 100 68 66 44 40nm-special 2.46 3.55 4.50 5.68 6.83 100 84 66 55 41

lcond nm-f-comb 2.54 3.49 4.35 5.95 8.16 100 82 78 52 6

nm-h-comb 2.35 3.33 4.07 5.80 8.19 100 84 81 56 4nm-special 2.45 3.31 4.14 5.87 8.06 100 86 80 56 10

hcond nm-f-comb 1.95 2.86 3.24 5.65 7.79 100 100 100 68 18

nm-h-comb 2.07 2.50 3.15 5.40 7.71 100 100 100 68 20nm-special 1.96 2.48 3.12 5.01 7.20 100 100 100 82 41

multi nm-f-comb 4.59 6.07 7.07 7.62 8.04 74 41 16 10 7

nm-h-comb 4.27 5.82 6.89 7.55 7.99 76 46 20 11 7

nm-special 4.52 6.05 6.96 7.57 8.02 78 41 18 10 7

mult2 nm-f-comb 3.42 3.85 5.58 6.24 7.60 85 84 55 51 16

nm-h-comb 3.22 3.83 4.95 6.13 7.47 86 84 72 51 20

nm-special 3.26 3.92 5.32 6.02 7.46 86 84 56 52 20

Table 9: Nelder-Mead - results of restart strategies

worse on the other function groups.It is also apparent that increasing the number of used instances for MetaMax(k)

leads to a higher overall success rate and faster convergence on multi-modal problemsbut slower convergence on ill-conditioned functions, as apparent from the ECDFgraph in gure 8.

0 1 2 3 4 5 6 7 8 9log10evaluations/D

0.0

0.2

0.4

0.6

0.8

1.0

Pro

port

ion o

f tr

ials

nm-k-100

nm-k-50

nm-k-20

best 2009f10-14,10-D

Figure 8: Nelder-Mead - ECDF comparing MetaMax(k) strategies

In fact, the performance of the tested MetaMax(k) strategies on ill-conditioned

32

Page 49: Viktor Kajml diploma thesis

functions is worse than that of the corresponding restart strategies. This is the op-posite of what was observed when using the compass search algorithm and can beexplained by the fact that the Nelder-Mead algorithm, unlike compass search, canhandle ill-conditioned problems very quickly and with high success rate (at least forlow dimensionalities). Therefore, there is no need for the MetaMax mechanism ofselecting multiple instances each round, as almost any instance is capable of ndingthe global optimum. Selecting more than one at the same time only serves to decreasethe rate of convergence.

Overall, the three tested MetaMax(k) strategies perform only slightly better thanthe corresponding xed restart strategies and are clearly worse than the best restartstrategies, such as nm-special.

Table 11 shows the results of other tested MetaMax strategies. Unfortunately,measurements for all dimensionalities were not nished in time, before the deadlineof this thesis, therefore table 11 contains only partial results for some strategies.

For the dimensionalities where the results of all the strategies are available, itis apparent that nm-m-100 and nm-i-100 outperform nm-k-100, both in terms ofsuccess rates and API. There are no signicant dierences in performance betweenMetaMax variants using 100 and 50d local search algorithm instances, as well as noobservable dierences between performance of MetaMax and MetaMax(∞).

In comparison with the restart strategy nm-special, MetaMax and MetaMax(∞)have better success rates on function groups separ, multi and mult2 and as a result,a better overall success rate.

MetaMax and MetaMax(∞) also converge faster on multi and mult2, but areslower on lcond and hcond. The overall result being that they are better than thebest restart strategy ,in terms of API, for d=2 and d=3, but are worse for d=5.Unfortunately, we cannot make comparisons for higher dimensionalities, where theresults for MetaMax and MetaMax(∞) are not available. However, based on the factthat the advantage in performance of MetaMax over the restart strategy is lower ind=3 than in d=2, and that the restart strategy is better in d=5, we can extrapolatethat MetaMax would likely also perform worse for higher dimensionalities. Even ifthere was an improvement in performance, the fact remains, that the Nelder-Meadmethod has such a bad performance in higher dimensionalities, that it is unlikelythat MetaMax could improve it to a practical level.

33

Page 50: Viktor Kajml diploma thesis

log10API Success rate [%]2D 3D 5D 10D 20D 2D 3D 5D 10D 20D

all

nm-f-1000d 3.64 4.27 5.07 6.31 7.76 81 70 61 45 14nm-f-2000d 3.82 4.38 5.14 6.33 7.71 75 69 61 45 17nm-f-5000d 4.03 4.52 5.40 6.52 7.71 71 65 56 42 18nm-k-20 3.96 4.56 5.26 6.54 7.65 72 65 60 38 17nm-k-50 3.76 4.52 5.22 6.40 7.68 81 66 61 45 17nm-k-100 3.58 4.38 5.23 6.45 7.75 85 71 62 44 15nm-special 2.95 3.88 4.84 6.04 7.49 92 79 63 51 24

separ

nm-f-1000d 3.24 3.95 4.59 6.02 7.27 84 67 64 44 29nm-f-2000d 3.59 4.02 4.56 6.01 7.04 69 66 63 46 40nm-f-5000d 3.65 4.07 4.74 6.08 7.07 68 65 62 49 40nm-k-20 3.85 4.32 4.86 6.05 6.91 69 65 62 46 40nm-k-50 3.61 4.35 4.93 6.14 7.03 84 66 63 45 40nm-k-100 3.21 4.38 4.98 6.24 7.30 100 68 64 43 31nm-special 2.46 3.55 4.50 5.68 6.83 100 84 66 55 41

lcond

nm-f-1000d 3.01 3.49 4.44 5.95 8.24 84 80 77 52 2nm-f-2000d 3.05 3.50 4.60 5.94 8.20 82 80 77 53 4nm-f-5000d 3.13 3.60 4.86 6.15 8.16 80 77 76 52 6nm-k-20 3.46 3.98 4.70 6.25 8.20 80 78 76 52 4nm-k-50 3.40 4.03 4.77 6.34 8.19 84 80 77 54 4nm-k-100 3.44 4.00 4.81 6.42 8.21 84 80 78 53 4nm-special 2.45 3.31 4.14 5.87 8.06 100 86 80 56 10

hcond

nm-f-1000d 1.92 2.47 3.26 5.65 7.99 100 100 100 68 9nm-f-2000d 1.99 2.46 3.30 5.59 7.87 100 100 100 68 14nm-f-5000d 2.00 2.44 3.51 5.74 7.79 100 100 100 69 18nm-k-20 2.59 3.10 3.77 5.77 7.87 100 100 100 68 14nm-k-50 2.74 3.24 3.93 5.84 7.93 100 100 100 68 12nm-k-100 2.85 3.33 4.02 5.96 8.00 100 100 100 67 9nm-special 1.96 2.48 3.12 5.01 7.20 100 100 100 82 41

multi

nm-f-1000d 5.77 6.63 7.14 7.62 7.98 54 24 16 10 8

nm-f-2000d 6.11 6.78 7.21 7.64 8.02 40 20 14 10 7nm-f-5000d 6.56 6.91 7.32 7.69 8.04 23 18 11 8 7nm-k-20 6.00 6.59 7.00 7.50 7.92 28 17 13 9 7nm-k-50 5.36 6.40 6.93 7.46 7.88 52 21 14 10 7nm-k-100 5.15 6.25 6.88 7.43 7.87 55 25 16 10 8

nm-special 4.52 6.05 6.96 7.57 8.02 78 41 18 10 7

mult2

nm-f-1000d 4.12 4.68 5.81 6.24 7.42 85 83 53 51 22

nm-f-2000d 4.24 4.95 5.93 6.41 7.52 84 83 53 50 18nm-f-5000d 4.62 5.42 6.46 6.87 7.60 84 68 36 34 16nm-k-20 3.82 4.70 5.84 7.07 7.47 84 68 52 18 18nm-k-50 3.61 4.50 5.46 6.19 7.48 84 68 53 50 18nm-k-100 3.22 3.86 5.36 6.18 7.45 84 83 55 51 20nm-special 3.26 3.92 5.32 6.02 7.46 86 84 56 52 20

Table 10: Nelder-Mead - results of MetaMax(k) and corresponding xed restartstrategies

34

Page 51: Viktor Kajml diploma thesis

log10API Success rate [%]2D 3D 5D 10D 20D 2D 3D 5D 10D 20D

all

nm-k-100 3.58 4.38 5.23 6.45 7.75 85 71 62 44 15nm-m-100 2.89 3.76 4.96 - - 96 89 69 - -nm-i-100 2.94 3.76 4.95 - - 96 89 69 - -nm-k-50d 3.58 4.42 5.23 6.55 7.83 85 71 62 43 13nm-m-50d 2.85 3.74 4.93 - - 96 89 69 - -nm-i-50d 2.88 3.76 - - - 98 89 - - -nm-special 2.95 3.88 4.84 6.04 7.49 92 79 63 51 24

separ

nm-k-100 3.21 4.38 4.98 6.24 7.30 100 68 64 43 31nm-m-100 2.62 3.39 4.82 - - 100 100 67 - -nm-i-100 2.65 3.36 4.81 - - 100 100 66 - -nm-k-50d 3.18 4.42 5.05 6.33 7.52 100 68 64 44 28nm-m-50d 2.55 3.38 4.79 - - 100 100 66 - -nm-i-50d 2.61 3.40 - - - 100 100 - - -nm-special 2.46 3.55 4.50 5.68 6.83 100 84 66 55 41

lcond

nm-k-100 3.44 4.00 4.81 6.42 8.21 84 80 78 53 4nm-m-100 2.74 3.73 4.71 - - 100 86 80 - -nm-i-100 2.76 3.80 4.70 - - 100 85 80 - -nm-k-50d 3.40 4.07 4.89 6.52 8.26 86 80 78 55 2nm-m-50d 2.65 3.73 4.67 - - 100 85 80 - -nm-i-50d 2.76 3.80 - - - 100 86 - - -nm-special 2.45 3.31 4.14 5.87 8.06 100 86 80 56 10

hcond

nm-k-100 2.85 3.33 4.02 5.96 8.00 100 100 100 67 9nm-m-100 2.54 3.15 4.00 - - 100 100 100 - -nm-i-100 2.63 3.21 4.02 - - 100 100 100 - -nm-k-50d 2.84 3.39 4.11 6.35 8.06 100 100 100 56 7nm-m-50d 2.52 3.13 4.03 - - 100 100 100 - -nm-i-50d 2.64 3.22 - - - 100 100 - - -nm-special 1.96 2.48 3.12 5.01 7.20 100 100 100 82 41

multi

nm-k-100 5.15 6.25 6.88 7.43 7.87 55 25 16 10 8nm-m-100 3.88 4.96 6.52 - - 82 72 26 - -nm-i-100 3.90 4.91 6.51 - - 82 73 25 - -nm-k-50d 5.19 6.26 6.77 7.37 7.82 55 24 18 12 9

nm-m-50d 3.84 4.92 6.50 - - 82 73 26 - -nm-i-50d 3.67 4.89 - - - 90 73 - - -nm-special 4.52 6.05 6.96 7.57 8.02 78 41 18 10 7

mult2

nm-k-100 3.22 3.86 5.36 6.18 7.45 84 83 55 51 20

nm-m-100 2.64 3.55 4.70 - - 100 85 73 - -nm-i-100 2.72 3.51 4.68 - - 100 85 74 - -nm-k-50d 3.25 3.88 5.26 6.19 7.60 84 83 55 51 16nm-m-50d 2.66 3.53 4.60 - - 100 85 76 - -nm-i-50d 2.70 3.50 - - - 100 86 - - -nm-special 3.26 3.92 5.32 6.02 7.46 86 84 56 52 20

Table 11: Nelder-Mead - results of MetaMax strategies

35

Page 52: Viktor Kajml diploma thesis

5.3 BFGS

Results of restart strategies bfgs-f-comb, bfgs-h-comb and bfgs-special areshown in table 13. Best xed and objective function stagnation based restart strate-gies for each dimension, which were used to make bfgs-hfcomb and bfgs-h-comb,are listed in table 12.

Dimensionality Fixed Stagnation based

d=2 bfgs-f-100d bfgs-h-2d

d=3 bfgs-f-100d bfgs-h-2d

d=5 bfgs-f-200d bfgs-h-2d

d=10 bfgs-f-1000d bfgs-h-2d

d=20 bfgs-f-1000d bfgs-h-2d

Table 12: BFGS - best restart strategies for each dimensionality

For the selected best xed restart strategies, we see an ordinary behaviour whererun lengths increase with dimensionality. However, for the stagnation based restartstrategies, bfgs-h-2d is apparently the best for all dimensionalities. This is quiteunusual but, in hindsight, not entirely unexpected. It has to do with the way ourimplementation of BFGS works: At the begging of each step, the algorithm estimatesgradient of the objective function by using the nite dierence method. This involvesevaluating the objective function at a set of neighbouring solutions, which are veryclose to the current solution. The number of these neighbour solutions is always 2d -one for each vector in a positive orthonormal basis of the search space. A very quickway to detect convergence is to check, if objective function values at these points areworse than function value at the current solution. As it turns out, this is preciselywhat bfgs-h-2d does and also the reason why it works so well.

In contrast with the surprisingly good results of bfgs-h-2d, the special restartstrategy, which is based on monitoring value of the norm of the estimated gradient,performs very poorly and is clearly the worst of all the tested restart strategies. Theother two strategies bfgs-h-comb and bfgs-f-comb have a very similar performance,with bfgs-h-comb being slightly better.

Overall, BFGS has excellent results on ill-conditioned problems, even exceedingthe performance of the best algorithms from the BBOB 2009 conference for certaindimensionalities on the group hcond, which is illustrated in gure 9. However, itperforms quite poorly on multimodal functions (multi and mult2).

Table 14 sums up the results comparing MetaMax(k) with corresponding xedrestart strategies. In terms of success rate, both types of strategies perform thesame on all function groups. In terms of rate of convergence, expressed by values ofAPI, the results are similar to those observed when using the Nelder-Mead method:MetaMax(k) strategies perform better on multi and mult2, but worse on lcond,hcond and separ.

The overall performance of MetaMax(k) across all function groups is worse thanthat of the corresponding xed restart strategies and consequently also worse thanperformance of the best restart strategies.

36

Page 53: Viktor Kajml diploma thesis

0 1 2 3 4 5 6 7 8log10evaluations/D

0.0

0.2

0.4

0.6

0.8

1.0Pro

port

ion o

f tr

ials

bfgs-special

bfgs-f-comb

bfgs-h-comb

best 2009f10-14,20-D

Figure 9: BFGS - ECDF of the best restart strategies

log10API Success rate [%]2D 3D 5D 10D 20D 2D 3D 5D 10D 20D

all

bfgs-f-comb 3.58 4.17 4.71 5.32 5.83 82 71 64 57 52bfgs-h-comb 3.37 3.91 4.56 5.18 5.66 83 75 65 58 56

bfgs-special 4.70 4.85 5.20 5.76 6.03 55 54 50 48 48

separ bfgs-f-comb 2.66 3.42 4.01 4.50 4.86 94 77 65 60 60

bfgs-h-comb 2.36 3.11 3.96 4.44 4.81 100 84 66 61 60

bfgs-special 3.45 3.86 4.12 4.51 4.84 69 63 61 60 60

lcond bfgs-f-comb 3.41 3.78 4.19 4.77 5.51 82 80 78 75 66

bfgs-h-comb 3.14 3.55 4.00 4.59 5.10 83 80 78 76 75

bfgs-special 3.52 3.90 4.06 5.56 5.52 84 80 80 72 73

hcond bfgs-f-comb 2.54 2.61 2.89 3.27 3.75 98 98 98 96 94

bfgs-h-comb 2.43 2.42 2.81 3.21 3.64 100 100 97 99 98

bfgs-special 3.25 2.68 3.09 3.44 3.87 81 96 93 93 92

multi bfgs-f-comb 5.10 6.32 6.97 7.60 7.98 62 30 18 10 8

bfgs-h-comb 5.04 5.96 6.86 7.48 7.91 60 43 20 12 8

bfgs-special 7.07 7.20 7.49 7.87 8.16 8 8 5 5 4

mult2 bfgs-f-comb 4.14 4.63 5.38 6.59 7.25 74 70 66 44 30

bfgs-h-comb 3.86 4.43 5.04 6.09 6.71 74 70 67 47 40

bfgs-special 5.95 6.43 6.99 7.38 7.65 40 27 16 14 14

Table 13: BFGS - results of restart strategies

37

Page 54: Viktor Kajml diploma thesis

log10API Success rate [%]2D 3D 5D 10D 20D 2D 3D 5D 10D 20D

all

bfgs-f-1000d 3.94 4.32 4.84 5.32 5.83 76 67 60 57 52bfgs-f-2000d 4.19 4.49 5.01 5.44 5.90 68 64 57 54 51bfgs-f-5000d 4.33 4.65 5.14 5.58 5.93 66 59 54 49 51bfgs-k-20 4.33 4.81 5.31 5.76 6.14 68 60 54 52 51bfgs-k-50 4.33 4.74 5.27 5.79 6.22 69 65 58 55 52bfgs-k-100 4.18 4.75 5.30 5.81 6.26 78 66 59 56 53bfgs-h-comb 3.37 3.91 4.56 5.18 5.66 83 75 65 58 56

separ

bfgs-f-1000d 3.46 3.74 4.08 4.50 4.86 69 66 64 60 60

bfgs-f-2000d 3.51 3.76 4.08 4.52 4.93 69 66 63 60 60

bfgs-f-5000d 3.53 3.83 4.12 4.52 4.86 68 64 62 60 60

bfgs-k-20 3.94 4.29 4.68 5.09 5.43 69 65 62 60 60

bfgs-k-50 4.13 4.45 4.83 5.27 5.62 69 66 63 60 60

bfgs-k-100 3.62 4.56 4.97 5.41 5.75 100 67 63 60 60

bfgs-h-comb 2.36 3.11 3.96 4.44 4.81 100 84 66 61 60

lcond

bfgs-f-1000d 3.37 3.68 4.25 4.77 5.51 80 78 76 75 66bfgs-f-2000d 3.29 3.69 4.40 4.74 5.27 80 78 76 75 74bfgs-f-5000d 3.49 3.74 4.58 4.93 5.40 79 77 76 75 75

bfgs-k-20 3.72 4.11 4.55 5.09 5.55 79 77 76 75 75

bfgs-k-50 3.79 4.14 4.61 5.19 5.68 80 78 76 75 75

bfgs-k-100 3.90 4.25 4.68 5.28 5.79 80 79 76 75 74bfgs-h-comb 3.14 3.55 4.00 4.59 5.10 83 80 78 76 75

hcond

bfgs-f-1000d 2.55 2.42 2.82 3.27 3.75 100 100 97 96 94bfgs-f-2000d 2.57 2.49 2.88 3.32 3.85 100 100 97 95 94bfgs-f-5000d 2.66 2.37 2.87 3.26 3.77 98 100 95 93 92bfgs-k-20 3.06 3.17 3.52 3.96 4.40 100 100 96 94 94bfgs-k-50 3.26 3.36 3.73 4.15 4.61 100 100 97 96 94bfgs-k-100 3.45 3.57 3.92 4.31 4.78 100 100 98 96 94bfgs-h-comb 2.43 2.42 2.81 3.21 3.64 100 100 97 99 98

multi

bfgs-f-1000d 5.85 6.74 7.16 7.60 7.98 52 20 15 10 8

bfgs-f-2000d 6.44 6.88 7.20 7.68 8.01 26 18 13 9 7bfgs-f-5000d 6.62 7.05 7.36 7.79 8.12 23 14 10 6 5bfgs-k-20 6.22 6.68 7.06 7.56 7.93 23 17 13 8 7bfgs-k-50 6.06 6.56 6.99 7.52 7.90 25 20 15 9 8

bfgs-k-100 5.71 6.44 6.93 7.48 7.90 39 22 16 10 8

bfgs-h-comb 5.04 5.96 6.86 7.48 7.91 60 43 20 12 8

mult2

bfgs-f-1000d 4.45 5.06 6.01 6.59 7.25 80 76 50 44 30bfgs-f-2000d 4.96 5.48 6.38 6.80 7.30 70 64 40 36 27bfgs-f-5000d 5.31 6.09 6.66 7.26 7.40 65 42 29 17 25bfgs-k-20 4.58 5.65 6.57 6.97 7.29 69 45 27 26 26bfgs-k-50 4.31 5.08 6.08 6.71 7.17 72 64 42 37 27bfgs-k-100 4.18 4.83 5.88 6.46 6.99 72 67 45 42 36bfgs-h-comb 3.86 4.43 5.04 6.09 6.71 74 70 67 47 40

Table 14: BFGS - results of MetaMax(k) and corresponding xed restart strategies

38

Page 55: Viktor Kajml diploma thesis

Comparison of MetaMax(k), MetaMax and MetaMax(∞) using a maximum num-ber of 100 instances is given in table 15. Again, there are no signicant dierencesbetween the performance of bfgs-m-100 and bfgs-i-100. Both of them clearly out-perform bfgs-k-100 on all function groups, both in terms of success rate and therate of convergence. However, their overall results are still not better than those ofthe best restart strategies.

Results of using the three MetaMax strategies with the maximum number ofinstances set to 50d are also shown in table 15. Again, the increase in maximumnumber of instances did not aect the performance of MetaMax and MetaMax(∞)in any discernible way. For MetaMax(k) it actually led to increase in API (decreasein the rate of convergence) on function groups separ, lcond and hcond.

We must therefore conclude, that for the BFGS algorithm MetaMax does notprovide any signicant advantage over using a well selected restart strategy. In par-ticular, the restart strategy bfgs-h-2d proved to outperform all MetaMax variantson almost all function groups and in all dimensionalities. The ECDF graph in gure10 illustrates these results.

0 1 2 3 4 5 6 7 8log10evaluations/D

0.0

0.2

0.4

0.6

0.8

1.0

Pro

port

ion o

f tr

ials

bfgs-k-50d

bfgs-h-comb

bfgs-m-50d

best 2009f1-24,5-D

Figure 10: BFGS - ECDF of MetaMax variants using 50d instances

39

Page 56: Viktor Kajml diploma thesis

log10API Success rate [%]2D 3D 5D 10D 20D 2D 3D 5D 10D 20D

all

bfgs-k-100 4.18 4.75 5.30 5.81 6.26 78 66 59 56 53bfgs-m-100 3.39 4.07 4.74 5.46 6.01 88 76 66 58 54bfgs-i-100 3.25 4.10 4.78 5.47 6.01 90 76 66 58 54bfgs-k-50d 4.10 4.78 5.32 6.01 6.56 82 67 64 57 54bfgs-m-50d 3.44 4.08 4.75 5.46 5.99 85 76 66 58 55bfgs-i-50d 3.47 4.06 4.76 5.47 6.04 86 78 66 58 53bfgs-h-comb 3.37 3.91 4.56 5.18 5.66 83 75 65 58 56

separ

bfgs-k-100 3.62 4.56 4.97 5.41 5.75 100 67 63 60 60

bfgs-m-100 2.46 3.32 4.12 4.65 5.11 100 84 66 62 60

bfgs-i-100 2.35 3.36 4.14 4.67 5.13 100 84 66 62 60

bfgs-k-50d 3.61 4.65 5.16 5.77 6.30 100 67 64 61 60

bfgs-m-50d 2.38 3.34 4.12 4.66 5.11 100 84 66 62 60

bfgs-i-50d 2.44 3.36 4.12 4.68 5.13 100 84 66 62 60

bfgs-h-comb 2.36 3.11 3.96 4.44 4.81 100 84 66 61 60

lcond

bfgs-k-100 3.90 4.25 4.68 5.28 5.79 80 79 76 75 74bfgs-m-100 3.35 3.84 4.39 5.12 5.82 84 81 78 76 73bfgs-i-100 2.62 3.88 4.45 5.15 5.73 100 81 78 76 72bfgs-k-50d 3.87 4.31 4.84 5.47 6.12 82 80 77 76 73bfgs-m-50d 3.36 3.82 4.39 5.09 5.76 84 82 78 76 74bfgs-i-50d 3.38 3.87 4.46 5.14 5.78 84 82 78 76 74bfgs-h-comb 3.14 3.55 4.00 4.59 5.10 83 80 78 76 75

hcond

bfgs-k-100 3.45 3.57 3.92 4.31 4.78 100 100 98 96 94bfgs-m-100 2.71 2.91 3.33 3.90 4.39 100 100 98 96 94bfgs-i-100 2.77 2.98 3.40 3.93 4.43 100 100 99 97 94bfgs-k-50d 3.45 3.71 4.18 4.81 5.43 100 100 98 96 95bfgs-m-50d 2.76 2.92 3.34 3.87 4.41 100 100 98 96 94bfgs-i-50d 2.82 3.00 3.41 3.94 4.43 100 100 98 96 95bfgs-h-comb 2.43 2.42 2.81 3.21 3.64 100 100 97 99 98

multi

bfgs-k-100 5.71 6.44 6.93 7.48 7.90 39 22 16 10 8bfgs-m-100 4.86 5.87 6.75 7.35 7.83 62 44 21 14 10

bfgs-i-100 4.81 5.80 6.75 7.34 7.82 62 44 21 14 10

bfgs-k-50d 5.38 6.41 6.88 7.42 7.88 53 23 17 12 8bfgs-m-50d 4.90 5.83 6.75 7.35 7.82 62 44 21 14 10

bfgs-i-50d 4.87 5.64 6.73 7.34 7.81 62 52 22 14 10

bfgs-h-comb 5.04 5.96 6.86 7.48 7.91 60 43 20 12 8

mult2

bfgs-k-100 4.18 4.83 5.88 6.46 6.99 72 67 45 42 36bfgs-m-100 3.56 4.39 5.05 6.22 6.87 92 72 67 44 36bfgs-i-100 3.46 4.42 5.07 6.19 6.87 94 72 67 46 37bfgs-k-50d 4.15 4.70 5.42 6.45 7.00 73 68 64 42 39bfgs-m-50d 3.80 4.41 5.07 6.23 6.81 80 72 67 44 39bfgs-i-50d 3.79 4.41 5.05 6.17 7.01 81 72 68 46 30bfgs-h-comb 3.86 4.43 5.04 6.09 6.71 74 70 67 47 40

Table 15: BFGS - results of MetaMax strategies

40

Page 57: Viktor Kajml diploma thesis

5.4 CMA-ES

Best xed restart and stagnation based restarts for each dimensionality are listedin table 16 and their results are compared together with the special restart strategyin table 17

For the best xed restart strategies we see the standard behaviour, where their runlength increases with the number of dimensions. However, for the objective functionvalue stagnation strategies, it is apparent that cmaes-h-100d is the best for alldimensionalities. Closer inspection of the results reveals that having a longer functionvalue history length improves performance on the group hcond with no negative eecton performance on other function groups. This is the reason why 100d comes out asthe best overall objective function stagnation based restart strategy. It is also likely,that had we tested strategies with even longer objective function history lengths,they would perform even better. The behaviour of the six tested stagnation basedrestart strategies on the group hcond is illustrated in gure 11.

0 1 2 3 4 5 6 7 8log10evaluations/D

0.0

0.2

0.4

0.6

0.8

1.0

Pro

port

ion o

f tr

ials

cmaes-h-5d

cmaes-h-50d

cmaes-h-10d

cmaes-h-20d

cmaes-h-100d

best 2009f10-14,20-D

Figure 11: CMA-ES - ECDF of function value stagnation based restart strategies

Dimensionality Fixed Stagnation based

d=2 cmaes-f-500d cmaes-h-100d

d=3 cmaes-f-500d cmaes-h-100d

d=5 cmaes-f-500d cmaes-h-100d

d=10 cmaes-f-1000d cmaes-h-100d

d=20 cmaes-f-2000d cmaes-h-100d

Table 16: CMA-ES - best restart strategies for each dimensionality

Comparison of the best-of strategies shows that, as usual, all three of themperform virtually the same. The results are very good overall, especially on ill-conditioned functions (groups hcond and lcond), with almost 100% success rate

41

Page 58: Viktor Kajml diploma thesis

across all dimensionalities and performance approaching that of the best 2009 BBOBworkshop algorithms. In comparison with other tested local search algorithms, theperformance of CMA-ES on multimodal functions (groups multi and mult2) is alsoquite good. On the other hand, performance on the group separ is surprisingly low,especially in terms of success rates.

log10API Success rate [%]2D 3D 5D 10D 20D 2D 3D 5D 10D 20D

all

cmaes-f-comb 2.87 3.34 4.17 5.01 5.63 97 97 83 68 61cmaes-h-comb 2.82 3.21 4.03 4.90 5.59 97 97 86 71 63

cmaes-special 2.84 3.26 4.02 4.88 5.56 97 97 86 71 62

separ cmaes-f-comb 2.54 2.97 3.84 4.66 5.12 100 100 84 66 63

cmaes-h-comb 2.52 2.96 3.85 4.64 5.11 100 100 84 66 64

cmaes-special 2.55 2.97 3.82 4.63 5.10 100 100 84 66 64

lcond cmaes-f-comb 2.36 2.70 3.14 3.87 4.92 100 100 100 100 80

cmaes-h-comb 2.38 2.65 3.11 3.79 4.89 100 100 100 100 80

cmaes-special 2.35 2.67 3.16 3.82 4.87 100 100 100 100 80

hcond cmaes-f-comb 2.60 3.07 3.53 3.69 4.13 100 100 97 100 100

cmaes-h-comb 2.66 2.90 3.21 3.72 4.48 100 100 100 100 100

cmaes-special 2.59 2.90 3.17 3.61 4.14 100 100 100 100 100

multi cmaes-f-comb 3.04 3.55 4.67 6.18 6.90 100 100 81 41 30

cmaes-h-comb 2.95 3.38 4.71 6.19 6.79 100 100 81 40 33

cmaes-special 3.01 3.56 4.66 6.15 6.82 100 100 80 42 32

mult2 cmaes-f-comb 3.71 4.28 5.47 6.42 6.92 87 85 57 38 37

cmaes-h-comb 3.53 4.08 5.09 5.95 6.77 88 86 70 54 36cmaes-special 3.59 4.07 5.10 5.95 6.73 86 85 70 54 37

Table 17: CMA-ES - results of restart strategies

Comparison of MetaMax(k) with corresponding xed run length restart strategiesgives results similar to BFGS: MetaMax(k) has better performance on multi-modalfunctions, but worse on ill-conditioned and separable functions. Overall, consideringthe results across all function groups, MetaMax(k) performs worse than an equiv-alent xed restart strategy and therefore also worse than any of the best-of restartstrategies. It is also apparent that as the number of used instances increases, the dif-ferences in performance between each variant of MetaMax(k) and the correspondingrestart strategy tend to get less pronounced.

Table 19 shows the results of MetaMax, MetaMax(∞) and MetaMax(k) usingthe limit of 100 and 50d instances and compares them with the best overall restartstrategy cmaes-special. All of the compared strategies perform very similarly interms of success rates. There are a few outlying results, like for example, the factthat on the group separ for d=5 cmaes-m-100, cmaes-i-100 and cmaes-k-100

have worse success rates than their counterpart strategies which use 50d instances.However, no clear trend is obvious.

As with the other tested local search algorithms, there is no dierence in perfor-mance, when using MetaMax or MetaMax(∞) or when using 100 or 50d instanceswith these two MetaMax variants. Unlike the other algorithms, there also seemsto be no dierence in rate of convergence between cmaes-k-100 and cmaes-k-50d.

42

Page 59: Viktor Kajml diploma thesis

log10API Success rate [%]2D 3D 5D 10D 20D 2D 3D 5D 10D 20D

all

cmaes-f-1000d 2.95 3.39 4.25 5.01 5.72 97 97 80 68 61cmaes-f-2000d 3.04 3.52 4.38 5.03 5.63 97 94 77 67 61cmaes-f-5000d 3.28 3.76 4.53 5.18 5.73 94 89 75 64 60cmaes-k-20 3.27 3.92 4.73 5.47 5.99 96 89 75 64 60cmaes-k-50 3.22 3.76 4.65 5.39 5.99 97 94 78 67 62

cmaes-k-100 3.27 3.72 4.61 5.41 6.04 97 97 80 68 61cmaes-special 2.84 3.26 4.02 4.88 5.56 97 97 86 71 62

separ

cmaes-f-1000d 2.69 3.07 4.19 4.66 5.21 100 100 68 66 64

cmaes-f-2000d 2.75 3.49 4.21 4.67 5.12 100 84 68 65 63cmaes-f-5000d 3.22 3.55 4.21 4.69 5.15 86 84 68 65 63cmaes-k-20 3.06 3.90 4.65 5.09 5.53 100 84 67 65 63cmaes-k-50 3.01 3.91 4.70 5.17 5.58 100 84 68 66 64

cmaes-k-100 3.04 3.56 4.78 5.22 5.63 100 100 68 66 64

cmaes-special 2.55 2.97 3.82 4.63 5.10 100 100 84 66 64

lcond

cmaes-f-1000d 2.30 2.69 3.27 3.87 5.14 100 100 100 100 80

cmaes-f-2000d 2.40 2.71 3.23 3.85 4.92 100 100 100 100 80

cmaes-f-5000d 2.40 2.82 3.60 4.34 4.96 100 100 100 85 80

cmaes-k-20 2.86 3.25 3.69 4.82 5.45 100 100 100 85 80

cmaes-k-50 2.91 3.36 3.81 4.45 5.52 100 100 100 100 80

cmaes-k-100 2.99 3.39 3.85 4.49 5.57 100 100 100 100 80

cmaes-special 2.35 2.67 3.16 3.82 4.87 100 100 100 100 80

hcond

cmaes-f-1000d 2.57 2.93 3.18 3.69 4.41 100 100 100 100 96cmaes-f-2000d 2.55 2.88 3.18 3.63 4.13 100 100 100 100 100

cmaes-f-5000d 2.63 2.88 3.22 3.70 4.40 100 100 100 100 96cmaes-k-20 3.29 3.61 3.95 4.38 4.92 100 100 100 100 96cmaes-k-50 3.39 3.69 4.03 4.50 4.93 100 100 100 100 100

cmaes-k-100 3.41 3.74 4.04 4.53 5.04 100 100 100 100 98cmaes-special 2.59 2.90 3.17 3.61 4.14 100 100 100 100 100

multi

cmaes-f-1000d 3.25 3.72 5.08 6.18 6.87 100 100 65 41 31cmaes-f-2000d 3.35 3.91 5.40 6.28 6.90 100 100 60 38 30cmaes-f-5000d 3.68 4.35 5.62 6.39 6.99 100 93 56 36 29cmaes-k-20 3.28 3.96 5.38 6.41 6.98 95 94 57 36 29cmaes-k-50 3.18 3.64 5.19 6.35 6.95 100 100 64 38 31cmaes-k-100 3.24 3.66 5.17 6.28 6.97 100 100 66 42 32

cmaes-special 3.01 3.56 4.66 6.15 6.82 100 100 80 42 32

mult2

cmaes-f-1000d 3.79 4.38 5.32 6.42 6.87 88 85 68 38 37

cmaes-f-2000d 4.03 4.48 5.43 6.49 6.92 86 84 68 38 37

cmaes-f-5000d 4.29 5.03 5.81 6.63 7.01 85 69 55 37 36cmaes-k-20 3.77 4.75 5.75 6.51 6.95 86 69 54 37 36cmaes-k-50 3.55 4.10 5.34 6.31 6.89 86 85 61 38 37

cmaes-k-100 3.61 4.17 5.05 6.36 6.88 87 85 68 38 37

cmaes-special 3.59 4.07 5.10 5.95 6.73 86 85 70 54 37

Table 18: CMA-ES - results of MetaMax(k) and corresponding xed restart strategies

43

Page 60: Viktor Kajml diploma thesis

However, they both converge more slowly than MetaMax or MetaMax(∞) which arein turn outperformed by cmaes-special.

The greatest dierence in terms of rate of convergence is on the groups separ,lcond and hcond, where cmaes-special converges faster than the MetaMax strate-gies over the whole tested range of function evaluations (interval < 0, 105 >). On thegroup multi the restart strategy cmaes-special converges faster initially, but thenslows down to the rate of the MetaMax strategies, which is illustrated in gure 12.On mult2 all of the tested strategies converge more or less at the same rate.

0 1 2 3 4 5 6 7 8 9log10evaluations/D

0.0

0.2

0.4

0.6

0.8

1.0

Pro

port

ion o

f tr

ials

cmaes-m-50d

cmaes-special

cmaes-k-50d

best 2009f15-19,5-D

Figure 12: CMA-ES - ECDF comparison of MetaMax variants using 50d instances

Overall, all of the tested MetaMax strategies are outperformed by the best-ofrestart strategies.

44

Page 61: Viktor Kajml diploma thesis

log10API Success rate [%]2D 3D 5D 10D 20D 2D 3D 5D 10D 20D

all

cmaes-k-100 3.27 3.72 4.61 5.41 6.04 97 97 80 68 61cmaes-m-100 3.13 3.60 4.60 5.35 6.06 97 97 77 69 60cmaes-i-100 3.19 3.66 4.56 5.35 6.07 98 97 80 70 59cmaes-k-50d 3.26 3.72 4.51 5.44 6.19 97 97 87 71 60cmaes-m-50d 3.12 3.60 4.47 5.34 6.06 97 97 83 69 60cmaes-i-50d 3.15 3.67 4.50 5.34 6.07 97 97 83 70 59cmaes-special 2.84 3.26 4.02 4.88 5.56 97 97 86 71 62

separ

cmaes-k-100 3.04 3.56 4.78 5.22 5.63 100 100 68 66 64

cmaes-m-100 2.83 3.34 4.54 5.05 5.52 100 100 68 66 63cmaes-i-100 2.88 3.32 4.57 5.06 5.53 100 100 68 66 64

cmaes-k-50d 2.97 3.59 4.52 5.42 5.91 100 100 84 66 64

cmaes-m-50d 2.78 3.34 4.23 5.05 5.53 100 100 84 66 63cmaes-i-50d 2.79 3.40 4.25 5.06 5.52 100 100 84 66 63cmaes-special 2.55 2.97 3.82 4.63 5.10 100 100 84 66 64

lcond

cmaes-k-100 2.99 3.39 3.85 4.49 5.57 100 100 100 100 80cmaes-m-100 2.77 3.20 3.76 4.49 5.58 100 100 100 100 80cmaes-i-100 2.90 3.31 3.82 4.49 5.59 100 100 100 100 80cmaes-k-50d 2.99 3.43 3.91 4.56 5.65 100 100 100 100 82

cmaes-m-50d 2.78 3.20 3.76 4.46 5.58 100 100 100 100 80cmaes-i-50d 2.87 3.35 3.81 4.48 5.59 100 100 100 100 81cmaes-special 2.35 2.67 3.16 3.82 4.87 100 100 100 100 80

hcond

cmaes-k-100 3.41 3.74 4.04 4.53 5.04 100 100 100 100 98cmaes-m-100 3.22 3.59 3.96 4.61 5.08 100 100 100 94 94cmaes-i-100 3.32 3.67 3.97 4.51 5.11 100 100 100 100 92cmaes-k-50d 3.44 3.76 4.09 4.63 5.23 100 100 100 100 93cmaes-m-50d 3.22 3.57 3.92 4.54 5.07 100 100 100 96 93cmaes-i-50d 3.33 3.68 4.03 4.52 5.12 100 100 100 100 93cmaes-special 2.59 2.90 3.17 3.61 4.14 100 100 100 100 100

multi

cmaes-k-100 3.24 3.66 5.17 6.28 6.97 100 100 66 42 32

cmaes-m-100 3.17 3.69 5.20 6.36 7.04 100 100 66 39 29cmaes-i-100 3.22 3.71 4.82 6.38 7.03 100 100 82 39 29cmaes-k-50d 3.22 3.70 4.81 6.32 7.05 100 100 82 44 32

cmaes-m-50d 3.16 3.67 4.88 6.36 7.03 100 100 81 39 30cmaes-i-50d 3.15 3.77 4.87 6.37 7.03 100 100 81 40 30cmaes-special 3.01 3.56 4.66 6.15 6.82 100 100 80 42 32

mult2

cmaes-k-100 3.61 4.17 5.05 6.36 6.88 87 85 68 38 37

cmaes-m-100 3.58 4.11 5.38 6.07 6.98 88 86 55 52 34cmaes-i-100 3.57 4.22 5.44 6.13 7.00 89 85 55 52 34cmaes-k-50d 3.61 4.07 5.11 6.11 6.98 86 85 70 52 35cmaes-m-50d 3.60 4.14 5.40 6.12 6.98 87 85 54 51 35cmaes-i-50d 3.54 4.11 5.41 6.11 6.98 88 85 54 52 34cmaes-special 3.59 4.07 5.10 5.95 6.73 86 85 70 54 37

Table 19: CMA-ES - results of MetaMax strategies

45

Page 62: Viktor Kajml diploma thesis

5.5 Additional results

Figure 13 shows the time that it takes to process a single MetaMax round de-pending on the number of local search algorithm instances i, not including the timefor processing local search algorithm steps. MetaMax(∞) strategy was used for thismeasurement, so i is equal to the number of rounds. The dependency of processingtime on the number of instances seems to be roughly linear. Theoretically, it shouldbe O(i · log(i)), which is the time complexity of Andrew's convex hull algorithm.However, with the amount of noise present in the measured data it is dicult to tellexactly.

What is clear is, that as MetaMax adds new instances it becomes slower andslower. The total processing time seems to grow at a rate of about O(i2), whichis the main motivation for implementing a mechanism for limiting the maximumnumber of instances.

Also shown in gure 13 is the total number of used function evaluations at eachround. In [GK11] the inverse dependency (number of rounds depending on numberof total function evaluations), is examined and the result given is Ω( vt

logvt), where vt

is the number of used function evaluations. Our measurements conrm this result.The timing experiments were performed on Sony Vaio VGN-NR11S-S with 2048

MB RAM and Intel Core 2 Duo T5250 1.5 GHz processor.We also obtained several interesting results from the series of smaller measure-

ments, which are described at the end of section 4. Firstly, there seems to be nonoticeable dierence in performance between using MetaMax with the discardingmechanism and without it. The actual maximum number of allowed instances seemsnot aect performance either, which corresponds with what was observed during thefull-scale measurements. We also evaluated an alternate instance discarding mecha-nism, which simply discards the worst instances and compared it with the mechanismthat we normally use, that discards the most inactive instances. Again, there was nonoticeable dierence in performance.

The most interesting results were obtained when comparing the eects of dierentinstance selection methods: using the recommended form of the evaluation functionh1 = e−vi/vt , its simplied version h2 = evi and the alternate instance selectionmethod, described in subsection 3.1. An ECDF graph comparing the eect of thesethree modications on the results of MetaMax is shown in gure 14.

Using the simpler estimate function h2 usually results in selecting fewer instanceseach round than when using h1. Overall, h2 proved to give noticeably worse resultsthan h1, with the only areas where it had a slight advantage being the function groupslcond and hcond. This supports what was observed during the full-scale measure-ments, namely that using the mechanism of selecting multiple instances results inworse performance on functions where the used local search algorithm already per-forms well (as a reminder, we used BFGS for these measurements). The alternateinstance selection method performed similarly to the original one, with no notice-able dierence in the measured results. One thing that has to be kept in mind whenevaluating the results of these experiments is that they were performed only usingone local search algorithm and with a limited budget of 5000 function evaluations.Therefore, we can only conclude that our new proposed instance selection mecha-

46

Page 63: Viktor Kajml diploma thesis

02000400060008000

100001200014000

F. e

valu

ati

ons

012345678

Round t

ime [

s]

0 50 100 150 200Rounds

0

100

200

300

400

500

600

Tota

l ti

me [

s]

Figure 13: MetaMax timing measurementsData based on running MetaMax(∞) for 200 rounds, using the CMA-ES local search algorithm and averaged over 20 runs. Solid lines showaverage values, dashed lines show maximal and minimal values. Top -total number of objective function evaluations used up to the currentround. Middle - time to run each round in seconds, not including runningtime of the local search algorithm. Bottom - total running time, also notincluding running time of the local search algorithm.

47

Page 64: Viktor Kajml diploma thesis

0 1 2 3 4 5 6log10evaluations/D

0.0

0.2

0.4

0.6

0.8

1.0Pro

port

ion o

f tr

ials

h2

alt

h1f1-24,10-D

Figure 14: ECDF comparing MetaMax strategies using dierent instance selectionmethods

nism is a viable alternative and would be worth investigating more thoroughly. Onthe other hand, the simplied estimate function h2 probably does not merit anyfurther attention, on account of its poor performance.

6 Conclusion

We have implemented the MetaMax algorithm as described in [GK11] as well asproposed and implemented a modication designed to limit the maximum number ofused local search algorithm instances by discarding one old instance every time a newone is added. Subsequent testing revealed that this has negligible, if any, eect on theperformance of MetaMax but signicantly reduces the computing time requirements.

We have also proposed a new method of selecting local search algorithm instanceswhich is invariant to monotone transformation of the objective functions values andthe choice of estimate function, which is one of the parameters of the MetaMaxalgorithm. Unfortunately, we were able to test it only in a limited fashion. Theresults indicate that the new method is viable and performs as well as the original onewithout the need to choose the estimate function. However, more thorough testingwould be needed in order to ascertain whether it brings any further advantages.

Comparing the three MetaMax variants together conrmed that there is almostno dierence in performance between MetaMax and MetaMax(∞), as stated in[GK11], while MetaMax(k) performs worse than the other two variants. When com-paring MetaMax directly with various restart strategies, using four dierent kinds oflocal search algorithms, we obtained the following results:

For the very simple Compass search algorithm, using MetaMax gives better per-formance than even well tuned restart strategies, with overall increase in success rateof 5-10% and faster convergence. The value of aggregate performance index, which

48

Page 65: Viktor Kajml diploma thesis

expresses the rate of convergence (the lower the better), decreased by a factor of1.08-2.63, depending on dimensionality.

For the somewhat more advanced Nelder-Mead method the results were incon-clusive, with MetaMax achieving the same success rates but being better in terms ofconvergence only for low dimensionalities (d=2 and d=3 specically).

In measurements performed using BFGS and CMA-ESMetaMax performed clearlyworse than the tested restart strategies. It achieves the same success rates but ini-tially converges much more slowly with the resulting aggregate performance indicesbeing greater than those of the restart strategies by a factor of 1.66-3.02 for CMA-ESand 1.41-2.13, for BFGS.

Therefore, we have to conclude that, overall, in the area of black-box contin-uous parameter optimization, and when used with only one kind of local searchalgorithm, MetaMax does not oer any signicant advantage over using more con-ventional restart strategies.

It could be argued that one advantage of MetaMax is that, unlike most restartstrategies, it can be used without going through the process of parameter tuningrst. However, during our testing we have found at least one restart strategy foreach used local search algorithm, which performs well across all dimensionalitiesand does not require tuning either (specically cs-special, nm-special, bfgs-h-2and cmaes-special). Considering this, together with the fact that implementingMetaMax itself also requires some eort, this advantage is not as great as it mightinitially seem.

Nevertheless, it might still be worthwhile to explore the utility of MetaMax inother areas, such as combinatoric-optimization, or in comparison with other portfoliostrategies, using multiple types of local search algorithms at once.

49

Page 66: Viktor Kajml diploma thesis

References

[CS00] Rachid Chelouah and Patrick Siarry. Tabu Search applied to globaloptimization. In: European Journal of Operational Research 123 (2000),pp. 256270.

[GK11] András György and Levente Kocsis. Ecient Multi-Start Strategies forLocal Search Algorithms. In: Journal of Articial Intelligence Research41 (July 2011), pp. 407444.

[Han+13a] Nikolaus Hansen et al. Real-Parameter Black-Box Optimization Bench-

marking 2010: Presentation of the Noiseless Functions. Apr. 13, 2013.url: http://coco.lri.fr/downloads/download13.09/bbobdocexperiment.pdf.

[Han+13b] Nikolaus Hansen et al. Real-Parameter Black-Box Optimization Bench-

marking: Experimental Setup. Apr. 13, 2013. url: http://coco.lri.fr/downloads/download13.09/bbobdocexperiment.pdf.

[Han11] Nikolaus Hansen. The CMA Evolution Strategy: A Tutorial. June 28,2011. url: https://www.lri.fr/~hansen/cmatutorial.pdf.

[Han13a] Nikolaus Hansen. Comparing Continuous Optimizers webpage. Mar. 5,2013. url: http://coco.gforge.inria.fr.

[Han13b] Nikolaus Hansen. The CMA Evolution Strategy. Feb. 13, 2013. url:https://www.lri.fr/~hansen/cmaesintro.html.

[HM03] Pierre Hansen and Nenad Mladenovi¢. A Tutorial on Variable Neigh-borhoodSearch. In: Les Cahiers du GERAD G-2003-46 (July 2003).issn: 07112440.

[KLT03] Tamara G. Kolda, Robert Michael Lewis, and Virginia Torczon. Op-timization by Direct Search. New Perspectives on Some Classical andModern Methods. In: SIAM Review 45.3 (2003), pp. 385482.

[Neu11] Arnold Neumaier. Complete Search in Continuous Global Optimiza-tion and Constraint Satisfaction. In: Acta Numerica 13 (May 2011),pp. 271369.

[NM65] J. A. Nelder and R. Mead. A simplex method for function minimiza-tion. In: The Computer Journal 7.4 (1965), pp. 308313.

[NW99] Jorge Nocedal and Stephen J. Wright. Numerical Optimization. Ed. byPeter Glynn and Stephen M. Robinson. New York: Springer, 1999. isbn:0-387-98793-2.

[PG13] Petr Po²ík and András György. e-mail correspondence. 2013.

[Po²13] Petr Po²ík. Building an Aggregated Performance Index. 2013.

50

Page 67: Viktor Kajml diploma thesis

A Used local search algorithms

A.1 Compass search

Sometimes also called coordinate search. It is perhaps the simplest pattern searchalgorithm and a version of it has been used as far back as the late 1940's by En-rico Fermi and Nicholas Metropolis. Its main advantage is that it is very simple toimplement. However, it tends to converge rather slowly compared to more complexalgorithms and it performs very poorly on ill-conditioned functions.

It can be described in the following way:

Algorithm 6: Compass search

input : distance constant a0, contraction constant δ, starting position x0 andobjective function f

x← x01

a← a02

while stop conditions not met do3

Create 2n neighbour solutions according to the formula:4

yi = x+ dbi for i = 1, . . . , 2d Where bi is the i-th vector of positiveorthonormal basis of dimension d or, in other words, i-th vector of d× 2dmatrix B = [I,−I].if f(yi) ≥ f(x)∀y then5

Unsuccessful iteration, contract the pattern: a =← δa6

else7

Successful iteration: x← argmaxi=1,...,2nf(yi)8

return x9

The usual setting for δ is δ = 0.5. The value of a0 is problem dependent. In ourcase we use a0 = 1 based on the fact that the minimum of all of our benchmarkfunctions lies in the interval 〈−5, 5〉n. Note that, on a step following a successfuliteration, one of the neighbour solutions yi is identical to x in the previous step.Therefore, it is possible to save one function evaluation by remembering the value off(x) in between steps.

A.2 Nelder-Mead algorithm

Also sometimes called simplex method or Nelder-Mead method, was rst pro-posed by Nelder and Mead in [NM65]. The algorithm keeps track of d + 1 points,which form a simplex in d dimensional search-space. We assume that at the begin-ning of each step these points are labelled according to their function values so thatx1 is the point with the best value and xn+1 with the worst one. Each step, neighboursolutions are generated at some of the following points:

x0 = 1n

n∑i=1

xi

Is the center of mass of all points except xn+1, the worst one,.

51

Page 68: Viktor Kajml diploma thesis

xr = x0 + ρ(x0 − xn+1)Is reection of xn+1 through x0

xe = x0 + χρ(x0 − xn+1)Is called the expansion point. It is similar to xr, but reected further away fromthe simplex.

xco = x0 + ψρ(x0 − xn+1)Outside contraction point - reection of xn+1 through x0, but closer to thesimplex than xr.

xci = x0 − ψ(x0 − xn+1)Inside contraction point. Lies inside the simplex on a line connecting x0 andxn+1.

xsi = xi + σ(xi − x1) for i = 2, . . . , n+ 1Shrink points. They are positioned on lines between x1 and all the other points.

An example conguration of these points for a simplex in two dimensions is given ingure 15.

0.5 0.0 0.5 1.0

x1

0.0

0.2

0.4

0.6

0.8

1.0

1.2

x2

x0 =xs3

x1

x2

x3

xr

xe

xci

xco

xs3

Figure 15: Nelder-Mead algorithm in 2DIllustration of positions of points used by the Nelder-Mead algorithm ford=2. Dashed blue line represents the outline of the simplex before, andsolid blue line after shrinking.

Used constants and their default values are:

52

Page 69: Viktor Kajml diploma thesis

ρ = 1 reection coecient

χ = 2 expansion coecient

ψ = 0.5 contraction coecient

σ = 0.5 reduction coecient

With these point and their positions dened, it is now possible to describe thealgorithm:

Algorithm 7: Nelder-Mead method

input : constants ρ, χ, ψ, σ, starting simplex xi, . . . , xn+1 and objectivefunction f

while stop conditions not met do1

Sort x1, . . . , xn+1 so that f(x1) ≤ f(x2) ≤ · · · ≤ f(xn+1)2

get f(xr)3

if f(xr) < f(x1) then4

get f(xe)5

if f(xe) < f(xr) then6

xn+1 ← xe7

else8

xn+1 ← xr9

else10

if f(xr) < f(xn) then11

xn+1 ← xr12

else13

if f(xr) < f(xn+1) then14

get f(xco)15

if f(xco) < f(xr) then16

xn+1 ← xco17

else18

xi ← xsi for i = 2, . . . , n+ 119

else20

get f(xci)21

if f(xci) < f(xn+1) then22

xn+1 ← xci23

else24

xi ← xsi for i = 2, . . . , n+ 125

return x126

The resulting behaviour is that the simplex will gradually creep to the nearestoptimum. One disadvantage of this algorithms is, that its performance deterioratesquickly for problems with high number of dimensions.

53

Page 70: Viktor Kajml diploma thesis

A.3 BFGS

Stands for Broyden-Fletcher-GoldfarbShanno algorithm. It is a line search algo-rithm which approximates the Newton method. It uses an estimation of the Hessianin the form of a matrix H−1, which is updated every step based on values of thegradient. Since we are dealing with black-box optimization problems the gradienthas to be estimated as well, using the nite dierence method. The formulas forupdating H−1 are not overly complex and are included below. However, a detaileddescription of their derivation and the rationale behind them is beyond the scope ofthis text. For the interested reader we would recommend [NW99] which is, at thetime of writing of this thesis, freely available online.

The structure of the algorithm is as follows:

Algorithm 8: BFGS algorithm

input : starting position x0 and objective function fx← x01

H−1 ← I2

while stop conditions not met do3

nd the search direction p by solving: Hp+5f(x) = 0, using Hi−1

4

nd xp using line search from x in the direction p5

s← xp − x6

y ←5f(xp)−5f(x)7

H−1p ← H−1 + (sT y+yTH−1y)(ssT )(sT s)2

− H−1ysT+syTB−1i

sT s8

H−1 ← H−1p9

x← xp10

return x11

Initially H−1 = I. This makes the algorithm act as a pure gradient descent onthe rst step but it gradually becomes more and more aected by the estimatedvalue of Hessian. This is actually a desirable behaviour in line-search algorithms, aspure gradient search converges very quickly but has problems with ill-conditionedfunctions. In this way it is possible to combine the advantages of both methods.

A.4 CMA-ES

Stands for Covariance matrix adaptation evolution strategy. At each step it cre-ates a new population of λ points by sampling a multivariate normal distribution.Population for generation g + 1 is sampled as follows:

xk,g+1 ∼ mg + σgN (0, Cg) for k = 1, ..., λ (11)

Where mg is mean of the distribution and is equal to the weighted mean of µbest points from the current generation. Depending on implementation details, thesepoints might be further weighted by a weight function w. N (0, Cg) Is a multivariatenormal distribution with zero mean and covariance matrix Cg. The covariance matrixis initially set as C0 = I and then updated every step, based on the following:

1. Value of the covariance matrix in the previous iteration Cg−1

54

Page 71: Viktor Kajml diploma thesis

2. Covariance of µ best points in the previous generation

3. Evolution path - weighted sum of the movements of distribution mean m overthe past iterations

The entire mechanism of updating the covariance matrix is quite complicated, so weshall not describe it here in detail and will instead refer the reader to [Han11], whichis a good tutorial on CMA-ES.

The parameter σg serves to give more control over the step length between twoiterations. Its value changes based on movement of the mean over past iterations,but in a slightly dierent way than Cg. It increases if the mean keeps moving in thesame direction and decreases if it moves only a little.

Algorithm 9: CMA-ES algorithm

input : starting position m0, population size λ and objective function fC ← I1

m← m02

σ ← 03

pσ ← 04

pc ← 05

while stop conditions not met do6

Sample points x1, . . . , xλ from random distribution m+ σN (0, C)7

mnew ← update_m(x1, . . . , xµ)8

pσ ← update_isotropic_path(pσ, σ, C, m, mnew)9

pc ← update_anisotropic_path(pc, σ, m, mnew)10

C ← update_C(C, pc, m, σ, x1, . . . , xµ)11

σ ← update_sigma(σ, pσ)12

m← mnew13

return x14

The resulting behaviour is that, in each step, the likelihood of generating success-ful individuals from the previous generation is maximized and thus, with successiveiterations, the population converges to the local optimum.

55

Page 72: Viktor Kajml diploma thesis

B CD contents

The contents of the attached CD are described in the following table:

Folder Descriptionsource Source code of all the used programsresults The measured results, sorted by used local search algo-

rithms and multi-start strategiesthesis_text LATEXsource code of this textthesis.pdf This text in pdf format

Table 20: CD contents

C Acknowledgements

This thesis was written using a LATEXtemplate provided by Jan Faigl.

56