8
Swarm Intelligent Tuning of One-Class ν -SVM Parameters Lei Xie National Key Laboratory of Industrial Control Technology, Institute of Advanced Process Control, Zhejiang University, Hangzhou 310027, P.R. China [email protected] Abstract. The problem of kernel parameters selection for one-class clas- sifier, ν -SVM, is studied. An improved constrained particle swarm op- timization (PSO) is proposed to optimize the RBF kernel parameters of the ν -SVM and two kinds of flexible RBF kernels are introduced. As a general purpose swarm intelligent and global optimization tool, PSO do not need the classifier performance criterion to be differentiable and convex. In order to handle the parameter constraints involved by the ν -SVM, the improved constrained PSO utilizes the punishment term to provide the constraints violation information. Application studies on an artificial banana dataset the efficiency of the proposed method. Keywords: Swarm intelligence, particle swarm optimization, ν -SVM, radical basis function, hyperparameters tuning. 1 Introduction Due to the superior statistical properties and the successful application, the ker- nel based one-class classifier have aroused attentions in recent years. Among the kernel based approaches, ν -Support vector machine (ν -SVM) proposed by Vapnik[1] and Support vector dada description by Tax[2] are two important and fundamentally equivalent approaches. For kernel based approaches, the choice of the kernel function is a crucial step which need skills and tricks. If the ker- nel family is predetermined, e.g., the RBF (Radical Basis Function) kernel, the problem reduces to selecting an appropriate set of parameters for the classifiers. Such kernel parameters together with the regularization coefficient are called the hyperparameters. In practice, the hyperparameters are usually determined by grid search[3], i.e., the hyperparameters space is explored by comparing the performance measure on fixed points and eventually, the parameter combination with the best per- formance is selected. Due to its computational complexity, grid search is only suitable for the low dimension problems. An alternative approach to optimize the hyperparameters is gradient decent methods [4], which needs the kernel function and performance assessing function to be differentiable with respect to the kernel This work is partially supported by National Natural Science Foundation of China with grant number 60421002 and 70471052. G. Wang et al. (Eds.): RSKT 2006, LNAI 4062, pp. 552–559, 2006. c Springer-Verlag Berlin Heidelberg 2006

[Lecture Notes in Computer Science] Rough Sets and Knowledge Technology Volume 4062 || Swarm Intelligent Tuning of One-Class ν-SVM Parameters

  • Upload
    yiyu

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

Page 1: [Lecture Notes in Computer Science] Rough Sets and Knowledge Technology Volume 4062 || Swarm Intelligent Tuning of One-Class ν-SVM Parameters

Swarm Intelligent Tuning of One-Class

ν-SVM Parameters�

Lei Xie

National Key Laboratory of Industrial Control Technology, Institute of AdvancedProcess Control, Zhejiang University, Hangzhou 310027, P.R. China

[email protected]

Abstract. The problem of kernel parameters selection for one-class clas-sifier, ν-SVM, is studied. An improved constrained particle swarm op-timization (PSO) is proposed to optimize the RBF kernel parametersof the ν-SVM and two kinds of flexible RBF kernels are introduced. Asa general purpose swarm intelligent and global optimization tool, PSOdo not need the classifier performance criterion to be differentiable andconvex. In order to handle the parameter constraints involved by theν-SVM, the improved constrained PSO utilizes the punishment term toprovide the constraints violation information. Application studies on anartificial banana dataset the efficiency of the proposed method.

Keywords: Swarm intelligence, particle swarm optimization, ν-SVM,radical basis function, hyperparameters tuning.

1 Introduction

Due to the superior statistical properties and the successful application, the ker-nel based one-class classifier have aroused attentions in recent years. Amongthe kernel based approaches, ν-Support vector machine (ν-SVM) proposed byVapnik[1] and Support vector dada description by Tax[2] are two important andfundamentally equivalent approaches. For kernel based approaches, the choiceof the kernel function is a crucial step which need skills and tricks. If the ker-nel family is predetermined, e.g., the RBF (Radical Basis Function) kernel, theproblem reduces to selecting an appropriate set of parameters for the classifiers.Such kernel parameters together with the regularization coefficient are called thehyperparameters.

In practice, the hyperparameters are usually determined by grid search[3], i.e.,the hyperparameters space is explored by comparing the performance measureon fixed points and eventually, the parameter combination with the best per-formance is selected. Due to its computational complexity, grid search is onlysuitable for the low dimension problems. An alternative approach to optimize thehyperparameters is gradient decent methods [4], which needs the kernel functionand performance assessing function to be differentiable with respect to the kernel� This work is partially supported by National Natural Science Foundation of China

with grant number 60421002 and 70471052.

G. Wang et al. (Eds.): RSKT 2006, LNAI 4062, pp. 552–559, 2006.c© Springer-Verlag Berlin Heidelberg 2006

Page 2: [Lecture Notes in Computer Science] Rough Sets and Knowledge Technology Volume 4062 || Swarm Intelligent Tuning of One-Class ν-SVM Parameters

Swarm Intelligent Tuning of One-Class ν-SVM Parameters 553

and regularization parameters. This hampers the usage of some reasonable per-formance criteria such as the number of support vectors. Furthermore, gradientdecent methods rely on the initial guess of the solutions and likely to converge toa local optimal point especially for the high dimensional optimization problems.

In this paper, we propose a swarm intelligent approach, Particle Swarm Op-timization (PSO), for the hyperparameters selection to overcome the deficien-cies of mentioned above. As a swarm intelligent approach and general globaloptimization tool, PSO was first proposed by Kennedy and Eberhart[5] whichsimulates the simplified social life models. Since PSO has many advantages overother heuristic and evolutionary techniques such as it can be easily implementedand has great capability of escaping local optimal solution[6], PSO has beenwidely applied in many engineering problems[7]. The performance criterion tobe optimized for the ν-SVM involves the weighed average of misclassificationrates on the target set and outlier set, where the outliers are assumed uniformlydistributed around the target set. Although only the RBF kernel is involved inthis paper, the swarm intelligent tuning methodology is general and can be easilyextended to other kernel families.

The rest of this paper is structured as follows. Section 2 introduces the fun-damental elements and parameterization of the ν-SVM with RBF kernels, twoflexible RBF kernel formulations are introduced. A short summary of basic PSOalgorithm is given in Section 3. The proposed constrained-PSO based ν-SVMhyperparameters tuning approach is presented in Section 4. The experimentalresults on an artificial banana dataset are reported in Section 5 prior to a con-cluding summary in Section 6.

2 ν-SVM and RBF Kernel Parameterization

The main idea of the ν-SVM is (i) to map the input vectors to a feature spaceand (ii) to find a hyperplane with largest margin from the origin point whichseparate the transferred data from the rest of the feature space.

Given a data set containing l target training examples, {xi ∈ �n, i=1,2,...,l},the mapping Φ : x → F is implicitly done by a given kernel K : �n × �n → �which compute the inner product in the feature space, i.e., 〈Φ(xi), Φ(xj)〉 =K(xi,xj). The ν-SVM solves the following optimization problem:

minw,b,ξ,ρ

12 〈w, w〉 − ρ + 1

νl

l∑

i=1

ξi,

s.t. 〈w, Φ(x)〉 ≥ ρ− ξi, ξi ≥ 0(1)

where w and ρ are the normal vector and offset of the separating hyperplane,the distance between the hyperplane and the origin is ρ/ ‖w‖. 0 ≤ ν ≤ 1 isthe tuning parameter which controls the upper limit on the fraction of trainingerror on target class and a lower bound on the fraction of support vectors. ξi

represent the slack variables which allow the possibility that some of the trainingexamples can be wrongly classified and this is necessary for the problems thatare not linearly separable in the feature space.

Page 3: [Lecture Notes in Computer Science] Rough Sets and Knowledge Technology Volume 4062 || Swarm Intelligent Tuning of One-Class ν-SVM Parameters

554 L. Xie

The dual problem of (1) is formulated as:

minα

12

l∑

i=1

l∑

j=1

αiαjK(xi,xj),

s.t. 0 ≤ αi ≤ 1νl ,

l∑

i=1

αi = 1(2)

where αi are called Lagrange multipliers and the decision function for a datavector z is:

f(z) =

⎧⎨

⎩1 (target class),

l∑

i=1

αiK(xi, z) ≥ ρ

0 (outlier), else.. (3)

2.1 RBF Kernel Parameterization

For the ν-SVM, the RBF is the most general purpose kernel formulation. Incurrent study, we consider the general RBF kernel with a symmetric positivematrix S:

K(xi,xj) = exp(−(xi − xj)T S(xi − xj)). (4)

The ordinary RBF kernel[8] assumes S=sI, i,e, s is the only adjustable param-eter. In addition to the ordinary RBF kernel, two more flexible RBF kernels areconsidered in this paper: (i) Diagonal Kernel, S = diag[s1, s2, . . . , sn], si ≥0,so the elements of x can be scaled at different levels; (ii) Arbitrary Kernel,S = PTΛP with Λ = diag[s1, s2, ..., sn], si ≥ 0 and P is an orthogonal rotat-ing matrix, so the input space is scaled and rotated simultaneously[9][10]. Theorthogonal rotating matrix P can be further parameterized as in ref[10]:

P =n−1∏

i=1

n∏

j=i+1

Ri,j ,Ri,j ∈ �n×n, (5)

where Ri,j are elementary rotating matrices determined by angles 0 ≤ θi,j ≤ π:

Ri,j =

⎢⎢⎢⎢⎣

Ii−1 0

0

⎢⎢⎣

⎣cos θi,j 0 − sin θi,j

0 Ij−i−1 0sin θi,j 0 cos θi,j

⎦ 0

0 In−j

⎥⎥⎦

⎥⎥⎥⎥⎦

. (6)

Together with the parameter ν, Table 1 lists the hyperparameters p of ν-SVMwith different RBF kernel formulations:

2.2 Performance Criterion of ν-SVM

For one-class classifier, there are two kinds of error need to be minimized, i.e., (i)the rejection rate of the target objects εT−(true negative error) and (ii) accep-tation rate of outliers εF+(false positive error)[11]. The leave-one-out estimation

Page 4: [Lecture Notes in Computer Science] Rough Sets and Knowledge Technology Volume 4062 || Swarm Intelligent Tuning of One-Class ν-SVM Parameters

Swarm Intelligent Tuning of One-Class ν-SVM Parameters 555

Table 1. Parameters List of RBF ν-SVM

RBF Kernel Type Adjustable Parameters

Ordinary Kernel p = [ν, s]T ∈ �2

Diagonal Kernel p = [ν, s1, s2, ..., sn]T ∈ �n+1

Arbitrary Kernel p = [ν, s1, s2, ..., sn, θi,j ]T ∈ �n+1+n(n−1)/2, 1 ≤ i < j ≤ n

of the former error is the fraction of support vectors, εT− = nSV/l, where nSVindicates the number of support vectors.

With respect to the second error rate, one has to assume the outlier distri-bution and to generate a set of artificial outliers to estimate the εF+. Tax[14]proposed a natural assumption that the outliers are distributed uniformly in ahypersphere enclosing the target classes. To generate the outliers, a uniform hy-perspherical distribution generation method presented by Luban[12] is involvedand the fraction of accepted outliers gives an estimation of εF+.

On the basis of the above estimations of εT− and εF+, the performance cri-terion of the ν-SVM on one training set is defined as:

ε = λεT− + (1− λ)εF+ = λ · nSV/l + (1− λ)εF+, (7)

where 0 ≤ λ ≤ 1 balances the two kinds of errors. In practice, in order to preventthe over fitting of RBF parameters on one training set, the performance criterioncan be selected as the average ε on multiple training sets.

3 Particle Swarm Optimization

PSO is an algorithm first introduced by Kennedy and Eberhart[5]. In PSO,each solution of the optimization problem, called a particle, flies in the problemsearch space looking for the optimal position according to its own experience aswell as to the experience of its neighborhood. The performance of each particle isevaluated using the criterion in Eq.(7). Two factors characterize a particle statusin the m-dimensional search space: its velocity and position which are updatedaccording to the following equations at the jth iteration:

{Δpj+1

i = u ·Δpji + ϕ1r

j1(p

jid − pj

i ) + ϕ2rj2(p

jgd − pj

i )pj+1

i = pji + Δpj+1

i

, (8)

where Δpj+1i ∈ �m, called the velocity for particle i, represents the position

change by this swarm from its current position in the jth iteration, pj+1i ∈ �m

is the particle position, pjid ∈ �m is the best previous position of particle i,

pjgd ∈ �m is the best position that all the particles have reached, ϕ1, ϕ2 are

the positive acceleration coefficient, u is so called inertia weight and ri1, r

j2 are

uniformly distributed random numbers between [0, 1].

Page 5: [Lecture Notes in Computer Science] Rough Sets and Knowledge Technology Volume 4062 || Swarm Intelligent Tuning of One-Class ν-SVM Parameters

556 L. Xie

4 PSO Based Hyperparameters Tuning of ν-SVM

Particle Swarm Optimization, in its original form, can only be applied to theunconstrained problems while the RBF ν-SVM introduces a set of constrains onthe parameters. In this section, a novel constrained PSO with restart and rulesto select its parameters are presented. A general PSO hyperparameters tuningframework for the ν-SVM is given as well.

4.1 Constrained PSO with Restart

The purpose of hyperparameters tuning is to find a optimal combination of pa-rameters which minimize the misclassification rate defined by Eq.(7). With Ntraining datasets of target class at hand, the general formulation of the opti-mization problem is given as follows:

min ε

s.t. ε =n∑

i=1

εi/N, εi ← Eq.(2)(7)

pL ≤ p ≤ pU ,p = [p1, p2, ..., pm]T. (9)

With respect to the parameter constraints pL ≤ p ≤ pU , the penalty term isadded to the objective function to provide the information on constraint viola-tions. In current study, the penalty term is in the form of:

Pe(p) =m∑

i=1

b2i ,bi =

⎧⎨

B(pi − pUi )2, pi > pU

i

0, pUi > pi > pL

i

B(pi − pLi )2, pi < pL

i

, (10)

where B is a positive constant, e.g. of value 100. The penalty term Pe(p) de-creases to zero if and only if no constraints are violated. Adding the penalty termto the objective function of (9) leads to the following constrain free formulation:

min ε + Pe(p)

s.t. ε =n∑

i=1

εi/N ; εi ← Eq.(2)(7) . (11)

There are several ways of determining when the PSO algorithm should stop.The most common adopted criterion is reaching a maximum number of iterationsIMAX. However, it is pointless for PSO to proceed if the algorithm no longerpossesses any capability of improvement. In this study, an improved PSO algo-rithm with restart is presented, i.e., a new particle population will be generatedwhen current one has no potential to explore better solutions. Such potential ismeasured with the following criterion which indicates whether all the particlesare clustered around the same spot:

maxi,j

(‖pi − pj‖Σ) < δ, 1 ≤ i ≤ j ≤ nSwarm, (12)

where ‖pi − pj‖Σ =√

(pi − pj)T Σ(pi − pj)is the norm of a vector, Σ is a pos-itive weighting matrix, e.g. Σ = diag−1(pU −pL). δ is the predefined tolerance,

Page 6: [Lecture Notes in Computer Science] Rough Sets and Knowledge Technology Volume 4062 || Swarm Intelligent Tuning of One-Class ν-SVM Parameters

Swarm Intelligent Tuning of One-Class ν-SVM Parameters 557

e.g. 10−3 and nSwarm is the population size. With the restart scheme proposedabove, the exploring capability of PSO algorithm is further improved and it ismore possible to find global solution in limited iterations.

4.2 PSO Parameters Setting

There are several parameters needed to be predefined before PSO is carried out.There are some rules of thumb reported in the literatures[5][6] to select the PSOparameters.

Population size (nSwarm). This parameter is crucial for PSO algorithm. Asmall population does not create enough interaction for the emergent behaviorto PSO to occur. However, the population size is not problem specific, a valueranging from 20 to 50 can guarantee the exploring capability of PSO, especiallyfor unconstrained optimization problems. To our experience, for the ν-SVM hy-perparameters tuning problem, which includes a set of constraints, 40 to 80population size is usually enough.Inertia coefficient (u). This parameter controls the influence of previous veloc-ity on the current one. Generally speaking, large u facilitates global exploration,whilst low u facilitates local exploration. According to the suggestion of Par-sopoulos and Varhatis [6], a initial value around 1.2 and gradual decline towards0 can be taken as a good choice of u.Acceleration coefficient (ϕ1, ϕ2). Proper tuning of these parameters can im-prove the probability of finding global optimum and the speed of convergence.The default value is ϕ1 = ϕ2 = 2.

5 Experimental Evaluations

We applied the proposed tuning approach on an artificial 2-dimensional bananashape target set. 10 training datasets, each with 100 targets and 1000 artificialoutliers uniformly distributed around the targets, were generated to train andevaluate the performance of the candidate RBF kernel parameters.

The value of the mean misclassification rate defined in Eq.(9) over 10 trainingdatasets with respect to ν and s (for ordinary RBF kernel and λ=0.5) is illus-trated in Fig.1. It reveals that the hyperparameters tuning problem has multilocal optimal points and it is very likely for the gradient based method to con-verge to such points and miss the global optimal solution. In contrast, the PSObased swarm intelligent tuning approach does not suffer the above limitations.

For each of the three kinds of RBF kernels listed in Table 1, PSO with thesame parameter settings(nSwarm=40, u decreases from 1.2 to 0,ϕ1 = ϕ2= 2,IMAX=100, λ=0.5, B=500 and δ=10−3) were performed to find the best hyper-parameters. The optimization results over 10 training sets are listed in Table 2.The first row gives the optimal solution by PSO, note that more flexible kernelhas more parameters to tune. The second row shows the average misclassificationrate on the target set εT− and the third row represents the average fraction of

Page 7: [Lecture Notes in Computer Science] Rough Sets and Knowledge Technology Volume 4062 || Swarm Intelligent Tuning of One-Class ν-SVM Parameters

558 L. Xie

Fig. 1. ε value for different combination

of ν and s (Banana)

Fig. 2. Decision bound of the ν-SVM with

optimal RBF kernel (Banana)

Table 2. Results obtained by PSO over 10 training datasets (Banana)

RBF Kernel Type Ordinary Kernel Diagonal Kernel Arbitrary Kernel

pT [ν, s]=[0.0977,0.2951] [ν, s1, s2]=[0.0987,0.2948, 0.3000]

[ν, s1, s2, θ1,1]=[0.1295,0.2370, 0.3667, 0.6003]

εT− 0.1751(0.0471) 0.1767(0.0469) 0.1709(0.0338)εF+ 0.3697(0.0474) 0.3681(0.0473) 0.3308(0.0412)ε 0.2724(0.0225) 0.2724(0.0224) 0.2509(0.0223)

accepted outliers εF+, both over 10 datasets. The number in the bracket givesthe standard deviation. The last row shows optimal value of the performancecriteria, i.e., the weighed average of the above two error rates.

From Table 2, it can be seen that replacing the ordinary kernel with the di-agonal kernel does not improve the performance of the ν-SVM, the two kernelsgive the approximately identical solution (note that s1 ≈ s2 for the diagonalkernel). However, when the arbitrary kernel formulation is utilized, the aver-age misclassification rate decreases from 0.2724 to 0.2509, or an improvementof about 6%, so we can conclude that arbitrary kernel yield significant betterresults. The decision bound of the ν-SVM with optimal arbitrary RBF kernelfor one training dataset is illustrated in Fig.2. Note that the Banana shape isrotated clockwise about 40 compared with the original one in ref[14].

6 Conclusion

The purpose of one-class classifier is to separate the target class from the otherpossible objects. In this paper, a constrained-PSO based hyperparameters tuningapproach is proposed for the ν-SVM and two flexible RBF kernel formulation areintroduced. Application study demonstrates that diagonal and arbitrary RBFkernels can lead to better performance than the ordinary kernel.

Page 8: [Lecture Notes in Computer Science] Rough Sets and Knowledge Technology Volume 4062 || Swarm Intelligent Tuning of One-Class ν-SVM Parameters

Swarm Intelligent Tuning of One-Class ν-SVM Parameters 559

References

1. Vapnik, V. N.: The nature of statistical learning theory. Springer, Berlin (1995).2. Tax, D. M. J.: One-class classification, PhD thesis, Delft University, Delft (2001).3. Unnthorsson, R., Runarsson, T. P., Jonsson, M. T.: Model selection in one-class

ν-SVM using RBF kernels. http://www.hi.is/∼runson/svm/paper.pdf (2003)4. Chapelle, O., Vapnik, V., Bousquet, O., Mukherjee, S.: Choosing multiple param-

eters for support vector machines. Machine Learning 46 (2002) 131-159.5. Kennedy, J., Eberhart, R.: Particle swarm optimization, In Proc. IEEE Int. Conf.

Neural Networks, Perth, (1995), 1942-1948.6. Parsonpoulos, K. E., Varhatis: Recent approaches to global optimization problems

through particle swarm optimization. Natural Computing 1 (2002) 235-306.7. Xie, X. F., Zhang, W. J., Yang, Z. L.: Overview of particle swarm optimization.

Control and Decision 18 (2003) 129-134.8. Scholkopf, B., Somla, A. J.: Learning with kernels: support vector machines, regu-

larization, optimization and beyond. MIT press, Cambridge (2002).9. Frauke, F., Igel, C.: Evolutionary tuning of multiple SVM parameters. Neurocom-

puting 64 (2005) 107-117.10. Rudolph, G.: On correlated mutations in evolution strategies, In: Parallel problems

solving from nature 2 (PPSN II). Elsevier, Amsterdam (1992) 105-114.11. Alpaydin, E.: Introduction to machine learning. MIT Press, Cambridge (2004).12. Luban, M., Staunton, L. P.: An efficient method for generating a uniform distri-

bution of points within a hypersphere. Computers in Physics 2 (1988) 55-60.13. Chang, C., Lin, C.: LIBSVM: a library for support vector machines.

http://www.csie.ntu.edu.tw/∼cjlin/libsvm/index.html (2005)14. Tax, D. M. J., Duin, R. P. W.: Uniform object generation for optimizing one-class

classifiers. Journal of Machine Learning Research 2 (2001) 15-173.