Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine

Feature Selction for SVMsFeature Selction for SVMs

J. Weston et al., NIPS 2000오장민 (2000/01/04)

Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine Learning, Ph.D. thesis, 1999

AbstractAbstractA method of feature selection for Support Vector

Machines (SVMs)A efficient Wrapper method.Superior to some standard feature selection

algorithm

1. Introduction1. Introduction Importance of feature selection in supervised

learning Generalization performance, running time

requirements, and interpretational issuesSVM

State-of-the-art in classification & regressionObjective

To select a subset of features while preserving of improving the discriminative ability of a classifier

2. The Feature Selection 2. The Feature Selection problemproblemDefinition

Find the m << n features with the smallest expected generalization error, which is unknown and must be estimated.

Given fixed set of functions y = f(x, ), Find a preprocessing of the data x (x), {0,1}n, and the parameters minimizing

Subject to ||||0=m, where F(x,y) is unknown, V(, ) is a loss function.

),()),((,(),( ydFfyV xx

Four issues on Feature Four issues on Feature SelectionsSelectionsStarting point

Forward/backward/MiddleSearch organization

Exhaustive search (2N)/ Heuristic searchEvaluation strategy

Filter (independent on any learning algorithm) Wrapper (bias of a particular induction algorithm)

Stopping criterion

from Mark A. Holl, Correlation-based Feature Selection for Machine Learning, Ph.D. thesis, 1999

Feature filtersFeature filtersForward selection: only addition to the subset.Backward elimination: only deletion from the

subtionExample

FOCUS [Almuallim and Dieterich 92] Finds the minimum combination of filters that divides

the training data into pure classes. Feature minimizing entropy. Most discriminating feature. (if its value differs

between positive and negative examples)from Mark A. Holl, Correlation-based Feature Selection for

Machine Learning, Ph.D. thesis, 1999

Wrapper filtersWrapper filtersUse a induction algorithm to estimate the merit of

feature subsets.Consider inductive bias.Better results but slower. (repeatedly call the

induction algorithm)


Correlation based Feature Correlation based Feature SelectionSelection Relevant: if their values vary systemically with category

membership.

Redundant: if one or more of the other features are correlated with a feature.

A good feature subset if one that contains features highly correlated with the class, yet uncorrelated with each other.

)()|( cCpvVcCp ii


Pearson Correlation coefficient

z: outside variable, c: composite, k: number of components, rzi: correlation between components and the outside variable, rii: inter-correlation between components

higher rzi higher rzc

lower rii higher rzc

larger k higher rzc

ii

zizc rkkk

rkr)1(


2. (cont’d)2. (cont’d)Taking advantage of the performance increase of

wrapper methods whilst avoiding their computational complexity.

3. Support Vector Learning3. Support Vector Learning Idea: map x into a high dimensianl space (x)

and make an optimal hyper-plane in this space. Mapping () is performed by a kernel function K(,

). Decision function given by SVM

Optimal hyper-plane is the large margin classifier. (reduced to QP problem)

i

iii bKybwf ),()()( 0 xxxx

l

i iii

l

jijijiji

l

ii

liCy

KyyW

1

1,1

2

,...1 ,0 ,0

subject to

),(21)(

xx

3. Support Vector Learning3. Support Vector Learning

Where images are the mapping (x1),…,(xl).

Theorem 1 justifies that the performance depends on both R and M, where R is controlled by the mapping function () .

)(11 0222

2

WRElM

REl

EPerr

Theorem 1. If images of training data of size l belonging to a sphere of size R are separable with the corresponding margin M, then the expectation of the error probability has the bound

where expectation is taken over sets of training data of size l.

4. Feature Selection for SVMs4. Feature Selection for SVMsRemind the objective

Minimize

via , . Enlarge the set of functions

),()),((,(),( ydFfyV xx

bwfbwf

),(),(

)(),(

xxxx

))(),(())(),((),( yxyxyx KK

4. Feature Selection for SVMs4. Feature Selection for SVMs

Minimize over .

Where

),()()( 02222 WRWR

i ii

i jijijiiii

li

KKR

,...,1,0,1subject to

),(),(max)(,

2

xxxx

(Vapink’s statistical learning theory)

4. Feature Selection for SVMs4. Feature Selection for SVMs Finding the minimum of R2W2 over requires

searching over all possible subsets of n features which is combinatorial problem.

Approximate algorithm: Approximate the binary valued vector {0,1}n,

with a real valued vector . Optimize R2W2 by gradient descent.

kkk

RWWRWR

)(),(),()(),( 2

0202

2022

4. Feature Selection for SVMs4. Feature Selection for SVMsSummary

Find minimum of (, ) by minimizing approximate integer programming

For large enough as p 0 only m elements of will be nonzero, approximating optimization problem (, ) .

lim

WR

ii i

i

pi

,...,1 ,0 ,subject to

)(),()( 022

5. Experiments (Toy data)5. Experiments (Toy data)Comparison

Standard SVMs, feature selection algorithm, three classical filter methods (Pearson correlation coefficients, Fischer criterion score, Kolmogorov-Smirnov test).

Description Linear SVM for linear problem, second order

polynomial kernel for nonlinear problem 2 best features were selected for comparison.

Fischer criterion score

Kolmogorov-Smirnov test

22)(

rr

rrrF

r : mean value for r-th feature in the

positive and negative classes

1,ˆˆsup)( rrrtst yfXPfXPlrKS

where fr denotes the r-th feature from each training example, and P is the corresponding empirical distribution.

5. Experiments (Toy data)5. Experiments (Toy data)Data

Linear data 6 of 202 were relevant. Probability of y = 1 or –1 are

equal. {x1, x2, x3} were drawn as xi=yN(i,1) with prob. 0.7 and xi=N(0,1) with prob 0.3 .{x4, x5, x6} were drawn as xi=yN(i–3,1) with prob. 0.7 and xi=N(0,1) with prob 0.3. Other features are noise xi=N(0,20)

Nonlinear data 2 of 52 were relevant. If y=–1, {x1,x2} are drawn from

N(1, ) or N(2, ) with equal prob. =I. If y=1, {x1,x2} are drawn from two normal distribution. Other features are noise xi=N(0,20)

(a) Linear problem, (b) nonlinear problem both with many irrelevant features. The x-axis is the number of training points, and the y-axis the test error as a fraction of test points.

(a) (b)

5. Experiments (Toy data)5. Experiments (Toy data)

5. Experiments (Real-life Data)5. Experiments (Real-life Data)Comparision

(1) Minimizing R2W2

(2) Rank ordering features according to how much they change the decision boundary and removing those that cause the least change.

(3) Fischer criterion score

riijij

r

Nsv

i

Nsv

jij

rjj

KK

KyrI

xxxxx

xx

),(),(

where

,),()(1 1

5. Experiments (Real-life Data)5. Experiments (Real-life Data)Data

Face detection training set 2,429/13,229 , test set 104/2M , dimension

1,740 Pedestrian detection

training set 924/10,044 , test set 124/800K , dimension 1,326

Cancer morphology classification training 38/ test 34 , dimension 7,129 1 error using all genes. 0 errors for (1) using 20 genes,

5 errors for (2), 3 errors for (3)

5. Experiments (Real-life Data)5. Experiments (Real-life Data)

(a) The top ROC curves are for 725 features and the bottom one for 120 features for face detection. (b) ROC curves using all features and 120 features for pedestrian detection.Solid: all features, solid with a circle: (1), dotted and dashed: (2), dotted: (3)

(a) (b)

6. Conclusion6. Conclusion Introduce a wrapper method to perform feature

selection for SVMs.Computationally feasible for high dimensional

datasets compared to existing wrapper methodsSuperior to some filter methods.

Documents

Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine