Upload
candace-wilkerson
View
220
Download
0
Embed Size (px)
DESCRIPTION
1. Introduction Importance of feature selection in supervised learning Generalization performance, running time requirements, and interpretational issues SVM State-of-the-art in classification & regression Objective To select a subset of features while preserving of improving the discriminative ability of a classifier
Citation preview
Feature Selction for SVMsFeature Selction for SVMs
J. Weston et al., NIPS 2000오장민 (2000/01/04)
Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine Learning, Ph.D. thesis, 1999
AbstractAbstractA method of feature selection for Support Vector
Machines (SVMs)A efficient Wrapper method.Superior to some standard feature selection
algorithm
1. Introduction1. Introduction Importance of feature selection in supervised
learning Generalization performance, running time
requirements, and interpretational issuesSVM
State-of-the-art in classification & regressionObjective
To select a subset of features while preserving of improving the discriminative ability of a classifier
2. The Feature Selection 2. The Feature Selection problemproblemDefinition
Find the m << n features with the smallest expected generalization error, which is unknown and must be estimated.
Given fixed set of functions y = f(x, ), Find a preprocessing of the data x (x), {0,1}n, and the parameters minimizing
Subject to ||||0=m, where F(x,y) is unknown, V(, ) is a loss function.
),()),((,(),( ydFfyV xx
Four issues on Feature Four issues on Feature SelectionsSelectionsStarting point
Forward/backward/MiddleSearch organization
Exhaustive search (2N)/ Heuristic searchEvaluation strategy
Filter (independent on any learning algorithm) Wrapper (bias of a particular induction algorithm)
Stopping criterion
from Mark A. Holl, Correlation-based Feature Selection for Machine Learning, Ph.D. thesis, 1999
Feature filtersFeature filtersForward selection: only addition to the subset.Backward elimination: only deletion from the
subtionExample
FOCUS [Almuallim and Dieterich 92] Finds the minimum combination of filters that divides
the training data into pure classes. Feature minimizing entropy. Most discriminating feature. (if its value differs
between positive and negative examples)from Mark A. Holl, Correlation-based Feature Selection for
Machine Learning, Ph.D. thesis, 1999
Wrapper filtersWrapper filtersUse a induction algorithm to estimate the merit of
feature subsets.Consider inductive bias.Better results but slower. (repeatedly call the
induction algorithm)
from Mark A. Holl, Correlation-based Feature Selection for Machine Learning, Ph.D. thesis, 1999
Correlation based Feature Correlation based Feature SelectionSelection Relevant: if their values vary systemically with category
membership.
Redundant: if one or more of the other features are correlated with a feature.
A good feature subset if one that contains features highly correlated with the class, yet uncorrelated with each other.
)()|( cCpvVcCp ii
from Mark A. Holl, Correlation-based Feature Selection for Machine Learning, Ph.D. thesis, 1999
Pearson Correlation coefficient
z: outside variable, c: composite, k: number of components, rzi: correlation between components and the outside variable, rii: inter-correlation between components
higher rzi higher rzc
lower rii higher rzc
larger k higher rzc
ii
zizc rkkk
rkr)1(
from Mark A. Holl, Correlation-based Feature Selection for Machine Learning, Ph.D. thesis, 1999
2. (cont’d)2. (cont’d)Taking advantage of the performance increase of
wrapper methods whilst avoiding their computational complexity.
3. Support Vector Learning3. Support Vector Learning Idea: map x into a high dimensianl space (x)
and make an optimal hyper-plane in this space. Mapping () is performed by a kernel function K(,
). Decision function given by SVM
Optimal hyper-plane is the large margin classifier. (reduced to QP problem)
i
iii bKybwf ),()()( 0 xxxx
l
i iii
l
jijijiji
l
ii
liCy
KyyW
1
1,1
2
,...1 ,0 ,0
subject to
),(21)(
xx
3. Support Vector Learning3. Support Vector Learning
Where images are the mapping (x1),…,(xl).
Theorem 1 justifies that the performance depends on both R and M, where R is controlled by the mapping function () .
)(11 0222
2
WRElM
REl
EPerr
Theorem 1. If images of training data of size l belonging to a sphere of size R are separable with the corresponding margin M, then the expectation of the error probability has the bound
where expectation is taken over sets of training data of size l.
4. Feature Selection for SVMs4. Feature Selection for SVMsRemind the objective
Minimize
via , . Enlarge the set of functions
),()),((,(),( ydFfyV xx
bwfbwf
),(),(
)(),(
xxxx
))(),(())(),((),( yxyxyx KK
4. Feature Selection for SVMs4. Feature Selection for SVMs
Minimize over .
Where
),()()( 02222 WRWR
i ii
i jijijiiii
li
KKR
,...,1,0,1subject to
),(),(max)(,
2
xxxx
(Vapink’s statistical learning theory)
4. Feature Selection for SVMs4. Feature Selection for SVMs Finding the minimum of R2W2 over requires
searching over all possible subsets of n features which is combinatorial problem.
Approximate algorithm: Approximate the binary valued vector {0,1}n,
with a real valued vector . Optimize R2W2 by gradient descent.
kkk
RWWRWR
)(),(),()(),( 2
0202
2022
4. Feature Selection for SVMs4. Feature Selection for SVMsSummary
Find minimum of (, ) by minimizing approximate integer programming
For large enough as p 0 only m elements of will be nonzero, approximating optimization problem (, ) .
lim
WR
ii i
i
pi
,...,1 ,0 ,subject to
)(),()( 022
5. Experiments (Toy data)5. Experiments (Toy data)Comparison
Standard SVMs, feature selection algorithm, three classical filter methods (Pearson correlation coefficients, Fischer criterion score, Kolmogorov-Smirnov test).
Description Linear SVM for linear problem, second order
polynomial kernel for nonlinear problem 2 best features were selected for comparison.
Fischer criterion score
Kolmogorov-Smirnov test
22)(
rr
rrrF
r : mean value for r-th feature in the
positive and negative classes
1,ˆˆsup)( rrrtst yfXPfXPlrKS
where fr denotes the r-th feature from each training example, and P is the corresponding empirical distribution.
5. Experiments (Toy data)5. Experiments (Toy data)Data
Linear data 6 of 202 were relevant. Probability of y = 1 or –1 are
equal. {x1, x2, x3} were drawn as xi=yN(i,1) with prob. 0.7 and xi=N(0,1) with prob 0.3 .{x4, x5, x6} were drawn as xi=yN(i–3,1) with prob. 0.7 and xi=N(0,1) with prob 0.3. Other features are noise xi=N(0,20)
Nonlinear data 2 of 52 were relevant. If y=–1, {x1,x2} are drawn from
N(1, ) or N(2, ) with equal prob. =I. If y=1, {x1,x2} are drawn from two normal distribution. Other features are noise xi=N(0,20)
(a) Linear problem, (b) nonlinear problem both with many irrelevant features. The x-axis is the number of training points, and the y-axis the test error as a fraction of test points.
(a) (b)
5. Experiments (Toy data)5. Experiments (Toy data)
5. Experiments (Real-life Data)5. Experiments (Real-life Data)Comparision
(1) Minimizing R2W2
(2) Rank ordering features according to how much they change the decision boundary and removing those that cause the least change.
(3) Fischer criterion score
riijij
r
Nsv
i
Nsv
jij
rjj
KK
KyrI
xxxxx
xx
),(),(
where
,),()(1 1
5. Experiments (Real-life Data)5. Experiments (Real-life Data)Data
Face detection training set 2,429/13,229 , test set 104/2M , dimension
1,740 Pedestrian detection
training set 924/10,044 , test set 124/800K , dimension 1,326
Cancer morphology classification training 38/ test 34 , dimension 7,129 1 error using all genes. 0 errors for (1) using 20 genes,
5 errors for (2), 3 errors for (3)
5. Experiments (Real-life Data)5. Experiments (Real-life Data)
(a) The top ROC curves are for 725 features and the bottom one for 120 features for face detection. (b) ROC curves using all features and 120 features for pedestrian detection.Solid: all features, solid with a circle: (1), dotted and dashed: (2), dotted: (3)
(a) (b)
6. Conclusion6. Conclusion Introduce a wrapper method to perform feature
selection for SVMs.Computationally feasible for high dimensional
datasets compared to existing wrapper methodsSuperior to some filter methods.